Stats 101C: Homework 2

1. Use \(\texttt{ggplot2}\) to create a graphic, based on the \(\texttt{LArealestate.csv}\) data that shows 3 or more variables on the same plot. What questions does the graphic answer?

The graphic considers a couple of the features of the houses in dataset. First, we compute the dollar amount per square footage, and we look at how this quantity differs across the three cities. Clearly, houses in Beverly Hills have an average price per square footage that is dramatically higher than in the other two cities. We also show the average square foot per house in each of the cities. In this case, the difference is even more extreme, as the square footage in Beverly Hills homes is on average three times the square footage of houses in the other cities.

2.

(a) Make a plot that helps us understand the association between people’s desired weight and their current weight, given their gender and whether or not they exercise. Your plot should include least squares lines to show the linear relation between desired weight and current weight for each of the four subgroups. Interpret these plots.

We define the groups as follows:

Group 1: no exerise and male

Group 2: exerise and male

Group 3: no exercise and female

Group 4: exercise and female

## # A tibble: 4 × 2
##   category `mean(wtdesire - weight)`
##      <dbl>                     <dbl>
## 1        1                -13.897161
## 2        2                 -9.781941
## 3        3                -22.125638
## 4        4                -16.593541

As seen in the four plots above, the general trends between the groups show that people’s desired weight is lower than their current weight. If we look more closely within each group though, we see (in the table above) that group 3, the women who do not exercise have the biggest discrepancy in between their current weight and their desired weight, while men who exercise have the smallest discrepancy in their desires and current state. In addition, as people’s current weight grows, the desired weight grows as well, though at a slower rate. This means that people generally have reasonable expectations of themselves, but some of the heavier people may have overly ambitious goals of losing a lot of weight.

(b) Instead of a regression line, use a smoother. Explain how the results differ from (a)

Since we use a smoother instead of a regression line, the curves are generally not linear. Instead, they are able to catch the smaller trend that very heavy people tend to have very low desired weight relative to their current weight. This is paricularly obvious in group two, where the curve very obviously curves back down as the current weight surpasses 300 pounds. In general, the smoothing curve is more sensitive to the data and allows for more flexibility.

3.

(a) Write an R function that simulates observations from this model

f = function(x) {
  5 + 2 * x + 1.5 * x^2 + 0.4 * x^3
}

y = function(x) {
  eps = rnorm(n = length(x), 0, 10);
  f(x) + eps;
}

sim = function(x) {
  y(x)
}

(b) Fit a series of models: a linear model, a quadratic, a polynomial of order 3, order 4, order 5, order 6, and order 7. For each, estimate the bias and the standard deviation at \(x_0 = 3\). Make a plot of bias against the order of the polynomial. Connect the points with lines.

iters      = 1000;
num_models = 7;

x = rep(0:10, 5);
x_test  = 3
y_test  = matrix(rep(0, num_models * iters), nrow = num_models, ncol = iters);
set.seed(123);

for (i in 1:iters) {
  y_train = sim(x);
  
  for (order in 1:num_models) { # fit 7 models for each iteration
     model = lm(y_train ~ poly(x, degree = order, raw = TRUE));
     y_test[order, i] = predict(model, data.frame(x = x_test));
  }
}

(c) Superimpose, on the same plot, a plot of the variance of the model at \(x_0 = 3\) against the order of the polynomial. (Again, connecting points with lines).

##   order         bias model_variance  std_dev
## 6     6 7.187328e-04      10.133804 3.183364
## 7     7 1.794416e-03      10.180514 3.190692
## 5     5 1.913180e-02       7.674790 2.770341
## 4     4 2.477467e-02       5.873021 2.423432
## 3     3 3.100144e-02       5.832488 2.415055
## 2     2 1.106427e+01       3.199567 1.788733
## 1     1 3.395148e+01       2.387868 1.545273

(d) Where does the bias have a minimum? Explain why this is not surprising.

As seen in the table above sorted in ascending order by the bias, the bias associated with the 6th order polynomial is the minimum. However, this is not surprising because we allow a lot of flexibility in the model, so we expect the bias of more complicated models to be small.

(e) Calculate the MSE at \(x = 3\) and explain why the minimum is where it is at.

We use the expected test MSE, given by

\[ \mathrm{E} \left( y_0 - \hat{f}(x_0) \right)^2 = \mathrm{Var}\left( \hat{f}(x_0)\right) + \left[ \mathrm{Bias} \left( \hat{f}(x_0) \right)\right]^2 + \mathrm{Var}\left( \varepsilon \right) \]

##   order       mse
## 3     3  105.8334
## 4     4  105.8736
## 5     5  107.6752
## 6     6  110.1338
## 7     7  110.1805
## 2     2  225.6176
## 1     1 1255.0910

As seen in the calculated test MSEs, the third order polynomial has the lowest test MSE, which is not surprising because we know that the \(y\) is a cubic polynomial. Thus, the fitted model should have small test MSE when using a 3rd order approximation.

These results are consistent with the bias-variance tradeoff, which says that in general, as the model grows in complexity, the variance grows as well. Thus, even though the bias indicated that the 6th order polynomial was a good choice, the low MSE associated with the 3rd order model indicates that the choice in models is not soley determined looking at the bias. The MSE and variance associated with the predictions are also important considerations.

4. Exercise 2.4.7 (exclude part d)

(a) Compute the Euclidan distance between each observation and the test point, \(X_1 = X_2 = X_3 = 0\).

x1 = c(0, 2, 0, 0, -1, 1);
x2 = c(3, 0, 1, 1, 0, 1);
x3 = c(0, 0, 3, 2, 1, 1);
y = c("red", "red", "red", "green", "green", "red");

train = data.frame(x1, x2, x3, y)

x = c(0, 0, 0);

l2_dist = sqrt((x - train[,1])^2 + (x - train[,2])^2 + (x - train[,3])^2)
l2_dist

## [1] 3.000000 2.000000 3.162278 2.236068 1.414214 1.732051

The Euclidean distance from the test poin to each of the observaions is given in the output above.

(b) What is our prediction with \(K = 1\)? Why?

##       y  l2_dist
## 5 green 1.414214
## 6   red 1.732051
## 2   red 2.000000
## 4 green 2.236068
## 1   red 3.000000
## 3   red 3.162278

By calculating the Euclidean distance between each observation and the test point and sorting the distances in ascending order, we can see which class (green or red) is closest to the test point. If we use \(K = 1\) nearest neighbor, then we predict that the test point is in the green class since the “closest neighbor” to the test point is green.

(c) What is our prediction with \(K = 3\)? Why?

If we use \(K = 3\), then we consider the three closest neighbors. Again looking at the table generated above, we see that although the closest neighbor is green, the second and third closest neighbors are red. Thus, these points “vote,” and since the number of red neighbors outnumber the number of green neighbors, we predict that the test point is red.