The graphic considers a couple of the features of the houses in dataset. First, we compute the dollar amount per square footage, and we look at how this quantity differs across the three cities. Clearly, houses in Beverly Hills have an average price per square footage that is dramatically higher than in the other two cities. We also show the average square foot per house in each of the cities. In this case, the difference is even more extreme, as the square footage in Beverly Hills homes is on average three times the square footage of houses in the other cities.
We define the groups as follows:
Group 1: no exerise and male
Group 2: exerise and male
Group 3: no exercise and female
Group 4: exercise and female
## # A tibble: 4 × 2
## category `mean(wtdesire - weight)`
## <dbl> <dbl>
## 1 1 -13.897161
## 2 2 -9.781941
## 3 3 -22.125638
## 4 4 -16.593541
As seen in the four plots above, the general trends between the groups show that people’s desired weight is lower than their current weight. If we look more closely within each group though, we see (in the table above) that group 3, the women who do not exercise have the biggest discrepancy in between their current weight and their desired weight, while men who exercise have the smallest discrepancy in their desires and current state. In addition, as people’s current weight grows, the desired weight grows as well, though at a slower rate. This means that people generally have reasonable expectations of themselves, but some of the heavier people may have overly ambitious goals of losing a lot of weight.
Since we use a smoother instead of a regression line, the curves are generally not linear. Instead, they are able to catch the smaller trend that very heavy people tend to have very low desired weight relative to their current weight. This is paricularly obvious in group two, where the curve very obviously curves back down as the current weight surpasses 300 pounds. In general, the smoothing curve is more sensitive to the data and allows for more flexibility.
f = function(x) {
5 + 2 * x + 1.5 * x^2 + 0.4 * x^3
}
y = function(x) {
eps = rnorm(n = length(x), 0, 10);
f(x) + eps;
}
sim = function(x) {
y(x)
}
iters = 1000;
num_models = 7;
x = rep(0:10, 5);
x_test = 3
y_test = matrix(rep(0, num_models * iters), nrow = num_models, ncol = iters);
set.seed(123);
for (i in 1:iters) {
y_train = sim(x);
for (order in 1:num_models) { # fit 7 models for each iteration
model = lm(y_train ~ poly(x, degree = order, raw = TRUE));
y_test[order, i] = predict(model, data.frame(x = x_test));
}
}
## order bias model_variance std_dev
## 6 6 7.187328e-04 10.133804 3.183364
## 7 7 1.794416e-03 10.180514 3.190692
## 5 5 1.913180e-02 7.674790 2.770341
## 4 4 2.477467e-02 5.873021 2.423432
## 3 3 3.100144e-02 5.832488 2.415055
## 2 2 1.106427e+01 3.199567 1.788733
## 1 1 3.395148e+01 2.387868 1.545273
As seen in the table above sorted in ascending order by the bias, the bias associated with the 6th order polynomial is the minimum. However, this is not surprising because we allow a lot of flexibility in the model, so we expect the bias of more complicated models to be small.
We use the expected test MSE, given by
\[ \mathrm{E} \left( y_0 - \hat{f}(x_0) \right)^2 = \mathrm{Var}\left( \hat{f}(x_0)\right) + \left[ \mathrm{Bias} \left( \hat{f}(x_0) \right)\right]^2 + \mathrm{Var}\left( \varepsilon \right) \]
## order mse
## 3 3 105.8334
## 4 4 105.8736
## 5 5 107.6752
## 6 6 110.1338
## 7 7 110.1805
## 2 2 225.6176
## 1 1 1255.0910
As seen in the calculated test MSEs, the third order polynomial has the lowest test MSE, which is not surprising because we know that the \(y\) is a cubic polynomial. Thus, the fitted model should have small test MSE when using a 3rd order approximation.
These results are consistent with the bias-variance tradeoff, which says that in general, as the model grows in complexity, the variance grows as well. Thus, even though the bias indicated that the 6th order polynomial was a good choice, the low MSE associated with the 3rd order model indicates that the choice in models is not soley determined looking at the bias. The MSE and variance associated with the predictions are also important considerations.
x1 = c(0, 2, 0, 0, -1, 1);
x2 = c(3, 0, 1, 1, 0, 1);
x3 = c(0, 0, 3, 2, 1, 1);
y = c("red", "red", "red", "green", "green", "red");
train = data.frame(x1, x2, x3, y)
x = c(0, 0, 0);
l2_dist = sqrt((x - train[,1])^2 + (x - train[,2])^2 + (x - train[,3])^2)
l2_dist
## [1] 3.000000 2.000000 3.162278 2.236068 1.414214 1.732051
The Euclidean distance from the test poin to each of the observaions is given in the output above.
## y l2_dist
## 5 green 1.414214
## 6 red 1.732051
## 2 red 2.000000
## 4 green 2.236068
## 1 red 3.000000
## 3 red 3.162278
By calculating the Euclidean distance between each observation and the test point and sorting the distances in ascending order, we can see which class (green or red) is closest to the test point. If we use \(K = 1\) nearest neighbor, then we predict that the test point is in the green class since the “closest neighbor” to the test point is green.
If we use \(K = 3\), then we consider the three closest neighbors. Again looking at the table generated above, we see that although the closest neighbor is green, the second and third closest neighbors are red. Thus, these points “vote,” and since the number of red neighbors outnumber the number of green neighbors, we predict that the test point is red.