STATS 101C: HW 5

Exercise 5.4.9

a. Using the \(\texttt{Boston}\) data set, provide an estimate for the population mean of \(\texttt{medv}\).

data(Boston)
b = Boston
mu_hat = mean(b$medv) # estimate of the population mean of medv

The estimate for the population is given by \(\hat{\mu}\) = 22.533.

b. Provide an estimate of the standard error of \(\hat{\mu}\). Interpret the result.

n = nrow(b)
se = sd(b$medv) / sqrt(n)

An estimate for the standard error of \(\hat{\mu}\) is given by \(\hat{se} =\) 0.409. This value is the standard deviation of the sample mean’s estimate of the population mean. In other words, it is a an approximation of the standard deviation of the error in the sample mean’s estimate of the true mean.

c. Estimate the standard error of \(\hat{\mu}\) using bootstrap.

set.seed(1)
iters = 1e4

# function to use to get the approximation
boot.fn = function(d, ind) {
  return(mean(d[ind]));
}
boot_result = boot(b$medv, boot.fn, iters)
boot_result

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = b$medv, statistic = boot.fn, R = iters)
## 
## 
## Bootstrap Statistics :
##     original       bias    std. error
## t1* 22.53281 0.0005436561   0.4054831

The estimate for the standard error calculated using bootstrap is given by \(\hat{se} = 0.4054\), and we see that it is very close to the approximation from part (b).

d. Provide 95% confidence interval for the mean of \(\texttt{medv}\)

se_hat = 0.40548
low = mu_hat - 2 * se_hat
high = mu_hat + 2 * se_hat
boot_ci = c(low, high)

medv_test = t.test(b$medv)
ci = c(medv_test$conf.int[1], medv_test$conf.int[2])

The bootstrap confidence interval is given by:

\[ \left( 21.722, 23.344 \right) \]

The confidence interval generated using a t-test is given by:

\[ \left( 21.730, 23.336 \right) \]

The two confidence intervals are very close to each other, and we see that the bootstrap results nto only give very close approximations to the true standard error but also a close approximation to the 95% confidence interval for the population as well.

Exercise 6.8.2

a. The lasso, relative to least squares, is less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. The tuning parameter \(\lambda\) has the effect of making the model less flexible, since coefficients get shrunk to 0. In the context of the bias-variance tradeoff, this means that though there will be more bias (less flexiblity to adapt to the data), the resulting model is less prone to variability in the data and thus give higher prediction accuracy.

b. Ridge regression, relative to least squares, is less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance for the same reason that explains the lasso behavior

c. Nonlinear methods, relative to least squares, are more flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. This is because more flexibility in the model allows the training error to decrease quickly, as the model can adapt to the given data very well. However, these tend to have high variance when exposed to testing data, where small changes/differences from the training data can cause a lot of variance, and thus higher bias on testing data.

Exercise 6.8.8

a, b. In the R code in the chunk below, we generated some simulated data, 100 predictors and response, according to the model:

\[ Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \epsilon \]

where we choose \(\beta = (-2, -1, 1, 1)\).

beta = c(-2, -1, 1, 2);
f = function(x) {
  y = beta[1] + beta[2] * x + beta[3] * x^2 + beta[4] * x^3 + eps;
  return(y);
}
set.seed(123)
x = rnorm(100);
eps = rnorm(100)

y = f(x)
d = data.frame(x, y)

c. Best subset selection

regfit.full = regsubsets(y ~ poly(x, 10, raw = T), 
                         data = d, nvmax = 10, method = "exhaustive");
full.sum = summary(regfit.full)

which.min(full.sum$bic)   # 3
which.max(full.sum$adjr2) # 7
which.min(full.sum$cp)    # 3

multiplot(plotCriteria(regfit.full), cols = 2); # generate plots for all 3 criteria

coefficients(regfit.full, id = 3)

##           (Intercept) poly(x, 10, raw = T)1 poly(x, 10, raw = T)2 
##            -2.0296061            -1.0795538             0.9084569 
## poly(x, 10, raw = T)3 
##             2.0204363

coefficients(regfit.full, id = 7)

##            (Intercept)  poly(x, 10, raw = T)1  poly(x, 10, raw = T)2 
##             -2.2068250             -1.1092516              3.2758848 
##  poly(x, 10, raw = T)3  poly(x, 10, raw = T)4  poly(x, 10, raw = T)6 
##              2.0530514             -4.0243410              2.2138377 
##  poly(x, 10, raw = T)8 poly(x, 10, raw = T)10 
##             -0.4870719              0.0373538

According to \(C_p\) and \(BIC\), the best model obtained uses 3 of the predictors. As see in the coefficients() call in the code chunk above, the model uses the order 1, 2, 3 predictors. If we use the adjusted \(R^2\), then the model uses 7 predictors: \(X, X^2, X^3, X^4, X^6, X^8, X^{10}\). The plots are shown above, and we see can see visually that both \(C_p\) and \(BIC\) are minimized using a subset of 3 predictors. Note that even though the adjusted \(R^2\) is maximized using predictors, there is little change after incorporation of the 3rd predictor, so it may be beneficial to use 3 predictors as well. The coefficient estimates using the 3 predictor model give very close approximations to the true values of \(\beta = (\beta_0, \beta_1, \beta_2, \beta_3)\), so it is reasonable assumption to accept this as a good model.

d. Forward, backward stepwise selection - compare answers to (c)

# forward
fwd.fit = regsubsets(y ~ poly(x, 10, raw = T), method = 'forward', data = d, nvmax = 10)
fwd.sum = summary(fwd.fit)
which.min(fwd.sum$bic)   # 3
which.max(fwd.sum$adjr2) # 4
which.min(fwd.sum$cp)    # 3

# backward
bkwd.fit = regsubsets(y ~ poly(x, 10, raw = T), method = "backward", data = d, nvmax = 10)
bkwd.sum = summary(bkwd.fit)
which.min(bkwd.sum$bic)   # 3
which.max(bkwd.sum$adjr2) # 7
which.min(bkwd.sum$cp)    # 3

fwd_plots = plotCriteria(fwd.fit, col = "red");
bkwd_plots = plotCriteria(bkwd.fit, col = "blue");

fwd_bkwd = append(fwd_plots, bkwd_plots);
multiplot(fwd_bkwd, cols = 2)

coefficients(fwd.fit, id = 3)

##           (Intercept) poly(x, 10, raw = T)1 poly(x, 10, raw = T)2 
##            -2.0296061            -1.0795538             0.9084569 
## poly(x, 10, raw = T)3 
##             2.0204363

coefficients(bkwd.fit, id = 3)

##           (Intercept) poly(x, 10, raw = T)1 poly(x, 10, raw = T)2 
##            -2.0296061            -1.0795538             0.9084569 
## poly(x, 10, raw = T)3 
##             2.0204363

coefficients(bkwd.fit, id = 7) # same predictors chosen as in best subset

##            (Intercept)  poly(x, 10, raw = T)1  poly(x, 10, raw = T)2 
##             -2.2068250             -1.1092516              3.2758848 
##  poly(x, 10, raw = T)3  poly(x, 10, raw = T)4  poly(x, 10, raw = T)6 
##              2.0530514             -4.0243410              2.2138377 
##  poly(x, 10, raw = T)8 poly(x, 10, raw = T)10 
##             -0.4870719              0.0373538

With forward stepwise selection, \(BIC\) and \(C_p\) suggest models with 3 predictors, while the model chosen by adjusted \(R^2\) has 4 predictors. With backward stepwise selection, \(BIC\) and \(C_p\) select a model with 3 predictors, while adjusted \(R^2\) selects a model with 7 predictors. Comparing the models to the predictor model chosen in the previous part, we see that the coefficient estimates are actually the same. In fact, even for the predictor model chosen by using the adjusted \(R^2\) yields the same predictors with the same coefficient estiamtes. The plots agree with these results (forward: red, backward: blue) and follow the same general trends as in part (c).

e. Fit a lasso model and use CV to select optimal value of \(\lambda\)

#### f. We generate a response vector $Y$ according to the model
x_mat = model.matrix(y ~ poly(x, 10, raw = T), data = d)[,-1]
y = d$y

cv.lasso = cv.glmnet(x = x_mat, y = y, alpha = 1);
plot(cv.lasso) # lamdba vs. MSE

lambda_star = cv.lasso$lambda.min # find optimal lambda via CV
predict(cv.lasso, type = "coefficients", s = lambda_star) # coefficient estimates

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                                   1
## (Intercept)            -1.955366728
## poly(x, 10, raw = T)1  -0.963705978
## poly(x, 10, raw = T)2   0.744397361
## poly(x, 10, raw = T)3   1.936282462
## poly(x, 10, raw = T)4   .          
## poly(x, 10, raw = T)5   0.013011465
## poly(x, 10, raw = T)6   0.009279842
## poly(x, 10, raw = T)7   .          
## poly(x, 10, raw = T)8   .          
## poly(x, 10, raw = T)9   .          
## poly(x, 10, raw = T)10  .

The plot of the value of the error as a function of lambda is shown above. The resulting coefficient estimates show reasonable approximations. The first 4 order estimates are very close to the true value of the estimates, which are \(\beta = (-2, -1, 1, 2)\). There are estimates for the 5th and 6th order coefficients, but these are small in comparison to the first four estimates, so they have less influence. Overall, the lasso model does a good job of approximating the coefficients. It includes a bit extra information (order 5, 6), but the main approximations are accurate.

f. Generate a response vector \(Y\) according to the model:

\[ Y = \beta_0 + \beta_7 X^7 + \epsilon, \quad \beta_0 = -2, \quad \beta_7 = -1 \]

and perform best subset selection and the lasso.

## best subset selection\
# beta0 = -2, beta7 = -1
y0 = beta[1] + beta[2] * x^7 + eps;
d0 = data.frame(y = y0, x = x)
bss = regsubsets(y0 ~ poly(x, 10, raw = TRUE), 
                 data = d0, method = "exhaustive", nvmax = 10)
bss_sum = summary(bss)
which.min(bss_sum$cp)    # 3 
which.min(bss_sum$bic)   # 1
which.max(bss_sum$adjr2) # 4

set.seed(123)
coef(bss, id = 3) # order 2, 7, 9

##              (Intercept) poly(x, 10, raw = TRUE)2 poly(x, 10, raw = TRUE)6 
##             -1.958221056             -0.257807002              0.009422817 
## poly(x, 10, raw = TRUE)7 
##             -0.999773816

coef(bss, id = 1) # bic model: X^7, close to approximation

##              (Intercept) poly(x, 10, raw = TRUE)7 
##                -2.106749                -1.000296

coef(bss, id = 4) # order 4, 5, 7, 9

##              (Intercept) poly(x, 10, raw = TRUE)4 poly(x, 10, raw = TRUE)6 
##              -1.96762983              -0.41711570               0.18075304 
## poly(x, 10, raw = TRUE)7 poly(x, 10, raw = TRUE)8 
##              -1.00062525              -0.01966157

## lasso
mat0 = model.matrix(y ~ poly(x, 10, raw = TRUE), data = d0)[,-1]
lasso0 = cv.glmnet(mat0, y0, alpha = 1);
lambda0_star = lasso0$lambda.min
predict(lasso0, type = "coefficients", s = lambda_star)

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                                    1
## (Intercept)               -2.1721173
## poly(x, 10, raw = TRUE)1   .        
## poly(x, 10, raw = TRUE)2   .        
## poly(x, 10, raw = TRUE)3   .        
## poly(x, 10, raw = TRUE)4   .        
## poly(x, 10, raw = TRUE)5   .        
## poly(x, 10, raw = TRUE)6   .        
## poly(x, 10, raw = TRUE)7  -0.9760871
## poly(x, 10, raw = TRUE)8   .        
## poly(x, 10, raw = TRUE)9   .        
## poly(x, 10, raw = TRUE)10  .

Using best subset selection, we get models with 3, 1, and 4 predictors, generated by the results of \(C_p\), \(BIC\), and adjusted \(R^2\), respectively. The best model of these three choices is the one chosen by \(BIC\), which correctly chooses the model with the 7th order predictor. The coefficient estimates \((-2.11, -1.00)\) are close to the true values of \(\beta = (-2, -1)\). Using the lasso model, we see that the lasso correctly chooses the 7th order predictor, and the first two estimates \((-2.17, -0.98)\) are close to the true value of the coeffiecients. We can conclude that the lasso does a good job of estimating the coefficients without adding too much extra information. Best subset selection is dependent on which criteria we use to choose the model, so there could be some issues with accuracy there.