🦀WQU Guru/Quantitative Proficiency Test/Probability And Statistics/External Resources/Ucla Statistics/Hw6
plotCriteria = function(summ, crit, p) {
    
    preds = 1:p;
    if (crit == "cp") {
        criteria = summ$cp;
    } else {
        criteria = summ$bic;
    }
    model_info = data.frame(predictors = preds, criteria);
    p = ggplot(model_info, aes(x = predictors, y = criteria)) + geom_point() +
        geom_line() + labs(x = "Subset Size", y = crit) + theme_bw() + 
        scale_x_continuous(breaks = 1:p);
    return(p);
}


predict.regsubsets = function(object, newdata, id,...){
  form = as.formula(object$call[[2]])
  mat = model.matrix(form, newdata)
  coefi = coef(object, id = id)
  xvars = names(coefi)
  mat[, xvars] %*% coefi
}

50 Years of Data Science Response

I think there is a distinct line between data science and statistics. In recent years, data science is more or less a buzz word used to describe a set of skills that someone who has statistical training. However, data science is too loosely used in the context of merely being able to load packages and call pre-written function. Statistics is the purer field that encompasses the theoretical foundations of probability, distribution theorey, simulation, etc. that are often beyond the scope of what it takes to be a modern day “data scientist”. The two should be more made more distinct, as data science is beginning to detract from what it means to be well-versed in statistics.

Boston Dataset Problem

Note that in fitting all of the following models, we used the \(\log\) of the response variable because the raw values were heavily skewed. As a result, when calculating the test MSE, we first exponentiate the predicted values so that they are on the same scale as the response variable in the testing data.

I. Best subsets with Cp (Least Squares)

bestsub.fit = regsubsets(log(crim) ~ ., method = 'exhaustive',
                         nvmax = n_pred, data = train)
summ = summary(bestsub.fit)
plotCriteria(summ, "cp", n_pred) # 5 predictors

coef(bestsub.fit, id = 5)
##  (Intercept)           zn          nox          rad        black 
## -4.900241548 -0.011484897  5.238125189  0.133050426 -0.001419616 
##        lstat 
##  0.040586516
cp.mod  = lm(log(crim) ~ zn + nox + rad + black + lstat, data = train)
preds   = predict(cp.mod, newdata = test)
cp.mse  = mean((exp(preds) - test$crim)^2)

Looking at the plot, we see that after 5 predictors, the \(C_p\) does not lower significantly, so using 5 predictors for the model makes sense. The 5 most important predictors, as seen in the \(\texttt{coef()}\) function call are: zn, nox, rad, black, and lstat. The test MSE is 86.574 when using \(C_p\) as the criteria for best subset.

II. Best subsets with BIC (Least Squares)

plotCriteria(summ, "bic", n_pred); # 4 predictors

coef(bestsub.fit, id = 4);
## (Intercept)          zn         nox         rad       lstat 
## -5.57141400 -0.01115101  5.37554839  0.13755634  0.04370372
bic.mod = lm(log(crim) ~ zn + nox + rad + lstat, data = train);
preds   = predict(bic.mod, newdata = test);
bic.mse = mean((exp(preds) - test$crim)^2)

Looking at the plot of the \(BIC\), we see that 4 is an optimal subset size for the model. The 4 most important predictors are: zn, nox, rad, and lstat. The test MSE using this model is: 87.355.

III. Best subsets with 10-fold CV (Least Squares)

d = rbind(train, test)
n = dim(d)[1]
k = 10;
set.seed(123)
folds=sample(1:10, n, replace = TRUE)
cv.errors = matrix(NA, nrow = k, ncol = n_pred, dimnames = list(NULL, paste(1:n_pred)))
# rows will hold the mse from each fold
# cols will represent the number of variables in the model

for (j in 1:k) {     # iterate through folds
  best.fit = regsubsets(log(crim) ~ ., data = d[folds != j,], nvmax = n_pred)

  for(i in 1:n_pred) { # i identifies the number of variables in the model
    pred = predict.regsubsets(best.fit, d[folds == j,], id = i)  
    cv.errors[j, i] = mean((d$crim[folds == j] - exp(pred))^2)
  }

}
avg.mse       = apply(cv.errors, 2, mean)  # calculates the average MSE for each model
min_mse_model = unname(which.min(avg.mse));
cv4_mse       = unname(avg.mse[4]) # MSE corresponding to model using 4 
diff          = abs((avg.mse[8] - avg.mse[4]))
avg.mse
##        1        2        3        4        5        6        7        8 
## 49.10832 55.57642 53.17082 47.18480 50.66407 47.47351 47.41003 46.63842 
##        9       10       11       12       13 
## 46.90862 47.07618 47.02154 46.95992 46.93782

Using 10 fold cross valiation, we can calculate the test MSE corresponding to each of the model sizes (output shown above). We see that the model that has the minimum MSE is the one with 8 predictors, which gives 46.638 test MSE. However, the test MSE for the model that uses four predictors differs only by 0.546. Thus, we can sacrifice a small increase in MSE to use a simpler model. The most important predictors in a 4-predictor model are: zn, nox, rad, and lstat. The test MSE is: 47.185.

IV. Lasso

set.seed(123)
train_X  = model.matrix(log(crim) ~ ., data = train);
test_X   = model.matrix(log(crim) ~ ., data = test);
y        = train[,1]

# find best lambda via CV
cv.lasso    = cv.glmnet(x = train_X, y = log(y), alpha = 1);
lambda_star = cv.lasso$lambda.min

# fit lasso model using best lambda
lasso.fit = glmnet(x = train_X, y = log(y), alpha = 1, lambda = lambda_star);
preds     = predict(lasso.fit, s = lambda_star, newx = test_X)
lasso.mse = mean((exp(preds) - test$crim)^2)

coef(lasso.fit) # 9 predictors
## 15 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept) -4.6299037724
## (Intercept)  .           
## zn          -0.0086746603
## indus        0.0004541887
## chas         .           
## nox          4.3272087629
## rm           .           
## age          0.0027278590
## dis         -0.0367943749
## rad          0.1233470729
## tax          0.0005606161
## ptratio      .           
## black       -0.0011619577
## lstat        0.0348237631
## medv         .

Using Lasso (and cross validation to find the optimal \(\lambda\)), we see the chosen model has 9 predictors: zn, indus, nox, age, dis, rad, black, and lstat. Predictors picked in previous models also show up in this set of predictors. The test MSE is: 87.103.

V. Ridge

set.seed(123);
cv.ridge     = cv.glmnet(x = train_X, y = log(y), alpha = 0)
lambda_ridge = cv.ridge$lambda.min
ridge.fit    = glmnet(x = train_X, y = log(y), alpha = 0, lambda = lambda_ridge);

# predict test data
preds = predict(ridge.fit, s = lambda_ridge, newx = test_X)
predictors = coef(ridge.fit)[2:15,]
data.frame(magnitude = sort(abs(predictors), decreasing = TRUE))
##               magnitude
## nox         3.582928389
## rm          0.099419796
## rad         0.092859694
## chas        0.059908104
## dis         0.045823001
## lstat       0.037705785
## medv        0.013487678
## zn          0.009456475
## indus       0.004521191
## age         0.004292967
## ptratio     0.002816637
## tax         0.002020438
## black       0.001826191
## (Intercept) 0.000000000
ridge.mse = mean((exp(preds) - test$crim)^2)

Using Ridge Regression, we get a model with all of the predictors, since it does not zero out any of the coefficients, but we can examine the magnitude of the coefficients to evaluate importance. The sorted predictors based on the magnitude of their coefficients are shown above. We see that the 5 important predictors are nox, rm, rad, chas, and dis. The test MSE is: 90.157.

VI. Principal Components

set.seed(123)
pcr.fit = pcr(log(crim) ~ ., data = train, scale = T, validation = "CV");
validationplot(pcr.fit, val.type = "MSEP")

comps = 5:10
mse_comp = data.frame(comps = comps, test_mse = 0)
for (n in c(5:10)) {
    pcr.n = pcr(log(crim) ~., data = train, scale = TRUE, ncomp = n)
    preds = predict(pcr.n, newdata = test)
    mse_comp[n-4,2]   = mean((exp(preds) - test$crim)^2)
}

pcr.mse = mse_comp[4,2];
# 8 principle components
fit.8 = pcr(log(crim) ~., data = train, scale = TRUE, ncomp = 8)
fit.8$loadings
## 
## Loadings:
##         Comp 1 Comp 2 Comp 3 Comp 4 Comp 5 Comp 6 Comp 7 Comp 8
## zn      -0.262 -0.132 -0.396        -0.316  0.423 -0.346  0.423
## indus    0.341  0.108                       0.118  0.227  0.702
## chas            0.348  0.258  0.841 -0.313                     
## nox      0.335  0.275        -0.145         0.175         0.120
## rm      -0.225  0.439 -0.360         0.243 -0.201 -0.405       
## age      0.305  0.283  0.174 -0.185               -0.617       
## dis     -0.308 -0.371 -0.149  0.149 -0.131        -0.104       
## rad      0.302        -0.448  0.226  0.110  0.218        -0.442
## tax      0.326        -0.403  0.157  0.101  0.278  0.116       
## ptratio  0.205 -0.377         0.354  0.575 -0.277 -0.318  0.195
## black   -0.201         0.434         0.459  0.715              
## lstat    0.319 -0.181  0.133        -0.348  0.128 -0.348 -0.219
## medv    -0.288  0.434 -0.155         0.173         0.123       
## 
##                Comp 1 Comp 2 Comp 3 Comp 4 Comp 5 Comp 6 Comp 7 Comp 8
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.077  0.077  0.077  0.077  0.077  0.077  0.077  0.077
## Cumulative Var  0.077  0.154  0.231  0.308  0.385  0.462  0.538  0.615

Using Principle Components Regression, we can see from the plot that the model with 8 principle components has sufficiently low test MSE. Even though adding principle components will continue to lower the MSE, the improvement does not warrant the incorporation of more principle components. The test MSE corresponding to this model is: 91.655. We cannot use feature selection since the principle components contain linear combinations of all the predictors, but we can look at the loadings to see what constitutes each of the loadings. In the first component, it appears that nox, age, rad, and tax have coefficients that are larger in magnitude.

Model Recommendation

I would recommend the 4 predictor model chosen via best subset with cross validation because this had the lowest test MSE. Ultimately, it depends on the criteria with which we evaluate the model. If we wanted a model that is better at certain things, like reducing dimensions and explaining the variation in the data, then perhaps PCR would be ideal, but in terms of getting the best predictions and test MSE, then the 4 CV best subset model is an optimal model, among the ones we used.