🦀WQU Guru/Quantitative Proficiency Test/Probability And Statistics/External Resources/Ucla Statistics/Hw9

Problem 1

(a) To which of these cultures does your final project belong, and why? What are examples of models that could be applied to these data and fit into this culture?

The final project falls more in the culture of algorithmic modeling. As mentioned in the paper, part of the process involves finding what gives good predictive accuracy and adopting the approach that “nature produces data in a black box whose insides are complex, mysterious, and unknowable.” In the project, we’re given a set of predictors and we are trying to predict the time variable, and in doing so, we will explore which of the predictors are more influential than others and use different methods to get higher prediction accuracy. Some models that could be applied to these data and fit into this culture are boosting, neural networks, and random forests. All of these work to increase the predictive abilities of more traditional statistical learning algorithms.

(b) What sort of questions might we ask of this data if, instead, the final project belonged to the other culture? What are examples of models we might apply in that case?

If the final project belonged to the culture of data models, then we would be more interested in the underlying structure of the data. We would seek to make conclusions about the model’s mechanism and try to fit the data with parametric models so that we can derive conclusions with theoretical results that are compatible with said parametric models. A canonical example for these types of problems is linear regression.

Problem 2

Perform gradient boosting to predict median value of owner-occupied homes for the Boston data set. Report the mean squared error on your testing data set.

data(Boston)
set.seed(100)
train = sample(1:nrow(Boston), nrow(Boston)/2); # use half of Boston for train
b_train = Boston[train,]
b_test  = Boston[-train,]

xg.boston = xgboost(data = data.matrix(b_train[,-14]), 
                    label = data.matrix(b_train[,14]),
                    nrounds = 5000, eta = 0.1, verbose = 0);

# To get predictions on your test data based on the XGB model
preds.xgb = predict(xg.boston, newdata = data.matrix(b_test[,1:13]))

# Getting the MSE based on your predictions from your XGB model
mse.xgb = mean((preds.xgb - b_test[,14])^2)

We use gradient boosting to predict the median value of owner-occupied homes for the Boston data set, and the resulting testing MSE was 9.768.

Problem 3

(a) Split the data set into a training set and a test set

data(Carseats) # 400 x 11
# predict Sales using regression trees
set.seed(1)
train = sample(1:nrow(Carseats), nrow(Carseats) / 2);
train.car = Carseats[train,];
test.car  = Carseats[-train,];

(b) Fit a regression tree to the training set. Plot the tree, and interpret the result. What test MSE do you obtain?

tree.car = tree(Sales ~., train.car);
# summary(tree.car)
plot(tree.car)
text(tree.car, pretty = 0)

yhat = predict(tree.car, newdata = test.car); # predictions on test data
tree.mse = mean((yhat - test.car$Sales)^2)    # 4.1489

Examining the regression tree above, we see that ShelveLoc, Price, Age, Advertising, Income, CompPrice are the variables used in the tree construction. The initial split is made using ShelveLoc, and the second split is made on the Price. The test MSE using a regression tree is 4.149.