Stats 101C: Homework 1

Question 1.

a. 4 questions that could be answered with the data (2 questions about estimating parameters/inference, 2 questions about prediction)

1. Parameter/Inference Question 1: What are the effects of the location (\(\texttt{city}\)) on the \(\texttt{price}\) of the house?

2. Parameter/Inference Question 2: What is the top variable associated with the \(\texttt{price}\) of the house?

3. Prediction Question 1: Using the variables \(\texttt{beds, baths, sqft}\), can we predict the \(\texttt{price}\) of the house?

4. Prediction Question 2: Using the variables \(\texttt{beds, baths, sqft, price}\), can we predict the \(\texttt{city}\) of the house?

b. Answer one of the questions from part a.

We address the second inference question: What is the top variable associated with the \(\texttt{price}\) of the house. After fitting a linear model with the price as the response variable and the \(\texttt{beds, bath, sqft}\) as predictor variables, we find that \(\texttt{sqft}\) is significant with significance level 0.01, and we claim that the square footage has the most influence on the price of the house.

la = read.csv('LArealestate.csv', stringsAsFactors = FALSE);
fit = lm(la$price ~ la$beds + la$baths + la$sqft)
summary(fit)

## 
## Call:
## lm(formula = la$price ~ la$beds + la$baths + la$sqft)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -7540896 -1102669  -304272   318099 21398538 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1069120.6   519961.8  -2.056   0.0408 *  
## la$beds        95354.1   229703.1   0.415   0.6784    
## la$baths     -180255.2   216356.3  -0.833   0.4056    
## la$sqft         1503.0      138.9  10.820   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3006000 on 249 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.7233, Adjusted R-squared:   0.72 
## F-statistic:   217 on 3 and 249 DF,  p-value: < 2.2e-16

Question 2.

a. Fit 5 models. Model 5 should be fifth-order polynomial, model 1 a 1st order model. List the \(MSE_{\mathrm{training}}\) for each.

d = read.csv("hw1.csv", stringsAsFactors = FALSE);

m1 = lm(d$y ~ d$x); 
m2 = lm(d$y ~ d$x + I(d$x^2));
m3 = lm(d$y ~ d$x + I(d$x^2) + I(d$x^3));
m4 = lm(d$y ~ d$x + I(d$x^2) + I(d$x^3) + I(d$x^4));
m5 = lm(d$y ~ d$x + I(d$x^2) + I(d$x^3) + I(d$x^4) + I(d$x^5));

mean(m1$residuals^2) # MSE = 9688.946
mean(m2$residuals^2) # MSE = 7597.056
mean(m3$residuals^2) # MSE = 6829.43
mean(m4$residuals^2) # MSE = 2034.409
mean(m5$residuals^2) # MSE = 2032.052

b. Based on \(MSE_{\mathrm{training}}\), which would you choose?

Based on the \(MSE_{\mathrm{training}}\) for each of the models, I would choose the 5th order fit, which has the smallest traininng \(MSE\) of the 5 fitted models: 2032.05.

c. Generate a new data set, the testing data set, using this R code. Use the generated y-values to compute the \(MSE_{\mathrm{testing}}\) for each of the 5 models. Then write your own function to compute the \(MSE\).

set.seed(456);
x = seq(0, 4, by = 0.5);

f = function(x) {
  500 + 200 * x + rnorm(length(x), 0, 100);
}

y_actual = f(x)  # true values

# predicted values from each model
y_m1 = predict(m1, data.frame(x));
y_m2 = predict(m2, data.frame(x));
y_m3 = predict(m3, data.frame(x));
y_m4 = predict(m4, data.frame(x));
y_m5 = predict(m5, data.frame(x));

# function to calculate the MSE
mse = function(y, y_fit) {
  mean((y - y_fit)^2);
}

mse(y_actual, y_m1); # 10991.10 --- lowest testing MSE
mse(y_actual, y_m2); # 14714.35
mse(y_actual, y_m3); # 17088.13
mse(y_actual, y_m4); # 14897.54
mse(y_actual, y_m5); # 15006.96

d. Describe how the \(MSE_{\mathrm{testing}}\) and \(MSE_{\mathrm{training}}\) compare.

Now that you know the true model,

\[y = 500 + 200 \cdot x + \texttt{rnorm(length(x), 0, 100)},\]

do the \(MSE\)s make sense?

As expected, the testing error for each of the models is greater than their respective training errors. The smallest increase in error was in the MSE of the linear model. Since we are given the true model, the resulting \(MSE_{\mathrm{testing}}\) make sense because the true model gives a linear relatinoship between \(y\) and \(x\). Thus, it makes sense that the linear model we fit gives the lowest test \(MSE\). Even though the training error for the 5th order fit may have been very low, the high testing error indicates that the fitted model was overfitting the data.

Question 3. Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide \(n\) and \(p\).

a. Collect data on the top 500 firms in the US: record profit, number of employees, industry, and the CEO salary. We want to understand which factors affect thte CEO salary.

This is a regression problem, and since we are investigating the factors that affect the CEO salary, we are in interested inference rather than prediction. There are 500 firms from which we are collecting data, so \(n = 500\). Within each company, we are looking at the predictor variables: profit, number of employees, industry, so \(p = 3\).

b. Launching a product – success or failure. We collect data on 20 similar products that were previously launched. We record: success or failure, price charged for product, marketing budget, competition price, and ten other variables.

This is a binary classification problem, where the two outcomes/classes are success or failure. Since we want to try to model the success or failure of the product, we are interested in prediction. We consider data from 20 other products, so \(n = 20\). The variables used in the prediction are: price charged for product, marketing budget, competition price, and ten other variables, for a total of \(p = 13\) predictor variables.

c. Predicting the % change in the US dollar in relation to weekly changs in the world stock markets. We collect weekly data for all of 2012. For each week, we record \(\%\) in dollar, the \(\%\) change in the US market, the \(\%\) change in the British market, and the \(\%\) change in the German market.

Since we are interested in the percent change in the US dollar in relation to weekly changes, this is a regression problem, and we are interested in prediction. We are collecting data from each week of the year so \(n = 52\) weeks. From each week, we consider the variables: the \(\%\) change in the US market, the \(\%\) change in the British market, and the \(\%\) change in the German market, giving a total of \(p = 3\) predictor variables.

Question 4. The least squares regression estimates are examples of Best Linear Unbiased Estimators (BLUE).

a. Review the Gauss-Markov theorem and explain under which conditions this desirable quality of unbiasedness is achieved.

The desirable quality of unbiasedness is achieved under the following conditions:

The set of errors \(\{\varepsilon_i\}\) have 0 mean, \(\mathbb{E}[\varepsilon_i] = 0\)
The errors have the same finite variance, \(\mathrm{Var}(\varepsilon_i) = \sigma^2 < \infty\)
Distinct errors are uncorrelated, \(\mathrm{Cov}(\varepsilon_i, \varepsilon_j) = 0, \forall i \neq j\)

b. Give an example of a situation in which the GM theorem is not satisfied.

Consider a situation in which we are trying to fit a model to predict the stock price of a particularly volatile stock given the month of the year. The stock prices are susceptible to large variation throughout the year, and the relationship is clearly nonlinear, so the Gauss-Markov conditions are not satisfied.