We address the second inference question: What is the top variable associated with the \(\texttt{price}\) of the house. After fitting a linear model with the price as the response variable and the \(\texttt{beds, bath, sqft}\) as predictor variables, we find that \(\texttt{sqft}\) is significant with significance level 0.01, and we claim that the square footage has the most influence on the price of the house.
la = read.csv('LArealestate.csv', stringsAsFactors = FALSE);
fit = lm(la$price ~ la$beds + la$baths + la$sqft)
summary(fit)
##
## Call:
## lm(formula = la$price ~ la$beds + la$baths + la$sqft)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7540896 -1102669 -304272 318099 21398538
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1069120.6 519961.8 -2.056 0.0408 *
## la$beds 95354.1 229703.1 0.415 0.6784
## la$baths -180255.2 216356.3 -0.833 0.4056
## la$sqft 1503.0 138.9 10.820 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3006000 on 249 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7233, Adjusted R-squared: 0.72
## F-statistic: 217 on 3 and 249 DF, p-value: < 2.2e-16
d = read.csv("hw1.csv", stringsAsFactors = FALSE);
m1 = lm(d$y ~ d$x);
m2 = lm(d$y ~ d$x + I(d$x^2));
m3 = lm(d$y ~ d$x + I(d$x^2) + I(d$x^3));
m4 = lm(d$y ~ d$x + I(d$x^2) + I(d$x^3) + I(d$x^4));
m5 = lm(d$y ~ d$x + I(d$x^2) + I(d$x^3) + I(d$x^4) + I(d$x^5));
mean(m1$residuals^2) # MSE = 9688.946
mean(m2$residuals^2) # MSE = 7597.056
mean(m3$residuals^2) # MSE = 6829.43
mean(m4$residuals^2) # MSE = 2034.409
mean(m5$residuals^2) # MSE = 2032.052
Based on the \(MSE_{\mathrm{training}}\) for each of the models, I would choose the 5th order fit, which has the smallest traininng \(MSE\) of the 5 fitted models: 2032.05.
set.seed(456);
x = seq(0, 4, by = 0.5);
f = function(x) {
500 + 200 * x + rnorm(length(x), 0, 100);
}
y_actual = f(x) # true values
# predicted values from each model
y_m1 = predict(m1, data.frame(x));
y_m2 = predict(m2, data.frame(x));
y_m3 = predict(m3, data.frame(x));
y_m4 = predict(m4, data.frame(x));
y_m5 = predict(m5, data.frame(x));
# function to calculate the MSE
mse = function(y, y_fit) {
mean((y - y_fit)^2);
}
mse(y_actual, y_m1); # 10991.10 --- lowest testing MSE
mse(y_actual, y_m2); # 14714.35
mse(y_actual, y_m3); # 17088.13
mse(y_actual, y_m4); # 14897.54
mse(y_actual, y_m5); # 15006.96
\[y = 500 + 200 \cdot x + \texttt{rnorm(length(x), 0, 100)},\]
As expected, the testing error for each of the models is greater than their respective training errors. The smallest increase in error was in the MSE of the linear model. Since we are given the true model, the resulting \(MSE_{\mathrm{testing}}\) make sense because the true model gives a linear relatinoship between \(y\) and \(x\). Thus, it makes sense that the linear model we fit gives the lowest test \(MSE\). Even though the training error for the 5th order fit may have been very low, the high testing error indicates that the fitted model was overfitting the data.
This is a regression problem, and since we are investigating the factors that affect the CEO salary, we are in interested inference rather than prediction. There are 500 firms from which we are collecting data, so \(n = 500\). Within each company, we are looking at the predictor variables: profit, number of employees, industry, so \(p = 3\).
This is a binary classification problem, where the two outcomes/classes are success or failure. Since we want to try to model the success or failure of the product, we are interested in prediction. We consider data from 20 other products, so \(n = 20\). The variables used in the prediction are: price charged for product, marketing budget, competition price, and ten other variables, for a total of \(p = 13\) predictor variables.
Since we are interested in the percent change in the US dollar in relation to weekly changes, this is a regression problem, and we are interested in prediction. We are collecting data from each week of the year so \(n = 52\) weeks. From each week, we consider the variables: the \(\%\) change in the US market, the \(\%\) change in the British market, and the \(\%\) change in the German market, giving a total of \(p = 3\) predictor variables.
The desirable quality of unbiasedness is achieved under the following conditions:
The set of errors \(\{\varepsilon_i\}\) have 0 mean, \(\mathbb{E}[\varepsilon_i] = 0\)
The errors have the same finite variance, \(\mathrm{Var}(\varepsilon_i) = \sigma^2 < \infty\)
Distinct errors are uncorrelated, \(\mathrm{Cov}(\varepsilon_i, \varepsilon_j) = 0, \forall i \neq j\)
Consider a situation in which we are trying to fit a model to predict the stock price of a particularly volatile stock given the month of the year. The stock prices are susceptible to large variation throughout the year, and the relationship is clearly nonlinear, so the Gauss-Markov conditions are not satisfied.