college = read.csv("College.csv")
rownames(college) = college[,1]
college = college[,-1]
fix(college)
To answer the first question, we first fit a linear model using the predictor variables mentioned in the question and found that the Acceptance Rate, whether or not the school was a private school or not, and if the applicants were in the top 10 percent were all significant. Taking this analysis one step further, we look at the correlation between these variables, as seen in the correlation heat map below. We can see that Graduation Rate is most positively correlated with the variables \(\texttt{Top10Perc}\) and \(\texttt{Private}\), while it was most negatively correlated with the Acceptance Rate and the Student to Faculty Ratio. The correlation heap map also illustrates some interesting relationships between some of the other variables. For example, there is very little correlation between the variables \(\texttt{Top10Perc}\) and \(\texttt{Private}\).
d = college %>% mutate(AcceptRate = Accept / Apps * 100) # 777 x 19
grad_fit = lm(d$Grad.Rate ~ d$S.F.Ratio + d$AcceptRate + d$Private + d$Top10perc)
summary(grad_fit)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.17380483 4.57887625 12.9232156 1.000143e-34
## d$S.F.Ratio -0.06211168 0.15579158 -0.3986844 6.902360e-01
## d$AcceptRate -0.14602323 0.04009231 -3.6421755 2.884020e-04
## d$PrivateYes 10.67481808 1.31274186 8.1316963 1.683949e-15
## d$Top10perc 0.37408616 0.03556972 10.5169847 2.878934e-24
To answer the second question, we first generate a scatter plot of the graduation rate versus the cost of attending the school, where the cost is as defined above. Since there somewhat appears to be a linear relationship between the two variables, we fit a regression model to see if this relationship is significant. Indeed, the variable corresponding to the cost is significant.
cost_fit = lm(cost$Grad.Rate ~ cost$total_cost)
summary(cost_fit)
##
## Call:
## lm(formula = cost$Grad.Rate ~ cost$total_cost)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.526 -11.168 -0.065 12.259 53.551
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.486e+01 3.140e+00 14.286 < 2e-16 ***
## cost$total_cost 3.297e-03 4.934e-04 6.683 4.48e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.71 on 775 degrees of freedom
## Multiple R-squared: 0.05448, Adjusted R-squared: 0.05326
## F-statistic: 44.66 on 1 and 775 DF, p-value: 4.48e-11
Show that the following are equivalent \[ \begin{align} p(X) &= \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} \\ \frac{p(X)}{1 - p(X)} &= e^{\beta_0 + \beta_1 X} \end{align} \] Using the first equality, we can write the following expression for \(1 - p(X)\) \[ 1 - p(X) = 1 - \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} = \frac{1}{1 + e^{\beta_0 + \beta_1 X}} \]
\[ \Rightarrow \frac{p(X)}{1 - p(X)} = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} \cdot \left(1 + e^{\beta_0 + \beta_1 X} \right) = e^{\beta_0 + \beta_1 X} \] which is exactly the second equation given above.
If we wish to predict a test observation’s response using only observations within 10% of the range of \(X\) closet to the test observation, then on average, we will use \[\frac{(X + 0.5) - (X - 0.5)}{1 - 0} = \frac{1}{10}\] of the available observations to make the prediction.
If we have \(p = 2\) features, \(X_1, X_2\), and use 10% of the range of each of the predictors to generate a prediction, then on average we will use \[ \frac{\left((X_1 + 0.05) - (X_1 - 0.05) \right) \cdot \left( (X_2 + 0.05) - (X_2 - 0.05)\right)}{(1 - 0) \cdot (1 - 0)} = \frac{1}{100} \] of the total observations
If we have \(p = 100\) features, and we use 10% of the observations closest to each of the features to generate a prediction, then on average we will use
\[ \frac{\prod_{i = 1}^{100} \left((X_i + 0.05) - (X_i - 0.05) \right)}{1} = \prod_{i = 1}^{100} 0.1 = 0.1^{100} = \frac{1}{10^{100}} \] of the total observations
From parts (a)-(c), we see that as \(p\) increases, the fraction of the observations that are “close” to any given test observation decreases exponentially as a function of \(p\). This means that for KNN, we are not able to efficiently use the training data to our advantage.
We can calculate the length of each side of the hypercube by ensuring that the \(\mathrm{length}^p = 0.1\). Thus:
\[ p = 1 \quad \Rightarrow \mathrm{length = 0.1} \]
\[ p = 2 \quad \Rightarrow \mathrm{length = 0.1^{\frac{1}{2}}} \]
\[ p = 100 \quad \Rightarrow \mathrm{length = 0.1^{\frac{1}{100}}} \]
Suppose \(X_1 =\) hours studied, \(X_2 =\) undergrad GPA, and \(Y=\) received an A. We fit a logistic regression and produce estimated coefficient \[\hat{\beta_0} = -6, \quad \hat{\beta_1} = 0.05, \quad \hat{\beta_2} = 1\]
Since are given the estimated coefficients, we know that the probability can be written \[ \hat{p}(X) = \frac{e^{\hat{\beta_0} + \hat{\beta_1} X_1 + \hat{\beta_2} X_2}}{1 + e^{\hat{\beta_0} + \hat{\beta_1} X_1 + \hat{\beta_2} X_2}} = \frac{e^{-6 + 0.05 X_1 + X_2}}{1 + e^{-6 + 0.05 X_1 + X_2}}, \quad X = \left( X_1, X_2 \right) \]
Since we are given \(X = \left( 40, 3.5 \right)\), we can compute the probability that the student gets an A in the class as follows \[ \hat{p}(X) = \frac{e^{-6 + 0.05 X_1 + X_2}}{1 + e^{-6 + 0.05 X_1 + X_2}} = \frac{e^{-6 + 0.05 \cdot 40 + 3.5}}{1 + e^{-6 + 0.05 \cdot 40 + 3.5}} = 0.378 \]
Recall that the logistic function can be written as \[ \frac{p(X)}{1 - p(X)} = e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2}, \quad X = \left( X_1, X_2 \right) \] We are given \(p(X) = 0.5\), so we can plug this value in and solve for \(X_1\), take \(\log\) on both sides, and solve for \(X_1\)
\[ \begin{align} 1 &= e^{-6 + 0.05 X_1 + 3.5} \\ 0 &= -6 + 0.05 X_1 + 3.5 \\ X_1 &= 50 \end{align} \] Thus, the student must study 50 hours in order to have a 50% chance of getting an A in the class.
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202
## Median : 0.2380 Median : 0.2340 Median :1.00268
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821
## Today Direction
## Min. :-18.1950 Down:484
## 1st Qu.: -1.1540 Up :605
## Median : 0.2410
## Mean : 0.1499
## 3rd Qu.: 1.4050
## Max. : 12.0260
From the numerical summary above, we see that each of the Lag variables has approximately the same minimum and maximum values, with the mean values differing for Lag3, Lag4, and Lag5. It may therefore be worth looking at the Lag values as they vary over each year. The numerical summary of the Volume is also given, and we can also break the volume down by year as well. From the plots above, we see that volume steadily increases until about 2009, where volume drops back down. Looking at the figure with the Lag variables plotted across each year, we see that the lag variables per year are roughly in the same region of values, but across each year, these values vary a little more.
glm_dir = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = Weekly, family = binomial)
summary(glm_dir) # Lag 2 is statistically significant
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
After fitting a logistic regression with Direction as response with the 5 lag variables and volume as predictors, we find that Lag2 is statistically significant.
We compute the confusion matrix, as shown below. This tabulates the distribution of the true classes and and compares it against the classes predicted by the model we fit. From the matrix shown below, we see that the model correctly classified 54 of the Downs and 557 ups, but incorrectly classified 478 of the other classes. This gives us 0.5611 prediction accuracy.
conf_mat
##
## glm.pred Down Up
## Down 54 48
## Up 430 557
We fit a logistic regression model using Lag2 as the only predictor, with the training period from 1990 to 2008. We then used the model on the testing data. The confusion matrix is below, and we see that our prediction accuracy is 0.625
table(test.pred, test$Direction)
##
## test.pred Down Up
## Down 9 5
## Up 34 56
We then repeat (d) using LDA. The confusion matrix is given below, and we see that the overall prediction accuracy is 0.625
table(lda.class, test$Direction)
##
## lda.class Down Up
## Down 9 5
## Up 34 56
We repeat (d) using KNN, with K = 1. The confusion matrix is given below, and we seet that the overall prediction accuracy is 0.5
table(lda.class, test$Direction)
##
## lda.class Down Up
## Down 9 5
## Up 34 56
After fitting the logistic regression model, LDA, and KNN, I think that the logistic regression model would provide the bst results for the data. Of the three models, KNN gives the worst testing results, at just 50% prediction accuracy on the test set. While logistic regression and LDA give the same prediction accuracy, I think logistic regression is the better model because we use fewer assumptions about the data. LDA assumes that the observations are drawn from a Gaussian distribution with a common covariance matrix, which may not necessarily be true. Logistic regression can thus have better results if this assumption is false.