STATS 101C: HW 3

Exercise 2.4.8

(a), (b) Load in \(\texttt{College}\) data

college = read.csv("College.csv")
rownames(college) = college[,1]
college = college[,-1]
fix(college)

(c)

Question 1: How do the variables \(\texttt{Private}\), \(\texttt{Top10perc}\), \(\texttt{S.F.Ratio}\), \(\texttt{AcceptRate}\), where \(\texttt{AcceptRate}\) is calculated by dividing \(\texttt{Accept}\) by \(\texttt{Apps}\), play a role in the variable \(\texttt{Grad.Rate}\)?

Question 2: Is there a linear relationship betwenen the cost of living attending the school (excluding tuition) with the graduation rate? In this case, the cost of living is calculated as: \(\texttt{Room.Board} + \texttt{Books} + \texttt{Personal}\).

To answer the first question, we first fit a linear model using the predictor variables mentioned in the question and found that the Acceptance Rate, whether or not the school was a private school or not, and if the applicants were in the top 10 percent were all significant. Taking this analysis one step further, we look at the correlation between these variables, as seen in the correlation heat map below. We can see that Graduation Rate is most positively correlated with the variables \(\texttt{Top10Perc}\) and \(\texttt{Private}\), while it was most negatively correlated with the Acceptance Rate and the Student to Faculty Ratio. The correlation heap map also illustrates some interesting relationships between some of the other variables. For example, there is very little correlation between the variables \(\texttt{Top10Perc}\) and \(\texttt{Private}\).

d = college %>% mutate(AcceptRate = Accept / Apps * 100) # 777 x 19 

grad_fit = lm(d$Grad.Rate ~ d$S.F.Ratio + d$AcceptRate + d$Private + d$Top10perc)
summary(grad_fit)$coefficients

##                 Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)  59.17380483 4.57887625 12.9232156 1.000143e-34
## d$S.F.Ratio  -0.06211168 0.15579158 -0.3986844 6.902360e-01
## d$AcceptRate -0.14602323 0.04009231 -3.6421755 2.884020e-04
## d$PrivateYes 10.67481808 1.31274186  8.1316963 1.683949e-15
## d$Top10perc   0.37408616 0.03556972 10.5169847 2.878934e-24

To answer the second question, we first generate a scatter plot of the graduation rate versus the cost of attending the school, where the cost is as defined above. Since there somewhat appears to be a linear relationship between the two variables, we fit a regression model to see if this relationship is significant. Indeed, the variable corresponding to the cost is significant.

cost_fit = lm(cost$Grad.Rate ~ cost$total_cost)
summary(cost_fit)

## 
## Call:
## lm(formula = cost$Grad.Rate ~ cost$total_cost)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.526 -11.168  -0.065  12.259  53.551 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.486e+01  3.140e+00  14.286  < 2e-16 ***
## cost$total_cost 3.297e-03  4.934e-04   6.683 4.48e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.71 on 775 degrees of freedom
## Multiple R-squared:  0.05448,    Adjusted R-squared:  0.05326 
## F-statistic: 44.66 on 1 and 775 DF,  p-value: 4.48e-11

Exercise 4.7.1

Show that the following are equivalent \[ \begin{align} p(X) &= \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} \\ \frac{p(X)}{1 - p(X)} &= e^{\beta_0 + \beta_1 X} \end{align} \] Using the first equality, we can write the following expression for \(1 - p(X)\) \[ 1 - p(X) = 1 - \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} = \frac{1}{1 + e^{\beta_0 + \beta_1 X}} \]

\[ \Rightarrow \frac{p(X)}{1 - p(X)} = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} \cdot \left(1 + e^{\beta_0 + \beta_1 X} \right) = e^{\beta_0 + \beta_1 X} \] which is exactly the second equation given above.

Exercise 4.7.4

(a)

If we wish to predict a test observation’s response using only observations within 10% of the range of \(X\) closet to the test observation, then on average, we will use \[\frac{(X + 0.5) - (X - 0.5)}{1 - 0} = \frac{1}{10}\] of the available observations to make the prediction.

(b)

If we have \(p = 2\) features, \(X_1, X_2\), and use 10% of the range of each of the predictors to generate a prediction, then on average we will use \[ \frac{\left((X_1 + 0.05) - (X_1 - 0.05) \right) \cdot \left( (X_2 + 0.05) - (X_2 - 0.05)\right)}{(1 - 0) \cdot (1 - 0)} = \frac{1}{100} \] of the total observations

(c)

If we have \(p = 100\) features, and we use 10% of the observations closest to each of the features to generate a prediction, then on average we will use

\[ \frac{\prod_{i = 1}^{100} \left((X_i + 0.05) - (X_i - 0.05) \right)}{1} = \prod_{i = 1}^{100} 0.1 = 0.1^{100} = \frac{1}{10^{100}} \] of the total observations

(d)

From parts (a)-(c), we see that as \(p\) increases, the fraction of the observations that are “close” to any given test observation decreases exponentially as a function of \(p\). This means that for KNN, we are not able to efficiently use the training data to our advantage.

(e)

We can calculate the length of each side of the hypercube by ensuring that the \(\mathrm{length}^p = 0.1\). Thus:

\[ p = 1 \quad \Rightarrow \mathrm{length = 0.1} \]

\[ p = 2 \quad \Rightarrow \mathrm{length = 0.1^{\frac{1}{2}}} \]

\[ p = 100 \quad \Rightarrow \mathrm{length = 0.1^{\frac{1}{100}}} \]

Exercise 4.7.6

Suppose \(X_1 =\) hours studied, \(X_2 =\) undergrad GPA, and \(Y=\) received an A. We fit a logistic regression and produce estimated coefficient \[\hat{\beta_0} = -6, \quad \hat{\beta_1} = 0.05, \quad \hat{\beta_2} = 1\]

(a) Estimate the probability that a student who studes for 40 hours and has an undergrad GPA of 3.5 gets an A in the class.

Since are given the estimated coefficients, we know that the probability can be written \[ \hat{p}(X) = \frac{e^{\hat{\beta_0} + \hat{\beta_1} X_1 + \hat{\beta_2} X_2}}{1 + e^{\hat{\beta_0} + \hat{\beta_1} X_1 + \hat{\beta_2} X_2}} = \frac{e^{-6 + 0.05 X_1 + X_2}}{1 + e^{-6 + 0.05 X_1 + X_2}}, \quad X = \left( X_1, X_2 \right) \]

Since we are given \(X = \left( 40, 3.5 \right)\), we can compute the probability that the student gets an A in the class as follows \[ \hat{p}(X) = \frac{e^{-6 + 0.05 X_1 + X_2}}{1 + e^{-6 + 0.05 X_1 + X_2}} = \frac{e^{-6 + 0.05 \cdot 40 + 3.5}}{1 + e^{-6 + 0.05 \cdot 40 + 3.5}} = 0.378 \]

(b) How many hours would the student in part (a) need to study to have a 50% chance of getting an A in the class?

Recall that the logistic function can be written as \[ \frac{p(X)}{1 - p(X)} = e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2}, \quad X = \left( X_1, X_2 \right) \] We are given \(p(X) = 0.5\), so we can plug this value in and solve for \(X_1\), take \(\log\) on both sides, and solve for \(X_1\)

\[ \begin{align} 1 &= e^{-6 + 0.05 X_1 + 3.5} \\ 0 &= -6 + 0.05 X_1 + 3.5 \\ X_1 &= 50 \end{align} \] Thus, the student must study 50 hours in order to have a 50% chance of getting an A in the class.

Exercise 4.7.10

(a)

summary(Weekly)

##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume       
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202  
##  Median :  0.2380   Median :  0.2340   Median :1.00268  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821  
##      Today          Direction 
##  Min.   :-18.1950   Down:484  
##  1st Qu.: -1.1540   Up  :605  
##  Median :  0.2410             
##  Mean   :  0.1499             
##  3rd Qu.:  1.4050             
##  Max.   : 12.0260

From the numerical summary above, we see that each of the Lag variables has approximately the same minimum and maximum values, with the mean values differing for Lag3, Lag4, and Lag5. It may therefore be worth looking at the Lag values as they vary over each year. The numerical summary of the Volume is also given, and we can also break the volume down by year as well. From the plots above, we see that volume steadily increases until about 2009, where volume drops back down. Looking at the figure with the Lag variables plotted across each year, we see that the lag variables per year are roughly in the same region of values, but across each year, these values vary a little more.

(b)

glm_dir = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
              data = Weekly, family = binomial)
summary(glm_dir) # Lag 2 is statistically significant

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

After fitting a logistic regression with Direction as response with the 5 lag variables and volume as predictors, we find that Lag2 is statistically significant.

(c)

We compute the confusion matrix, as shown below. This tabulates the distribution of the true classes and and compares it against the classes predicted by the model we fit. From the matrix shown below, we see that the model correctly classified 54 of the Downs and 557 ups, but incorrectly classified 478 of the other classes. This gives us 0.5611 prediction accuracy.

conf_mat

##         
## glm.pred Down  Up
##     Down   54  48
##     Up    430 557

(d)

We fit a logistic regression model using Lag2 as the only predictor, with the training period from 1990 to 2008. We then used the model on the testing data. The confusion matrix is below, and we see that our prediction accuracy is 0.625

table(test.pred, test$Direction)

##          
## test.pred Down Up
##      Down    9  5
##      Up     34 56

(e)

We then repeat (d) using LDA. The confusion matrix is given below, and we see that the overall prediction accuracy is 0.625

table(lda.class, test$Direction)

##          
## lda.class Down Up
##      Down    9  5
##      Up     34 56

(g)

We repeat (d) using KNN, with K = 1. The confusion matrix is given below, and we seet that the overall prediction accuracy is 0.5

table(lda.class, test$Direction)

##          
## lda.class Down Up
##      Down    9  5
##      Up     34 56

(h)

After fitting the logistic regression model, LDA, and KNN, I think that the logistic regression model would provide the bst results for the data. Of the three models, KNN gives the worst testing results, at just 50% prediction accuracy on the test set. While logistic regression and LDA give the same prediction accuracy, I think logistic regression is the better model because we use fewer assumptions about the data. LDA assumes that the observations are drawn from a Gaussian distribution with a common covariance matrix, which may not necessarily be true. Logistic regression can thus have better results if this assumption is false.