1. What affect does the sample size have on precision and accuracy? In this exercise, you’ll be asked to design a simulation that helps you answer these questions.
(a) What does it mean to say that an estimator is accurate? What does it mean to say it is precise?
An estimator is said to be accurate if it is representative of the true value of what it is trying to estimate. An estimator is precise, if it can consistently estimate values that are close together. Precision, however, does not necessarily mean that the estimator is accurate, as the a precise estimator can consistently generate incorrect estimates.
(b) We wish to estimate the mean of a population. Suppose we take a random sample of n measurements from this population and calculate the sample mean plus 1. In other words, our estimator is \(\overline{X} + 1\). Is this estimator biased? Explain.
Suppose the true mean of the population is \(\mu\). We take the expectation of the estimator and use linearity of expectation:
\[
\mathbb{E} [ \overline{X} + 1] =
\mathbb{E} \left[ \frac{1}{n} \sum_{i=1}^n X_i \right] + 1
= \frac{1}{n} \sum_{i = 1}^n \mathbb{E} [X_i] + 1 =
\frac{1}{n} \cdot n \cdot \mu + 1 = \mu + 1 \neq \mu
\]
Since the expectation of \(\overline{X} + 1\) is not equal to the true mean of the population, then we conclude that the estimator is biased.
(c) Write a simulation that demonstrates the bias of this estimator. Draw a sample of size n = 100 from a Normal population with mean 100 and standard deviation 10. (Pretend that we are sampling IQ’s.) Show us a graph of the sampling distribution of your estimator (\(\overline{X}\) + 1) and explain why this shows evidence that the estimator is or is not biased.
n = 100;
mu = 100;
sigma = 10;
iters = 10000;
mu_hat = c();
se = c();
for (i in c(1:iters)) {
x = rnorm(n, mu, sigma);
mu_hat[i] = mean(x) + 1; # xbar + 1
se[i] = sd(x) / sqrt(n);
}
ggplot(data.frame(mu_hat), aes(mu_hat)) + geom_histogram(binwidth = 0.1) +
scale_x_continuous(breaks = 96:105) + theme_bw()

As seen in the histogram plotted above, we see that the estimates are biased, with the curve centered at around 101, whereas the true mean is 100.
(d) Use the estimated standard error to measure the precision of your estimator in (c). How does it compare to the precision of the standard estimator, \(\overline{X}\)?
# calculate the estimated standard error to measure the precision of estimator
std_err = mean(se);
The standard estimator error of the estimator from part (c) is 0.996. Since the variance of a random variable is invariant to constant shifts, \(\mathrm{Var}( \overline{X} + 1) = \mathrm{Var} ( \overline{X} )\), so it follows that the two standard errors are the same as well.
(d) Repeat (c) but use \(n = 10000\). What happened to the bias? What happened to the precision? Again, provide histograms of the sampling distribution.
# repeat part (c)
n = 10000;
for (i in c(1:iters)) {
x = rnorm(n, mu, sigma);
mu_hat[i] = mean(x) + 1; # xbar + 1
se[i] = sd(x) / sqrt(n);
}
ggplot(data.frame(mu_hat), aes(mu_hat)) + geom_histogram(binwidth = 0.01) +
scale_x_discrete(limits = 100:102) + theme_bw()

As we can see from the histogram above, the estimate is still bias, with the distribution centered around 101, but with higher sample size, the estimate becomes more precise, where the precision can be estimated by the standard error estimate, 0.1. All the estimates lie very close to 101, with none less than 100 and none greater than 102. This contrasts the precision of the estimator with smaller sample size, where the distribution ranged from 98 to 104.
(e) What affect does increasing the sample size have on accuracy? On precision?
Increasing the sample size appears to have no effect on the accuracy of the estimator, but it does have effects on the precision of the estimator. When we increased the sample size from 100 to 10000, the estimator still suggested that the true mean was 101, but the estimator that used 10000 samples produced estimates that were heavily centered around 101, whereas the smaller sample size gave estimates that had higher variance.
(f) Suppose you are fitting a model with a data set of size n and you estimate the testing MSE. Then, you get twice as much data. What do you think will change in the estimate of the testing MSE?
If we get twice as much data, the variance term in the estimate of the testing MSE will decrease, whereas the bias will remain relatively the same. This change is visible in the simulation above, as the the histogram became much narrower when we increased the sample size, which is indicative of smaller variance.
2. Exercise 8.3.3. Consider the Gini index, classification error, and cross-entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of \(\hat{p}_{m1}\).
# PLot displaying Gini Index, class. error, and cross entropy against phat_m1
# in binary classification setting
p = seq(0, 1, 0.01)
gini = p * (1 - p) * 2
entropy = -(p * log(p) + (1 - p) * log(1 - p))
class.err = 1 - pmax(p, 1 - p)
d = data.frame(p, gini, entropy, class.err)
long_d = melt(d, id = "p")
ggplot(long_d, aes(x = p, y = value, colour = variable)) + geom_point() +
theme_bw()

3. Births Data Set
(a) Cut your data into Training and Testing. Each should have 1000 observations. Use the seed number 1234. Use a tree (not pruned) to predict whether a baby will be born prematurely. What is the testing misclassification error?
set.seed(1234);
n = dim(births)[1]; # 1998
train_ind = sample(n, n / 2);
train = births[train_ind,]; # 999 x 21
test = births[-train_ind,]; # 999 x 21
# use tree model to predict whether a baby will be born prematurely
tree_mod = tree(formula = Premie ~., data = train)
# plot(tree_mod)
# text(tree_mod, pretty = TRUE)
# testing classification error
preds = predict(tree_mod, newdata = test, type = "class")
conf.matrix = table(preds, test$Premie)
misclass = (conf.matrix[2, 1] + conf.matrix[1,2]) / sum(conf.matrix) # 0.078
Using the a tree, we are able to predict whether a baby will be born prematurely with classification error of 0.08.
(b) Use cross-validation to determine if the tree can be improved through pruning. If so, prune the tree to the appropriate size and provide a plot.
# CV to determine if tree can be improved through pruning
cv.train = cv.tree(tree_mod, FUN = prune.misclass)
# prune tree to appropriate size, provide plot
plot(cv.train$dev~cv.train$size)

# fit pruned model
pruned_fit = prune.misclass(tree_mod, best = 2)
plot(pruned_fit)
text(pruned_fit, pretty = TRUE)

# used pruned model on test data
preds_pruned = predict(pruned_fit, newdata = test, type = "class")
# results
conf.matrix = table(preds_pruned, test$Premie)
misclass = (conf.matrix[2, 1] + conf.matrix[1,2]) / sum(conf.matrix) # 0.078
Using cross validation, we determine that the pruning the tree to have 2 terminal nodes would improve the predictive ability of the tree model. This is shown in the first plot above. The second plot shows the split that the tree uses to make classifications.
(c) Interpret your pruned tree (or your tree in (a) if you did not need to prune). In particular, does it tell us whether smoking is a potential cause of premature births? What factors are associated with premature births?
The pruned tree uses only 2 terminal nodes. We see that after using it on the testing data, it actually gives the same misclassification error rate as the un-pruned tree. In this case, we should use the simpler model to avoid overfitting. After plotting the pruned tree, the only split it shows is if weight is less than 87.5, so the tree model identifies this as the most important factor associated with premature babies. Using just the pruned model, we are not able to tell wehther smoking is a potential cause of premature births.
(d) What is the testing misclassification error rate of your pruned tree? Keep in mind that approximately 9% of all births are premature. This means that if a doctor simply predict “not premature” ALWAYS, he or she will have only a 9% misclassification error. Did you do better?
Using the pruned tree with 2 terminal nodes results in a testing misclassification error rate of 0.078. This is slightly better than if a doctor simply predicts “not premature” for every baby.