Why estimate ƒ?
1. Prediction
● Make predictions of Y
●
● The accuracy of Ŷ depends on
○ Reducible error: ƒ̂(X)
○ Irreducible error: ε
○
2. Inference
● Understand the association between Y and X1, …, Xp
● Estimate ƒ
○ Which predictors are associated with the response?
○ What is the relationship between the response and each predictor?
○ Is the relationship linear or more complicated?
Measuring the quality of fit
● For regression we use the mean squared error:
○ Using the training set, we get the training MSE
○ Test set → test MSE
○ As flexibility increases, the training MSE goes down but the test MSE gets a
U-shape
■ Overfitting
Bias-Variance Tradeoff
○
■ Variance: the amount by which ƒ̂ would change if we estimated it
using a different training set
● More flexible methods have higher variance
■ Bias: the error that is introduced by approximating a real-life problem
● More flexible models have lower bias
,5 Resampling Methods
The Validation Set Approach
1. Randomly divide the available set of observations into a training set and a
validation set (hold-out set).
2. Fit the model to the training set.
3. Predict the responses in the validation set.
4. The resulting validation set error rate provides an estimate of the test error rate.
- The validation estimate of the test error can be highly variable, depending on which
observations are in which set.
- Only a subset of the observations are used to fit the model, but statistical methods tend to
perform worse when trained on fewer observations.
Leave-One-Out Cross-Validation (LOOCV)
1. Split the observations in two parts, the training set with all but one observation, and
the validation set with a single observation (x1, y1).
2. The method is fit on the n - 1 observations and the remaining observation is
predicted.
3. The MSE1 is calculated.
4. The procedure is n - 1 times with the other observations as validation sets.
+ Far less bias than validation set.
+ Will always yield the same results when repeated, no randomness in the set splits.
- Can be expensive, has to be fit n times. Can be time consuming for large n.
Shortcut using least squares or polynomial regression:
→ hi is the leverage
k-Fold Cross Validation
1. Randomly divide the set of observations into k groups (folds) of approximately equal
size.
2. The first fold is treated as a validation set, and the method is fit on the remaining k - 1
folds.
3. The MSE1 is calculated.
4. This procedure is repeated k times.
5. The k-fold DV estimate is calculated:
, + Shorter computation than LOOCV, since it has to be fitted k (= 5 or 10 usually) times
instead of n times.
+ More accurate estimates of the test error rate than the LOOCV.
- There will be an intermediate level of bias (with k = 5 or k = 10, each set
contains (k - 1)n/k observations), more than LOOCV.
+ LOOCV has higher variance than k-fold with k < n, because the folds are less
correlated with each other than the n - 1 datasets of LOOCV. The mean of
many highly correlated quantities has higher variance than the mean of many
quantiles that are not as highly correlated.