Data Science Methods
Contents
1 Model Evaluation 3
1.1 Linear Models for Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Generalization Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 The Bias-Variance Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Estimating the Expected Prediction Error . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 In-Sample Measures for Generalization Error: AIC and BIC . . . . . . . . . . . . . . . . . 5
1.6 K-fold Cross-Validation (CV) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Shrinkage methods 8
2.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Lasso Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Dimension reduction 9
3.1 Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Selecting the Number of Factors L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.5 PCA versus Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 Nonparametric Regression: k-Nearest Neighbors and Kernel Regression 12
4.1 k-Nearest Neighbors Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Kernel Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 The MSE of the NW Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.4 Local Linear Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5 Linear Discriminant Analysis 15
5.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.2 Decision Theory for Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.3 Linear Methods for Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.4 Linear Probability Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.5 LDA for Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.6 Reduced Rank LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.7 Fisher’s Linear Discriminant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.8 QDA and Regularized Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.9 Model Evaluation applied to Classification Problems . . . . . . . . . . . . . . . . . . . . . 18
6 Logistic Regression and Stochastic Gradient Descent 19
6.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.2 Training Logistic Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1
,Data Science Methods Overview CHoogteijling
6.3 Regularisation of Logistic Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.4 Comparison of Logistic Regression and LDA . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.5 Newton-Raphson Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.6 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
7 Clustering Methods 22
7.1 K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7.2 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
8 Bayesian Updating 24
8.1 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
8.2 Bayes Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
8.3 Bayesian Learning: Recursion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
8.4 Bayesian Learning and Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
9 Model Averaging 26
9.1 Weighting Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
9.2 Consistency and Asymptotic RMSE Optimality . . . . . . . . . . . . . . . . . . . . . . . . 28
9.3 Model Averaging for Gaussian Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . 29
A Background 31
A.1 Jensen’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
A.2 Rayleigh Quotient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
A.3 Logarithm Cribsheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
A.4 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
A.5 Eigenvectors and Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
A.6 The Lagrangian Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
A.7 Matrix Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
B Test Questions 34
2
,Data Science Methods Overview CHoogteijling
1 Model Evaluation
Model performance is measured by how well a model generalizes. There are two potential objectives
for model evaluation, they can both play a role.
• Model selection is comparing the performance of different models to identify the best model.
• Model assessment is estimating the ability of a model to perform on new data.
In data-rich situations we can use train-val-test split, and in cases of insufficient data we can use cross-
validation.
1.1 Linear Models for Regression
Suppose we have p features X = (X1 , . . . , Xp )T in the feature space and the target variable Y . We
consider the regression model of the form
Y = f (X, β) + ε, with
M
X
f (X, β) = βm hm (X)
m=0
ε˜N (0, σε2 error term
• A linear regression model has basis functions hm (X), m = 1, . . . , M as the features, spanning an
M -dimensional feature space.
We set up the log-likelihood function to find the least squares problem and maximize to the noise variance:
1. We have the likelihood function, that helps determine the model parameters βj and σε . Where X
is the N × (M + 1) matrix with elements X nm = hm (xn ) and y = (y1 , . . . , yN )T .
N
Y
P (y | X, β, σε ) = N (yn | |f (xn , β), σε2 )
n=1
2. We take the logarithm of P (y | X, β, σε ), where ED (β) is the sum-of-squared-errors-function.
This shows that maximizing the likelihood with respect to the βm is equivalent to minimizing the
sum-of-squared-errors.
N ED (β)
ln P (y | X, β, σε ) = −N ln σε − ln(2π) −
2 σε2
N N
1X 2 1X
ED (β) = (yn − f (xn , β)) = (yn − β T h(xn ))2
2 n=1 2 n=1
3. We differentiate the log likelihood function with respect to βm .
N
1 X
∂βm ln P (y | X, β, σε ) = − (yn − β T h(xn ))(hm (xn )
σε2 n=1
4. We set these to zero for m = 0, . . . , M and solve for βm , then we have the normal equations for the
least squares problem.
β̂ = (X T X)−1 X T y = X + y
X + = (X T X)−1 X T Moore-Penrose inverse
5. We maximize the log likelihood function with respect to the noise variance σε2 .
N
1 X
σε2 = (yn − β̂ T h(xn ))
N n=1
3
, Data Science Methods Overview CHoogteijling
1.2 Generalization Error
We have the loss functions for a trained regression model fˆ(X):
L(Y, fˆ(X)) = (Y − fˆ(X))2 squared error
L(Y, fˆ(X)) − |Y − fˆ(X)| absolute error
The generalization error shows how well the model predicts responses for new data independently
drawn from the same population distribution. For the data set T = {(xn , yn )}N
n=1 .
errT = E(X,Y ) [L(Y, fˆ(X)) | T ]
The expected prediction error quantifies how well a predictive model is expected to perform on new,
unseen data.
err = ET ,(X,Y ) [L(Y, fˆ(X))]
err = ET [errT ]
The training error is the average loss on the set T the model was trained on.
N
1 X
err = L(yn , fˆ(xn ))
N n=1
• The prediction error is the average discrepancy between the model’s predictions and the true values
of the dependent variable for new observations.
• The prediction error is the expectation of the generalization error when averaged over all possi-
ble sets of observations T because the observations are drawn independently from the same joint
distribution as (X, Y ).
• The generalization error should be small to ensure low prediction error on unseen data.
• The generalization error can often not be estimated directly, so we use the estimate of the expected
prediction.
• The training error can never be an indicator of the generalization performance, as we can make the
training error arbitrarily small without improving generalization performance.
• Overfitting is when the model is too tailored to the specifics of the noise in the training set.
1.3 The Bias-Variance Decomposition
The prediction error can be decomposed into three terms: the bias (squared) of the estimated model,
plus the variance of the estimated model, plus the variance of the Gaussian noise.
• The bias term measures how much on average our estimated model deviates from the true mean,
given by the function f (X).
• The variance term is the expected (squared) deviation of the estimated model around its mean.
• The third term is an irreducible error, due to the inherent variance in the data-generating process
around its true mean f (X).
err[x0 ] = E[(Y − fˆ(X))2 | X = x0 ]
= (E[f (x0 )] − f (x0 ))2 + E[f (x0 ) − E[fˆ(x0 )]]2 + σε2
= bias2 (fˆ(x0 )) + Var(fˆ(x0 )) + σ 2ε
2
= bias + variance + σε2
4