Comprehensive Answers Graded A 2024-2025
How are parameters estimated for linear regression? - ✔️✔️Method of least squares
What is the multiple linear regression model? - ✔️✔️Yi = B0 + B1Xi1 + ... + BpXip + ei
ei are random errors
Yi is the response for the ith case
There are p predictor variables x1, x2, ..., xp
Least square estimates - ✔️✔️Minimizes sum of squares
Residual - ✔️✔️Actual - Predicted value
Residual sum of squares - ✔️✔️sum of residuals squared (Yi - mu hat)^2 summed
together
We want small RSS to indicate a better model (minimize SS)
Factor - ✔️✔️Is a categorical variable that can be incorporated into regression models by
dummy variables. Factors with k levels are coded with k-1 variables since one level is
the reference level
What is matrix formulation of the linear model? - ✔️✔️Y = XB + e
- Y is the response vector
- X is the design matrix
- B is the vector of p+1 (including intercept) regression parameters
- e is the vector of error terms
Ex: Y = [Y1, Y2....]
B = [B0, B1, ...]
E = [e1, e2,...]
F-test hypotheses - ✔️✔️Ho: response is not related to any of the predictors
Ha: response y is related to at least one of the predictors in the model
Bias-Variance tradeoff - ✔️✔️As complexity in model increases (more predictors), bias
decreases but variance increases
- We want to control both bias and variance
,Bias - ✔️✔️Bias(μˆ) = E[μˆ] − μ
Bias arises because of the model misspecification (banana shape but fit a linear
regression) WRONG MODEL
Variance - ✔️✔️Var(μˆ) = SE(μˆ)^2
Arises due to noise in estimating regression coefficients
Mean squared error (MSE) - ✔️✔️Overall error in estimation
E[(mu hat - mu)^2]
or
[Bias^2] + Variance
Xij means - ✔️✔️Variable J with index individual record of i
What happens to bias and variance when the sample size (n) increases? - ✔️✔️The
variance decreases and bias stays the same
"Sample size increase squashes variance but does nothing to bias"
Collinearity - ✔️✔️One of the columns in the design matrix is a linear combination of the
others. This makes it very difficult to distinguish between the effects of the variables in
the model. It is hard to get good estimates of the parameters with least squares.
- Collinearity gets worse as the ratio of p/n increases
- HIGH VARIANCE AS A RESULT
- You have perfect collinearity when p >= n
P > n - ✔️✔️When you have more predictors than sample size (n) you cannot get
standard errors for regression coefficients or optimal least square parameters
estimations (see MRNA data example)
- The method of least squares for parameter estimation will not work. You will get zero
residual sum of squares and no estimation for standard errors and parameter
estimation. Regularized regression techniques like ridge and lasso will give you a
unique parameter estimate even when p>n. VERY HELPFUL!!!
Each time we add a variable to the model, RSS will... - ✔️✔️never increase.
, So adding a variable to a model will improve the model (less bias) but will increase
variance. So tradeoff!!!
Principle of Parsimony (Occam's Razor) - ✔️✔️With all things being equal, simple models
are better than complex ones
Complex models have what? - ✔️✔️High variance, but low bias
Variable selection - ✔️✔️It is a way of trying to find the balance between model fit and
model complexity
Information Criteria - ✔️✔️- Can think of improvement in model fit in terms of information
about the response
- Only add variable if it contributes sufficient additional information about the response
to warrant additional model complexity
Q is overall quality of the model
WANT SMALL Q
Q = Badness of fit + k * Number of predictors + constant
We want to minimize Q
AIC - ✔️✔️Akaike Information criterion
For a linear model with unknown error variance:
AIC = nlog(RSS/n) + 2p + constant
So adding a single numeric predictor to the regression model cannot increase the AIC
by more than 2 units
- AIC for linear models is equivalent Mallow's Cp variable selection. WE CHOOSE
MODEL WITH SMALL AIC
Stepwise Variable Selection with AIC - ✔️✔️- Alternative to choosing the selection of
predictors that result in minimum AIC for all possible models. This approach is very time
consuming because there are 2^p different models to consider so with p=10 predictors,
there are 2^10 different models to consider
- DO STEPWISE VARIABLE SELECTION
1. Choose initial model (typically null model with no predictors or full model with all
predictors)