Maria Andrade
Stats II Study Guide
Week 1: Revision Stats I & Dummy Coding
Revision Stats 1
Linear Regression
● Dependent variable → Y
● Independent variable(s) → X
● Function of linear regression:
○ B0 → population y-intercept
○ B1 → population slope coefficient
○ Xi → independent variable
○ Ei → random error
Eg: Interpretation of betas
● Eg: pricei= B0 + B1 · squaremeteri + B2 · bedrooms + Ei
● B0: the predicted house price when the amount of bedrooms is 0 and the square meters is 0
● B1: the increase in the predicted house price for every additional square meter given that the amount
of bedrooms remains constant
● B2: the increase in the predicted house price for every additional bedroom given that the amount of
square meters remains constant.
P-Values
● Alpha = 0.05 → how often we allow ourselves to make a mistake
● compare the p-value with alpha → if the p-value is lower than alpha you reject the Ho
Model Fit: To test model fit you have SST, SSR and SSM
Model Fit description Formula Variance exp
SST difference btw the observed total unstandardized variance
data and the mean of y
SSR Difference btw the observed unexplained unstandardized variances→
data and the model variation not accounted for in the model
, Maria Andrade
SSM Difference btw the men value of explained unstandardized variance →
Y and the model variation accounted for in the model
F-Ratio
● F-ratio: the ratio btw the standardized SSM and standardized SSR
○ Formula:
■ MSM Formula =
● MSM stands for the standardized explained variance
■ MSR formula =
● MSR stands for the standardized unexplained variance
○ When the F-ratio is high → the explained variance is high and the unexplained variance is low
R^2
● R2: the proportion of explained variance over total variance
○ Formula:
● Can be used to compare models, to see if one is better than the other
● The higher the R2 the more variance is explained
Assumptions of a Line
● If the assumptions are not met, then the inference of the results are invalid.
Linearity Independence of Normality (errors) Homoscedasticity multicollinearity
errors
meaning If yi is a linear The errors are Errors are normally Errors have equal 2 or + predictors are
function of the independent distributed variance highly correlated with
predictors each other
Check Residuals plot: X If time series 1)Histograms Zpred-Zresid plot VIF (>10) or tolerance
= ZPRED, Y = Durbin- Watson 2) PP/QQ plots Leven’s Test (<0.1) Average VIF
ZRESID 3)KS-SW test “much larger” than 1
If residuals are Not for cross 4)Skew & Kurtosis
symmetric sectional data
around 0
, Maria Andrade
+ 2)PP/QQ plots: Pp-plot: Equality of variance of Predictors explain the
magnify deviations in the errors same variance
middle & qq-plot : magnify
deviations in the tails
4) s/SEskewness K
/SEkurtosis
Fix Transform data/ Multilevel modeling SE’s are inflated, change SE’ inflates Remove variables
change model or clustered SEs through transform or Transform or
bootstrap bootstrapping
Outliers
● An outlier is an extreme in y
● Its cause of concern when:
○ >5% of data > 1.96 sd
○ >1% of data > 2.58 sd
○ >3.29sd
Influential Cases
● A case which influences any part of the regression analysis
● Its an extreme in x → pushes regression line
● Diagnostics:
○ Leverage → measures potential to influence regression
○ Mahalanobis distance → measures potential to influence regression
○ DFFIT(s) → difference in mean y including and excluding case
○ SDFBeta → change in one regression coefficient after exclusion
○ Cook’s Distance → the average of changes in all regression coefficients after exclusion
Dummy Coding
Dummy coding → categorical predictor with multiple categories
Steps:
1. Recode a variable into dummies
2. Number of dummies = categories - 1
3. A dummy is 0 or 1 for a particular category
4. Reference category is 0 for all dummies
Stats II Study Guide
Week 1: Revision Stats I & Dummy Coding
Revision Stats 1
Linear Regression
● Dependent variable → Y
● Independent variable(s) → X
● Function of linear regression:
○ B0 → population y-intercept
○ B1 → population slope coefficient
○ Xi → independent variable
○ Ei → random error
Eg: Interpretation of betas
● Eg: pricei= B0 + B1 · squaremeteri + B2 · bedrooms + Ei
● B0: the predicted house price when the amount of bedrooms is 0 and the square meters is 0
● B1: the increase in the predicted house price for every additional square meter given that the amount
of bedrooms remains constant
● B2: the increase in the predicted house price for every additional bedroom given that the amount of
square meters remains constant.
P-Values
● Alpha = 0.05 → how often we allow ourselves to make a mistake
● compare the p-value with alpha → if the p-value is lower than alpha you reject the Ho
Model Fit: To test model fit you have SST, SSR and SSM
Model Fit description Formula Variance exp
SST difference btw the observed total unstandardized variance
data and the mean of y
SSR Difference btw the observed unexplained unstandardized variances→
data and the model variation not accounted for in the model
, Maria Andrade
SSM Difference btw the men value of explained unstandardized variance →
Y and the model variation accounted for in the model
F-Ratio
● F-ratio: the ratio btw the standardized SSM and standardized SSR
○ Formula:
■ MSM Formula =
● MSM stands for the standardized explained variance
■ MSR formula =
● MSR stands for the standardized unexplained variance
○ When the F-ratio is high → the explained variance is high and the unexplained variance is low
R^2
● R2: the proportion of explained variance over total variance
○ Formula:
● Can be used to compare models, to see if one is better than the other
● The higher the R2 the more variance is explained
Assumptions of a Line
● If the assumptions are not met, then the inference of the results are invalid.
Linearity Independence of Normality (errors) Homoscedasticity multicollinearity
errors
meaning If yi is a linear The errors are Errors are normally Errors have equal 2 or + predictors are
function of the independent distributed variance highly correlated with
predictors each other
Check Residuals plot: X If time series 1)Histograms Zpred-Zresid plot VIF (>10) or tolerance
= ZPRED, Y = Durbin- Watson 2) PP/QQ plots Leven’s Test (<0.1) Average VIF
ZRESID 3)KS-SW test “much larger” than 1
If residuals are Not for cross 4)Skew & Kurtosis
symmetric sectional data
around 0
, Maria Andrade
+ 2)PP/QQ plots: Pp-plot: Equality of variance of Predictors explain the
magnify deviations in the errors same variance
middle & qq-plot : magnify
deviations in the tails
4) s/SEskewness K
/SEkurtosis
Fix Transform data/ Multilevel modeling SE’s are inflated, change SE’ inflates Remove variables
change model or clustered SEs through transform or Transform or
bootstrap bootstrapping
Outliers
● An outlier is an extreme in y
● Its cause of concern when:
○ >5% of data > 1.96 sd
○ >1% of data > 2.58 sd
○ >3.29sd
Influential Cases
● A case which influences any part of the regression analysis
● Its an extreme in x → pushes regression line
● Diagnostics:
○ Leverage → measures potential to influence regression
○ Mahalanobis distance → measures potential to influence regression
○ DFFIT(s) → difference in mean y including and excluding case
○ SDFBeta → change in one regression coefficient after exclusion
○ Cook’s Distance → the average of changes in all regression coefficients after exclusion
Dummy Coding
Dummy coding → categorical predictor with multiple categories
Steps:
1. Recode a variable into dummies
2. Number of dummies = categories - 1
3. A dummy is 0 or 1 for a particular category
4. Reference category is 0 for all dummies