CHAPTER 22: Multiple linear Regression, Model violations
Motivation:
•The market-model example:
(Y = ‘daily stock price of Heineken’ on X= ‘daily price of AEX’)
-model requirements were checked graphically
-transformation of Y and X into daily returns (%) was suggested
-visual observations can be misleading
–proper tests are needed
•Amazon ebook sales: no checks have been done!
(Y = `dollar sales from published ebooks’ on X= `ebookprice’)
•Baseball teams’ performance: no checks have been done!
(Y= `runs per season’ on X= `on-base and slugging percentages’)
•Wage differences: no significant differences detected (H0). Is it due to H0 being valid, small sample
size, or invalid assumptions?
22.1 Collinearity (=if the correlation between 1 explanatory variable and linear combination of some
other explanatory variables is very strong, it can lead to collinearity)
-does not influence SSE and hence the usefulness of the model
-but interpretation of the regression coefficient becomes harder
-the values of t-tests are biased towards zero
-proving the individual significances may be hard
What can be done? (against collinearity)
-only take action if necessary (collinearity isn’t always the case, there is a possibility of it)
-possible action: remove a perpetrating variable from the model or transform them into linearly
independent components
-if caused by squared or interaction terms, the problem can occasionally be solved by switching to
centered variables (if it is possible), that is, using
22.3: Non-linearity
Is the linearity in the basic assumption E ( Y )=β 0 + β 1 X appropriate?
Consequences? Model and estimates are incorrect IF LINEARITY IS VIOLATED!
What can be done? Find a correct model specification (for example logarithms, or dummies, etc)
This can often be detected by studying the residuals
The existence of non-linearity can be tested as follows:
-estimate the original model E ( Y )=β 0 + β 1 X 1+ ..+ β k X k
-create the variable of the accompanying predictions ŷ
-extend the original model by including the square of the prediction (for example, with coefficient γ =
gamma!):
, First estimate
the normal model, after that
extend the model with PREDICT2
with using the cbind function
conclusion: model should be
extended to a non-linear one!
22.2: Heteroskedasticity (if homoskedasticity is violated!)
Or of its second-order counterpart with interactions. The usefulness of this model, H 0 : E ( ε 2 ) =γ 0
indicates the presence of heteroskedasticity (if the x_K’s are not equal to 0, there is
homoskedasticity)
What can be done?
,- Heteroskedasticity-consistent standard errors can be used to obtain confidence intervals/tests
for parameter values
- Weighted least squares (not addressed here!)
not discussed in
lecture, because
there is
homoskedasticity
here!
Aux model is
explained by a linear
of quadratic function!
it is gamma0 +
gamma1X1
or gamma1X1 +
gamma 2 X1^2
Third step: regress aux model on price e-book (first option above). Alternative: regress aux model on
price e-book and square of e-book price! (=second option above!). We have to look to F-statistic and
its p-value to check whether the auxiliary model is useful
, Possible solutions as H 0 :γ =0 is rejected (because p-value < any reasonable alpha!):
- Heteroskedasticity consistent standard errors
- Weighted least squares estimation, that is, standardizing data so that errors become
homoscedastic
This is still the amazon example, and now we know there is heteroskedasticity!
standard output =
valid under homo- AND
heteroskedasticity! BUT,
standard error, t-value and
p-value are only valid
under homoscedasticity (if
obtained with lm-
command!)
= alternative procedure
how to obtain the errors
that are also valid under heteroskedasticity! (ESTIMATED ARE FOR BOTH EQUAL!)
22.3 Non-normality (= not crucial for outcome!)
Consequences:
-the LS estimators are generally not normally distributed
-the LS estimators are not optimal anymore
-the statistical conclusions thus cannot be trusted
-however, these problems are less serious for large sample sizes (CLT implies that the LS-estimators
are approximately normal) with the main exception being prediction intervals
Non-normality can be detected with the Kolgomorov-Smirnov, Shapiro-Wilk, or Lilliefors test and
other test procedures (see chapter 24)
What can be done?
- A perfect remedy does not exist
Motivation:
•The market-model example:
(Y = ‘daily stock price of Heineken’ on X= ‘daily price of AEX’)
-model requirements were checked graphically
-transformation of Y and X into daily returns (%) was suggested
-visual observations can be misleading
–proper tests are needed
•Amazon ebook sales: no checks have been done!
(Y = `dollar sales from published ebooks’ on X= `ebookprice’)
•Baseball teams’ performance: no checks have been done!
(Y= `runs per season’ on X= `on-base and slugging percentages’)
•Wage differences: no significant differences detected (H0). Is it due to H0 being valid, small sample
size, or invalid assumptions?
22.1 Collinearity (=if the correlation between 1 explanatory variable and linear combination of some
other explanatory variables is very strong, it can lead to collinearity)
-does not influence SSE and hence the usefulness of the model
-but interpretation of the regression coefficient becomes harder
-the values of t-tests are biased towards zero
-proving the individual significances may be hard
What can be done? (against collinearity)
-only take action if necessary (collinearity isn’t always the case, there is a possibility of it)
-possible action: remove a perpetrating variable from the model or transform them into linearly
independent components
-if caused by squared or interaction terms, the problem can occasionally be solved by switching to
centered variables (if it is possible), that is, using
22.3: Non-linearity
Is the linearity in the basic assumption E ( Y )=β 0 + β 1 X appropriate?
Consequences? Model and estimates are incorrect IF LINEARITY IS VIOLATED!
What can be done? Find a correct model specification (for example logarithms, or dummies, etc)
This can often be detected by studying the residuals
The existence of non-linearity can be tested as follows:
-estimate the original model E ( Y )=β 0 + β 1 X 1+ ..+ β k X k
-create the variable of the accompanying predictions ŷ
-extend the original model by including the square of the prediction (for example, with coefficient γ =
gamma!):
, First estimate
the normal model, after that
extend the model with PREDICT2
with using the cbind function
conclusion: model should be
extended to a non-linear one!
22.2: Heteroskedasticity (if homoskedasticity is violated!)
Or of its second-order counterpart with interactions. The usefulness of this model, H 0 : E ( ε 2 ) =γ 0
indicates the presence of heteroskedasticity (if the x_K’s are not equal to 0, there is
homoskedasticity)
What can be done?
,- Heteroskedasticity-consistent standard errors can be used to obtain confidence intervals/tests
for parameter values
- Weighted least squares (not addressed here!)
not discussed in
lecture, because
there is
homoskedasticity
here!
Aux model is
explained by a linear
of quadratic function!
it is gamma0 +
gamma1X1
or gamma1X1 +
gamma 2 X1^2
Third step: regress aux model on price e-book (first option above). Alternative: regress aux model on
price e-book and square of e-book price! (=second option above!). We have to look to F-statistic and
its p-value to check whether the auxiliary model is useful
, Possible solutions as H 0 :γ =0 is rejected (because p-value < any reasonable alpha!):
- Heteroskedasticity consistent standard errors
- Weighted least squares estimation, that is, standardizing data so that errors become
homoscedastic
This is still the amazon example, and now we know there is heteroskedasticity!
standard output =
valid under homo- AND
heteroskedasticity! BUT,
standard error, t-value and
p-value are only valid
under homoscedasticity (if
obtained with lm-
command!)
= alternative procedure
how to obtain the errors
that are also valid under heteroskedasticity! (ESTIMATED ARE FOR BOTH EQUAL!)
22.3 Non-normality (= not crucial for outcome!)
Consequences:
-the LS estimators are generally not normally distributed
-the LS estimators are not optimal anymore
-the statistical conclusions thus cannot be trusted
-however, these problems are less serious for large sample sizes (CLT implies that the LS-estimators
are approximately normal) with the main exception being prediction intervals
Non-normality can be detected with the Kolgomorov-Smirnov, Shapiro-Wilk, or Lilliefors test and
other test procedures (see chapter 24)
What can be done?
- A perfect remedy does not exist