Assignment 1
Question 1
1.1) In statistics, the variance inflation factor (VIF) is the quotient of the
variance in a model with multiple terms by the variance of a model with one
term alone.[1] It quantifies the severity of multicollinearity in an ordinary
least square’s regression analysis. It provides an index that measures how
much the variance (the square of the estimate's standard deviation) of an
estimated regression coefficient is increased because of collinearity.
Cuthbert Daniel claims to have invented the concept behind the variance
inflation factor, but did not come up with the name.
We can calculate k different VIFs (one for each Xi) in three steps:
Step one
First we run an ordinary least square regression that has Xi as a function of all
the other explanatory variables in the first equation.
If i = 1, for example, equation would be :
where is a constant and e is the error term.
Step two
Then, calculate the VIF factor for with the following formula:
where R2i is the coefficient of determination of the regression equation in step
one, with on the left-hand side, and all other predictor variables (all the
other X variables) on the right-hand side.
Step three
Analyse the magnitude of multicollinearity by considering the size of the
. A rule of thumb is that if then multicollinearity is high
(a cut-off of 5 is also commonly used).
Some software instead calculates the tolerance which is just the reciprocal
of the VIF. The choice of which to use is a matter of personal preference.
, 1.2) The coefficient of determination, R^2, is used to analyse how differences
in one variable can be explained by a difference in a second variable. For
example, when a person gets pregnant has a direct relation to when they give
birth.
Step 1: Find the correlation coefficient, r (it may be given to you in the
question). Example, r = 0.543.
Step 2: Square the correlation coefficient.
0.5432 = .295
Step 3: Convert the correlation coefficient to a percentage.
.295 = 29.5%
1.3) The C-statistic (sometimes called the “concordance” statistic or C-index)
is a measure of goodness of fit for binary outcomes in a logistic regression
model. In clinical studies, the C-statistic gives the probability a randomly
selected patient who experienced an event (e.g. a disease or condition) had a
higher risk score than a patient who had not experienced the event. It is
equal to the area under the Receiver Operating Characteristic (ROC) curve
and ranges from 0.5 to 1.