STATISTICS
END-TERM
Distributions overview
1. Normal distribution and T-Distribution
Look like each other, but the normal distribution is the distribution of a population
(used for one-sample Z test, and proportion test).
The T-distribution is more accurate for samples, with an exact form depending on the
number of degrees of freedom (usually, 1 or 2 degrees of freedom).
2. 𝛘2-distribution and F-distribution
Both are asymmetrical right-skewed distributions with only positive values (W and F).
The exact shape of 𝛘2-distribution depends on 1 degree of freedom: v. Here, you look
up W-values based on 𝛂 and v.
The exact shape of F-distribution depends on 2 degrees of freedom: v1 and v2. Here,
you look up F-values based on 𝛂, v1 and v2.
Multiple Linear Regression Model
= There is one dependent variable (Y), and multiple independent variables (X1, X2, etc.).
- Basic assumption multiple regression:
Y = 𝛃0 + 𝛃1X1 + 𝛃2X2 + … + 𝛃kXk + 𝛆
With E(𝛆) = 0
It could also be written as E(Y) = 𝛃0 + … + 𝛃kXk
Reminder: you always have to write one E in the formula; either at the end (+𝛆) or at
the beginning (E(Y)).
→ Here, the interpretation of the slope b1 = the average change in the statistics
when a student studies 1 hour more, ceteris paribus (=everything else remaining the
same, meaning in this case: one IV changes with +1, and the other IVs are constant).
Quantitative variables: can take all kinds of values
Qualitative variables: can only indicate if an option is valid (e.g. gender = ‘man’). → In
regression:
- Dummy variables: can only take value 0 or 1.
Always displayed in regression model: ‘the number of options – 1’.
→ you add the dummy variables in the basic assumption multiple regression, after
the IVs as: … + 𝛃5D1 + 𝛃6D2
→ The ‘slope’ is then the difference between the dummy in the model and the
omitted dummy:
Slope of D1 = average of D1 – average of omitted dummy
Slope of D2 = average of D2 – average of omitted dummy
1
, → The omitted options ‘disappear’ into the intercept b0. The slope b1 of the
remaining options are now the mean difference from the omitted option.
→ Interpretation of dummy slope: If someone has “dummy variable”, it means that
it will have/score “slope” higher/lower than the omitted dummy, ceteris paribus.
Second-order 𝛃k
A regression model measures linear coherence. Still, it is possible to include a second-order
(squared) relationship in the model.
→ In the basic assumption: you note the normal, linear slope 𝛃4 of the independent
variable X4, and at the end, you also note the slope 𝛃8 of the squared independent
variable X42.
- So: there are 2 slope coefficients of the same variable: the normal and the non-linear
(second-order, squared).
Interaction 𝛃x
Additionally, it is also possible to create an interaction slope. This means that the magnitude
of the effect of Xk on Y, depends on another X.
- Xk * the other X = ‘the interaction between Xk in … and X’
→ In the basic assumption: you note the normal slopes of both independent
variables 𝛃1X1 + … + 𝛃4X4, and at the end, you also note the slope of the interaction
effect 𝛃9X1_ X4.
- You can make an interaction with quantitative and qualitative (dummy) variables.
Then:
When value of dummy = 0 → interaction effect is ‘not active’.
When value of dummy = 1 → interaction effect is ‘active’.
SPSS output multiple regression
(Don’t look at the numbers in this example, they don’t make any sense):
→ Std. Error of the Estimate = S𝛆
2
, → look at df (=degrees of
freedom):
Regression df = k
Residual df = n-(k+1)
Total df = n-1
→ left dark-grey column
Shows the independent variables k=5
1.constant shows 𝛃0.
a.Mentions dependent variable below.
- Know how to calculate everything related SPSS. Here already the calculation of MSE:
MSE = SSE / dfe or MSE = S𝛆2 or MSE = MSR / F
r2 = SSR/SST or 1 – SSE/SST
Example question to find r2:
Solve: F = (r2/dfr) / (1 - r2 /dfe)
Or: solve regression test complete model
3
END-TERM
Distributions overview
1. Normal distribution and T-Distribution
Look like each other, but the normal distribution is the distribution of a population
(used for one-sample Z test, and proportion test).
The T-distribution is more accurate for samples, with an exact form depending on the
number of degrees of freedom (usually, 1 or 2 degrees of freedom).
2. 𝛘2-distribution and F-distribution
Both are asymmetrical right-skewed distributions with only positive values (W and F).
The exact shape of 𝛘2-distribution depends on 1 degree of freedom: v. Here, you look
up W-values based on 𝛂 and v.
The exact shape of F-distribution depends on 2 degrees of freedom: v1 and v2. Here,
you look up F-values based on 𝛂, v1 and v2.
Multiple Linear Regression Model
= There is one dependent variable (Y), and multiple independent variables (X1, X2, etc.).
- Basic assumption multiple regression:
Y = 𝛃0 + 𝛃1X1 + 𝛃2X2 + … + 𝛃kXk + 𝛆
With E(𝛆) = 0
It could also be written as E(Y) = 𝛃0 + … + 𝛃kXk
Reminder: you always have to write one E in the formula; either at the end (+𝛆) or at
the beginning (E(Y)).
→ Here, the interpretation of the slope b1 = the average change in the statistics
when a student studies 1 hour more, ceteris paribus (=everything else remaining the
same, meaning in this case: one IV changes with +1, and the other IVs are constant).
Quantitative variables: can take all kinds of values
Qualitative variables: can only indicate if an option is valid (e.g. gender = ‘man’). → In
regression:
- Dummy variables: can only take value 0 or 1.
Always displayed in regression model: ‘the number of options – 1’.
→ you add the dummy variables in the basic assumption multiple regression, after
the IVs as: … + 𝛃5D1 + 𝛃6D2
→ The ‘slope’ is then the difference between the dummy in the model and the
omitted dummy:
Slope of D1 = average of D1 – average of omitted dummy
Slope of D2 = average of D2 – average of omitted dummy
1
, → The omitted options ‘disappear’ into the intercept b0. The slope b1 of the
remaining options are now the mean difference from the omitted option.
→ Interpretation of dummy slope: If someone has “dummy variable”, it means that
it will have/score “slope” higher/lower than the omitted dummy, ceteris paribus.
Second-order 𝛃k
A regression model measures linear coherence. Still, it is possible to include a second-order
(squared) relationship in the model.
→ In the basic assumption: you note the normal, linear slope 𝛃4 of the independent
variable X4, and at the end, you also note the slope 𝛃8 of the squared independent
variable X42.
- So: there are 2 slope coefficients of the same variable: the normal and the non-linear
(second-order, squared).
Interaction 𝛃x
Additionally, it is also possible to create an interaction slope. This means that the magnitude
of the effect of Xk on Y, depends on another X.
- Xk * the other X = ‘the interaction between Xk in … and X’
→ In the basic assumption: you note the normal slopes of both independent
variables 𝛃1X1 + … + 𝛃4X4, and at the end, you also note the slope of the interaction
effect 𝛃9X1_ X4.
- You can make an interaction with quantitative and qualitative (dummy) variables.
Then:
When value of dummy = 0 → interaction effect is ‘not active’.
When value of dummy = 1 → interaction effect is ‘active’.
SPSS output multiple regression
(Don’t look at the numbers in this example, they don’t make any sense):
→ Std. Error of the Estimate = S𝛆
2
, → look at df (=degrees of
freedom):
Regression df = k
Residual df = n-(k+1)
Total df = n-1
→ left dark-grey column
Shows the independent variables k=5
1.constant shows 𝛃0.
a.Mentions dependent variable below.
- Know how to calculate everything related SPSS. Here already the calculation of MSE:
MSE = SSE / dfe or MSE = S𝛆2 or MSE = MSR / F
r2 = SSR/SST or 1 – SSE/SST
Example question to find r2:
Solve: F = (r2/dfr) / (1 - r2 /dfe)
Or: solve regression test complete model
3