chapter 10 introduction to multivariate relationships
causal relationships are asymmetrical → 𝑥 causes 𝑦
- association between variables
o as 𝑥 changes, the distribution of 𝑦 should change in some way
o association does NOT imply causation
- appropriate time order
- elimination of alternative explanations
o observational studies can never prove that 1 variable is a cause of another
- anecdotal evidence is not enough to disprove causality unless it can deflate 1 of the 3
criteria
- randomized experiments are the standard for establishing causality, although this isn’t
always possible in social research
in multivariate analysis, a variable is said to be controlled when its influence is removed
- randomized experiments inherently control other variables in a probabilistic sense
statistical control: approximating an experimental type of control by grouping observations
with equal/similar values on the control variables in observational research
control variable: any variable that is held constant
lurking variable: a variable not measured in a study, but does influence the association
multivariate associations
- spurious: both 𝑥1 and 𝑦 are dependent on 𝑥2 , but their association disappears when 𝑥2
is controlled
- chain relationship: the relationship between 𝑥1 and 𝑦 exists but is indirect. 𝑥2 is an
intervening variable or mediator
- multiple causes: can either be independent or dependent (= there exists a relationship
between the causes themselves)
- suppressor: when controlling for a suppressor variable, the association between 2
variables increases
- interaction: an association has diff strengths and/or directions at diff values of the
control variable
Simpson’s paradox: the possibility that after controlling for a variable, each association has the
opposite direction as the bivariate association
confounding: when 2 explanatory variables both have effects on a response variable but are
also associated with each other
- omitted variable bias: a study neglecting to observe a confounding variable that
explains a major part of the effect
1
,chapter 9 linear regression and correlation
non-directional: 𝑥 predicts 𝑦
directional:
- pos association: higher 𝑥 predicts higher 𝑦
- neg association: higher 𝑥 predicts lower 𝑦
linear regression model: 𝑦̂ = 𝑎 + 𝑏𝑥
- predicted criterion value → 𝑦̂
- 𝑦-intercept → 𝑎
- slope → 𝑏
o pos when high 𝑥-values coincide with high 𝑦-values, and vice versa
o neg when low 𝑥-values coincide with high 𝑦-values, and vice versa
o we can’t use 𝑏 to interpret the strength of the association between 𝑥 and 𝑦
▪ 𝑏 depends on the scale
we consider 3 types of 𝑦:
- 𝑦: observed outcome value of an individual
- 𝑦̅: avg outcome value (mean of 𝑦)
- 𝑦̂: individual’s predicted outcome value based on model
least square estimation: the best straight line falling closest to all data points in the scatterplot
𝑠
Pearson’s correlation: 𝑏*= 𝑟 = (𝑠𝑥 ) 𝑏
𝑦
- interpretation: 0 < negligible < .10 ≤ small < .30 ≤ moderate < .50 ≤ large
- both 𝑟 and 𝑏* are measures of effect size
residual (𝒆): vertical distance between observed 𝑦 and predicted 𝑦̂
- 𝑒 = 𝑦 − 𝑦̂
- we can use this residual to determine how well the model performs in predicting 𝑦
total sum of squares: 𝑇𝑆𝑆 = ∑(𝑦 − 𝑦̅)2
how much variation is there in the to be
explained dependent variable
marginal variation
sum of squared errors: 𝑆𝑆𝐸 = ∑(𝑦 − 𝑦̂)2
how much variation is still unexplained
after adding the independent variable
conditional variation
regression sum of squares: 𝑅𝑆𝑆 = ∑(𝑦̂ − 𝑦̅)2
how much variation is explained by adding
the independent variable
the smaller the 𝑆𝑆𝐸, the better the prediction → 𝑆𝑆𝐸 = 𝑇𝑆𝑆 − 𝑅𝑆𝑆
we use diff sum of squares to inspect the explanatory power of the model and for significance
2
, coefficient of determination (𝑹𝟐 ): proportion of variation in 𝑦 that is explained by the model
𝑇𝑆𝑆−𝑆𝑆𝐸 ∑(𝑦−𝑦̅)2 −∑(𝑦−𝑦̂)2
- 𝑅2 = 𝑇𝑆𝑆
= ∑(𝑦−𝑦̅)2
- 0≤𝑅 ≤1 2
- the closer to 1, the stronger the linear relationship
- interpretation: 0 < negligible < .02 ≤ small < .13 ≤ moderate < .26 ≤ large
inferential statistics: using sample data to make inferences abt the population parameters
- we can’t confirm hypotheses, but we can falsify
o by inspecting the probability of finding 𝑏 (or 𝑟) when the null hypothesis was true
o null hypothesis: no association between variables (independent)
▪ 𝐻0: 𝛽 = 0
o alternative hypothesis: association between variables (dependent)
▪ 𝐻𝑎: 𝛽 ≠ 0
▪ if directional: 𝛽 < 0 or 𝛽 > 0
- check significance of 𝑏 using 𝑡-statistic
𝑏
o 𝐻0: 𝛽 = 0 𝑡 = 𝑠𝑒 with 𝑑𝑓 = 𝑛 − 2
- check significance of 𝑅2 using the 𝐹-statistic
𝑅 2 /1 (𝑇𝑆𝑆−𝑆𝑆𝐸)/1 𝑅𝑆𝑆/1 𝑀𝑆𝑅
o 𝐹 = (1−𝑅2)/(𝑛−2) = 𝑆𝑆𝐸/(𝑛−2)
= 𝑆𝑆𝐸/(𝑛−2) = 𝑀𝑆𝐸
▪ 𝑑𝑓1 = 𝑘 = 1
𝑘 = number of regression parameters 𝑏
▪ 𝑑𝑓2 = 𝑛 − 𝑘 − 1 = 𝑛 − 2
- based on the 𝑡- or 𝐹-statistic, determine the 𝑝-value:
o what is the probability of finding a result this extreme, when the 𝐻0 is true?
- 𝐹 = 𝑡 2 → both options yield the same conclusion
4 scenarios are possible, depending on the decision and the condition of 𝐻0
- 2x erroneous decision (which we want to avoid)
o type 1 error: probability of rejecting 𝐻0 when it is true
▪ determined by the selected 𝛼-level (.05)
▪ if observed 𝑝-value < 𝛼 : reject 𝐻0
o type 2 error (𝛽): probability of not rejecting 𝐻0 when it is false
▪ determined by:
• strength of association/diff in population
• sample size of study
• selected 𝛼-level
o trade-off: the smaller the type 1 error, the larger the type 2 error
- 2x correct decision
o 1 − 𝛽 = power → probability of correctly rejecting 𝐻0
▪ typically aim for 80%
assumptions of linear regression:
- representativeness: analyses are based on a random sample
- functional form: relation between 𝑥 and 𝑦 is linear
- homoscedasticity: conditional variance around 𝑏 is equal for all 𝑥
- normal distribution: conditional variance of 𝑦 for all 𝑥 is normal
3
causal relationships are asymmetrical → 𝑥 causes 𝑦
- association between variables
o as 𝑥 changes, the distribution of 𝑦 should change in some way
o association does NOT imply causation
- appropriate time order
- elimination of alternative explanations
o observational studies can never prove that 1 variable is a cause of another
- anecdotal evidence is not enough to disprove causality unless it can deflate 1 of the 3
criteria
- randomized experiments are the standard for establishing causality, although this isn’t
always possible in social research
in multivariate analysis, a variable is said to be controlled when its influence is removed
- randomized experiments inherently control other variables in a probabilistic sense
statistical control: approximating an experimental type of control by grouping observations
with equal/similar values on the control variables in observational research
control variable: any variable that is held constant
lurking variable: a variable not measured in a study, but does influence the association
multivariate associations
- spurious: both 𝑥1 and 𝑦 are dependent on 𝑥2 , but their association disappears when 𝑥2
is controlled
- chain relationship: the relationship between 𝑥1 and 𝑦 exists but is indirect. 𝑥2 is an
intervening variable or mediator
- multiple causes: can either be independent or dependent (= there exists a relationship
between the causes themselves)
- suppressor: when controlling for a suppressor variable, the association between 2
variables increases
- interaction: an association has diff strengths and/or directions at diff values of the
control variable
Simpson’s paradox: the possibility that after controlling for a variable, each association has the
opposite direction as the bivariate association
confounding: when 2 explanatory variables both have effects on a response variable but are
also associated with each other
- omitted variable bias: a study neglecting to observe a confounding variable that
explains a major part of the effect
1
,chapter 9 linear regression and correlation
non-directional: 𝑥 predicts 𝑦
directional:
- pos association: higher 𝑥 predicts higher 𝑦
- neg association: higher 𝑥 predicts lower 𝑦
linear regression model: 𝑦̂ = 𝑎 + 𝑏𝑥
- predicted criterion value → 𝑦̂
- 𝑦-intercept → 𝑎
- slope → 𝑏
o pos when high 𝑥-values coincide with high 𝑦-values, and vice versa
o neg when low 𝑥-values coincide with high 𝑦-values, and vice versa
o we can’t use 𝑏 to interpret the strength of the association between 𝑥 and 𝑦
▪ 𝑏 depends on the scale
we consider 3 types of 𝑦:
- 𝑦: observed outcome value of an individual
- 𝑦̅: avg outcome value (mean of 𝑦)
- 𝑦̂: individual’s predicted outcome value based on model
least square estimation: the best straight line falling closest to all data points in the scatterplot
𝑠
Pearson’s correlation: 𝑏*= 𝑟 = (𝑠𝑥 ) 𝑏
𝑦
- interpretation: 0 < negligible < .10 ≤ small < .30 ≤ moderate < .50 ≤ large
- both 𝑟 and 𝑏* are measures of effect size
residual (𝒆): vertical distance between observed 𝑦 and predicted 𝑦̂
- 𝑒 = 𝑦 − 𝑦̂
- we can use this residual to determine how well the model performs in predicting 𝑦
total sum of squares: 𝑇𝑆𝑆 = ∑(𝑦 − 𝑦̅)2
how much variation is there in the to be
explained dependent variable
marginal variation
sum of squared errors: 𝑆𝑆𝐸 = ∑(𝑦 − 𝑦̂)2
how much variation is still unexplained
after adding the independent variable
conditional variation
regression sum of squares: 𝑅𝑆𝑆 = ∑(𝑦̂ − 𝑦̅)2
how much variation is explained by adding
the independent variable
the smaller the 𝑆𝑆𝐸, the better the prediction → 𝑆𝑆𝐸 = 𝑇𝑆𝑆 − 𝑅𝑆𝑆
we use diff sum of squares to inspect the explanatory power of the model and for significance
2
, coefficient of determination (𝑹𝟐 ): proportion of variation in 𝑦 that is explained by the model
𝑇𝑆𝑆−𝑆𝑆𝐸 ∑(𝑦−𝑦̅)2 −∑(𝑦−𝑦̂)2
- 𝑅2 = 𝑇𝑆𝑆
= ∑(𝑦−𝑦̅)2
- 0≤𝑅 ≤1 2
- the closer to 1, the stronger the linear relationship
- interpretation: 0 < negligible < .02 ≤ small < .13 ≤ moderate < .26 ≤ large
inferential statistics: using sample data to make inferences abt the population parameters
- we can’t confirm hypotheses, but we can falsify
o by inspecting the probability of finding 𝑏 (or 𝑟) when the null hypothesis was true
o null hypothesis: no association between variables (independent)
▪ 𝐻0: 𝛽 = 0
o alternative hypothesis: association between variables (dependent)
▪ 𝐻𝑎: 𝛽 ≠ 0
▪ if directional: 𝛽 < 0 or 𝛽 > 0
- check significance of 𝑏 using 𝑡-statistic
𝑏
o 𝐻0: 𝛽 = 0 𝑡 = 𝑠𝑒 with 𝑑𝑓 = 𝑛 − 2
- check significance of 𝑅2 using the 𝐹-statistic
𝑅 2 /1 (𝑇𝑆𝑆−𝑆𝑆𝐸)/1 𝑅𝑆𝑆/1 𝑀𝑆𝑅
o 𝐹 = (1−𝑅2)/(𝑛−2) = 𝑆𝑆𝐸/(𝑛−2)
= 𝑆𝑆𝐸/(𝑛−2) = 𝑀𝑆𝐸
▪ 𝑑𝑓1 = 𝑘 = 1
𝑘 = number of regression parameters 𝑏
▪ 𝑑𝑓2 = 𝑛 − 𝑘 − 1 = 𝑛 − 2
- based on the 𝑡- or 𝐹-statistic, determine the 𝑝-value:
o what is the probability of finding a result this extreme, when the 𝐻0 is true?
- 𝐹 = 𝑡 2 → both options yield the same conclusion
4 scenarios are possible, depending on the decision and the condition of 𝐻0
- 2x erroneous decision (which we want to avoid)
o type 1 error: probability of rejecting 𝐻0 when it is true
▪ determined by the selected 𝛼-level (.05)
▪ if observed 𝑝-value < 𝛼 : reject 𝐻0
o type 2 error (𝛽): probability of not rejecting 𝐻0 when it is false
▪ determined by:
• strength of association/diff in population
• sample size of study
• selected 𝛼-level
o trade-off: the smaller the type 1 error, the larger the type 2 error
- 2x correct decision
o 1 − 𝛽 = power → probability of correctly rejecting 𝐻0
▪ typically aim for 80%
assumptions of linear regression:
- representativeness: analyses are based on a random sample
- functional form: relation between 𝑥 and 𝑦 is linear
- homoscedasticity: conditional variance around 𝑏 is equal for all 𝑥
- normal distribution: conditional variance of 𝑦 for all 𝑥 is normal
3