Notes – Statistics 2 2023
Lecture 01: 04/09/2023
Bivariate Relationships between Continuous Variables
Covariance
- Variance: how much do observations deviate from the central tendency?
- Covariance: how much do variables vary together?
- When one variable changes, how does this affect the other variable?
∑(𝑥𝑖 − 𝑥̅ ) (𝑦𝑖 − 𝑦̅)
𝑐𝑜𝑣(𝑥, 𝑦) =
𝑛−1
- Covariance does not have a set range (it depends on the variable’s scale)
- Covariance is an unstandardised measure, so we cannot compare when variables have very different
scales.
- Covariance statistic depends on the variance of x and y.
- We therefore use Correlation Coefficients: standardised covariance statistic.
- Or use Linear regression models: not standardised, but with other advantages.
Correlation coefficient, which always takes values between -1 and 1, describes the strength of the linear
relationship between two variables. We denote the correlation by 𝑅.
- The correlation coefficient is a standardised measure of the linear association between two continuous
variables. What is the direction (positive or negative) of the relationship?
- r = 1 -> a perfect positive linear relationship. All observations fall on a positively sloped line.
- r = 0 -> no linear relationship.
- r = -1 -> a perfect negative linear relationship. All observations fall on a negatively sloped line.
- Nonlinear trends, even when strong, sometimes produce correlations that do not reflect the strength.
- Always plot the data to see the distribution of the data.
- Interpreting the correlation
- r < |0.1| : very small
- |0.1| <= |0.3| : small
- |0.3| <= |0.5| : moderate
- r > |0.5|: large
- Correlation does not imply causation. Even if two variables have a strong correlation, it does not mean
that one causes the other.
Person’s r correlation
𝑐𝑜𝑣(𝑥, 𝑦)
𝑟=
𝑆𝐷(𝑥) ∗ 𝑆𝐷(𝑦)
Assumptions
- Interval-ratio (continuous) variables.
- Linear relationship between variables.
Reporting correlations:
- Higher levels of economic inequality are associated with lower levels of electoral democracy (r = -0.35).
This association is moderate in size and statistically significant (p < 0.01).
,Notes – Statistics 2 2023
Spearman’s rho correlation
- Measures the strength and direction of association between two ranked variables.
- Primarily used for discrete ordinal variables and when assumptions of Person’s r are violated.
Sample vs. Population
- Population
- Observations of relevance for our research questions.
- Sample
- Selection of observations we analyse.
We use our sample to make inferences about the population.
Linear regression is the statistical method for fitting a line to data where the relationship between two variables,
x and y, can be modelled by a straight line with some error.
- Prediction line telling us how to expect the mean/ average value of Y to change when X changes by one
unit.
A statistical model is an abstraction/ simplification that may be useful for answering our questions.
- Linear regression is a method that allows researchers to summarise how predictions or average values of
an outcome vary across observations defined by a set of predictors.
- What is our best guess about one variable if we know what the other variable equals?
𝑦𝑖 = 𝑏0 + 𝑏1 ∗ 𝑥𝑖 + 𝜖𝑖
The values 𝑏0 and 𝑏1 represent the model’s parameters, and the error is represented by 𝜖.
- i represents the individual observation.
- 𝑏0 represents the intercept/ constant term (the average value of Y we expect to observe when X = 0).
- 𝑏1 represents the slope (how we expect the mean of Y to change when X increases by one unit).
- The DV needs to be a continuous variable while the IV can have any form.
- The data fall around a straight line, even if none of the observations fall exactly on the line.
- Dependent variable
- What we want to predict
- Common labels: Y, DV, outcome variable
- Independent variable
- What we are using to predict the DV
- Common labels: X, IV, predictor variable
Main purposes of regression
- Making predictions including to new data.
- Describing relationships.
- Studying causal relationships: causal inference.
Extrapolation describes the fallacy of applying a model estimate to values outside of the realm of the original
data. It can be unreliable, as it assumes that the linear relationship continues indefinitely.
, Notes – Statistics 2 2023
Lecture 02: 11/09/2023
Bivariate Linear Regression
Ordinary Least squares (OLS) regression
Least squares regression aims to find the best-fitting linear relationship by minimising the sum of squared
residuals.
𝑦𝑖 = 𝑏0 + 𝑏1 ∗ 𝑥 + 𝜖𝑖
Error/ Residual
- e -> actual value of Y for observation i and the model’s prediction for that observation.
- Represents variation in Y not explained by our model.
- Positive error/ residual -> the actual value is higher than our predicted value (above the regression line).
- Negative error/ residual -> the actual value is lower than our predicted value (below the regression line).
Reporting OLS regression:
- A discussion about the direction of the relationship (positive or negative coefficient).
- Higher values of X are associated with higher/ lower values of Y.
- Name the value of the effect.
- Based on this model, we expect Y to increase/ decrease by … (value) on average with each one
unit increase in X.
- If it is a bivariate OLS regression: we only interpret the intercept if the predictor variable is scaled such
that the value of 0 refers to a particular category of relevance -> then the intercept is the mean of Y.
- A conclusion about the null hypothesis with reference to the p-value or the confidence interval.
- This association is (not) statistically significant (p ...).
Residuals are the leftover variation in the data after accounting for the model fit:
𝐷𝑎𝑡𝑎 = 𝐹𝑖𝑡 + 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙
- Each observation has a residual.
- Residuals can be used to detect outliers (large residuals show us the outliers).
- The sum of residuals in a well-fitted linear regression model should ideally be close to 0.
- The model is not systematically overestimating or underestimating the observed values.
First, we need to make predictions of certain points. Then we need to subtract the actual observed value.
Residual = Observed Value − Predicted Value
Prediction for example point (77.0, 85.3): 𝑦̂ = 41 + 0.59𝑥 = 41 + 0.59 ∗ 77.0 = 86.4
𝑒 = 𝑦𝑥 − 𝑦̂𝑥 = 85.3 − 86.4 = −1.1
Residuals are helpful in evaluating how well a linear model fits a data set. Residuals can be displayed in a residual
plot where the vertical coordinate is the value of the residual.
- A residual plot where the residuals are around zero indicates a good model fit.
- Other patterns (curves, funnels) in the residual plot can suggest violations of the regression assumption.
Lecture 01: 04/09/2023
Bivariate Relationships between Continuous Variables
Covariance
- Variance: how much do observations deviate from the central tendency?
- Covariance: how much do variables vary together?
- When one variable changes, how does this affect the other variable?
∑(𝑥𝑖 − 𝑥̅ ) (𝑦𝑖 − 𝑦̅)
𝑐𝑜𝑣(𝑥, 𝑦) =
𝑛−1
- Covariance does not have a set range (it depends on the variable’s scale)
- Covariance is an unstandardised measure, so we cannot compare when variables have very different
scales.
- Covariance statistic depends on the variance of x and y.
- We therefore use Correlation Coefficients: standardised covariance statistic.
- Or use Linear regression models: not standardised, but with other advantages.
Correlation coefficient, which always takes values between -1 and 1, describes the strength of the linear
relationship between two variables. We denote the correlation by 𝑅.
- The correlation coefficient is a standardised measure of the linear association between two continuous
variables. What is the direction (positive or negative) of the relationship?
- r = 1 -> a perfect positive linear relationship. All observations fall on a positively sloped line.
- r = 0 -> no linear relationship.
- r = -1 -> a perfect negative linear relationship. All observations fall on a negatively sloped line.
- Nonlinear trends, even when strong, sometimes produce correlations that do not reflect the strength.
- Always plot the data to see the distribution of the data.
- Interpreting the correlation
- r < |0.1| : very small
- |0.1| <= |0.3| : small
- |0.3| <= |0.5| : moderate
- r > |0.5|: large
- Correlation does not imply causation. Even if two variables have a strong correlation, it does not mean
that one causes the other.
Person’s r correlation
𝑐𝑜𝑣(𝑥, 𝑦)
𝑟=
𝑆𝐷(𝑥) ∗ 𝑆𝐷(𝑦)
Assumptions
- Interval-ratio (continuous) variables.
- Linear relationship between variables.
Reporting correlations:
- Higher levels of economic inequality are associated with lower levels of electoral democracy (r = -0.35).
This association is moderate in size and statistically significant (p < 0.01).
,Notes – Statistics 2 2023
Spearman’s rho correlation
- Measures the strength and direction of association between two ranked variables.
- Primarily used for discrete ordinal variables and when assumptions of Person’s r are violated.
Sample vs. Population
- Population
- Observations of relevance for our research questions.
- Sample
- Selection of observations we analyse.
We use our sample to make inferences about the population.
Linear regression is the statistical method for fitting a line to data where the relationship between two variables,
x and y, can be modelled by a straight line with some error.
- Prediction line telling us how to expect the mean/ average value of Y to change when X changes by one
unit.
A statistical model is an abstraction/ simplification that may be useful for answering our questions.
- Linear regression is a method that allows researchers to summarise how predictions or average values of
an outcome vary across observations defined by a set of predictors.
- What is our best guess about one variable if we know what the other variable equals?
𝑦𝑖 = 𝑏0 + 𝑏1 ∗ 𝑥𝑖 + 𝜖𝑖
The values 𝑏0 and 𝑏1 represent the model’s parameters, and the error is represented by 𝜖.
- i represents the individual observation.
- 𝑏0 represents the intercept/ constant term (the average value of Y we expect to observe when X = 0).
- 𝑏1 represents the slope (how we expect the mean of Y to change when X increases by one unit).
- The DV needs to be a continuous variable while the IV can have any form.
- The data fall around a straight line, even if none of the observations fall exactly on the line.
- Dependent variable
- What we want to predict
- Common labels: Y, DV, outcome variable
- Independent variable
- What we are using to predict the DV
- Common labels: X, IV, predictor variable
Main purposes of regression
- Making predictions including to new data.
- Describing relationships.
- Studying causal relationships: causal inference.
Extrapolation describes the fallacy of applying a model estimate to values outside of the realm of the original
data. It can be unreliable, as it assumes that the linear relationship continues indefinitely.
, Notes – Statistics 2 2023
Lecture 02: 11/09/2023
Bivariate Linear Regression
Ordinary Least squares (OLS) regression
Least squares regression aims to find the best-fitting linear relationship by minimising the sum of squared
residuals.
𝑦𝑖 = 𝑏0 + 𝑏1 ∗ 𝑥 + 𝜖𝑖
Error/ Residual
- e -> actual value of Y for observation i and the model’s prediction for that observation.
- Represents variation in Y not explained by our model.
- Positive error/ residual -> the actual value is higher than our predicted value (above the regression line).
- Negative error/ residual -> the actual value is lower than our predicted value (below the regression line).
Reporting OLS regression:
- A discussion about the direction of the relationship (positive or negative coefficient).
- Higher values of X are associated with higher/ lower values of Y.
- Name the value of the effect.
- Based on this model, we expect Y to increase/ decrease by … (value) on average with each one
unit increase in X.
- If it is a bivariate OLS regression: we only interpret the intercept if the predictor variable is scaled such
that the value of 0 refers to a particular category of relevance -> then the intercept is the mean of Y.
- A conclusion about the null hypothesis with reference to the p-value or the confidence interval.
- This association is (not) statistically significant (p ...).
Residuals are the leftover variation in the data after accounting for the model fit:
𝐷𝑎𝑡𝑎 = 𝐹𝑖𝑡 + 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙
- Each observation has a residual.
- Residuals can be used to detect outliers (large residuals show us the outliers).
- The sum of residuals in a well-fitted linear regression model should ideally be close to 0.
- The model is not systematically overestimating or underestimating the observed values.
First, we need to make predictions of certain points. Then we need to subtract the actual observed value.
Residual = Observed Value − Predicted Value
Prediction for example point (77.0, 85.3): 𝑦̂ = 41 + 0.59𝑥 = 41 + 0.59 ∗ 77.0 = 86.4
𝑒 = 𝑦𝑥 − 𝑦̂𝑥 = 85.3 − 86.4 = −1.1
Residuals are helpful in evaluating how well a linear model fits a data set. Residuals can be displayed in a residual
plot where the vertical coordinate is the value of the residual.
- A residual plot where the residuals are around zero indicates a good model fit.
- Other patterns (curves, funnels) in the residual plot can suggest violations of the regression assumption.