Lectures by Chip Huisman
Semester 1, Block 3 2019-2020
Lecture 1 – 06/01/2020
Relationship between 2 variables
We call the analysis of the relationship between 2 variables ‘bivariate analysis’.
Association = Correlation = Relation
- Dependent and independent variable
- Response and explanatory variable
- Outcome and predictor variable
- Y and x variable
We only look at interval/ratio variables.
The relationship between variables can be studied and analyzed by generating and looking
at a scatter plot.
Step-by-step plan for drawing a distribution diagram/scatter plot:
1. Draw the axes and determine which variable goes on which axis
2. Determine the range of the values and mark them on the axes
3. Place a dot for each pair of scores
4. (If necessary, give the dots a name)
The correlation coefficient (Pearson r)
- Displays the linear relationship between 2 interval/ratio variables
- A positive number indicates positive relation. A negative number a negative relation
- The value lies between -1 (perfect negative correlation) and +1 (perfect positive
correlation). 0 means no correlation at all
- Correlation does not depend on original units of measurement
,Linear relationships
Linear function: y=α + βxx
This formula expresses the values on the y-axis as a linear function of the values on the x-
axis. The formula has a straight line with a slope βx (beta) and y-intercept α.
The slope βx (beta) = a number that indicates how much the value of y increases or
decreases with an increase of one x.
The y-intercept α = a number that indicates where the line crosses the y-axis. This is also
called the constant.
Linear means rectilinear/straight.
Intermezzo
Nominal + order = ordinal
Ordinal + differences equally large = interval
Interval + zero point = ratio
What is a MODEL?
A model is an approximation to reality.
A statistical model is an approximation of a characteristic of individuals within a population.
Everyone within a population has an age. But for a very large population this is very
inconvenient to display. So you give an approximation by calculating the average/mean age.
Ergo, the average/mean is a statistical model.
Similarly, a relationship between two variables within a population can be expressed with a
model.
This relationship between two variables can be represented by a linear function.
Taken together, this is called a linear model.
Least squares prediction equation
Prediction refers to the formal/mathematical aspect of a model. You put data in your model
and your model predicts an outcome.
Estimation refers to the statistical application of a model. You apply a model to sample data
in order to say something about a population. Based on sample data you can estimate a
linear model.
What we try to estimate is the line (a linear model) that best fits the data. The least squares
method (OLS = Ordinary Least Squares) appears to be the most suitable for this.
Prediction and estimation are used interchangeably by many people but there is a
difference.
,Estimating a line based on a cloud of observed data points
We want to find the line that best summarizes data in a line (linear model).
How do we do that?
We need a prediction equation: ^y =a+bx
^y (y-hat) is the predicted value of y given the value of x.
Where we have to calculate the a and the b with:
s ∑ ( x i− x́ )( y i− ý )
b= xy2 =
sx ∑ ( xi −x́)2
a= ý−b x́
Intermezzo
Lower case Greek letters are used for populations parameters.
Roman letters are used for sampling statistics.
The μ (Greek mu) and σ (Greek small sigma) indicate the mean and the standard deviation
of a population (these are often unknown).
ý and s indicate the mean and standard deviation of a sample. These are therefore variables
whose value depends on the sampling.
μ and σ are constants because they are related to observations of the entire population.
ý and s are often used to estimate the often unknown μ and σ .
^y (y-hat) is the predicted value of y given the value of x within a predicted equation.
Formula for the b-coefficient or slope
s xy ∑ ( x i− x́ )( y i− ý )
b= 2
=
s x ∑ ( xi −x́)2
If we divide the covariance by the variance we get the b-coefficient or slope.
Deviation score x = ( x i−x́ )
Deviation score y = ( y i− ý )
Σ (Greek capital sigma) means that you have to add things up.
Step-by-step plan for calculating the b-coefficient:
1. Calculate the means for x and y
2. Calculate all the individual deviations (deviation scores) for x and y
3. Calculate all the individual squared deviations for x
4. Calculate all the deviation scores of x squared
5. Calculate the sum of the deviation scores of x squared
6. Calculate the sum of the deviation scores of x times the deviation scores of y
7. Divide the sum of the deviation scores of x times the deviation scores of y by the sum
of the deviation scores of x squared
, Beware of outliers
An outlier is an extreme value which can have a strong influence on the slope of the
regression line.
The prediction equation has the least squares property
Why is that useful/relevant?
You want the line that gives ^y =a+bx the best fit for our observed cloud of data points.
Therefore you want the smallest Sum of squared errors = SSE.
The SSE is a measure of the discrepancy between the line ^y =a+bx and the cloud of
observed values points.
Properties:
- The sum of the residues is zero
And the line always goes through the center of the data. The point (x́, y´ ¿ ¿
What does the least squares mean and what does the sum of the least squares, or the sum
of squared errors mean?
The line through a point cloud is a model for that point cloud. And you want that model to
represent that point cloud as good as possible.
Real Titanic / Model of Titanic
So, you go look for the best matching/fitting line to the point cloud.
But which line is that? It is the line where the distances between the predicted values for y
and the observed value for y is the smallest. That difference is called the predictor error
(residual).
Point cloud with regression line and residuals -> the most appropriate line is the line where
the sum of the squared residuals is the smallest.
Forecast comparison has the least squares characteristic
- The prediction errors are called residuals: