Outline
1. Recap of Foundational Concepts in Statistics
2. The Problem We Want to Solve:
3. Example Problem
4. Correlation Analysis
5. Simple Linear Regression
Recap of Foundational Concepts in Statistics
Population vs Sample
• When we refer to a numerical descriptor for a population we refer to it as a
parameter, where a numerical descriptor for a sample is referred to as a
statistic.
• We use sample statistics to approximate population parameters.
Statistical Inference
• Statistical inference is the attempt to reach a conclusion concerning a complete
set of observations (the population) using only a subset thereof (a sample).
• It is important to note that this sample needs to be representative of the
population in order to make accurate inference.
• We make use of sampling distributions to make inference.
• Statistical inference is conducted with the help of hypothesis testing.
Hypothesis Testing
Hypothesis testing allows us to make statements about a population from a sample of
that population. It involves the following basic steps:
Step 1: Define the null hypothesis (H0) This is the hypothesis of no statistical
significance. Step 2: Define the alternative hypothesis (Ha) This is the hypothesis of
statistical significance.
Step 3: Define the significance level (α) This is the type one error rate (probability of
falsely rejecting H0). Typically, α = 0.05 or α = 0.01 are sufficiently low. Step 4: Calculate
the test statistic This will be calculated differently depending on the test being
conducted.
,Step 5: Find the p-value This is the probability of getting a result as or more extreme
than the observed test statistic, assuming H0 is true. A precise p-value can be
generated using software or an approximate one using tables by hand.
Step 6: Make a conclusion If p-value is ≤ α, then we reject H0 and conclude statistical
significance of our result. Otherwise, we fail to reject H0 and conclude no statistical
significance (this means that we can’t make any statements about the population from
our sample result).
The problem we want to solve
Describing the relationship between two variables
• How strong is the relationship? (so we want to be able to quantify it) Is this
observed relationship likely real or just due to chance?
• Can we explain the impact that changing one variable has on another variable?
• Can we predict the value of one variable from another variable?
Lecture example
As part of an experiment, a lecturer recorded the overall course marks and number of
lectures attended for 20 students in the course that they teach. The results of this
experiment are shown below:
,Correlation analysis as a method to solve our problem
Correlation is a measure of strength and direction of a linear relationship between two
variables
• Correlation is bounded between -1 and 1.
• Correlation does not have a unit.
• Correlation cannot be used to predict one variable from another.
Correlation coefficient
• Correlation is measured using the correlation coefficient (typically the Pearson
correlation coefficient).
• The population correlation coefficient (ρ) measures the direction and strength of
the association between the full set of two variables.
• The sample correlation coefficient (r) is an estimate of ρ and measures the
direction and strength of the association between the two variables in a sample
of the population.
• The sample correlation coefficient is given by:
Test your understanding
Calculate the correlation coefficient between X and Y for the following 3 observations:
, Example of data with different correlation coefficients
Correlation analysis with our example