1. Gathering and exploring data
1.1. Using data to answer statistical questions
Statistical problem solving is an investigative process that involves 4 components:
- Formulate a statistical question
- Collect data
- Analyse data
- Interpret results
3 main components of statistics for answering a statistical question:
- Design = starting the goals and/or statistical question of interest and planning how to obtain
data that will address them
- Description = summarizing and analysing the data that are obtained
- Inference = making decisions and predictions based on the data for answering the statistical
question
Probability = framework for quantifying how likely various possible outcomes are
1.2. Sample versus population
Subject = entities measured in a study
Population = total set of all the subjects of interest
Sample = subset of the population for whom we (plan to) have data
Descriptive statistics refers to methods for summarizing the collected data. The summaries usually
consist of graphs and numbers such as averages and percentages.
Inferential statistics refers to methods of making decisions or predictions about a population, based
on data obtained from a sample of that population.
- An important aspect of this involves reporting the likely precision of a prediction. How close
is the sample value to the true value of the population? margin of error
Parameter = numerical summary of the population
Statistic = numerical summary of a sample taken from the population
Random sampling = every subject in the population has the same chance of being included in the
sample
- Allows to make powerful inferences about populations
Randomness is also crucial to performing experiments well (randomization)
Margin of error = measure of the expected variability from one random sample to the next random
sample
‘very likely’ typically means 95 times out of 100 95% confidence interval
1
Approximate margin of error = ×100 %
√n
Random variation is roughly like the margin of error (above formula)
, The difference expected through ordinary random variation is smaller with larger samples
Statistically significant = when the difference between results of treatment and control group is so
large that it would be rare to see such a difference by ordinary random variation
1.3. Using calculators and computers
To make statistical analysis easier, large sets of data are organised in a data file
Two basic rules for constructing a data file:
- Any one row contains measurements for a particular subject
- Any one column contains measurements for a particular characteristic
Database = archived collection of data files
2. Exploring data with graphs and numerical summaries
2.1. Different types of data
Variables = any characteristic observed in a study
- A variable is called quantitative if observations on it take numerical values that represent
different magnitudes of the variable
o Key features to describe:
Center
Variability (AKA spread)
o Quantitative variables:
Discrete = if its possible values form a set of separate numbers
Continuous = if its possible values form an interval (infinite continuum of
possible values)
- A variable is called categorical if each observation belongs to one of a set of distinct
categories.
o Key feature to describe:
Relative number of observations in the various categories
Observations = data values that we observe for a variable
The distribution of a variable describes how the observations fall (are distributed) across the range
of possible values
- Can be displayed by a graph or a table
- Features to look for in distribution of categorical variables:
o Modal category = the category with the largest frequency
o And more generally how frequently each category was observed
- Features to look for in distribution of quantitative variables:
o Shape = do observations cluster in certain intervals and/or are they spread thin in
others?
o Center = where does a typical observation fall?
o Variability = how tightly are the observations clustering around a center?
1.1. Using data to answer statistical questions
Statistical problem solving is an investigative process that involves 4 components:
- Formulate a statistical question
- Collect data
- Analyse data
- Interpret results
3 main components of statistics for answering a statistical question:
- Design = starting the goals and/or statistical question of interest and planning how to obtain
data that will address them
- Description = summarizing and analysing the data that are obtained
- Inference = making decisions and predictions based on the data for answering the statistical
question
Probability = framework for quantifying how likely various possible outcomes are
1.2. Sample versus population
Subject = entities measured in a study
Population = total set of all the subjects of interest
Sample = subset of the population for whom we (plan to) have data
Descriptive statistics refers to methods for summarizing the collected data. The summaries usually
consist of graphs and numbers such as averages and percentages.
Inferential statistics refers to methods of making decisions or predictions about a population, based
on data obtained from a sample of that population.
- An important aspect of this involves reporting the likely precision of a prediction. How close
is the sample value to the true value of the population? margin of error
Parameter = numerical summary of the population
Statistic = numerical summary of a sample taken from the population
Random sampling = every subject in the population has the same chance of being included in the
sample
- Allows to make powerful inferences about populations
Randomness is also crucial to performing experiments well (randomization)
Margin of error = measure of the expected variability from one random sample to the next random
sample
‘very likely’ typically means 95 times out of 100 95% confidence interval
1
Approximate margin of error = ×100 %
√n
Random variation is roughly like the margin of error (above formula)
, The difference expected through ordinary random variation is smaller with larger samples
Statistically significant = when the difference between results of treatment and control group is so
large that it would be rare to see such a difference by ordinary random variation
1.3. Using calculators and computers
To make statistical analysis easier, large sets of data are organised in a data file
Two basic rules for constructing a data file:
- Any one row contains measurements for a particular subject
- Any one column contains measurements for a particular characteristic
Database = archived collection of data files
2. Exploring data with graphs and numerical summaries
2.1. Different types of data
Variables = any characteristic observed in a study
- A variable is called quantitative if observations on it take numerical values that represent
different magnitudes of the variable
o Key features to describe:
Center
Variability (AKA spread)
o Quantitative variables:
Discrete = if its possible values form a set of separate numbers
Continuous = if its possible values form an interval (infinite continuum of
possible values)
- A variable is called categorical if each observation belongs to one of a set of distinct
categories.
o Key feature to describe:
Relative number of observations in the various categories
Observations = data values that we observe for a variable
The distribution of a variable describes how the observations fall (are distributed) across the range
of possible values
- Can be displayed by a graph or a table
- Features to look for in distribution of categorical variables:
o Modal category = the category with the largest frequency
o And more generally how frequently each category was observed
- Features to look for in distribution of quantitative variables:
o Shape = do observations cluster in certain intervals and/or are they spread thin in
others?
o Center = where does a typical observation fall?
o Variability = how tightly are the observations clustering around a center?