STAT0021 notes
Lecture 1: Summary statistics
Make a cheat sheet with everything needed to know.
Understanding uncertainty
Draw conclusions for a population, but only have data from a sample
P(Heads/Tails) not so exact because it doesn’t have a defined population
May show a representative sample of stats students at UCL but not elsewhere in the world.
The sample number is not the population number.
Random sample – depends on what the population is
Categorical data (hair colour)
Different categories that can be selected.
Drop down menu choice
Mode helps to tell what is the most frequent category.
Ordinal data (months)
Categories, but have a clear ranking about, such as birth month.
Ordered in terms of first to last.
Not quite quantitative
Discrete data (shoes)
Whole numbers where the quantity is meaningful.
Counts of how many of something do I have
Boxplot for pairs of shoes captures many useful summary stats for quantitative data
Calculating summary stats
Put data in order
Median – middle value
Lower and upper quartile – ¼ and ¾
Interquartile range – difference between upper and lower quartile
To find lower quartile – use the amount in the population and divide by 4, and then round
up if not an integer.
Reasonable rule of thumb to use box plot for outliers if there is no other way to look at the
data.
Continuous data
Frequency histogram – group them all into ranges
Different from bar chart as they are not grouped into ranges.
Density histogram – sum of the column areas is 1. Widths are different and can change how
the heights are seen.
Data set is skewed – left tail is longer than the right.
, Mean
Measures of average are often measures of location
How spread out are the data
Variance – take average in data set and take each data point to measure how far it is from
the mean
Data set where all numbers are 5, the variance will be 0.
Use the denominator (n-1) because you have to guess the average across the population but
in this sample.
Standard deviation is the square root of the variance.
Around 95% of data lies between 2 standard deviations of the mean
Lecture 2: Elementary Probability Theory
Build probabilistic model for data, and to draw conclusions about the question.
Level of objectivity.
How much room for randomness in the final model you got to. What sort of ways can the
number be influenced by processes of how you’ve collected the data.
Basic rules of probability
An event is any outcome or set of outcomes. Event that the coin lands heads or tails.
Any probability has to be 0 and 1
Mutually exclusive – cannot occur together.
Different events within the rectangle represents the probability of that happening. Mutually
exclusive events don’t overlap in the example. Total area is 1.
Must subtract middle area because A+B includes all of the overlapped area. Minus the
middle once so that the middle is only counted once instead of twice.
Conditional probability
Includes ‘if.’
Area represented by the entire rectangle is 1. B has happened, so what is the probability of
A. B circle now becomes area 1. And now take the area of the overlapped circles of A and B,
and find where A has happened.
In general, the two conditionals changes the probability. So if it was P(red card | king) differs
from P(king | red card).
Bayes’ theorem
Independent events
Will not influence guess of other things.
Lecture 3: Discrete Probability (18/1)
Research Q – collecting data to answer the question
Lecture 1: Summary statistics
Make a cheat sheet with everything needed to know.
Understanding uncertainty
Draw conclusions for a population, but only have data from a sample
P(Heads/Tails) not so exact because it doesn’t have a defined population
May show a representative sample of stats students at UCL but not elsewhere in the world.
The sample number is not the population number.
Random sample – depends on what the population is
Categorical data (hair colour)
Different categories that can be selected.
Drop down menu choice
Mode helps to tell what is the most frequent category.
Ordinal data (months)
Categories, but have a clear ranking about, such as birth month.
Ordered in terms of first to last.
Not quite quantitative
Discrete data (shoes)
Whole numbers where the quantity is meaningful.
Counts of how many of something do I have
Boxplot for pairs of shoes captures many useful summary stats for quantitative data
Calculating summary stats
Put data in order
Median – middle value
Lower and upper quartile – ¼ and ¾
Interquartile range – difference between upper and lower quartile
To find lower quartile – use the amount in the population and divide by 4, and then round
up if not an integer.
Reasonable rule of thumb to use box plot for outliers if there is no other way to look at the
data.
Continuous data
Frequency histogram – group them all into ranges
Different from bar chart as they are not grouped into ranges.
Density histogram – sum of the column areas is 1. Widths are different and can change how
the heights are seen.
Data set is skewed – left tail is longer than the right.
, Mean
Measures of average are often measures of location
How spread out are the data
Variance – take average in data set and take each data point to measure how far it is from
the mean
Data set where all numbers are 5, the variance will be 0.
Use the denominator (n-1) because you have to guess the average across the population but
in this sample.
Standard deviation is the square root of the variance.
Around 95% of data lies between 2 standard deviations of the mean
Lecture 2: Elementary Probability Theory
Build probabilistic model for data, and to draw conclusions about the question.
Level of objectivity.
How much room for randomness in the final model you got to. What sort of ways can the
number be influenced by processes of how you’ve collected the data.
Basic rules of probability
An event is any outcome or set of outcomes. Event that the coin lands heads or tails.
Any probability has to be 0 and 1
Mutually exclusive – cannot occur together.
Different events within the rectangle represents the probability of that happening. Mutually
exclusive events don’t overlap in the example. Total area is 1.
Must subtract middle area because A+B includes all of the overlapped area. Minus the
middle once so that the middle is only counted once instead of twice.
Conditional probability
Includes ‘if.’
Area represented by the entire rectangle is 1. B has happened, so what is the probability of
A. B circle now becomes area 1. And now take the area of the overlapped circles of A and B,
and find where A has happened.
In general, the two conditionals changes the probability. So if it was P(red card | king) differs
from P(king | red card).
Bayes’ theorem
Independent events
Will not influence guess of other things.
Lecture 3: Discrete Probability (18/1)
Research Q – collecting data to answer the question