Week 2: Descriptive Statistics
Descriptive statistics vs. inferential statistics
~ Descriptive statistics:
- Statistics used to describe (sample) data without further conclusions
- Measures of central tendency: Mean, median, mode
- Measures of variation (or spread): range, IQR, variance, standard deviation
~ Inferential statistics:
- Describe data of sample in order to infer patterns in the population
- Statistical tests: t-test, χ2-test, etc.
o Sample vs. population
Studying the whole population is (almost always) practically
impossible
Sample is a (selected) subset of population and thus more
accessible
Selection of representative sample is very important
(Types of) variables
~ Tabular representation of data:
- Each case is shown in a row
- Each variable is in a column
- Table
~ Nominal (categorical) scale: unordered categories
- Gender (frequently binary: two categories), Native language, etc.
~ Ordinal: ordered (ranked) scale, but amount of difference unclear
- Rank of English profiency (in class), Likert scale (Rate on a scale from 1 to
5...)
~ Interval scale: numerical with meaningful difference but no true 0
- Year of birth, temperature in Celsius
~ Ratio scale: numerical with meaningful difference and true 0
- Number of questions correct, age
Distribution of a variable
~ Normal distribution
-
- Has convenient characteristics
- Completely symmetric
- Read area: (about) 80%
- Read and green area: (about) 95%
~ Frequency of values (distribution of variables shows variability)
, - table(dat$english_grade)
~ Histogram (shows frequency of all value in groups
- hist(dat$english_grade, xlab = "English grade", main = "")
~ Density curve (shows area proportional to the relative frequency)
- plot(density(dat$english_grade), main = "", xlab = "English
grade")
- The total area under a density curve is equal to 1
- A density curve does not provide information about the frequency of one
value
o E.g., there might be no one who has scored a grade of exactly 6.1
- It only provides information about an interval
o E.g., more than 50% of the grades lie between 5.5 and 7.5
~ A distribution can also be characterized by measures of center and variation
- (skewness measures the symmetry of the distribution; not covered in this
course)
Measures of central tendency
~ Mode
- most frequent element (for nominal data: only meaningful measure)
- my_mode <- function(x) {
counts <- table(x)
names(which(counts == max(counts)))
}
my_mode(dat$english_grade)
~ Median
- when data is sorted from small to large, it is the middle value
- median(dat$english_grade)
~ Mean
- arithmetical average
- mean(dat$english_grade)
Measures of variation
~ Quantiles: cutpoints to divide the sorted data in subsets of equal size
- Quartiles: three cutpoints to divide the data in four equal-sized sets
o q1 (1st quartile): cutpoint between 1st and 2nd group
o q2 (2nd quartile): cutpoint between 2nd and 3rd group (= median!)
o q3 (3rd quartile): cutpoint between 3rd and 4th group
- Percentiles: divide data in hundred equal-sized subsets
o q1 = 25th percentile
o q2 (= median) = 50th percentile
o Score at nth percentile is better than n% of scores
- quantile(dat$english_grade)
~ Minimum, maximum: lowest and highest value
- min/max(dat$english_grade)
, ~ Range: difference between minimum and maximum
- range(dat$english_grade)
- diff(range(dat$english_grade))
~ Interquartile range (IQR): q3 - q1
- IQR(dat$english_grade)
~ box plot is used to visualize variation of a variable
- Box (IQR): q1 (bottom), median (thickest line), q3 (top)
o (In example below, q1 and median have the same value)
- Whiskers: maximum (top) and minimum (bottom) non-outlier value
- Circle(s): outliers (> 1.5 IQR distance from box)
- boxplot(dat$english_grade, col = "red")
~ Deviation: difference between mean and individual value
~ Variance: average squared deviation
- Squared in order to make negative differences positive
- Population variance (with μ = population mean):
-
- As sample mean (xˉ) is approximation of population mean (μ), sample
variance formula contains division by n−1 (results in slightly higher
variance):
-
- var(dat$english_grade)
~ standard deviation: square root of variance
-
- sd(dat$english_grade)
Standardized scores
~ Standardization helps facilitate interpretation
- E.g., how to interpret: "Emma's score is 112" and "Tom's score is 105"
~ Interpretation should be done with respect to mean μ and standard deviation σ
- Raw scores can be transformed to standardized scores (z-scores or z-
values)
-
Descriptive statistics vs. inferential statistics
~ Descriptive statistics:
- Statistics used to describe (sample) data without further conclusions
- Measures of central tendency: Mean, median, mode
- Measures of variation (or spread): range, IQR, variance, standard deviation
~ Inferential statistics:
- Describe data of sample in order to infer patterns in the population
- Statistical tests: t-test, χ2-test, etc.
o Sample vs. population
Studying the whole population is (almost always) practically
impossible
Sample is a (selected) subset of population and thus more
accessible
Selection of representative sample is very important
(Types of) variables
~ Tabular representation of data:
- Each case is shown in a row
- Each variable is in a column
- Table
~ Nominal (categorical) scale: unordered categories
- Gender (frequently binary: two categories), Native language, etc.
~ Ordinal: ordered (ranked) scale, but amount of difference unclear
- Rank of English profiency (in class), Likert scale (Rate on a scale from 1 to
5...)
~ Interval scale: numerical with meaningful difference but no true 0
- Year of birth, temperature in Celsius
~ Ratio scale: numerical with meaningful difference and true 0
- Number of questions correct, age
Distribution of a variable
~ Normal distribution
-
- Has convenient characteristics
- Completely symmetric
- Read area: (about) 80%
- Read and green area: (about) 95%
~ Frequency of values (distribution of variables shows variability)
, - table(dat$english_grade)
~ Histogram (shows frequency of all value in groups
- hist(dat$english_grade, xlab = "English grade", main = "")
~ Density curve (shows area proportional to the relative frequency)
- plot(density(dat$english_grade), main = "", xlab = "English
grade")
- The total area under a density curve is equal to 1
- A density curve does not provide information about the frequency of one
value
o E.g., there might be no one who has scored a grade of exactly 6.1
- It only provides information about an interval
o E.g., more than 50% of the grades lie between 5.5 and 7.5
~ A distribution can also be characterized by measures of center and variation
- (skewness measures the symmetry of the distribution; not covered in this
course)
Measures of central tendency
~ Mode
- most frequent element (for nominal data: only meaningful measure)
- my_mode <- function(x) {
counts <- table(x)
names(which(counts == max(counts)))
}
my_mode(dat$english_grade)
~ Median
- when data is sorted from small to large, it is the middle value
- median(dat$english_grade)
~ Mean
- arithmetical average
- mean(dat$english_grade)
Measures of variation
~ Quantiles: cutpoints to divide the sorted data in subsets of equal size
- Quartiles: three cutpoints to divide the data in four equal-sized sets
o q1 (1st quartile): cutpoint between 1st and 2nd group
o q2 (2nd quartile): cutpoint between 2nd and 3rd group (= median!)
o q3 (3rd quartile): cutpoint between 3rd and 4th group
- Percentiles: divide data in hundred equal-sized subsets
o q1 = 25th percentile
o q2 (= median) = 50th percentile
o Score at nth percentile is better than n% of scores
- quantile(dat$english_grade)
~ Minimum, maximum: lowest and highest value
- min/max(dat$english_grade)
, ~ Range: difference between minimum and maximum
- range(dat$english_grade)
- diff(range(dat$english_grade))
~ Interquartile range (IQR): q3 - q1
- IQR(dat$english_grade)
~ box plot is used to visualize variation of a variable
- Box (IQR): q1 (bottom), median (thickest line), q3 (top)
o (In example below, q1 and median have the same value)
- Whiskers: maximum (top) and minimum (bottom) non-outlier value
- Circle(s): outliers (> 1.5 IQR distance from box)
- boxplot(dat$english_grade, col = "red")
~ Deviation: difference between mean and individual value
~ Variance: average squared deviation
- Squared in order to make negative differences positive
- Population variance (with μ = population mean):
-
- As sample mean (xˉ) is approximation of population mean (μ), sample
variance formula contains division by n−1 (results in slightly higher
variance):
-
- var(dat$english_grade)
~ standard deviation: square root of variance
-
- sd(dat$english_grade)
Standardized scores
~ Standardization helps facilitate interpretation
- E.g., how to interpret: "Emma's score is 112" and "Tom's score is 105"
~ Interpretation should be done with respect to mean μ and standard deviation σ
- Raw scores can be transformed to standardized scores (z-scores or z-
values)
-