STATS 2244 FINAL STUDY
REVIEW
December 10 2022
,Summarizing and Exploring Data
Data Stage: collect, monitor the quality of, and conduct a preliminary exploration of the data
Does the data collection method need “tweaking” to ensure quality (monitoring)?
Are there patterns, trends, or associations apparent in the data?
Are there any outliers or missing values? If so, how will you handle them?
Selecting a Summary
How many variables do you have?
o Univariate: 1 variable
Will describe the distribution of this one variable
o Bivariate: 2 variables
o Multivariate: three or more variables
Can explore relationships between variables
What types of variables do you have?
o Explanatory / response
o Quantitative / categorical
What characteristic(s) or relationship do you want to emphasize?
o Parameter, Measures of Spread, Relationship
Measures of Spread
Measures of Spread: characterize the variability in a distribution
Range
Range = maximum – minimum
Inflated by outliers and skew
5-Number Summary
5-number summary splits a distribution into 4 quarters
Minimum, Q1, x̃, Q3, maximum
Q1 = 25th percentile
X̃ = median
o Centermost value: order the dataset smallest→largest then take the middle value
Q3 = 75th percentile
Interquartile Range (IQR): Q3-Q1
IQR = Q3 – Q1
Q3 = third quartile = 75th percentile
Q1 = first quartile = 25th percentile
IQR contains the 50% of the data surrounding the median (25% above, 25% below)
1
,Percentiles
Percentile: a value below which a particular percentage of the distribution lies
Quartiles are percentiles which divide the distribution into 4 equal size sections
o Q1 = first quartile = 25th percentile = 25% of distribution lies below this value
o Q2 = second quartile = 50th percentile = 50% of distribution lies below this value
o Q3 = third quartile = 75th percentile = 75% of distribution lies below this value
If a value is in the 90th percentile, it is in the top 10% of the distribution
Variance
Takes into account all the data we have
Sample variance
Sample variance is a statistic
The larger the s2, the more variable the data (wider the spread)
Calculates the average of the square differences from the sample mean
R automatically uses this equation to calculate variance (assumes we’re working with
sample data, not population data)
Population variance
Population variance is a parameter
The larger the σ2, the more variable the data (wider spread)
Calculates the average of the square differences from the population mean (µ)
o Takes every value in the distribution and subtracts it from the population mean
o Squares the differences (between values and mean) to get rid of the negatives
o Divides by the total number of values in the distribution (N)
Standard Deviation
Square root of the sample variance
2
, o Gets rid of the squaring and returns variance to original units
Suitable for use with distributions without extreme outliers and/or skew
o Extreme outliers can make it seem like data has a wide variation, but really just
due to outliers
Measures of Center
Measures of center: tell us the “typical” value of a distribution
Mean
Mean (average): add up all the values and divide by the total number of values
Affected by outliers
Median
Median: arrange values smallest → largest and take centermost value
50th percentile: 50% of distribution below, 50% of the distribution above
Is not affected by outliers / extreme values
Describing Shape of a Distribution
Can describe the shape of a distribution when it is represented as a histogram
o Histogram: shows frequency distribution for univariate quantitative data
All values for variable on x-axis; frequency on y-axis
Symmetry
Symmetry: the degree to which the distribution looks like a mirror image when split down the
center
Opposite of symmetric is skewed
3
REVIEW
December 10 2022
,Summarizing and Exploring Data
Data Stage: collect, monitor the quality of, and conduct a preliminary exploration of the data
Does the data collection method need “tweaking” to ensure quality (monitoring)?
Are there patterns, trends, or associations apparent in the data?
Are there any outliers or missing values? If so, how will you handle them?
Selecting a Summary
How many variables do you have?
o Univariate: 1 variable
Will describe the distribution of this one variable
o Bivariate: 2 variables
o Multivariate: three or more variables
Can explore relationships between variables
What types of variables do you have?
o Explanatory / response
o Quantitative / categorical
What characteristic(s) or relationship do you want to emphasize?
o Parameter, Measures of Spread, Relationship
Measures of Spread
Measures of Spread: characterize the variability in a distribution
Range
Range = maximum – minimum
Inflated by outliers and skew
5-Number Summary
5-number summary splits a distribution into 4 quarters
Minimum, Q1, x̃, Q3, maximum
Q1 = 25th percentile
X̃ = median
o Centermost value: order the dataset smallest→largest then take the middle value
Q3 = 75th percentile
Interquartile Range (IQR): Q3-Q1
IQR = Q3 – Q1
Q3 = third quartile = 75th percentile
Q1 = first quartile = 25th percentile
IQR contains the 50% of the data surrounding the median (25% above, 25% below)
1
,Percentiles
Percentile: a value below which a particular percentage of the distribution lies
Quartiles are percentiles which divide the distribution into 4 equal size sections
o Q1 = first quartile = 25th percentile = 25% of distribution lies below this value
o Q2 = second quartile = 50th percentile = 50% of distribution lies below this value
o Q3 = third quartile = 75th percentile = 75% of distribution lies below this value
If a value is in the 90th percentile, it is in the top 10% of the distribution
Variance
Takes into account all the data we have
Sample variance
Sample variance is a statistic
The larger the s2, the more variable the data (wider the spread)
Calculates the average of the square differences from the sample mean
R automatically uses this equation to calculate variance (assumes we’re working with
sample data, not population data)
Population variance
Population variance is a parameter
The larger the σ2, the more variable the data (wider spread)
Calculates the average of the square differences from the population mean (µ)
o Takes every value in the distribution and subtracts it from the population mean
o Squares the differences (between values and mean) to get rid of the negatives
o Divides by the total number of values in the distribution (N)
Standard Deviation
Square root of the sample variance
2
, o Gets rid of the squaring and returns variance to original units
Suitable for use with distributions without extreme outliers and/or skew
o Extreme outliers can make it seem like data has a wide variation, but really just
due to outliers
Measures of Center
Measures of center: tell us the “typical” value of a distribution
Mean
Mean (average): add up all the values and divide by the total number of values
Affected by outliers
Median
Median: arrange values smallest → largest and take centermost value
50th percentile: 50% of distribution below, 50% of the distribution above
Is not affected by outliers / extreme values
Describing Shape of a Distribution
Can describe the shape of a distribution when it is represented as a histogram
o Histogram: shows frequency distribution for univariate quantitative data
All values for variable on x-axis; frequency on y-axis
Symmetry
Symmetry: the degree to which the distribution looks like a mirror image when split down the
center
Opposite of symmetric is skewed
3