Measures of central tendency
● Mean
○ Pool in all the money (for each firm) and then divide it equally
○ If I were to assume that all customers are spending the same amount of money, it would be the mean
○ Make the data uniform --> find the mean
○ Mean decreases when the total spend is the same but divided among more people
● Median
○ Middle value when data is arranged from smallest to largest
○ What does median mean?
■ Half of the customers in the sample spend = or < #80 (median)
■ Divide the dataset into 2 equal half
○ Use the median when: there are large outliers in the data
○ Outliers affect the total, which may inflate/deflate the mean
● Mode
○ The most frequently occurring value
● Variance and standard deviation
○ Measures how spread out the data is, relative to the sample's mean
○ Gives a sense of how representative the mean is
○ Population variance
○ Sample variance
○ Standard deviation is in the same units as the underlying data
Data types
Variable
● Numerical - any number you can do arithmetic operations (addition, subtraction.)
○ Discrete - whole numbers (eg. # of students, children, books, pets)
○ Continuous - can be measured with infinite positions (e.g. height, income, age)
● Categorical - when data is presented numerically but represent categories
○ Binary - 2 categories (e.g. yes/no, 1/0)
○ Nominal - sequence does not matter (e.g. gender, country of origin, eye color)
○ Ordinal - there is a order and hierarchy (e.g. customer satisfaction, rating, education levels)
■ One value mean higher than the value
, ■ Ordinal --> Order
Measurement scales
● Interval - equal intervals, there is no true zero (proper zero)
○ E.g. time of day, zero doesn't mean the absence of something here
■ There's no 0 o'clock
○ Can't do a lot of arithmetic operations with interval scales
● Ratio - the interval between each value is equal
○ true zero - there is no value
■ Helps do arithmetic operations
■ Absence of something
○ E.g. income, age
Types of dataset
1. Cross-sectional data (one point in time)
2. Time-series data (many points in time)
3. Panel data (many units at many points in time)
1. Cross-sectional data - Many observations at one point in time
2. Time-series data - One observation across many periods in time
3. Panel data - captures data of many units at many points in time
Dataset formats - Wide vs. long data format
● Wide data format
○ Looks neat; Excel doesn't understand this data easily
○ Each subject has a single row, different variables or time points are shown in separate columns
○ Better for comparison across variables (side-by-side)