Lecture 4: Cluster Analysis
Cluster Analysis: Step by Step
• Step 1 | Defining the objectives
• Step 2 | Designing the study
• Step 3 | Checking assumptions
• Step 4 | Estimating the model and assessing fit
• Step 5 | Interpreting the results
• Step 6 | Validating the results
Step 1: Defining the objectives
Purposes:
à Develop taxonomies, simplify data
à Often used for segmentation purposes
Examples:
o Which client groups can ING distinguish based on product usage, and how can they be targeted?
o Hello Fresh Meal Boxes: What socio-demographic subgroups of current customers can be distinguished, and
how is this linked to usage frequency?
o Amazon.com: What products can be recommended to online users based on their previous clicks and
purchases?
o Market analysis: Which different car types can be distinguished?
No talk about causal link à Interdependence method, try to look at a set of variables, truy to find a structural
outcome without distinguish drivers
´ Cluster analysis =technique to combine ‘objects’ or ‘persons’ into groups, on a pre-defined set of
characteristics/ variables à forms the groups in such a way that the groups resemble one another à
- Objects within a group are ‘similar’
- Objects across groups are ‘different’
Objectives
So, cluster analysis …
´ Inputs: characterization of objects (subjects) on a number of variables
´ Outputs: assignment of objects (subjects) to different groups
Step 2: Designing the Cluster Analysis: Inputs
´ What to decide on?
- Data: sample size and outliers
o Don’t have fixed sample size rule, think about sample to be representative of the
population
o Outliers look up upfront, outliers make come automatically when clusters are formed)
- Variable Selection and Measurement
- Measures of similarity between objects
, 1 Variable selection and Measurement
´ Which variables to use as a basis for grouping?
- Depends on researcher’s interest/objectives
- Do variables differentiate between objects?
v
Segmentation base: segment group you choose
ING Data à How they are going to form the group: based on the clients usage of financial products (do they have
savings account, mortgage, how much money in the account)
Rows à objects
Columns à variables / (characteristics of the car in the example)
Factor analysis: Reduce number of columns by grouping into factors
Cluster analysis: Groups of the rows in the data set, using all the columns in which we think is relevant
´ What is the measurement scale?
- Metric or non-metric
´ Should the variable be standardized?
- Make sure ‘order of magnitude’ is similar
Cluster Analysis: Step by Step
• Step 1 | Defining the objectives
• Step 2 | Designing the study
• Step 3 | Checking assumptions
• Step 4 | Estimating the model and assessing fit
• Step 5 | Interpreting the results
• Step 6 | Validating the results
Step 1: Defining the objectives
Purposes:
à Develop taxonomies, simplify data
à Often used for segmentation purposes
Examples:
o Which client groups can ING distinguish based on product usage, and how can they be targeted?
o Hello Fresh Meal Boxes: What socio-demographic subgroups of current customers can be distinguished, and
how is this linked to usage frequency?
o Amazon.com: What products can be recommended to online users based on their previous clicks and
purchases?
o Market analysis: Which different car types can be distinguished?
No talk about causal link à Interdependence method, try to look at a set of variables, truy to find a structural
outcome without distinguish drivers
´ Cluster analysis =technique to combine ‘objects’ or ‘persons’ into groups, on a pre-defined set of
characteristics/ variables à forms the groups in such a way that the groups resemble one another à
- Objects within a group are ‘similar’
- Objects across groups are ‘different’
Objectives
So, cluster analysis …
´ Inputs: characterization of objects (subjects) on a number of variables
´ Outputs: assignment of objects (subjects) to different groups
Step 2: Designing the Cluster Analysis: Inputs
´ What to decide on?
- Data: sample size and outliers
o Don’t have fixed sample size rule, think about sample to be representative of the
population
o Outliers look up upfront, outliers make come automatically when clusters are formed)
- Variable Selection and Measurement
- Measures of similarity between objects
, 1 Variable selection and Measurement
´ Which variables to use as a basis for grouping?
- Depends on researcher’s interest/objectives
- Do variables differentiate between objects?
v
Segmentation base: segment group you choose
ING Data à How they are going to form the group: based on the clients usage of financial products (do they have
savings account, mortgage, how much money in the account)
Rows à objects
Columns à variables / (characteristics of the car in the example)
Factor analysis: Reduce number of columns by grouping into factors
Cluster analysis: Groups of the rows in the data set, using all the columns in which we think is relevant
´ What is the measurement scale?
- Metric or non-metric
´ Should the variable be standardized?
- Make sure ‘order of magnitude’ is similar