Lecture 5: Logistic Regression
Cluster Analysis: Step by Step
• Step 1 | Defining the objectives
• Step 2 | Designing the study
• Step 3 | Checking assumptions
• Step 4 | Estimating the model and assessing fit
• Step 5 | Interpreting the results
• Step 6 | Validating the results
Step 1: Defining Objectives
The purpose of logistic regression is to
´ Predict the likelihood that an event will or will not occur
´ Assess variables that affect occurrence of the event
- Direction of influence
- Magnitude/importance (how strong will the change in a variable be)
Examples:
o What is the probability that a person will respond to a Neckerman direct mailing, and how can Neckerman
adjust its mail content to increase this probability?
o Does improved waiting time at the checkout increase the likelihood of visiting a C1000 store?
o What makes people more likely to donate to Foster Parents charity?
All three examples looking for a causal relationship:
o A set of variables influencing an outcome à the likelihood of an event happening
o Dependence method: we have a clear outcome variable, that is something is going to happen, interested in
variables in the likelihood that that is going to happen (ANOVA, Linear Regression)
Step 2: Designing the analysis
´ …..involves decisions on:
- The variables to be included
- The sample
o Size
o Composition
o Estimation versus holdout
Logistic Regression vs. Anova/ Lineair Regression:
o The fact that the outcome variable for Logistic Regression is a 0/1 variable, it’s a dummy
o The dummy can take on two values that are mutually exclusive (they cannot occur together)
o Variables are exhaustive (it is one or the other)
Variables:
´ The dependent variable can be …
- “naturally” dichotomous (0/1: two mutually exclusive/exhaustive options) (a mailing is responded
to or it isn’t), or
- Reconstructed from metric variable (polar extremes approach à into a (0/1) variable
´ The independent variables: (can be metric or non-metric, but then transform)
- Can be metric or dummy variables
, - Selection based on theory,
or intuition Polar extreme approach
1. Rank order your observations
2. Split the data in to three parts (low/ mid/ high values)
3. Drop the mid-values and keep low and high (turn them in 0 and 1)
Example in lecture: Advertising through in-store screens
o Do consumers notice ad messages on TV screens in store?
o How does this depend on message characteristics?
o Which consumers are more likely to notice the message?
o Store intercept survey: 879 respondents
o Dependent variable: probability that in-store display =seen/not seen (recall?)
o Independent variables (explanatory variables):
- Consumer: store visit frequency, spending on electronics, education (3 levels), Home Tv-ad seen
(yes/no)?
- Message: length (3 levels), sound (yes/no)
´ Sample:
- # observations/#independent variables (At least 20 rows in your data set for each explanatory
variable)
- Group sizes (The yesses or no’s) (representative? oversampling?) (Oversampling only with a rare
event à make the data set in such a way that there are more people that have a yes, may have a
better chance of finding the effect (but be careful with interpreting the results)
- Analysis versus Holdout sample
- Proportionally stratified subsamples à The proportion of yesses and no’s is the same in
your estimation and in your holdout sample
Step 3: Checking assumptions
´ Two groups for outcome variable
- (More Than Two: Extension: Multi Nominal Logit Model)
´ Robust to deviations from multivariate normality and homoscedasticity (don’t bother checking, since
Robust)
- Homoscedasticity: Equal variances in the groups of 0’s versus 1’s for the error terms
´ Check multicollinearity: Overlap between explanatory variables
´ Outcome = probability (a likelihood or a probability) (positive, between 0 or 1)
- Must lie between zero and one
- Need to adjust model form: S-shaped
If we would use a regular linear regression model we have a problem, because there is no guarantee that whatever
values you use for the explanatory variables (for the x’s), that your outcome will be guaranteed to be between 0 and
1 à we use a different function (see plot below)
Cluster Analysis: Step by Step
• Step 1 | Defining the objectives
• Step 2 | Designing the study
• Step 3 | Checking assumptions
• Step 4 | Estimating the model and assessing fit
• Step 5 | Interpreting the results
• Step 6 | Validating the results
Step 1: Defining Objectives
The purpose of logistic regression is to
´ Predict the likelihood that an event will or will not occur
´ Assess variables that affect occurrence of the event
- Direction of influence
- Magnitude/importance (how strong will the change in a variable be)
Examples:
o What is the probability that a person will respond to a Neckerman direct mailing, and how can Neckerman
adjust its mail content to increase this probability?
o Does improved waiting time at the checkout increase the likelihood of visiting a C1000 store?
o What makes people more likely to donate to Foster Parents charity?
All three examples looking for a causal relationship:
o A set of variables influencing an outcome à the likelihood of an event happening
o Dependence method: we have a clear outcome variable, that is something is going to happen, interested in
variables in the likelihood that that is going to happen (ANOVA, Linear Regression)
Step 2: Designing the analysis
´ …..involves decisions on:
- The variables to be included
- The sample
o Size
o Composition
o Estimation versus holdout
Logistic Regression vs. Anova/ Lineair Regression:
o The fact that the outcome variable for Logistic Regression is a 0/1 variable, it’s a dummy
o The dummy can take on two values that are mutually exclusive (they cannot occur together)
o Variables are exhaustive (it is one or the other)
Variables:
´ The dependent variable can be …
- “naturally” dichotomous (0/1: two mutually exclusive/exhaustive options) (a mailing is responded
to or it isn’t), or
- Reconstructed from metric variable (polar extremes approach à into a (0/1) variable
´ The independent variables: (can be metric or non-metric, but then transform)
- Can be metric or dummy variables
, - Selection based on theory,
or intuition Polar extreme approach
1. Rank order your observations
2. Split the data in to three parts (low/ mid/ high values)
3. Drop the mid-values and keep low and high (turn them in 0 and 1)
Example in lecture: Advertising through in-store screens
o Do consumers notice ad messages on TV screens in store?
o How does this depend on message characteristics?
o Which consumers are more likely to notice the message?
o Store intercept survey: 879 respondents
o Dependent variable: probability that in-store display =seen/not seen (recall?)
o Independent variables (explanatory variables):
- Consumer: store visit frequency, spending on electronics, education (3 levels), Home Tv-ad seen
(yes/no)?
- Message: length (3 levels), sound (yes/no)
´ Sample:
- # observations/#independent variables (At least 20 rows in your data set for each explanatory
variable)
- Group sizes (The yesses or no’s) (representative? oversampling?) (Oversampling only with a rare
event à make the data set in such a way that there are more people that have a yes, may have a
better chance of finding the effect (but be careful with interpreting the results)
- Analysis versus Holdout sample
- Proportionally stratified subsamples à The proportion of yesses and no’s is the same in
your estimation and in your holdout sample
Step 3: Checking assumptions
´ Two groups for outcome variable
- (More Than Two: Extension: Multi Nominal Logit Model)
´ Robust to deviations from multivariate normality and homoscedasticity (don’t bother checking, since
Robust)
- Homoscedasticity: Equal variances in the groups of 0’s versus 1’s for the error terms
´ Check multicollinearity: Overlap between explanatory variables
´ Outcome = probability (a likelihood or a probability) (positive, between 0 or 1)
- Must lie between zero and one
- Need to adjust model form: S-shaped
If we would use a regular linear regression model we have a problem, because there is no guarantee that whatever
values you use for the explanatory variables (for the x’s), that your outcome will be guaranteed to be between 0 and
1 à we use a different function (see plot below)