Especially big = squared! A logistic regression model is built to study the likelihood of purchasing health insurance (YES = 1, NO = 0) based
Formula’s
1: Which one needs to be squared (scale one) on age. Example Q: What is the likelihood of someone who is 60 years old will purchase the insurance?
Variance: -> SD From population 2: When 3 categories only use 2 cat. (1 is the Hypothesis: H0: Beta = 0 ; H1 : Beta ≠ 0 (there is an effect f age on the willingness to buy an insurance
value to %: reference)
𝒔 = √Var
Percentage Logistic regression
Standard error: 𝒙 #1st : Write down the intercept (b0), the slope (b1) and x:
= ( ) × 𝟏𝟎𝟎
𝒔 𝑵
𝑺𝑬 = b0 = -3.125
√𝒏 b1 = 0.076 ! Expected probability
From % to population
Confidence interval: value: x = 20 What is the expected
Percentage #2nd: Fill in the formulas below: probability of smoking
𝒙=( )×𝑵
𝑬𝒔𝒕𝒊𝒎𝒂𝒕𝒆 ± 𝒛(𝟐) × 𝑺𝑬 𝟏𝟎𝟎 logodds = b0 + b1*x for someone who is 50
H0= never significant p = exp(logodds)/(1+exp(logodds)) years old?
(p)
Steps Calculate Interpretation
Addition model 𝛽1 = slope = Geeft aan hoeveel Addition
𝑦̂ verandert als 𝑥 met 1 eenheid stijgt P-values: Used to assess the significance of those coefficients.
• P-Value < 0.05 we reject H0.
Slope: • P-Value > 0.05 we don’t reject H0
𝑦2 − 𝑦1 #Addition model in R
model_name = data_name %>% F-test (Whole model – is the model useful?)
𝑥2 − 𝑥1 lm(y ~ x + x , . ) • H0: Age and education have no effect (B1 = 0 and B2 = 0)
summary(model_name) • HA: At least one has an effect (B1 ≠ 0 or B2 ≠ 0)
If P-Value < 0.05 : Model fits the data
P-waarde: Testing specific expectations (for one of the variables) t-test of b-coefficients
< 0,05: Verwerp H0, accepteer HA: Significant effect. • H0: B1 = 0 → no effect
• HA: B1 ≠ 0 → significant effect
≥ 0,05: Verwerp HA, accepteer H0: Geen significant effect. • Direction: Positive (+) or negative (–) effect
Size: How much Y changes when X increases by 1
Explained variance: Used to assess the overall quality of the model. Example: Age = –0.08 → when age increases by 1, ageism decreases by 0.08
R squared = 0.2471: This is the explained variance: 24.71% of the variance of ageism can be explained by this model
Interaction Interaction
equations B0: Value of the intercept
𝑦 = 𝑏0 + 𝑏1 ⋅ 𝑥 + 𝑏2 ⋅ group + 𝑏3 ⋅ 𝑥 Starting point intercept = y when x is 0
⋅ group
Group = 1 als niet reference is B1 : Value of the b-coefficient associated with the variable “x”
Slope of reference
1: What is the effect (coefficient) of age
on ageism? Or What is the B2 : Value of the b-coefficient associated with the group variable (the dummy)
effect/coefficient among a specific group Difference with other group + or - when reference group x = 0
𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝑚𝑒𝑛𝑡𝑖𝑜𝑛𝑒𝑑 + 𝛽3 × B3: Value of the b-coefficient associated with the interaction
dummy 1 ∗ dummy 2 Start where the lines cross → difference between lines when x increases by 1 We now focus on the population. The effect of leadership style on employee
If reference group = below (+) motivation among men. IF asked about SIGNIFICANT
2. What is the effect (coefficient) of If reference group is upper (-)
campaign for educated people (education
= 1) : #Interaction model in R 2nd option: two models to visualize the interaction:
2.11(camp) -1.93(1)*(camp) model_interaction=data_name%>% model2_female = df%>%
2.11 – 1.93(camp) lm (y~x*x,.) filter(gender == "Female")%>%
Coefficient: 2.11 – 1.93 = 0.18. summary (model_interaction) lm(y~ x, . ) summary(model2_female)
3: What is the expected value of Y: In R output: Dummies invullen voor x
𝑦 = 𝑏0 + 𝑏1 𝑥 + 𝑏2 𝑥 Als vrouw reference is = mannen 1
Vergelijkt het groepen (bijv. mannen vs.
vrouwen)? → JA → kijk naar x:z
Influential and A case is influential if: # Cooks distance in R studio: Multicollinearity refers to a statistical phenomenon in regression analysis where two or more
Multicollinearity: High leverage: Far in x axis 1st : Create a model: for example: model <- lm(y~x, data = data563) predictor variables in a model are highly correlated with each other. It indicates a strong linear
High residual: Far in y axis 2nd : Plot the graph using one of the following code: relationship between the independent variables, which can cause problems in the regression
High impact: Removing/including it o plot(model) #Then hit enter 4 times until you see the first plot below left. analysis
would change the slope & estimate o plot(model, 4)
# in R
If no line appears, no influential cases. 1st: Create a model: model_original <- lm(y ~ V1 + V2 + V3 + V4 + V5, data = jouw_dataset)
If a case has high leverage, residual and impact, its cooks distance will be > 0.5 and even 1 2nd: Create a VIF model: vif(model_original)
2 variables have multicollinearity if VIF > 4.
!! Do not remove influential cases without further analysis of the cases themselves. How to solve that problem? Delete one of the variables that is highly correlated with
another one. Mix those 2 variables
Formula’s
1: Which one needs to be squared (scale one) on age. Example Q: What is the likelihood of someone who is 60 years old will purchase the insurance?
Variance: -> SD From population 2: When 3 categories only use 2 cat. (1 is the Hypothesis: H0: Beta = 0 ; H1 : Beta ≠ 0 (there is an effect f age on the willingness to buy an insurance
value to %: reference)
𝒔 = √Var
Percentage Logistic regression
Standard error: 𝒙 #1st : Write down the intercept (b0), the slope (b1) and x:
= ( ) × 𝟏𝟎𝟎
𝒔 𝑵
𝑺𝑬 = b0 = -3.125
√𝒏 b1 = 0.076 ! Expected probability
From % to population
Confidence interval: value: x = 20 What is the expected
Percentage #2nd: Fill in the formulas below: probability of smoking
𝒙=( )×𝑵
𝑬𝒔𝒕𝒊𝒎𝒂𝒕𝒆 ± 𝒛(𝟐) × 𝑺𝑬 𝟏𝟎𝟎 logodds = b0 + b1*x for someone who is 50
H0= never significant p = exp(logodds)/(1+exp(logodds)) years old?
(p)
Steps Calculate Interpretation
Addition model 𝛽1 = slope = Geeft aan hoeveel Addition
𝑦̂ verandert als 𝑥 met 1 eenheid stijgt P-values: Used to assess the significance of those coefficients.
• P-Value < 0.05 we reject H0.
Slope: • P-Value > 0.05 we don’t reject H0
𝑦2 − 𝑦1 #Addition model in R
model_name = data_name %>% F-test (Whole model – is the model useful?)
𝑥2 − 𝑥1 lm(y ~ x + x , . ) • H0: Age and education have no effect (B1 = 0 and B2 = 0)
summary(model_name) • HA: At least one has an effect (B1 ≠ 0 or B2 ≠ 0)
If P-Value < 0.05 : Model fits the data
P-waarde: Testing specific expectations (for one of the variables) t-test of b-coefficients
< 0,05: Verwerp H0, accepteer HA: Significant effect. • H0: B1 = 0 → no effect
• HA: B1 ≠ 0 → significant effect
≥ 0,05: Verwerp HA, accepteer H0: Geen significant effect. • Direction: Positive (+) or negative (–) effect
Size: How much Y changes when X increases by 1
Explained variance: Used to assess the overall quality of the model. Example: Age = –0.08 → when age increases by 1, ageism decreases by 0.08
R squared = 0.2471: This is the explained variance: 24.71% of the variance of ageism can be explained by this model
Interaction Interaction
equations B0: Value of the intercept
𝑦 = 𝑏0 + 𝑏1 ⋅ 𝑥 + 𝑏2 ⋅ group + 𝑏3 ⋅ 𝑥 Starting point intercept = y when x is 0
⋅ group
Group = 1 als niet reference is B1 : Value of the b-coefficient associated with the variable “x”
Slope of reference
1: What is the effect (coefficient) of age
on ageism? Or What is the B2 : Value of the b-coefficient associated with the group variable (the dummy)
effect/coefficient among a specific group Difference with other group + or - when reference group x = 0
𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝑚𝑒𝑛𝑡𝑖𝑜𝑛𝑒𝑑 + 𝛽3 × B3: Value of the b-coefficient associated with the interaction
dummy 1 ∗ dummy 2 Start where the lines cross → difference between lines when x increases by 1 We now focus on the population. The effect of leadership style on employee
If reference group = below (+) motivation among men. IF asked about SIGNIFICANT
2. What is the effect (coefficient) of If reference group is upper (-)
campaign for educated people (education
= 1) : #Interaction model in R 2nd option: two models to visualize the interaction:
2.11(camp) -1.93(1)*(camp) model_interaction=data_name%>% model2_female = df%>%
2.11 – 1.93(camp) lm (y~x*x,.) filter(gender == "Female")%>%
Coefficient: 2.11 – 1.93 = 0.18. summary (model_interaction) lm(y~ x, . ) summary(model2_female)
3: What is the expected value of Y: In R output: Dummies invullen voor x
𝑦 = 𝑏0 + 𝑏1 𝑥 + 𝑏2 𝑥 Als vrouw reference is = mannen 1
Vergelijkt het groepen (bijv. mannen vs.
vrouwen)? → JA → kijk naar x:z
Influential and A case is influential if: # Cooks distance in R studio: Multicollinearity refers to a statistical phenomenon in regression analysis where two or more
Multicollinearity: High leverage: Far in x axis 1st : Create a model: for example: model <- lm(y~x, data = data563) predictor variables in a model are highly correlated with each other. It indicates a strong linear
High residual: Far in y axis 2nd : Plot the graph using one of the following code: relationship between the independent variables, which can cause problems in the regression
High impact: Removing/including it o plot(model) #Then hit enter 4 times until you see the first plot below left. analysis
would change the slope & estimate o plot(model, 4)
# in R
If no line appears, no influential cases. 1st: Create a model: model_original <- lm(y ~ V1 + V2 + V3 + V4 + V5, data = jouw_dataset)
If a case has high leverage, residual and impact, its cooks distance will be > 0.5 and even 1 2nd: Create a VIF model: vif(model_original)
2 variables have multicollinearity if VIF > 4.
!! Do not remove influential cases without further analysis of the cases themselves. How to solve that problem? Delete one of the variables that is highly correlated with
another one. Mix those 2 variables