Interaction equations reference and group 1 How the questions are Tests in Rstudio
asked in the exam: Proportion test/binominal test: table (data_name$x), Binom.test(n_yes,total) Goodness of the fit test: Exp() Obs()
B0: Value of the intercept Chisquare(,)
=2 One sample t.test: We are comparing our sample mean with the population mean for example 6.5
B2: Value of the b- t.test(data_name$y, mu = 6.5)
coefficient associated compare CI with the average not with 0
with the group variable Paired samples: Assume we have 2 paired samples ex: Results on exam and retake
(the dummy) = -1 1st : Compute the differences in the right order (after – before)
B1 : Value of the b- diff = data_name$retake - data_name$exam
coefficient associated 2nd: Do a one sample t-test
with the variable “x” = -1 t.test(diff)
B3: Value of the b- ** You can directly compute a paired t.test: t.test(data_name$retake, data_name$exam, paired = TRUE)
coefficient associated 2 samples : we measure the difference in reading skills among two teach methods
with the interaction = 1 Welch: t.test(data_name$y~ data_name$x, var.equal = FALSE)
If group value is below Two sample: t.test(data_name$y~ data_name$x, var.equal = TRUE)
reference then it will be - More than 2 samples:
If group value is above Welch anova: oneway.test(data_name$y ~ data_name$x, data = data_name, var.equal = FALSE)
reference then it will be + Anova: 1. model = lm(data_name$y~ data_name$x, data=data_name) 2. summary(model)
1st: Start with the reference and find: Intercept (of the reference data123$group1 = ifelse(data$group == "Group 1", 1, 0)
category): B0 = 2 -> when x = 0, the value of y = +2 Slope: B1 = -1 When x data123$group2 = ifelse(data$group == "Group 2", 1, 0)
increases by 1, y decreases by 1. data123$group3 = ifelse(data$group == "Group 3", 1, 0)
2nd: For group 2: Coefficient B2 = Difference between lines (group- Now run the first and second code again without the new created reference group
reference) when x=0. In this example: At x=0 we have a difference of 1 Referencegroup<- lm(dependent variable) ~ group1 + group2, data=data123)
square between lines. Group – reference = 1 – 2 = -1 summary(Referencegroup)
3rd: The interaction coefficient: See graphically by how much the
difference between lines changes when x increases by 1. Addition model with two scale variables
Suppose that we think that ageism (negative attitudes towards people
Logarithmic regression Understanding the output: on the basis of their age) is hypothesized to be negatively affected by education
Hypothesis: H0: Beta = 0 H1: Beta ≠ (measured on a 10 point scale) and also negatively affected by age (18 to 80).
0 (there is an effect of age on the library(tidyverse)
willingness to buy an insurance) model_name = data_name %>%
Coefficient of age = 0.076 means
lm(dependent ~ independent 1 + independent 2, . )
that there is a positive relationship
between age and the likelihood of summary(model_name)
purchasing health insurance Interpretations of output:
P-value < 0.05 so the relationship is
Explained variance: Used to assess the overall quality of the model.
significant.
R squared = 0.2471: This is the explained variance: 24.71% of the variance
of ageism can be explained by this model
F test with associated P-Value: Overall quality of the model. If P-Value < 0.05
Model fits the data
b-coefficients: Used to assess the direction and increment of the
relationships. For age, it is negative so for an increase of 1 in age, ageism
decreases by 0.08. For education it is positive so for an increase of 1 in edu,
ageism increases by 0.1
Interaction model with a dummy and
P-Values: Used to assess the significance of those coefficients.
nominal variable
You can calculate the expected levels of support Linear equation from output: y = β0 + β1 * x1(age) + β2 * x2(education). With
for each group, depending on whether they have values: y = 7.25 – 0.08*x1(age) +0.10*x2(education)
been exposed to the campaign or not.
For example, among those with theoretical Addition model with one scale and one dummy variable
education: Y = 7.4634 + 0.4033(1) = 7.8667 if Linear equation: y = β0 + β1 * x1+ β2 * x2. X1 = Age X2 = Education as dummy
exposed to the campaign (camp = 1) Education dummy coded as (0) = no education, (1) yes education
Y = 7.4634 + 0.4033(0) = 7.7463 if not exposed For no education: (x2=0): y = β0 + β1*x1(age) + β2*0 -- > y = β0 + β1*x1(age).
to the campaign (camp = 0)
You can read the effect of campaign among For yes education: (x2 = 1) y = β0 + β1*x1(age) + β2*1
each group: 2.6, 1.2 and 0.40 for practical, Typical questions :
medium and theoretical education respectively. A - What is the effect (coefficient) of age on ageism? -> -0.0877
The effect of campaign is significant for B - What is the expected level of ageism of someone who is 20 years old and Code interaction model:
students following practical or medium library(tidyverse)
had access to education?
education but not significant for those with model_name = data_name %>%
theoretical level of education (P-Value = 0.455)
Y= 7.44 -0.087(Age) + 0.814(edu_dummy) -- > Y = 7.44 – 0.87(20) + 0.814(1) = lm(dependent ~ independent1 *
Wanneer je wilt filteren met een woord en niet 6.14 independent2, . )
met een getal dan: C - What is the expected level of ageism of someone who is 40 years old and no summary(model_name)
model = dataset %>% filter(variable== "type access to education? With dummy variable (0-1) when something is 0 you
variable") %>% lm(independent ~ other do not use it in the equation when it’s 1 you do use it.
Y= 7.44 -0.087(Age) + 0.814(edu_dummy) -- > Y = 7.44 – 0.87(40) = 3.96
observed thing, . ) summary(model)
Interaction model with two dummies
Suppose that you study the effect of a new information campaign (campaign) about
climate change on attitudes towards sustainable behaviour and sustainable policies
(support). You basically expect this effect will be biggest among those with a low
level of education, mainly because people with a high level of education already had
that info and already support these policies (education).
TP = 100: True positives are the patients that suffer from migraines and that were correctly identified by the
algorithm/model
TN = 150: True negatives are the patients that did not suffer from migraines and that were correctly identified
by the algorithm/model
FN = 30: False negatives are the patients that suffer from migraines, but the algorithm/model stated they did
not
FP = 20: False positives are the patients that do not have migraines, but the algorithm/model said they did
Testing normal distribution and equal variance (HOMO: equal, HETERO: not)
Statistically: analyzing P-value
Normality (use Shapiro-Wilk test) and test the P-value
H0: Normal distribution vs HA: Not normal
Suppose you interested in the relationship between crime and punishment: you expect that crime is strongly
If P-Value <0.05 we reject normal distribution: NO normal distribution
related to the level of punishment, and that other factors do not play a role. (the severity of crime is positively
Equal variance: We choose Levene test to check equal variance in a model where all independent
associated with the level punishment).
variables are categorical. We choose Breusch-pagan if at least one of the independent variables is scale.
Y = Punishment (dependent variable)
Graphically: using plots
X = Crime (independent variable)
Normality
Normal QQ plots: residuals follow line Steps to check the assumptions in R:
Histogram: only one peak and normal shaped
1st : Creating the model: model1 <- data_name %>% lm(punish ~ crime, . )
2nd : find residuals and predicted
res <- model1$residuals
pred <- model1$fitted.values, sometimes in the exam they ask to include residuals and predicted in the
dataset, in that case: data_name$res = model1$residuals and data_name$pred = model1$fitted.values
3rd : Assumptions graphically:
For normality:
hist(res) or hist(data_name$res)
Equal variance: Residuals plots: equally spread dots -- >
plot(model1,2)
Boxplots: spread should be similar
For equal variance: plot(model1,1) plot(model1,3)
Outputs + codes : respectively : plot(model1,2) ; plot(model1,1) ; plot(model1,3)
From the normal QQ plot, we
can see that there are quite a
lot of deviations from the
Shapiro-Wilk test: shapiro.test(data_name$res) straight dotted line, this is a
Breusch-Pagan test: library(lmtest), bptest(model1) violation of the normal
Levenes test: library(car), leveneTest(model2) distribution. From 2nd it
This method only works with one single independent variable, which is enough for you to k now for this seems that the equal variance
course. is mostly fine. However, from
the 3rd plot we can see that
the equal variance is not met.
asked in the exam: Proportion test/binominal test: table (data_name$x), Binom.test(n_yes,total) Goodness of the fit test: Exp() Obs()
B0: Value of the intercept Chisquare(,)
=2 One sample t.test: We are comparing our sample mean with the population mean for example 6.5
B2: Value of the b- t.test(data_name$y, mu = 6.5)
coefficient associated compare CI with the average not with 0
with the group variable Paired samples: Assume we have 2 paired samples ex: Results on exam and retake
(the dummy) = -1 1st : Compute the differences in the right order (after – before)
B1 : Value of the b- diff = data_name$retake - data_name$exam
coefficient associated 2nd: Do a one sample t-test
with the variable “x” = -1 t.test(diff)
B3: Value of the b- ** You can directly compute a paired t.test: t.test(data_name$retake, data_name$exam, paired = TRUE)
coefficient associated 2 samples : we measure the difference in reading skills among two teach methods
with the interaction = 1 Welch: t.test(data_name$y~ data_name$x, var.equal = FALSE)
If group value is below Two sample: t.test(data_name$y~ data_name$x, var.equal = TRUE)
reference then it will be - More than 2 samples:
If group value is above Welch anova: oneway.test(data_name$y ~ data_name$x, data = data_name, var.equal = FALSE)
reference then it will be + Anova: 1. model = lm(data_name$y~ data_name$x, data=data_name) 2. summary(model)
1st: Start with the reference and find: Intercept (of the reference data123$group1 = ifelse(data$group == "Group 1", 1, 0)
category): B0 = 2 -> when x = 0, the value of y = +2 Slope: B1 = -1 When x data123$group2 = ifelse(data$group == "Group 2", 1, 0)
increases by 1, y decreases by 1. data123$group3 = ifelse(data$group == "Group 3", 1, 0)
2nd: For group 2: Coefficient B2 = Difference between lines (group- Now run the first and second code again without the new created reference group
reference) when x=0. In this example: At x=0 we have a difference of 1 Referencegroup<- lm(dependent variable) ~ group1 + group2, data=data123)
square between lines. Group – reference = 1 – 2 = -1 summary(Referencegroup)
3rd: The interaction coefficient: See graphically by how much the
difference between lines changes when x increases by 1. Addition model with two scale variables
Suppose that we think that ageism (negative attitudes towards people
Logarithmic regression Understanding the output: on the basis of their age) is hypothesized to be negatively affected by education
Hypothesis: H0: Beta = 0 H1: Beta ≠ (measured on a 10 point scale) and also negatively affected by age (18 to 80).
0 (there is an effect of age on the library(tidyverse)
willingness to buy an insurance) model_name = data_name %>%
Coefficient of age = 0.076 means
lm(dependent ~ independent 1 + independent 2, . )
that there is a positive relationship
between age and the likelihood of summary(model_name)
purchasing health insurance Interpretations of output:
P-value < 0.05 so the relationship is
Explained variance: Used to assess the overall quality of the model.
significant.
R squared = 0.2471: This is the explained variance: 24.71% of the variance
of ageism can be explained by this model
F test with associated P-Value: Overall quality of the model. If P-Value < 0.05
Model fits the data
b-coefficients: Used to assess the direction and increment of the
relationships. For age, it is negative so for an increase of 1 in age, ageism
decreases by 0.08. For education it is positive so for an increase of 1 in edu,
ageism increases by 0.1
Interaction model with a dummy and
P-Values: Used to assess the significance of those coefficients.
nominal variable
You can calculate the expected levels of support Linear equation from output: y = β0 + β1 * x1(age) + β2 * x2(education). With
for each group, depending on whether they have values: y = 7.25 – 0.08*x1(age) +0.10*x2(education)
been exposed to the campaign or not.
For example, among those with theoretical Addition model with one scale and one dummy variable
education: Y = 7.4634 + 0.4033(1) = 7.8667 if Linear equation: y = β0 + β1 * x1+ β2 * x2. X1 = Age X2 = Education as dummy
exposed to the campaign (camp = 1) Education dummy coded as (0) = no education, (1) yes education
Y = 7.4634 + 0.4033(0) = 7.7463 if not exposed For no education: (x2=0): y = β0 + β1*x1(age) + β2*0 -- > y = β0 + β1*x1(age).
to the campaign (camp = 0)
You can read the effect of campaign among For yes education: (x2 = 1) y = β0 + β1*x1(age) + β2*1
each group: 2.6, 1.2 and 0.40 for practical, Typical questions :
medium and theoretical education respectively. A - What is the effect (coefficient) of age on ageism? -> -0.0877
The effect of campaign is significant for B - What is the expected level of ageism of someone who is 20 years old and Code interaction model:
students following practical or medium library(tidyverse)
had access to education?
education but not significant for those with model_name = data_name %>%
theoretical level of education (P-Value = 0.455)
Y= 7.44 -0.087(Age) + 0.814(edu_dummy) -- > Y = 7.44 – 0.87(20) + 0.814(1) = lm(dependent ~ independent1 *
Wanneer je wilt filteren met een woord en niet 6.14 independent2, . )
met een getal dan: C - What is the expected level of ageism of someone who is 40 years old and no summary(model_name)
model = dataset %>% filter(variable== "type access to education? With dummy variable (0-1) when something is 0 you
variable") %>% lm(independent ~ other do not use it in the equation when it’s 1 you do use it.
Y= 7.44 -0.087(Age) + 0.814(edu_dummy) -- > Y = 7.44 – 0.87(40) = 3.96
observed thing, . ) summary(model)
Interaction model with two dummies
Suppose that you study the effect of a new information campaign (campaign) about
climate change on attitudes towards sustainable behaviour and sustainable policies
(support). You basically expect this effect will be biggest among those with a low
level of education, mainly because people with a high level of education already had
that info and already support these policies (education).
TP = 100: True positives are the patients that suffer from migraines and that were correctly identified by the
algorithm/model
TN = 150: True negatives are the patients that did not suffer from migraines and that were correctly identified
by the algorithm/model
FN = 30: False negatives are the patients that suffer from migraines, but the algorithm/model stated they did
not
FP = 20: False positives are the patients that do not have migraines, but the algorithm/model said they did
Testing normal distribution and equal variance (HOMO: equal, HETERO: not)
Statistically: analyzing P-value
Normality (use Shapiro-Wilk test) and test the P-value
H0: Normal distribution vs HA: Not normal
Suppose you interested in the relationship between crime and punishment: you expect that crime is strongly
If P-Value <0.05 we reject normal distribution: NO normal distribution
related to the level of punishment, and that other factors do not play a role. (the severity of crime is positively
Equal variance: We choose Levene test to check equal variance in a model where all independent
associated with the level punishment).
variables are categorical. We choose Breusch-pagan if at least one of the independent variables is scale.
Y = Punishment (dependent variable)
Graphically: using plots
X = Crime (independent variable)
Normality
Normal QQ plots: residuals follow line Steps to check the assumptions in R:
Histogram: only one peak and normal shaped
1st : Creating the model: model1 <- data_name %>% lm(punish ~ crime, . )
2nd : find residuals and predicted
res <- model1$residuals
pred <- model1$fitted.values, sometimes in the exam they ask to include residuals and predicted in the
dataset, in that case: data_name$res = model1$residuals and data_name$pred = model1$fitted.values
3rd : Assumptions graphically:
For normality:
hist(res) or hist(data_name$res)
Equal variance: Residuals plots: equally spread dots -- >
plot(model1,2)
Boxplots: spread should be similar
For equal variance: plot(model1,1) plot(model1,3)
Outputs + codes : respectively : plot(model1,2) ; plot(model1,1) ; plot(model1,3)
From the normal QQ plot, we
can see that there are quite a
lot of deviations from the
Shapiro-Wilk test: shapiro.test(data_name$res) straight dotted line, this is a
Breusch-Pagan test: library(lmtest), bptest(model1) violation of the normal
Levenes test: library(car), leveneTest(model2) distribution. From 2nd it
This method only works with one single independent variable, which is enough for you to k now for this seems that the equal variance
course. is mostly fine. However, from
the 3rd plot we can see that
the equal variance is not met.