Recap year 1
Scientific Method - observe->question->hypothesis->experiment->conclusion
Also, falsifiability of hypotheses is important (possible that its incorrect)
Reliability - internal/external
Validity, confound, bias etc
Participant information sheet for research:
https://swanseachhs.eu.qualtrics.com/jfe/form/SV_9KwCUp9GA63clXD (Links to an external site.)
Section 1
Type 1 error: say there’s a difference but there isn't
Type 2 error: say there's no difference but there is
p < .05 is essentially the likelihood of making a Type 1 error and rejecting the null hypothesis when it is true.
Standard normal distribution -
The formula for creating the standard score, or z score, is: z = (X - μ) / σ
X represents the individual score. μ is the mean for all those scores in your sample. σ is the standard
deviation of the sample. All that means is that, for each participant, you take the average score away from
their own score and divide the result by the standard deviation of the sample.
Parametric assumptions -
Assumption of interval/ratio data (necessity for DV scores, easily verified)
Assumption of independent scores
Assumption of normality (central limit theorem)
Assumption of homogeneity of variance (Levene’s test)
Non-parametric statistics used when assumptions are too badly violated (nominal data or ordinal data, don’t
require assumptions, skewed distributions)
Two correlation tests: Pearson’s r (correlation based on z/standard scores, parametric), Spearman’s rank
(correlation based on ranks, non-parametric)
Zero correlation – can happen if the measure of one of the variables is too hard or too easy
Central limit theorem – in many situations, when independent random variables are summed up, their
properly normalized sum tends toward a normal distribution, even if original variables themselves are not
,normally distributed.
Central Limit Theorem does not always hold. Must check the normality of samples otherwise it might go
wrong.
As sample size increases, the sample mean will be normally distributed and hypothesis tests will be robust
against the violation of normality.
Standard normal distribution has a mean of zero and a standard deviation of 1.
68% of scores fall within 1 standard deviation either side of the mean in a normal distribution,
95% within 2 standard deviations either side of the mean, and 99.7% within 3 standard
deviations of the mean - this is the empirical rule.
Our alpha level is .05, or 5%, meaning that a score is significantly different from the mean if it
falls > 2 standard deviations away.
A bigger sample size results in greater power.
It takes more power to find an effect that is small than an effect that is large.
The more conservative your cut-off for significance, the more power you'll need to reach it (i.e. it takes more
power to get to p < .001 than it does to get to p < .05).
r value (correlation coefficient) Interpretation
0.3 Weak
0.5 Moderate
0.7 Strong
Regression
Set of tests used to predict what will happen in the future – regression tests. We expect
participant to behave in a certain way given the info we have on them already.
Method of least squares – for each line make note of the residuals (distance between line and
actual data point). Square each residual, add them up. Whichever line has the smallest total is
the best line.
We use inferential regression to know if the regression equation describes a line that fits the
data well. We assess whether there is significantly less error when we predict scores based on
the regression than if we make simplest prediction possible (that everyone scores the mean).
Strength of correlation: r
The more closely related two variables are, the stronger the correlation, greater the r squared
The more of the variance in one is explained by the variance in the other
, R squared value tells us how much of the variance is shared between the predictor and the
outcome
Also tells us the proportion of the total amount of variance by which you’ve improved your
prediction including the additional information.
To see if the line we draw to guess what participants will score on the outcome variable results
in significantly less error if we include the predictor variable, we use Analysis of Variance
(ANOVA).
How to tell whether your regression has significantly improved the accuracy of your prediction:
1. Figure out the amount of error between baseline model (the mean) and an individual data
point, square it, do it again. Then add them all up (SStotal)
2. Figure out the amount of error between model including predictor and an individual data
point, square it, do it again. Then add them all up (SSresidual)
3. SStotal – SSresidual = SSmodel (amount of reduction in error after adding in new info)
Sums of squares depend on how many values you’ve added up
MSmodel=SSmodel/no of variables in the model (not including constant) [degrees freedom]
MSresidual=SSresidual/no of observations-no of betas being estimated
In a simple linear regression:
The number of variables in the model that are not the mean is only going to be 1
There will only ever be two betas
MSmodel/MSresidual gives F ratio (same distribution of critical values that we use for ANOVA)
If probability less than 0.5, we have significant F value, so prediction we’ve made when
including our new variable is significantly better than if we were to just use the mean.
Assumptions of simple linear regression:
Use scatterplot, of relationship isn't linear, prediction won't hold and you cany use test without
transforming data to make it linear
Independent values of outcome variable (should come from separate participants)
We usually report simple linear regression by saying:
Whether the prediction was significantly improved by the inclusion of the predictor variable
& How much of the variance in the outcome variable is explained by the predictor
(e.g. "The addition of the predictor variable significantly improved the prediction [F(df, df) = whatever it
equals, p < .05]. The final model was able to explain x% of the variance in the outcome variable.")
Scientific Method - observe->question->hypothesis->experiment->conclusion
Also, falsifiability of hypotheses is important (possible that its incorrect)
Reliability - internal/external
Validity, confound, bias etc
Participant information sheet for research:
https://swanseachhs.eu.qualtrics.com/jfe/form/SV_9KwCUp9GA63clXD (Links to an external site.)
Section 1
Type 1 error: say there’s a difference but there isn't
Type 2 error: say there's no difference but there is
p < .05 is essentially the likelihood of making a Type 1 error and rejecting the null hypothesis when it is true.
Standard normal distribution -
The formula for creating the standard score, or z score, is: z = (X - μ) / σ
X represents the individual score. μ is the mean for all those scores in your sample. σ is the standard
deviation of the sample. All that means is that, for each participant, you take the average score away from
their own score and divide the result by the standard deviation of the sample.
Parametric assumptions -
Assumption of interval/ratio data (necessity for DV scores, easily verified)
Assumption of independent scores
Assumption of normality (central limit theorem)
Assumption of homogeneity of variance (Levene’s test)
Non-parametric statistics used when assumptions are too badly violated (nominal data or ordinal data, don’t
require assumptions, skewed distributions)
Two correlation tests: Pearson’s r (correlation based on z/standard scores, parametric), Spearman’s rank
(correlation based on ranks, non-parametric)
Zero correlation – can happen if the measure of one of the variables is too hard or too easy
Central limit theorem – in many situations, when independent random variables are summed up, their
properly normalized sum tends toward a normal distribution, even if original variables themselves are not
,normally distributed.
Central Limit Theorem does not always hold. Must check the normality of samples otherwise it might go
wrong.
As sample size increases, the sample mean will be normally distributed and hypothesis tests will be robust
against the violation of normality.
Standard normal distribution has a mean of zero and a standard deviation of 1.
68% of scores fall within 1 standard deviation either side of the mean in a normal distribution,
95% within 2 standard deviations either side of the mean, and 99.7% within 3 standard
deviations of the mean - this is the empirical rule.
Our alpha level is .05, or 5%, meaning that a score is significantly different from the mean if it
falls > 2 standard deviations away.
A bigger sample size results in greater power.
It takes more power to find an effect that is small than an effect that is large.
The more conservative your cut-off for significance, the more power you'll need to reach it (i.e. it takes more
power to get to p < .001 than it does to get to p < .05).
r value (correlation coefficient) Interpretation
0.3 Weak
0.5 Moderate
0.7 Strong
Regression
Set of tests used to predict what will happen in the future – regression tests. We expect
participant to behave in a certain way given the info we have on them already.
Method of least squares – for each line make note of the residuals (distance between line and
actual data point). Square each residual, add them up. Whichever line has the smallest total is
the best line.
We use inferential regression to know if the regression equation describes a line that fits the
data well. We assess whether there is significantly less error when we predict scores based on
the regression than if we make simplest prediction possible (that everyone scores the mean).
Strength of correlation: r
The more closely related two variables are, the stronger the correlation, greater the r squared
The more of the variance in one is explained by the variance in the other
, R squared value tells us how much of the variance is shared between the predictor and the
outcome
Also tells us the proportion of the total amount of variance by which you’ve improved your
prediction including the additional information.
To see if the line we draw to guess what participants will score on the outcome variable results
in significantly less error if we include the predictor variable, we use Analysis of Variance
(ANOVA).
How to tell whether your regression has significantly improved the accuracy of your prediction:
1. Figure out the amount of error between baseline model (the mean) and an individual data
point, square it, do it again. Then add them all up (SStotal)
2. Figure out the amount of error between model including predictor and an individual data
point, square it, do it again. Then add them all up (SSresidual)
3. SStotal – SSresidual = SSmodel (amount of reduction in error after adding in new info)
Sums of squares depend on how many values you’ve added up
MSmodel=SSmodel/no of variables in the model (not including constant) [degrees freedom]
MSresidual=SSresidual/no of observations-no of betas being estimated
In a simple linear regression:
The number of variables in the model that are not the mean is only going to be 1
There will only ever be two betas
MSmodel/MSresidual gives F ratio (same distribution of critical values that we use for ANOVA)
If probability less than 0.5, we have significant F value, so prediction we’ve made when
including our new variable is significantly better than if we were to just use the mean.
Assumptions of simple linear regression:
Use scatterplot, of relationship isn't linear, prediction won't hold and you cany use test without
transforming data to make it linear
Independent values of outcome variable (should come from separate participants)
We usually report simple linear regression by saying:
Whether the prediction was significantly improved by the inclusion of the predictor variable
& How much of the variance in the outcome variable is explained by the predictor
(e.g. "The addition of the predictor variable significantly improved the prediction [F(df, df) = whatever it
equals, p < .05]. The final model was able to explain x% of the variance in the outcome variable.")