one or more predictor variables.
5 key concepts: the SPINE
- Standard error
- Parameters
- Confidence intervals
- Null hypothesis testing
- estimation
Testing the hypothesis involves building stat models of the phenomenon of interest.
Unlike engineers, we are unable to have access to the real world situation and we can only infer
things about psychological, societal, biological or economic processes based upon the models
we build.
The degree to which a stat model represents the data collected is known as the fit of the model.
Some similarities, big differences= moderate fit
Excellent representation of the real-world situation= good fit.
Everything in the stat book boils down to this equation:
outcome= model + error
Most scientists are interested in inferring things to the general population.
In fact, scientists cannot collect data from every human being that is why we collect data from a
smaller subset of the pop called sample.
The larger the sample, the more likely it is to reflect the whole population.
Stat models are made up of variables and parameters. Variables (vary) and parameters are
constant.
Parameters represent the fundamental truth about the relation between variables in the model.
These are like: mean, median, correlation, regression.
We can predict values of the outcome based on a model.
There will always be error in predictions, there will always be parameters that tell us about the
shape or form of the model.
Because we use the parameters of the sample and not the population, we can only say that we
are drawing an estimate about the true pop parameter.
Mean is a hypothetical value. It is created to summarize data and there will be errors in
prediction.
, When we estimate different parameters we focus on minimizing the sum of squared errors.
This is known as the method of least squares or ordinary least squares
Sampling variation: the samples vary because they contain different members of the population.
Sampling distribution: frequency distribution of sample means from the same population.
We use the standard deviation as a measure of how representative the mean was of the
observed data.
The average of sample mean is the same as the average of the population mean.
Standard deviation of sample mean is called standard error
Since samples can range from hundreds to thousands, it is crazy to make (sample mean-overall
mean)^2
So staticians demonstrated central limit theorem: as the samples get larger (greater than 30)
the sampling distribution has a normal distribution with a mean = to the pop mean and
Standard deviation= s/ sqrt(N)
A sample (less than 30) is a t-distribution.
Confidence interval: boundaries within which we believe the population value will fall.
What does a 95% confidence interval mean?
Imagine you have 100 samples, in 95 the true mean pop will fall within the intervals while in only
5 they won’t.
The problem is that we don’t know whether the confidence interval is one of the 95% or the 5%
But the good thing is that we know that the sampling distribution is a normal distribution with a
mean of 0 and sd of 1.
If not we can convert score so that the mean becomes 0 and the sd becomes 1
z= x-x^/s
Lower boundary of CI= X^-(1.96XSE)
upper= X^+(1.96XSE)
T-distribution: this distribution changes its shape when the sample size becomes bigger.
Lower boundary of CI= X^-(t(n-1)XSE)
upper= X^+(t(n-1)XSE)
For a 95% confidence interval, we find a value of t for a two tailed test with probability of 0.05,
for the appropriate df.
Page 54
Two 95% confidence intervals that don’t overlap. Two possibilities:
- They have both the population mean but come from different populations
- Both samples come from the same population but one does not contain the population
mean.