Greenland et al. – Statistical tests, P values, confindence intervals, and power: a
guide to misinterpretations
Statistical significance is the classifying of results as “significant” or not based on a P value.
Some scientific journals now ban those statistical tests. In most scientific settings the
arbitrary classification of results into significant and non-significant is unnecessary for and
often damaging to valid interpretation of data; and that estimation of the size of effects and
the uncertainty surrounding our estimates will be far more important for scientific inference
and sound judgment than any such classification.
Statistical models, hypotheses and tests
Often the statistical models underpin the method. The model is a mathematical
representation of data variability, and ideally captures all sources of variability. But often
practical randomization is difficult.
Also, the scope of a model sound cover the observed data. But often decisions surrounding
analysis choices have been made after the data were collection.
The difficulty in understanding and assessing underlying assumptions is usually not
presented well enough or at all. Many assumptions thus go unremarked and unrecognized.
In most tests, we test null hypothesis (thus, there is no difference found). We may also test
hypothesis that the effect does or does not fall within a specific range, thus a one-sided or
dividing hypothesis. Too often we refer to null hypotheses test as “Null Hypothesis
Significance Testing”, whereas the hypothesis not always a null hypothesis.
Uncertainty, probability, and statistical significance
What was previously referred to as frequency probabilities (quantities that are hypothetical
frequencies of data patterns under an assumed statistical model) are now referred to as
probabilities. Thus frequency probabilities are misinterpreted as hypothesis probabilities.
The P value in this paper is approached as a statistical summary of the compatibility
between the observed data and what we would predict or expect to see if we knew the
entire statistical model (thus, all the assumptions used to compute the P value) were
correct.
In tests such as t-test and Chi-square, the P value is the probability that the chosen test
statistic would have been at least as large as its observed value if every model assumption
were correct. Nowadays, the P value only predicts the probability on e.g. the null
hypothesis.
A small P value does tell us something about how unusual the data would be if every single
assumption were correct; but it doesn’t tell us which assumption is incorrect. It may e.g. be
that P is small/large because study protocols were violated.
Also, mostly the significance/alpha level is 0.05. But is already fixed in advance and thus
part the study design, unchanged in light of the data. But P is a number computed from the
data and thus an analysis result, unknown until it is computed.
Moving from tests to estimates
Mostly P values are discussed only for the null hypothesis of no effect, which obscures the
close relationship between P values and confidence intervals, as well as the weaknesses
they share.
Read part in actual article
What P values, confidence intervals, and power calculations don’t tell us
guide to misinterpretations
Statistical significance is the classifying of results as “significant” or not based on a P value.
Some scientific journals now ban those statistical tests. In most scientific settings the
arbitrary classification of results into significant and non-significant is unnecessary for and
often damaging to valid interpretation of data; and that estimation of the size of effects and
the uncertainty surrounding our estimates will be far more important for scientific inference
and sound judgment than any such classification.
Statistical models, hypotheses and tests
Often the statistical models underpin the method. The model is a mathematical
representation of data variability, and ideally captures all sources of variability. But often
practical randomization is difficult.
Also, the scope of a model sound cover the observed data. But often decisions surrounding
analysis choices have been made after the data were collection.
The difficulty in understanding and assessing underlying assumptions is usually not
presented well enough or at all. Many assumptions thus go unremarked and unrecognized.
In most tests, we test null hypothesis (thus, there is no difference found). We may also test
hypothesis that the effect does or does not fall within a specific range, thus a one-sided or
dividing hypothesis. Too often we refer to null hypotheses test as “Null Hypothesis
Significance Testing”, whereas the hypothesis not always a null hypothesis.
Uncertainty, probability, and statistical significance
What was previously referred to as frequency probabilities (quantities that are hypothetical
frequencies of data patterns under an assumed statistical model) are now referred to as
probabilities. Thus frequency probabilities are misinterpreted as hypothesis probabilities.
The P value in this paper is approached as a statistical summary of the compatibility
between the observed data and what we would predict or expect to see if we knew the
entire statistical model (thus, all the assumptions used to compute the P value) were
correct.
In tests such as t-test and Chi-square, the P value is the probability that the chosen test
statistic would have been at least as large as its observed value if every model assumption
were correct. Nowadays, the P value only predicts the probability on e.g. the null
hypothesis.
A small P value does tell us something about how unusual the data would be if every single
assumption were correct; but it doesn’t tell us which assumption is incorrect. It may e.g. be
that P is small/large because study protocols were violated.
Also, mostly the significance/alpha level is 0.05. But is already fixed in advance and thus
part the study design, unchanged in light of the data. But P is a number computed from the
data and thus an analysis result, unknown until it is computed.
Moving from tests to estimates
Mostly P values are discussed only for the null hypothesis of no effect, which obscures the
close relationship between P values and confidence intervals, as well as the weaknesses
they share.
Read part in actual article
What P values, confidence intervals, and power calculations don’t tell us