What a p-value actually tells you, and what it does not

Behind a large share of the health claims that reach the public sits a single small number: the p-value. A study is called significant, a treatment is called effective, a risk is called real, and somewhere upstream a p-value crossed a threshold. The number is everywhere. The standard reading of it is usually wrong.

That is not a fringe complaint. The misunderstanding is common enough, and consequential enough, that in 2016 the American Statistical Association (ASA) took the unusual step of issuing a formal statement on how p-values should and should not be used. The goal of this piece is narrow and practical: to define what a p-value really measures, to name clearly what it does not, and to leave a careful reader able to interpret one honestly.

What a p-value is

To define a p-value, start with the idea it is built on. A null hypothesis is a specific baseline claim that there is no effect: no difference between two treatments, no association between an exposure and an outcome, nothing happening beyond ordinary variation. Most statistical tests are constructed to evaluate data against this baseline.

A p-value is the probability of observing data at least as extreme as the data actually collected, assuming the null hypothesis is true. That conditional clause is the whole point. The calculation begins by granting, for the sake of argument, that there is no real effect, and then asks how surprising the observed result would be in that world.

A small p-value means the data would be unusual if the null hypothesis were true. A large p-value means the data are unremarkable under that assumption. As the ASA puts it, a p-value indicates how incompatible the data are with a specified statistical model. That is its job, and it is a real and useful one. It is also a great deal less than what the number is routinely asked to mean.

What a p-value is not

The ASA statement is built around six principles, and several of them exist specifically to correct widespread misreadings.

Wasserstein RL, Lazar NA (2016). The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician.DOI: 10.1080/00031305.2016.1154108

The first and most important correction concerns probability. A p-value is not the probability that the null hypothesis is true, and it is not the probability that the studied hypothesis is false. It cannot be, because of how it is calculated: it already assumes the null hypothesis is true and works forward from there. A number computed by assuming something cannot also tell you the chance that the assumption is wrong. The ASA states this directly, that p-values do not measure the probability that the studied hypothesis is true.

A closely related error is to call a p-value the probability that the result happened by chance. The same statement rejects this reading: a p-value does not measure the probability that the data were produced by random chance alone. The data were produced by whatever process actually generated them. The p-value only describes how the observed result compares to what the no-effect model would typically produce.

Perhaps the most stubborn misreading attaches to the familiar value of 0.05. It is widely said that a p-value of 0.05 means there is a 5 percent chance the finding is a false positive. This is incorrect, and the error is not subtle. The false-positive rate of a study depends on factors a p-value does not contain, including how plausible the hypothesis was before the data arrived. The mistake is well documented enough that reviews of statistical reporting single it out, noting that it has appeared even in the formal reporting standards of respected publications.

Andrade C (2019). The P Value and Statistical Significance: Misunderstandings, Explanations, Challenges, and Alternatives. Indian Journal of Psychological Medicine (NCBI / PMC).DOI: 10.4103/IJPSYM.IJPSYM_193_19

Finally, a p-value says nothing about how large or how important an effect is. A statistically significant result can correspond to a difference far too small to matter to any patient. Statistical significance and the size of an effect are separate questions, and the ASA lists this separation as one of its principles: a p-value does not measure the size of an effect or the importance of a result.

Why 0.05 is a convention, not a law of nature

The threshold of 0.05 has acquired an authority it was never meant to hold. It traces to early twentieth-century statistical practice, where it was offered as a convenient, round benchmark, not as a natural boundary between findings that are real and findings that are not.

Treating it as such a boundary causes real harm. When results are sorted into significant and not significant by whether they fall on one side of 0.05, a great deal of information is thrown away. A p-value of 0.049 and a p-value of 0.051 describe nearly identical evidence, yet the bright-line habit declares one a discovery and the other a non-event. The ASA explicitly cautions against basing scientific conclusions on whether a p-value passes a specific threshold. The statistical community has continued to press this point: a 2021 ASA President's Task Force statement on statistical significance and replicability reaffirmed that p-values and significance tests, properly used and interpreted, remain valuable tools, while emphasizing that they assess results relative to sampling variation rather than practical importance.

The dichotomy also distorts what gets published. Results that clear the threshold are more likely to appear in print, which inflates the apparent strength of effects across an entire literature. A single threshold, applied mechanically, ends up shaping science rather than merely summarizing it.

Statistical significance is not clinical importance

The separation between significance and importance deserves its own emphasis, because it runs in both directions.

In a very large study, even a trivial difference can become statistically significant. With enough participants, a difference of no practical consequence, a fraction of a point on a scale, a sliver of a percentage, can produce a small p-value simply because the study had the power to detect almost anything. Significance here certifies that the difference is probably not zero, not that it is worth acting on.

The reverse happens in small studies. A genuinely important effect can fail to reach significance when the sample is too small to distinguish it from noise. A p-value above 0.05 in that setting is not evidence that nothing is there. It is evidence that the study could not tell. This is why the meaning of a result lives less in the p-value and more in the effect size, the magnitude of the difference, and the confidence interval, the range of values consistent with the data. Those quantities answer the question a reader usually cares about: not merely whether there is an effect, but how big it plausibly is and how precisely it was measured.

How to read a p-value responsibly

None of this means p-values are useless. It means a p-value is one piece of evidence, to be read alongside others, never a verdict on its own. A few habits keep its interpretation honest.

Look past the p-value to the effect size and its confidence interval. A narrow interval around a meaningful effect is worth far more than a small p-value alone.
Ask about the study design. Was the comparison randomized or observational? How large was the sample? What was measured, and what might have been missed?
Ask whether the finding replicates. A result that recurs across different populations and methods is far more trustworthy than a single striking p-value, and replicability is precisely the concern the statistical community has emphasized most in recent years.
Resist the bright line. Treat 0.049 and 0.051 as the near-equals they are, and read the strength of evidence as a continuum rather than a verdict.

This piece is educational and is not a substitute for personal medical advice. Its purpose is to explain how to read a number that appears throughout health research, not to guide any decision about a particular treatment or exposure.

The deeper lesson is one of proportion. A p-value answers a single, modest question about compatibility between data and a model. Asked to carry more than that, to certify truth, to measure importance, to draw a line between real and unreal, it fails, and the science built on the overreading inherits the failure. Read narrowly and in context, it remains a genuinely useful tool. For readers who want to see how questions of design, measurement, and inference are worked through in practice, the peer-reviewed publications page collects studies where exactly this kind of careful reading was required.