Imagine the American Physical Society (APS) convening a panel of experts to issue a missive to the scientific community on the difference between weight and mass. And imagine that the motivation for such a message was recognizing that engineers and builders had been confusing these concepts for decades––making bridges, buildings, and other structures much weaker than previously suspected.
This, in a sense, is what has happened with the recent release of a statement from the American Statistical Association (ASA), with the deceptively innocuous title, “ASA statement on statistical significance and P-values” [see excellent article attached]. The scientific measure in need of clarification was the P-value––perhaps the most ubiquitous statistical index that is used in scientific research to help decide what is “significant” and what is not. The ASA saw misunderstanding, and misuse, of statistical significance as a factor in the rise in concern about the credibility of many scientific claims (which has recently been called the “reproducibility crisis”). Thus, the ASA hopes that its official statement on the matter will help set scientists on the right course.
The formal definition of “P-value” is the probability that one observed data-point or summary (e.g. an average) and its more extreme values, differs from another, given a specified mathematical model and hypothesis (usually called the “null hypothesis”). The problem is that this index, by itself, is not of particular interest. What scientists want is a measure of the credibility of their conclusions––based on observed data. The P-value neither measures that nor is it part of a formula that provides it.
R. A. Fisher revolutionized statistical inference and experimental design in the 1920s and ’30s by establishing a comprehensive framework for statistical reasoning and writing the first statistical best-seller for experimenters. He formalized an approach to inference involving P-values and assessment of significance, based on the most frequent notion of probability, defined in terms of verifiable frequencies of repeatable events. He wanted to avoid the subjectivity of the Bayesian approach, in which probability of a hypothesis (“inverse probability”)––neither repeatable nor observable––was central.
Fisher was a champion of P-values as one of several tools to aid the fluid, inductive process of scientific reasoning––but not to substitute for it. Fisher used “significance” merely to indicate that an observation was worth following up, with refutation of the null hypothesis justified only if further experiments “rarely failed” to achieve significance. This is in stark contrast to the modern practice of making claims, based on a single demonstration of statistical significance..!!
Science 3 June 2016; 352: 1180–1181 [Insights section]