Recently there has been a great deal of controversy over the inability to reproduce published data — especially in fields such as the social and biomedical sciences. Now, a group of 72 prominent researchers [see attached article] is proposing one cause of the problem: weak statistical standards of evidence for claiming new discoveries. The most commonly accepted “statistical significance of data” is judged by the P-value; P-values are used to test (and dismiss) a ‘null hypothesis’, which generally theorizes that “the effect being tested for — doesn’t exist”. The smaller the P value that is found for a set of results, the less likely it is that the results are purely “due to chance”. Results are (almost always) deemed ‘statistically significant’ when this value is below 0.05. This means that “chances of a false positive are one in twenty.”

However, many scientists are concerned that this threshold is causing too many false positives to appear in the literature — a problem exacerbated by a practice called “P hacking”, in which researchers gather data without first creating a hypothesis to test, and then look for patterns in the results that can be reported as statistically significant. So, in a provocative manuscript posted by Benjamin et al. on the PsyArXiv preprint server on 22 July, they argue that P-value thresholds should be lowered to 0.005 for the social and biomedical sciences (this means that “chances of a false positive would be one in 200”). The final paper will be published in Nat Hum Behav.

One problem with decreasing P-value thresholds is that it may increase the odds of a false negative — i.e. stating that effects do not exist when in fact they do. To counter this, Benjamin and his colleagues suggest that researchers increase sample sizes by 70%; they say this would avoid increasing rates of false negatives, while still dramatically reducing rates of false positives. But, in practice, only well-funded scientists would have the means to do this.

Argamon, a computer scientist at the Illinois Institute of Technology in Chicago, says there is no simple answer to the problem, because “no matter what confidence level you choose, if there are enough different ways to design your experiment, it becomes highly likely that at least one of them will give a statistically significant result just by chance”. Two decades ago, geneticists/genomicists recognized the importance of multilocus testing and perturbation testing, thereby establishing a threshold of 5 × 10–8 (5.0e–08) for genome-wide association studies (GWAS), in which differences between people with a disease, and those without, are searched for, across hundreds of thousands of DNA nucleotide variants.

Still other scientists have abandoned P-values in favor of more-sophisticated statistical tools, such as Bayesian tests, which require researchers to define and test two alternative hypotheses. Of course, not all researchers will have “the technical expertise” to carry out Bayesian tests. “P-values can still be useful for gauging whether a hypothesis is supported by evidence. The P-value by itself is not necessarily evil.”

Nature 3 Aug 2o17; 548: 16–17