Statistical P values may have their flaws, but they’re unlikely to go away anytime soon. That was the observation of two Clinical Chemistry associate editors, James C. Boyd, MD, and Thomas M. Annesley, PhD, in their analysis of an article published earlier this year in the journal Nature. The Boyd-Annesley analysis appears in the July issue of Clinical Chemistry.

The Nature article, written by Regina Nuzzo, describes how the P value arose from competing models of several well-known statisticians. Ronald Fisher, a statistician from the United Kingdom, initially proposed the P value as a way to judge informally whether evidence needed a second look; “he did not mean it to be a definitive test,” Nuzzo wrote. Fisher’s idea essentially was to see if experiment results “were consistent with what random chance might produce,” she explained.

This compares with an alternative introduced by U.K. statistician Egon Pearson and Polish mathematician Jerzy Neyman that incorporated false negatives, false positives and statistical power. Researchers writing statistics manuals for working scientists—who didn’t have a firm grasp of either approach—ended up creating a hybrid of the two models, cramming “Fisher’s easy-to-calculate P value into Neyman and Pearson’s reassuringly rigorous rule-based system,” Nuzzo wrote. The result was the “seemingly indiscriminate use of a P value of <0.05 to establish statistical significance,” Boyd and Annesley noted in their Clinical Chemistry article.

P values have been seen as a gold standard for statistical validity, yet critics have questioned this method’s reliability. “Nuzzo points out that P values ignore the underlying plausibility that a real effect exists in the first place, and thereby, they often lead to false-positive findings that cannot be replicated in subsequent experiments,” wrote Boyd and Annesley. “In addition, P values fail to take the size of an effect into account, sometimes attracting attention to very small effects that have little real-world importance,” they added.

P values nevertheless are a staple in scientific literature and, as Clinical Chemistry’s editors point out, have endured despite the arguments that call for their removal.

Statisticians have sought alternatives. As an example, many “advocate replacing the P value with methods that take advantage of Bayes' rule: an eighteenth-century theorem that describes how to think about probability as the plausibility of an outcome, rather than as the potential frequency of that outcome. This entails a certain subjectivity — something that the statistical pioneers were trying to avoid,” Nuzzo’s article stated. “But the Bayesian framework makes it comparatively easy for observers to incorporate what they know about the world into their conclusions, and to calculate how probabilities change as new evidence arises.”

Recognizing that conventional statistics can only go so far, Nuzzo’s article encouraged researchers to “exercise scientific judgment about the plausibility of a given hypothesis and underlying study limitations,” the Clinical Chemistry analysis stated.

Boyd and Annesley concluded that they concurred with the recommendations in Nuzzo’s article. “Although it is hard to be proscriptive when it comes to using statistical methods, we encourage potential authors to have their statistical analyses reviewed by professional statisticians tuned to the strengths and weaknesses of the statistical approaches used,” they advised.