Back in the 1940s, economist Milton Friedman was part of the war effort. He performed statistical studies on high-temperature alloys for jet engines. Friedman, who eventually won a Nobel Prize in economics, used regression analysis on data about alloy strength versus temperature. His statistics predicted that a couple of as-yet-untried alloys would last about 200 hours, noteworthy because those tried thus far had failed after only about 20 hours.

Surprise: When metallurgists cooked up the new alloys, they went to pieces in less than 3 hours.

The lesson in Friedman’s experience is that you can’t derive engineering facts from statistics alone. That is a point amplified by Steve Ziliak, an economics professor at Roosevelt University who coauthored a book called *The Cult of Statistical Significance*. Ziliak is among a number of researchers who warn that statistical significance — given by the student T test and p values — is sometimes misused as a proxy for important scientific results. And as Milton Friedman discovered early in his career, reliance on statistics alone can often lead to astoundingly bad conclusions.

Ziliak says confusion about statistical significance is widespread even among researchers who should know better. He reached this conclusion by combing through papers published in a number of prestigious economics, operations research, and medical journals. He found numerous instances of researchers who used statistical significance as if it was the same as correlation. “They confuse the probability measure with a measure of correlation of effect size. But they are two very different things. It is almost embarrassing because it is such an elementary point,” he says.

Ziliak’s discovery is much more than just pedantic statistical minutia. In medical research, for example, confusion about significance levels can lead to rejecting good drugs in favor of others that have less oomph. “Suppose you have two diet pills which differ only in the size of the effect they have on dieters,” he says. “One pill takes off 20 pounds, plus or minus 10. The other takes off 5 pounds, plus or minus a half pound. Ninety percent of scientists in medicine would choose the second pill because they think its effects are more significant, though the first pill takes off more weight. That’s because the first pill has a signal-to-noise ratio of just two (20/10) while the second pill’s ratio is 10 (5/0.5).”

The irony is that researchers interested in losing weight would likely have no trouble picking the pill that was most effective, low signal-to-noise or not. People can effortlessly solve a problem in a social setting but struggle when it is presented as an abstract dilemma.

Interestingly enough, engineering research tends to be free of such misconceptions. “One reason engineers didn’t go down this path is that they use Monte Carlo and other types of simulations as well as different quantitative methods that don’t require inferential statistics,” says Ziliak. “And even when engineers do use inferential statistics, their practices have been shaped by people like W. Edwards Deming and other engineers who were around at the birth of modern statistics. Deming in particular saw that significance testing was not going to be relevant for most engineering purposes.”

All in all, if you find yourself wondering why the latest economic theories seem to work no better than Milton Friedman’s high-temperature-alloy predictions, consider the possibility they were hatched by someone unable to recognize an effective a diet pill from its statistics.

*— Leland Teschler, Editor*