Explainer: Correlation, causation, coincidence and more

Statistics don’t always say what people think they might

Over a 10-year period, Americans’ fondness for margarine correlated strongly with the divorce rate in Maine. Yet there’s no reason to think one caused the other. It’s an instance of two unrelated data sets showing a coincidental pattern.

Over a 10-year period, Americans’ fondness for margarine correlated strongly with the divorce rate in Maine. Yet there’s no reason to think one caused the other. It’s an instance of two unrelated data sets showing a coincidental pattern. 

Tyler Vigen/“Spurious Connections”/(CC BY 4.0)

Eating more mozzarella cheese shouldn’t make engineering schools hand out more diplomas. Yet between 2000 and 2009, the more mozzarella that Americans downed, the more doctorates in civil engineering that U.S. universities awarded. Over a 10-year period, as levels of one went up, so did the other. The two showed a strong positive correlation. Yet almost certainly this happened by coincidence. One did not cause the other.

This is a cheesy example. Still, it shows an important point about statistics: Correlation is not the same thing as causation — showing that one thing caused the other.

Another complication: Many events or trends can have multiple causes. And sometimes two variables might both be due to a third factor. All of this can sometimes confuse, or confound, a statistical study. (Statistics involves collecting and analyzing numerical data in large quantities and interpreting their meaning.)

Experiments can rule out such other — or confounding — causes by having a test group and a control group. But that’s not always possible or ethical. For example, researchers would not want to expose children to toxic chemicals just to see what bad effects might follow.

Fortunately, statistics offers mathematical tools that can account for possible confounders. That allows scientists to see how much a change in one variable might be linked to differences in something else.

Researchers built such a tool into their computer model for a recent study about lead. The model had data about lead in children’s blood and scores on a third-grade test. The researchers wanted to look for any link between those two variables. In addition, the model had data on family income, ethnicity and other things.

The statistical tool used math to rule out possible effects from those other factors. That let the model measure just the relationship between lead and test scores. Compared to children with no lead poisoning, children with even low levels of lead in their blood were more likely to fail the reading and math portions of the test. Environmental Health published the research on April 7, 2015.

Kathiann Kowalski reports on all sorts of cutting-edge science. Previously, she practiced law with a large firm. Kathi enjoys hiking, sewing and reading. She also enjoys travel, especially family adventures and beach trips.