Statistical factoids which are actually true

I have often discussed with my students about the hilarity that ensues when one conflates correlation with causation. The lesson, of course, is that they should never be confused, otherwise you can conclude things like “World hunger is caused by a lack of television sets.” Get it? Our first world has lots of TV sets and not much hunger. The third world lacks TV sets and has a great deal of world hunger. So, the solution is that we can ship our TV sets to the third world, and world hunger can be eradicated. While I don’t have the numbers, I would assume the correlation exists. Whether causation exists is a whole other matter, and is much more difficult to prove.

So, to explain the title of this blog, I mean “actually true” in the sense that the correlations are for real; but not the causation necessarily.

I am proposing here to make a few (not too many) posts in honor of a website called Spurious Correlations.

Currently, its front page reassures us that while it is true that government spending on science and technology correlates positively, and strongly, with death by suicide, we ought to fall short on curbing spending on science research, as some of that science might be about how to reduce the incidents of suicide.

Further down that page, I am not sure what to make of the high correlation shown between per capita cheese consumption, and the number of people who died after being tangled in their bedsheets. Or of the relationship between the number of people who fell and drowned out of their fishing boat, and the marriage rate in Kentucky.

But one of the perks of the website is in its ability to conjure up statsistics based on user choices. Here was the result of some playing around I did on their website:

They appear to prefer to show all of their graphs in time series, which still shows the data more or less rising and falling together, but linear correlations are nicer. They offered the data that was used to plot this graph, and I was able use that data to make my own scatterplot relating the data to each other rather than against time, showing the data has the same r value:

Now I can feel confident in saying that if there are, say, a total of 84,000 deaths due to cancer on the 52 Thursdays of any given year, there will be 15,650 Lawyers practising in Tennessee that year also. I have also worked out that if you got rid of all of the Tennessee lawyers, we would save the lives of just over 20,000 cancer patients per year. Isn’t statistics great?

When you take the square root of the coefficient of determination, you get 0.971299, which agrees with the r value offered on their website. The data, according to the website, originates from the Centre for Disease Control and the American Bar Association.