AFFINITY - Stats Made Simple Part One – What the Marathon Can Teach Us About Normal Distribution

Every fortnight in The Lab, we host a presentation on an interesting topic. Sometimes we bring in experts, but we also have quite a few experts of our own. Past presentations have included everything from Above The Line Media & Data to Workplace Mental Health. This week, one of our very own big thinkers, Caspar Yuill, presented to the team on the Dark Art of Statistics: what they are, how they work, and what problems they can solve. In part 1 of a 3 part series, Caspar explains why something most of us will know from high school, has far reaching applications for marketers.

The field of statistics can be confusing. I should know – as a psychology student that scraped by the first time, it seemed arcane and impenetrable. But I soon realised it didn’t have to be. Once I started applying my knowledge, I quickly developed a deeper appreciation for the power of statistics. Particularly in solving the sorts of problems strategists come up against every day.

One of the fundamental principles of statistics is the Normal Distribution. More popularly known as the Bell Curve, it helps us understand what our average is, where our outliers are sitting (and what we can learn from them), and why it’s important to test and learn.

The Normal Distribution

In the 1904 Olympics, the winner of the marathon completed the 42.195km course in 3 hours, 28 minutes, and 53 seconds. More than a century later at the 2016 Rio Olympics, Eliud Kipchoge ran it in 2 hours, 8 minutes, 44 seconds (and subsequently broke the 2hr mark in 2019).

Which begs the question: why have athletes gotten faster? Normal distribution has the answer.

A normal distribution is a way of looking at a collection of data. It graphs the size or quality of a certain thing over the number of things that have that size or quality, with the average sitting in the middle. It looks like this:

Normal Distribution Graph

The remarkable thing about normal distribution is that it applies to any property of any population that is naturally occurring. Graph the height of the world’s population for example, and you’ll have the majority of people in the middle (the average), and ever-dwindling quantities of taller and shorter people as we get further out. Until we get to the ends of the graph, where we have the outliers who are remarkably tall or remarkably short.

And it’s not just nature. The same principle can apply to everything from the performance of a series of marketing campaigns, to the size of customer home loans.

The way we describe the spread of our data set is the variation of the data. For example, if we took the income of a suburb in Sydney, we’d see some variation, but most of it would be clustered relatively close to the average. If we took the average income per household for the entire world, the data would be much more spread out, and we’d have far more outliers.

We can describe a data set’s variation in mathematical terms by its standard deviation – or how much the members of the group differ from the average. A low standard deviation indicates distribution is relatively close together, like the income of a suburb. A high standard deviation indicates the data is spread out.

The uncanny thing about normal distribution is we can predict how many data points exist in each segment of standard deviation. We can see that in the below graph, 99.73% of all the population falls within three standard deviations of the average.

68-95-99.7 Rule Graph

Source: towardsdatascience

To put that in perspective, let’s look at average male height globally. The average adult male height is 177.8cm, with a standard deviation of 10.2cm. That means roughly 68% of all men (1.99 billion) are between 167.6cm and 188cm. Expand to two standard deviations, and 95% of all men (3.7b) are between 157.4 cm and 198.2cm. Once we hit three standard deviations, you guessed it: 99.73% of men (3.89b) are between 147.2cm and 208.4cm. In other words, there may only be 2,800 seven footers on the planet.

Which brings us back to our marathon question: why are records increasing? David Epstein, author of The Sports Gene, tackled this very question. He concluded that while factors like technology had impacted on performance, of greater importance was the growth in the possible talent pool. In the early 1900s, television didn’t exist, radio wasn’t widespread, and the only thing going viral was the Spanish Flu. Today, far more of the world’s population can access sports information and participate, resulting in a far larger talent pool. With this comes the possibility of more outliers, like Eliud Kipchoge, completing a feat most people thought impossible.

Applying this idea to marketing can also be powerful. It explains why you should continue a campaign performing above historic benchmarks, instead of tinkering with it in the hopes it will do better. (In fact, it will probably perform worse thanks to Regression of Mean, but that’s for another day.) It tells us you should expect 80% of revenue to come from 20% of your customers, and how much of an outlier those customers are. And it tells us the value of experimenting with different creative in the early stages of a campaign (especially a digital one), as you’re far more likely to find an outlier that performs above average when you have 3-5 variants.

In Part Two, we’ll look at the Experimental Paradigm, and how statistical tools have been designed for that.