Median vs mean vs mode

I do a lot of statistical analysis in my day job. Though my job title is no longer security analyst, I literally analyze computer security issues and make recommendations for a living. You couldn’t study information security when I was in college, because the field barely existed then. My formal training is in journalism. But my journalism degree means I have more formal training in statistics than most people I know. So let’s look at median vs mean vs mode, and when to use each of them.

Median, mean, and mode are three different approaches to trying to answer the same question. Out of all the numbers you collected, what is typical?

As both a security professional and a former journalist, I wish people understood statistics better than they do. Understanding statistics makes it much easier to understand price guides, which helps keep you from getting ripped off. It also makes you a lot harder to lie to. And believe me, there are plenty of vendors, marketers, propagandists, and politicians trying to lie to you.

Median, mean, and mode

median vs mean vs mode
When it comes to median vs mean vs mode, all of them attempt to pick out the most typical value in a population. But it can be hard to know which one best measures the most typical value.

Let’s start with simple definitions of median, mean, and mode. Understanding when to use them begins with knowing what they are. Each of them take very different approaches to trying to solve what seems like a straightforward problem.

Definition of median

The median of a collection of numbers is the value that falls exactly in the middle. If you have an even number of values, it’s the average of the two numbers that fall exactly in the middle.

You cannot calculate the median with a mathematical formula. You arrange the values in order from lowest to highest, then take the middle value. If you’re using a computer, you can calculate it with a spreadsheet, or potentially with a software library if you’re a programmer. The Excel function for calculating a median is simply median(). You can use it one of two ways. You can plug all the numbers you have, separated by a comma, into the formula, like this:

=median(3,5,0,1,2,5)

If you have a large collection of numbers, it’s more helpful to paste all of them into a column. Then you can calculate it with a formula like this:

=median(a1:a999)

If you don’t know the exact beginning and end, you can use this:

=median(a:a)

which will calculate the median of all of the non-empty cells in column A.

Definition of mean

The mean of a collection of numbers is the average value. Mean and average are synonyms.

You can calculate the mean with a mathematical formula. Simply add all the individual values together, then divide it by the count.

It’s important to note that you cannot get the mean of multiple means by adding all of the means together and dividing by the count. You have to go back to the original values. This is a mistake I see my peers make a lot. We’ll have several severities, and they’ll average them together and wonder why that number doesn’t match my total average severity. I may have one mean calculated off a population of 2, and another one calculated off a population of 20,000. To get a total mean, I have to calculate off the full population of 20,002. Otherwise, the population of 2 weighs far more heavily than it should.

Definition of mode

Mode is the statistical measure I encounter least in the real world, and I think that may be a problem. The mode of a collection of numbers is the most frequently occurring value.

Like the median, you cannot calculate the mode with a mathematical formula. It’s tedious to do by hand because you have to tally up all the values, then report the value that gets the highest tally. But it’s easy to do with a computer. If you’re using a computer, you can calculate it with a spreadsheet, or potentially with a software library if you’re a programmer. The Excel function for calculating the mode is simply mode(). You can use it one of two ways. You can plug all the numbers you have, separated by a comma, into the formula, like this:

=mode(3,5,0,1,2,5)

If you have a large collection of numbers, it’s more helpful to paste all of them into a column. Then you can calculate it with a formula like this:

=mode(a1:a999)

If you don’t know the exact beginning and end, you can use this:

=mode(a:a)

which will calculate the mode of all of the non-empty cells in column A.

Advantages and disadvantages of median vs mean vs mode

Between median, mean, and mode I can’t tell you which one is the best because it varies. The advantage of the median is that it pays little attention to outliers. That’s good when your data has a lot of outliers in it, which are extreme values. Consider salaries, for example. Jeff Bezos makes $8.9 million an hour. In Georgia in 2021, the minimum wage is a miserable $5.15 an hour, more than $2 less than the Federal minimum wage of $7.25. Wyoming is almost as paid, with a minimum wage of $5.17. Both Jeff Bezos and the 16-year-old who has the misfortune of living in Georgia or Wyoming are outliers, though there are many more minimum-wage workers in Georgia than there are Jeff Bezos. There’s only one of him.

In 2019, the median US salary was $68,703 a year, while the mean US salary was $51,916 a year. But the most common income, the mode, was between $5,000 and $9,999.

This is why it’s so easy to lie with statistics. I can hide things by reporting one and ignoring the other two.

Also, all three of them share one weakness. All of them require a statistically significant sample size to be valid. This is the problem with the advice to check Ebay to see what something is worth. Looking at three months’ worth of Ebay sales usually doesn’t give you a statistically significant sample size.

Let’s take a look at all three.

Advantages and disadvantages of mean

First, let’s look at the advantages of the mean. It’s easy to calculate, with or without a computer. It considers every value, which usually means extreme values have a more difficult time hiding in the mean than in the median.

But the disadvantage of the mean is that extreme values can skew it. And if the mean is all you know, then you don’t know which extreme values are skewing it harder. I’ll bet most people assume that Jeff Bezos’ crazy wealth skews the average US income. But since I know the median, mean, and the mode of US income, I know that low-wage workers skew the average US income more than the billionaires do.

That said, in places where there isn’t a lot of variance, the mean works extremely well. Baseball statistics are a good example. This is an oversimplification, but a bad baseball player succeeds 20 percent of the time, a good one succeeds 30 percent of the time, and the ones who succeeded 35 percent of the time are absolute legends.

Advantages and disadvantages of median

The main advantage of the median is that outliers don’t skew it nearly as much as they skew the mean. Generally speaking, the median gives a more honest measure of the typical value in a population.

But the disadvantage of the median is that outliers can hide in it. And when the outliers aren’t evenly distributed, the more numerous outliers influence the median much more than the outliers on the other end. That means low-wage workers in Georgia and Wyoming influence the mean much more than the proverbial one percent that contains billionaires and multi-millionaires. The other disadvantage is you can’t calculate it with a mathematical formula. You either need a computer, or a lot of manual work to calculate the median of a large population.

That said, I like the median, because in my field of computer security, we pay too much attention to outliers. This is understandable, since the outliers are where problems are more likely to occur. But if we only look at the outliers, we end up with an intellectually dishonest assessment of the situation. And obviously, I have a computer to slice and dice the data any number of different ways.

Advantages and disadvantages of mode

The mode is a specialized statistic. It tells you what value occurs most frequently. That means it’s less generally useful than the median or the mean. But if you want to really know your data, knowing the mode can give you useful insights.

Depending on where you get your news, you may believe that US salaries exist on a bell curve. If you know the mode, then you know that’s not the case. In my field, people believe that computer security vulnerabilities are evenly distributed based on severity or that they exist on a bell curve. Having analyzed hundreds of millions of vulnerabilities that occurred in the real world, I can tell you they’re skewed toward the top. Knowing that puts me in a better position to solve the problem.

The advantage of the mode is the insight it gives you into the median and the mean. There’s a joke that says knowledge is knowing that tomato is a fruit, and wisdom is not putting it in a fruit salad. If you’re seeking wisdom, calculate the mode and find out why that value is the most frequently occurring value.

The disadvantage of the mode is that if it’s all you know, it doesn’t tell you a lot.

How to know when to use the median vs mean vs mode

To know whether the median or the mean is the more fair measure of normal, you need to know both, plus the mode. You can use the mode to determine which is more fair. The answer of whether to use median vs mean vs mode is to use all three. Then you can decide which one you’ll report, if you can report only one.

If the mode is extremely high or low, then you know you have a lot of outliers. In that case, the median is more fair than the average.

If the mode isn’t all that extreme but instead appears fairly random, that’s a pretty good indicator that the mean is a fair representative of the average.

If you really want to know your data, look at how many outliers exist on both extremes. If your outliers are skewed, that’s important to know. I’m under the impression that people believe most things exist on a bell curve. That may or may not be true, but some things definitely don’t. And if you’re interested in solving a problem, knowing how your statistics are distributed definitely helps you know where to start. It also helps you demonstrate a knowledge of your data that simply reporting the median or the mean won’t demonstrate on its own.

And no matter what your field, whether it’s computer security, public policy, or something else, understanding the data that relates to the problem you want to solve is one of the most important things you can do if you want results. Otherwise you can easily make the problem worse, in spite of your good intentions. If you were to ask my current and former coworkers about me, they may or may not like me very much. But most of them will say I got results.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: