A former colleague contacted me some time ago with an interesting conundrum. I thought his problem in the solution would be worth sharing, because it’s not at all uncommon. He manages a network of, let’s say, 22,000 computers. But he has licenses to scan 8,800 of them. The question is, what can he do?

Don’t get me wrong. Having incomplete scan data limits your options from a security perspective. If something is going on on a particular system, you really want a recent scan of that exact system so that you can figure out what your adversary has to work with. And in this situation, there’s only a one in three chance that he will have that.

But this incomplete scan data still has its uses. I figured if I could tell him what the incomplete data can tell us, that gives his higher ups what they need to know. If they accept the risk, that’s on them. At least he can say the knowingly accepted the risk, rather than making guesswork.

## Is my sample size statistically significant?

Statisticians make inferences from sample sizes all the time. You see it every election night. As soon as the polls close, you start seeing counts. The counts tell you how many votes each candidate got, and what percentage of the votes have been counted, or remained to be counted. And depending on how close the race is, as the night progresses, you start to see a prediction.

But people don’t like this when their candidate doesn’t win. So let’s look at another example. Insurance companies use the same methodology. It’s how they know how much to charge you for homeowners insurance, auto insurance, and the extended warranty on your toaster oven. Through the use of statistical modeling, they know how many cars will be stolen in your ZIP code, how many cars and homes will be damaged by hail storms, and how many toaster ovens just like yours will break. They just don’t know which ones. But since they know how many, all they have to do is calculate the payout, divide that by the number of policy holders, and then figure out the profit margin they want. Then they know how much to charge.

If statistics were hooey, insurance companies would be going out of business left and right.

So in this case, let’s bring up a simple statistical significance calculator and plug in the values we know. I took a statistics class in college, but that was in 1994, so I just remember the basics. If you want the nitty-gritty of how all this works, pick up any college textbook on the subject of elementary statistics.

### Analyzing the scan data with a statistical calculator

In our case, we plug in what we know. The size of our population is 22,000. Industry standard is 95% confidence with a 5% margin of error. So we will leave those numbers as is. We will take the defaults for everything else as well, because I sold back my statistics textbook in 1995 and we aren’t calling a presidential election here. We’re just giving a high level estimate of the quality of the data we have.

And it turns out, to have a 95% degree of confidence with a 5% margin of error, we are scanning more than enough machines. So let’s adjust some numbers upward to get a better idea of what level of confidence we have in this data and what we can infer from it.

There was a time I could do this on paper, but that was 1994, and I haven’t had to do it since. So we’ll do this by trial and error with an online calculator. And what I found is that scanning 8,800 machines gives us a 99.999% confidence interval with a 2% margin of error. That sounds pretty good. What am I 99.999% confident of?

## What I can infer from incomplete data

You can infer high level information about your systems based on what you do know. If you divide your vulnerability total by 8,800, you are 99.999% confident that the average number of vulnerabilities applies to the rest of the population. The margin of error is the degree of variance you can expect, which is 2% in this case. You can divide it up further. You can break the vulnerabilities down into critical, high, medium, and low, take the same totals, and have the same degree of confidence that those counts apply across the entire population.

You can get a bit more nuance as well, if you have enough data. If you know how many systems you have of any given type, and how many you scanned, then you can break that down. Just make sure that you calculate the degree of confidence for each sub population.

But you don’t have the specific scan details that a proper Qualys or Nessus scan of every system would have.

## Intellectual honesty

The problem with this methodology is you don’t know which systems are above average and which ones are below average, and you don’t know which vulnerabilities they have. So the data isn’t very actionable. The solution: quantify the problem. State the problem with the data so that the people who control the pursestrings can decide what to do about it

Ideally, in the situation, you budget so that you can scan your whole network next year so you can have actionable data. You don’t worry too much about outliers in statistics, and in some cases you want to get rid of the outliers. But when it comes to cybersecurity, your outliers are where the incidents happen. So that’s why you use statistics to quantify security problems, not to solve them.