I am 75% confident your vulnerability management metrics are too complicated. I’m 75% confident because I’d need to see examples from about twice as many organizations than I’ve seen in order to be 95% confident. But I’ve probably seen 150 more samples than most people. But I have bad news for you. I’m 75% confident your vulnerability management metrics are too simplistic. How can you be both? Measuring the wrong things puts you in situations like that. So let’s talk about NIST’s recommended vulnerability management metrics, and how to more closely align with their recommendations.
Why use NIST 800-40?
NIST has a standard for Enterprise patch management, which is the IT remediation side of vulnerability management. This distinction is important. Vulnerability management is measuring the success of patch management. Vulnerability management has little to no control over the actual outcome.
The system administrators are doing the actual work. A good vulnerability analyst is measuring the outcome, and hopefully can make some recommendations. But they are like a coach on a sports team. They aren’t the ones on the field scoring the points. But hopefully they are able to figure out who is best positioned to score so the team wins. Earl Weaver wasn’t a good enough baseball player to make the major leagues even as a bench player, but he became one of the greatest managers of all time. Ted Williams was one of the best players ever, but Weaver could beat him with a less talented team. So good management makes a difference.
I was a very successful system administrator, especially when it came to remediating vulnerabilities. I had to get the occasional risk acceptance, but when you exclude the risk acceptances, I was fixing vulnerabilities with a success rate of 100% and my MTTR, or average time remediate, was between 14 and 30 days.
And when I had a risk acceptance, by the time the risk acceptance expired, I had those vulnerabilities remediated as well. I was fixing things like Java. Yes, I even figured out how to fix Java without causing disasters.
I wasn’t always successful. My patching before 2005 was mediocre. I didn’t have the tools to measure my success rate and find where things were failing, so my success rate couldn’t have been much higher than 70%. And I didn’t have enough maintenance windows for my MTTR to be anywhere near 30.
I got good when I took a job at an organization that used NIST 800-40 to govern patch management. When I had those guidelines to follow, I went from being mediocre to fixing 200,000 vulnerabilities a year and a time frame that would make anyone jealous. The only reason I only fixed 800,000 was because that was the number of opportunities I had.
I knew I was okay, maybe even pretty good, but I wasn’t the one producing the metrics, so I didn’t know until a few years later, when I had become a vulnerability analyst myself, how to quantify how good or bad I was at patching.
Why your vulnerability metrics are too complicated
I am 75% confident that your vulnerability management metrics are too complicated because I only need to answer two questions to be able to tell you if you have a successful vulnerability management program. Yes, just two questions.
The questions are whether you are patching the right things, and if you are patching them fast enough.
Usually when metrics are overly complex, it’s because someone made the problem more complex than it needs to be, or it’s because they know they aren’t doing those two things, but they are trying to make the metrics look good.
The problem with that is without solving the underlying problem, there isn’t much you can do to make those metrics look good. And at worst, you obscure the objective.
The other problem with overly complicated metrics is that they are extremely intensive in computational power, labor, or both. And I can quantify that. I spent 75% of my time at one job producing metrics. I was spending 3 weeks out of the month putting together metrics, which only left a week out of the month to work with my teams on doing something about them. If we could have reversed that, I would have been a lot more effective.
At a later job, I figured out how to produce comparable metrics in a week with better tools. And then I figured out how to produce something about 80% as good programmatically, and it took 15 minutes. But then I had to figure out how to close that gap, and once I close the gap, my program took 6 hours to run. That’s still a lot better than a week, let alone 3 weeks, but using 24 times as much computing power didn’t improve the outcome by a factor of 24. It didn’t improve the outcome at all, all I was doing was wasting electricity.
Why your vulnerability management metrics are too simplistic
How can you be too simplistic and too complicated at once? Partly because you are probably defining the problem incorrectly, but also because you probably don’t have the right data.
Security policy is usually based on the criticality of the vulnerability. It usually doesn’t factor asset context in at all. Part of the reason for that is sometimes companies don’t know what computers they have, let alone whether a given computer is business critical or not. You can’t define criticality when you don’t know for certain if something exists. This is partly due to shadow IT, but also partly due to poor asset inventory.
But even if you have a subpar inventory, you can wing it for a while on asset criticality. If it’s on your disaster recovery plan, it’s a high criticality asset. If you’re not allowed to touch it because you might break it, it’s a high criticality asset. It’s going to blow your metrics, but that’s a problem for multiple IT VPs or maybe even the CIO to solve. If you know it’s a test system, its asset criticality is low. And if you don’t know what it should be, let it default to medium. You’ll end up with too many mediums, but IT will have motivation to help you fix it now.
Once you define those criticality levels, tag them in your vulnerability management system by criticality, and then you just need to report three stinking numbers for each combination of asset and vulnerability criticality.
You might even be able to get by with two. What percentage of your vulnerabilities were fixed before the due date, and your mean time to remediate.
NIST also recommends collecting the median time to remediate, although they don’t explain why. I will explain the significance.
Why median and mean?
If your median is more than your mean, that suggests you have a vulnerability backlog and a significant number of the vulnerabilities you are remediating are older. If the median is less than your mean, you probably don’t have a backlog.
The mean tells you if you typically are meeting your remediation policy. The median adds the nuance of better measuring how frequently you meet the policy. Here’s more background on median and mean if you need it.
It gives you the same insight that trending does, without the computational intensity of trending. It’s ingenious. And that’s a good thing, because trending is the most time-consuming and computationally difficult part of vulnerability metrics. Being able to make each month’s metrics more or less stand on their own is the holy Grail.
You don’t need a manifesto on security metrics 300 pages long to understand vulnerability management metrics. You’re failing because you are making it too damn complicated. Pardon the strong language, but you are ruining careers. The United States Government, the creators of the IRS long form, defined this problem and the solution in its entirety in 28 pages. The whole problem, not just the metrics. The metrics part of the publication is two pages.