Something extraordinary happened recently in the Information Security research report area. Why I think it’s so extraordinary might have passed you by, unless you geek out on statistical methods in opinion polling as I do. The report is Cisco’s 2021 Security Outcomes report, produced in collaboration with the Cyentia Institute which is the only report in recent memory that uses sound, statistical methods in conducting survey-based opinion research. What is that and why is it so important? Glad you asked! <soapbox on>

The Current Problem

Numerous information security reports and research are survey-based, meaning they ask a group of people their opinion on something, aggregate the results, draw conclusions, then present the results. This is a very common -- and often illuminating -- way to perform research. It reveals the preferences and opinions of a group of people, which can further be used as a valuable input in an analysis. This method is called opinion polling, and we are all very familiar with this kind of research, mostly from political opinion polls that ask a group of people how they’re voting.

The single most important thing to keep in mind when evaluating the veracity of an opinion poll is knowing how respondents are selected.

If the people that comprise the group are selected at random, one can extrapolate the results to the general population.
If the people that comprise the group are not selected at random, the survey results only apply to the group itself.

Here’s the problem: most, if not all, survey-based information security reports make no effort to randomize the sample. As a consequence, the results are skewed. If a headline reads “40% of Ransomware Victims Pay Attackers” and the underlying research is non-random opinion polling, like a Twitter poll or emailing a customer list, the headline is misleading. The results only apply to the people that filled out the survey, not the general population. If you’re not following, let me use a more tangible example.

Let’s suppose you are a campaign worker for one of the candidates in the 2020 US Presidential campaign. You want to know who is ahead in California - Trump or Biden. The way you figure this out is to ask a group of people how they plan on voting on Election Day. I want to extrapolate the results to the whole of California - not just one area or demographic. It’s not feasible to ask every single Californian voter how they’re voting; just ask a representative sample. Which one of these polling methods do you think will yield the most accurate results?

Stand in front of a grocery store in Anaheim (the most conservative city in California) and ask 2000 people how they are voting
Stand in front of a grocery store in San Francisco (the most liberal city in California) and ask 2000 people how they are voting
Ask 2000 people on my Twitter account and offer them a $5 Amazon gift card to answer a poll on how they’re voting. It’s the honor system that they’re actually from California.

All three options will result in significant error and bias in the report, especially if the results are applied to how all Californians will vote. It’s called selection bias and it occurs when the group sampled is systematically different than the wider population being studied. Regardless if you work for the Biden or Trump campaign, you can’t use these survey results. Option 1 will skew toward Trump and option 2 will skew toward Biden - giving both teams an inaccurate view of how all Californians will vote. Option 3 would yield odd results, with people from all over the world completing the survey just to get the gift card.

Every survey-based security research report I’ve seen is conducted using non-randomized samples, like the examples above, and are subject to selection bias. If you’re a risk analyst like me, you can’t use reports like this - the results are just too questionable. I won’t use it. Junk research is the death of the defensibility of a risk assessment.

What Cyentia did

Let’s add a 4th option to the list above:

4. Obtain a list of all likely California voters, randomly sample enough people so that the likely response rate will be ~2000, and ask them how they plan on voting.

This method is much better; this gets us closer to a more accurate (or less wrong, following George Box’s aphorism) representation of how a larger group of people will vote.

Here’s the extraordinary thing: Cisco and Cyentia actually did it! They produced a research report based on opinion polling that adheres to sampling and statistical methods. They didn’t just ask 500 randos on Twitter with the lure of gift cards to answer a survey, like everyone else does. They went through the hard work to get a list, randomly sample it, debias the questions, and present the results in a transparent, usable way. This is the first time I’ve ever seen this in our industry and I truly hope it’s not the last.

A sea change?

Nearly all survey-based Information Security research reports cut corners, use poor methods, and typically use the results to scare and sell, rather than inform. The result is, at best, junk and at worst, actively harmful to your company if used to make decisions. I know that doing the right thing is hard and expensive, but it’s worth doing. Wade Baker of Cyentia wrote a blog post detailing the painstaking complexity of the sampling methodology. As a consumer of this research, I want Cyentia and Cisco to know that their hard work isn’t for nothing. The hard work means I and many others can use the results in a risk analysis.

I truly hope this represents a sea change in how security research is conducted. Thanks to everyone involved for going the extra mile - the results are remarkable.

The 2021 Security Outcomes report and better research methods

The Current Problem

What Cyentia did

A sea change?

Further Reading