How to Lie with Statistics, Information Security Edition
There is terror in numbers. Perhaps we suffer from a trauma induced by grade-school arithmetic. -Darrell Huff
Have you ever finished reading a vendor whitepaper or a research institution’s annual security report and felt your Spidey sense begin to tingle with doubt or disbelief after reading some of the conclusions or research methodology? What you are probably sensing is a manipulation of statistics, an age-old hoodwink that has been occurring as long as numbers have been used to convey information.
Welcome to the very first post in long series about all the ways statistics, measurements, facts and data visualizations are used and abused in the Information Security space. The series is based on a monumental book titled “How to Lie with Statistics,” by Darrell Huff, published in 1954. Huff was the editor of Better Homes and Gardens in the 1940s and ’50s and had a lifelong passion for statistics — and even though he was not a statistician, his book has become the most widely read Statistics book to date. Huff introduced the general public to common ways that statistics are used to manipulate the facts in an easy-to-understand and accessible way. Darrell Huff called the manipulation of statistics statistaculation.
For example, the claims and misuse of survey research the ad men on Madison Avenue perpetrated drew Huff’s ire and were exposed. One example is an ad claiming that “Users report 23% fewer cavities with Doakes’s toothpaste,” when the sample of “users” is so unusually small that someone with sufficient motivation can find patterns in randomness where no pattern really exists. This type of numerical trickery would be a familiar sight to anyone who has spent just an hour in the RSA or Blackhat vendor expo halls.
This series takes the foundation Huff created over 60 years ago and updates the concepts and examples for the contemporary Information Security field. Readers will find that Huff’s work is just as relevant, and important, today as it was in the past.
Examples of Statisticulation in Security
Surveys
Did you know that 73% of CISOs agree that… wait, which CISOs?
Research based on survey results are prolific in the Information Security space, and the most commonplace surveys used to steer the reader to a particular conclusion are vendor-sponsored security reports — a pervasive industry example of the use of surveys to influence purchaser perception of the efficacy or need for a certain product. The majority of these security reports have one or more very serious biases that would make any reader with an elementary sense of the math behind sampling question the results.
Posts on surveys:
The “Gee Whiz” Graph: How Pie Charts, Line and Bar Graphs Distort Reality
Visual learners become overwhelmed with rows of numbers, so one quick way to represent data in a way anyone can understand is through the use of colorful pie, bar and line charts. Excel makes it simple for anyone to turn drab statistics into eye-catching displays, but this ease also introduces new problems. Not every chart type is appropriate for representing all data types, although they are often incorrectly viewed as interchangeable. Last, several dataviz techniques will be examined that are commonly used to emphasize (or de-emphasize) unfavorable conclusions.
Posts on Gee Whiz Graphs:
Coming soon!
The Semi-Attached Figure
The semi-attached figure is a situation in which one idea cannot be proven, so the author pulls the old bait-and-switch, stating a completely different idea and pretending it is the same thing. This is seen in two areas: when security vendors are trying to sell a product or service, and when Information Security professionals are trying to communicate risk to management. An example of this is when a vendor wants to make a problem seem bigger than it really is. You may receive a product pitch that claims there are one hundred times more cyber attacks today than there were in 2005. The problem with this statement is that the detection and response of attacks is exponentially better than it was 10 years ago. A vendor trying to sell you a product would like you to believe the sky is falling, and while his figure may be technically correct, it does not tell the full story.
Posts on the Semi-Attached Figure:
Logical Fallacies
One prevalent example is post hoc fallacy, also known as “correlation does not imply causation,” is very closely examined for two reasons: it is the most common, and perhaps the most damaging manipulation of data because it is easy to perpetrate, often by accident. It rears its ugly head in reports, surveys, risk analyses, reports to the Board, assigning attribution for an attack, and many other places. This occurs when two data sets are presented and it is falsely implied that one caused the other. This — and other examples of logical fallacies will be examined, as well as ways to spot this problem.
Posts on Logical Fallacies:
Why?
“…but 74% of hacks come from insiders, not hacking groups,” the auditor said as he held a slick printed infographic in one hand, and wagged a finger at me with the other. “Your write-up on insider threats says the exact opposite!”
The quizzical look on my face betrayed by confusion. Noting the company that performed the research, I checked the stats myself when I got back to myself desk, and sure enough, research shows that 74% of hacks come from inside the enterprise. A closer look, however, revealed a serious problem: the sample size was impossibly small, the survey was not performed to survey science standards, and methodology and bias were not disclosed. Essentially, the results apply only to those who took the survey and cannot be extrapolated to apply to all companies! Other research that uses real incident data, as opposed to an opinion poll that relies on the survey taker’s memory, showed that my original probabilistic assessment was correct. It was too late — the damage was done, and a company decision was made using really bad data.
Truth and accuracy matters, especially in an industry where so much value is placed on honesty and integrity. The misuse of statistics is widespread everywhere (not just security) and are often used to sell products or steer people to a certain conclusion.
One might think of this blog series as a handbook on how to lie and cheat by misusing statistics and graphics. It’s not the case — you don’t need me to learn that. It’s the same reason why information security professionals learn hacking techniques — defense. Spotting the use and abuse of numbers is easy… once you know what to look for.
I’ve been interested in this topic and gathering material and examples since 2014. I first spoke on this topic at BSides San Francisco 2015 (video | slides) and again at CircleCityCon 2018 (video|slides). Check this post often for links to new posts — more will be added on a regular basis!