We’re constantly flooded with “facts”. We’re presented with the conclusion based on the artifacts and being preached upon that data doesn’t lie. But how much time do we spend analyzing the actual data rather than interpretations? How often are we drawing our own conclusions? In a fast paced environment of perishable information, having the time to perform analysis is a luxury.
The title of this article (clickbait as it may be) is paraphrasing an interesting book, How to Lie with Statistics, by Darrell Huff. The book outlines errors when it comes to the interpretation of statistics, and how these errors may create incorrect conclusions. Funny (and dangerously) enough, Darrell himself told lies with statistics when defending the tobacco industry in the 50s and 60s. It goes to show that we’re all susceptible of telling or being told lies with data. Let’s raise the awareness and be less prone to such errors.
I once attended a board meeting where a security executive was trying to convince the management to allocate more funds to his area. His pitch was centered on the fact that the number of unauthorized physical access incidents doubled over the course of the previous year. The incidents were related to the data center facilities of the company and they were treated seriously at the board level. His graph looked something like this:
So far so good, right? Yes, until you look at the actual numbers.
Year 1 had a total of 2 incidents, year 2 had 4. Year 1 saw the company managing about 10 data center while during year 2 that number increased to more than 25. If we put these two data points in perspective, we see that the relative number of incidents per data center actually decreased year over year.
You would think that the board members would have been quick to catch the missing data. Well, in that particular case, the only question they asked was about the financial impact of the incidents. The question is very important, especially at the board level, however it missed the misconstructed fact. (Note: all incidents have been contained without financial impact due to additional controls).
Furthermore, the average rate of such incidents was much higher in the industry at the time, averaging one incident for every two data center per year.
Let’s redraw the chart with all the data elements.
This tells a different story. What were some of the omissions that made the first version of the chart tell a different story and what are the question we should ask when presented with partial information?
- Lack of absolute values. When offered comparative values expressed in percentage year over year, always ask for the absolute values.
- No scale for reference. How does this relate to the impacted assets? In our case, how many incidents per datacenter.
- No industry average comparison. How does this relate to what other companies are experiencing?
This article is part of a series where we will explore the most common misuses of data to construct “facts” in the Information / Cyber Security space. The main purpose is to raise awareness and make us less susceptible to such “facts”.
As the series progresses, I will add the links here.
This article was first published on LinkedIn.