Being data-driven doesn't protect you from being fooled by false narratives

When understanding is critical, it is important to be hypothesis-driven, not data-driven.

DCL

2022-01-07

Everyone loves to be told (sold) a story

If you work in data/statistics/DS/ML/AI/etc and are active on LinkedIn you’ve probably seen the image posted above. It typically receives a large number of positive comments. This makes sense, data by itself isn’t informative without a layer of interpretation. Young analysts often get told that they

Shouldn’t simply present data or even a summary of it. Stakeholders, managers, decision-makers want the story.

It is even common for data science job positions to ask for:

A storyteller who can visually bring things to life.

Unfortunately, in this increasingly data-driven world, the data you have access to can tell an incredibly misleading story. Given the collection of LEGO (data) in the image above, countless many creations could have been constructed (plausible stories). You could have built a spaceship or a robot with those LEGO’s and it would have looked equally as valid (perhaps even more so?). Additioanlly, paying too much attention to the data you have collected will result in an overfit intepretation that will not generalise to other scenarios.

Without a tested, validated model of the world, this kind of ‘storytelling from data’ can result in an erroneous narrative, one that may even give management an incorrect worldview resulting in counter-production actions.

Without a valid model of the world, ‘telling a story from data’ can result in an erroneous posthoc rationalisation of events obtained from noisy, incomplete data that is not representative of the wider population.

A simple example

Below is a very simple demonstration of how being purely data-driven can result in erroneous or ambiguous stories. Let’s say we’ve collected some data on a features relationship with an outcome of interest.

What is the story here? Obviously, as our feature X increases, we observe a negative relationship with the outcome Y. Without any other data, or theory to reason about, it is very tempting to state

Our story is that ‘increasing X has a negative impact on the outcome Y’

A great data-driven insight?. But are we suffering from omitted variable bias? Observe what happens when we collect some more data. The plot below colours the samples based on another feature, Z.

Controlling for Z, we now observe the feature X having the complete opposite effect. We might be tempted to change our story:

‘Increasing X has a positive impact on the outcome Y’

But even this is story may be flawed. If X influences Z (that the value of X influences what group/colour you are in) then our first story is still reasonable. But if X and Z are independent or even if Z influences X, then it might be reasonable to obtain the second story. Importantly

We can’t determine which story is true based on the data alone.

How do we avoid making erroneous conclusions?

Being 100% data driven-gets you into trouble, you need to be hypothesis-driven to better understand what is going on. That is, to

Collect data.
Form a hypothesis that is consistent with the data.
Conduct an experiment to test the hypothesis.

For the situation presented above, a simple controlled experiment might be to randomly allocate X and/or Z (if possible) to subjects and then perform a regression to assess the size and perhaps more importantly, direction or sign of the effects.

Using DAG’s can help us think about how the data might be being generated, and that gives insight into how we might be able to design an experiment that tests a formulated hypothesis.

Stories can be persuasive even if they are wrong

Amongst many great points, Morgan Housel from the Collaborative Fund states

The person who tells the most compelling story wins. Not the best idea. Just the story that catches people’s attention and gets them to nod their heads.
The world is governed by probability, but people think in black and white, right or wrong – did it happen or did it not? – because it’s easier.
Tell people what they want to hear and you can be wrong indefinitely without penalty.
Simple explanations are appealing even when they’re wrong. “It’s complicated” isn’t persuasive even when it’s right.

We have to accept that stories can be incredibly persuasive without being correct. Finding data to support a pre-existing story doesn’t necessarily make the story more probable, but it can certainly make the story sound more plausible and more likely to be believed.

The same is true even if we start from the data first. A dataset analysed in isolation, without theory, untested by experiment can tell many, contradicting stories.

Being data-driven doesn't protect you from being fooled by false narratives

Everyone loves to be told (sold) a story

A simple example

How do we avoid making erroneous conclusions?

Stories can be persuasive even if they are wrong

DCL

Related