UN Data Forum: Data Journalism (live blog)

This is a liveblog written by Rahul Bhargava at the 2017 UN World Data Forum.  This serves as a summary of what the speakers spoke about, not an exact recording.  With that in mind, any errors or omissions are likely my fault, not the speakers. This was a virtual session, with all the speakers calling in via video.


John Bailer: New & Numbers is an old idea.  Cohn’s book targeted journalists to hep them communicate to a broader community. Alberto Cairo’s Truthful Art book is a more recent example of this.  John runs a Stats & Stories podcast to explore these questions as well.

Trevor Butterworth: Trevor is an Irish journalist with a background in the arts. He wrote for major publications as a freelancer about cultural issues, back when this was called “computer-assisted reporting”.

Rebecca Goldin: Trained as a mathematician, Rebecca worked as a professor of mathematics.  She reconnected to lok at how people talked about numbers and statistics.  Now she supports educational needs of journalists, and how people think and communicate about statistics.

Brian Tarran: A journalist by training, Brian received no training on numbers. He ended up working with the Royal Statistics Society and that’s how he ended up working on stats.

David Spiegelhaler: Coming from a mathematician and medical statisticians, he is now a Professor for the public understanding of risk.  His job is to do outreach to the press and public. David does statistical communication, focused on risk. Number are used to persuade people, so we need to do this better to inform people better to think slowly about a problem (instead of manipulating their emotions).

Idrees Kahloon: Idrees is a praticing data journalist at the Economist, having studied mathematics and statistics. At the Economist he works on building statistical models.

How to make sure what you’re doing will work with statistics?

Idrees: Runs into this quite a bit, sitting between academics and journalists. This means applying rigorous methods, but on a deadline.  Its hard to explain a logistical regression to the lay audience. You have to be statistically sound, but also explainable. The challenge is to straddle this boundary.

David: Influenced by the risk communication field, but there is no easy answer there.  So you decide what you want to do, and then test if it is working the way you want. Use basic visual best practices, and then the crucial thing is to test the materials. Evaluate it.

Brian: At Significance Magazine, a membership/outreach magazine, the goal is to bring people into statistics. There are guidelines to follow, around engagement and ease of reading. The goal is to encourage authors to draw analogies to things they understand.  One example is in an upcoming issue about paleo-climatology; focusing on climate proxies in recent history. The author explains this by comparing it to how Netflix creates recommendations to users. That kind of metaphor is the best way to get these things across.

Rebecca: As David hinted at, you have to know your audience. The first step is to understand who it is you are writing for, and what is their background. So perhaps instead of logistic regression, you might need to focus on explaining the outcome (ie. not the process). With journalists in a workshop, the main challenge for them is around understanding how to express uncertainty.  This is the greatest challenge that people face.  Pictures and stories are often the best techniques here, rather than technical language

Trevor: Our statistical understanding is very nascent. To build a better foundation, surveying journalists helps you understand what journalists do and do not know about science and statistics.  Journalists assume researchers know how to design a study and analyze results. You have to understand that isn’t necessarily the case. You have to ask basic questions about study design, data collection, and data analysis techniques.  One of the goals is to build a network of statisticians to help journalists do this.  So a parallel project is to help researchers understand these statistical concepts.

Examples of successful and/or unsuccessful communication? and why?

Trevor: Science USA created this network of statisticians at academic institutions around the US, and journalists are using this online widget to ask them questions.  That interaction is a great success to build on. Science that supports a policy is taken up by various constituencies, and filtered by values. When studies turn out to be poorly done, communicating that gets really hard. People who have adopted knowledge to promote it are not equipped to make judgements about what process of technique was wrong. So they try to shoot you down, from ad-homnym point of view. In the US talking about policy with evidence without becoming tribal has become too hard. So the question of “is this a good study” gets lost very quickly, replaced by a partisan/political interpretation of who you are, and your motives for critiquing a study.

Rebecca: When a journalist does have more than an hour to sort through a concept is when we have an opportunity for great success. For example, Rebecca worked with a journalist looking at false-postivies vs. false-negatives. The journalist created a graphic that ended up on 538.  The conversation helped her clarify what the mathematics would tell her.  Some failures involve when you’re speaking with a journalist that just can’t wrap their head around an idea.  When they can’t slow down enough to understand something like an inference. This is difference between writing about a certainty (which journalists want to do) and a quanitifed uncertainty. Other times the mathematics are just knowledge disconnects, like explaining a confidence interval without the listener understanding what an interval is. There are lots of requests coming in, which points to a shortage of people with these skills in the newsroom. So lots of people are recognizing this need.

Brian: The expertise didn’t exist in the newsroom 15 years ago.  In his first year, Brian wrote about councils surveying citizens about an issue. This ended up putting citizens and council at odds, because the journalists couldn’t explain what the survey told them, or better ways to do this. We just did a terrible job of explaining the fundamentals in a way that could generate bridges between people. For a success, in magazine for it is too hard to convey the details to help people do statistics themselves.  We need to show people how to think like a statistician.  This is about a process, and questions you ask.  There is an new column called “Ask a Statistician” which tries to get at this directly. Hopefully over time this will build to something great.

David: One success is keeping certain stories out of the news that don’t have good science behind them.   Another one is the translation of relative risk to absolute risk.  If there is a change in risk, you need to show the baseline risk. There was a story about eating a bacon sandwich, how risk of some disease increased it. The morning story was terrible, but in the evening after much promotion the story was told correctly, indicating this would only increase 1 out of 100 cases. Even thought the BBC training introduces this, the journalists cannot do it on their own. Another reported how a study said sex was decreasing in the UK, due to phones and technology. David made a joke about this being due to Game of Thrones, but a journlists didn’t get the joke and wrote up the headling “no sex by 2030 due to Game of Thrones”. This is the danger of clickbait, produced by secondary outlets republishing with a crazy headline.

Idrees: The polls in the last year is a great example of both how to do it well and poorly. There were many models in the US about the election outcome, where some set out what the uncertainty was (like 538 giving Trump a 30% chance of winning), but others did not (like the Princeton election commission). Some think it is ok to just report marginal error, and ignore if the sample is good.  Idrees shares a paper about 50,000 tweets about the death of Joe Cox.  To test this they gathered a population of tweets, sampled it, and measure how many were celebratory.  Their data shows this was an order of magnitude less.


Responding to David and Rebecca’s comments, we’ve found that we need to separate percentages and chance. Has anyone come across guidelines about how to describe change? A lot believe you should do it in terms of “1 in 100” type language.

David: This is a disputed area. Using words like “probability” and “chance”, so people use an expected frequency – “of 100 people like you, 5 would have it”.  This is slightly better than “1 in 100” language. There is always metaphor and analogy involved. Using a phrase also depends on the imagery and appropriateness for the audience

Rebecca: When talking about 1 having something, and 99 not having something, you have to say “of people like you”.  This is a critical piece that stops people from arguing against these types of descriptions. You must express what the denominator is… precisely who we are talking about. Visual depictions can help this a lot.  Also comparing risks or frequencies can help. How does each option effect your risks and outcomes.  It is important to pair these.

For Trevor and Rebecca, who have been training journalists: what is the most important single skill for reporters to better work with data?

Trevor: To be pessimistic, most journalists can’t visualize the concepts in statistics.  Especially for probability, uncertainty, and distributions. You have to start with design of the data gathering effort. This leads to a certain approach of doing reporting. The best thing to do is to bring journalists and statisticians together.

Rebecca: In terms of basic numeracy, the most important thing is understanding absolute vs. relative risk.  They understand proportion and percentages, so they could understand this distinction ins a short amount of time.  So many studies do this now, and people know how to interpret it. The intuition is there. This is attainable.

Brian: Read the Tiger that Isn’t book. If everyone read it and appreciated the ways numbers could be misinterpreted, this would improve things a lot.

Idrees: The idea of being able to understand a distribution of outcomes. This is about getting across an expected value and a bell curve.  This is all tangled together though, so it is hard to understand one bit and not another.  Hard to see one silver bullet.

David: To agree with Rebecca, changing relative to absolute risk is vital.  Then doing it in whole numbers, and so on. Journalists are intelligent; they are used to critiquing and their intuition is good.  They often lack to confidence to go with their intuition when data comes in. They should go with their guts.

John: Look at some of the questions in the News & Numbers book mentioned earlier.

A key theme here has been about counting people who aren’t usually counted.  What alternative data sources do you use to capture and explain these populations.

David: Using mobile phone data is probably one piece of the discussion that is relevant.

John: The census in US tried to enumerate populations like homelessness with formal study design… like looking at a proxy of people receiving services related to their status.  Probably the audience is better informed than the panel.

A few years ago, we found that in 40% of journals data was incorrectly presented graphically.  We have to start really young to get people’s brains to start working differently. This goes beyond numeracy.

David: the Teaching Probability book is aimed at 10 to 15 year old.  It uses the metaphor of expected frequency as a basis.  If you do that it leads to probability.  Converting relative to absolute risk is included in this, based on the idea of what does this mean for 100 people.  In the UK probability has been taken out of the primary school curriculum. Recent psychological research says statistical literacy underlies general decision making skills; it is crucial.

Trevor: The kind of information literacy we teach children is quite poor. Cultural change is possible. The News & Numbers book, despite nailing the problems, had little effect on the culture of journalism. New outlets like Wonk blogUpshot, 528, Vox and others say cultural change around the importance of data is happening. There is a danger o naivete, suggesting the wrong idea that we don’t need statistics anymore because we have big data.

John: We need to be training the trainer, the help the teachers to be equipped to communicate these ideas.

Brian: At their local school they discuss improving the teaching of mathematics, but none of the teachers are confident enough to do this.  They need more confidence. People are too willing to accept the idea that you’re “bad at math”; we need to break that down.

Closing Remarks

Rebecca: The takeaway is to tell a story.  Veer a little from the technical truth to try and tell a story that frames the information in a way that is non-technical. Don’t be scared to say something a little bit incorrectly, to better convey what you want to say.  People will remember better what you say, and become more curious.

Idrees: Data journalism is kind of a new thing, so we will have wrinkles. If you write to an editor about something that is egregious, they actually listen.

Brian: We want to be telling a story, like a feature article not an academic paper. Tell a story the way you want to be told a story. Present your work in that way, with a story structure that feels good.

Trevor:Statistical should not be dry; try to have a real conversation.  Numbers don’t speak for themselves.  Also, recognize the limits of your own background. Think like a designer that communicates knowledge. The name of the game is collaboration.

David: Respect a journalistic approach. That means working with them, but at a minimum it means working out the crucial points, develop a story, and try it out with people.

John: This has been an outstanding conversation.







Finding Data Stories

Many people have written about techniques for telling data driven stories (1).  However, I’m struggling to find a similar list of techniques to help people in finding stories in their data.  To do that you need to have a sense of what kind of data stories can be told. Here’s my current take at a few categories of data stories that can be told (expanding on earlier thoughts I had written about).  I use this list to help community groups find stories in their data that they want to tell.  Each includes a real example based on data scraped from the Somerville tree audit (the town I live in). All of these techniques benefit from existing statistical techniques that can be used to back up the pattens they illustrate.  You can find stories of factoids, connections, comparisons, changes over time, and personal connections in your data.

Factoid Stories

There’s only one Eastern Redbud tree in all of Somerville! What’s the story of that tree?  Turns out the leaves change to bright pink in fall, but everything else it yellow and orange.

An Eastern Redbush tree (from Wikipedia – not the actual tree in Somerville)

Sometimes in large sets of data you find the most interesting thing is the story of one particular point.  This could be an “outlier” (a data point not like the others) like the Redbush example above, or it could be the data point that is most common (can we tap more of the Maple trees that dominate Somerville?).  Going in depth on one particular piece of your data can be a type of data story that fascinates and surprises people.

Connection Stories

How come Somerville Ave has some many trees in the best condition? Oh, it was recently renovated… that is why those are all new trees.  There’s a story about more aesthetic outcomes of big street resurfacing projects.

a map of somerville with healthy trees in green (created in TableauPublic)
A map of somerville with healthy trees in green (created in TableauPublic)

When two aspects of your data seem related, you can tell a story about their connection.  The fancy name for this is “correlation“, and you of course need to be careful attributing causes for the connection.  That said, finding a connection between two aspects of your data can lead to a good story that connects things people otherwise don’t think about together.

Comparison Stories

Walking down Somerville Ave. gives you a good sense of the most populous trees across the city.  That street is a good representative of the tree population in the city as a whole.  Is your street different?

Comparison of tree populations in the city and along one street (large bubbles mean more trees)
Comparison of tree populations in the city and along one street (large bubbles mean more trees)

Comparing between sections of your data can a good way to find an illustrative story to tell.  Often one part of your data tells one story, but another part tells a totally different story. Or, as in this example above, maybe there is a more human slice of your data that serves as an exemplar of an overall pattern.

Stories of Change

Turns out there was a big die-off of trees in 2008.  Was the climate weird that year? (I made this up since I don’t have any time-based data)

People like thinking about things changing over time.  We experience and think about the world based on how we interact with it over time.  Telling a story a story about change over time appeals to people’s interest in understanding what caused the change.

“You” Stories

You live on Highland Rd? Did you know that ALL 9 Spruce trees in Somerville are on Highland Rd? Maybe we should rename it “Spruce Rd”?

Map of spruce trees on Highland Rd, colored by tree health (created in TableauPublic)
Map of spruce trees on Highland Rd, colored by tree health (created in TableauPublic)

Another way to find a story in data is to think about how it relates to your life.  People with map literacy like maps because they can place themselves on it.  This personalization of the story creates a connection to the real world meaning of the data and can be a powerful  type of story for small audiences.  Stories about your personal experiences can be grounding and real.

In Conclusion…

This is just one take on the type of data stories that can be told.  Please let me know how you think about this! Telling that story effectively is a whole different topic, but I find the story finding exercise much easier when I introduce a bunch of categories like this.  Most of these benefit from multiple sets of data, so remember to go data “shopping” during your story finding process.


(1) For instance, I’m a huge fan of Seger and Heer’s Narrative Visualization paper, where they give a catalog of visual storytelling techniques.  Also good is Marije Rooze’s thesis work (particularly the tagged gallery of visualizations from the Guardian and New York Times).

Tools for Data Scraping and Visualization

Over the last few weeks I co-taught a short-course on data scraping and data presentation for.  It was a pleasure to get a chance to teach with Ethan Zuckerman (my boss) and interact with the creative group of students! You can peruse the syllabus outline if you like.

In my Data Therapy work I don’t usually introduce tools, because there are loads of YouTube tutorials and written tutorials.  However, while co-teaching a short-course for incoming students in the Comparative Media Studies program here at MIT, I led two short “lab” sessions on tools for data scraping, interrogation, and visualization.

There are a myriad of tools that support these efforts, so I was forced to pick just a handle to introduce to these students.  I wanted to share the short lists of tools I choose to share.

Data Scraping:

As much as possible, avoid writing code!  Many of these tools can help you avoid writing software to do the scraping.  There are constantly new tools being built, but I recommend these:

  • Copy/Paste: Never forget the awesome power of copy/paste! There are many times when an hour of copying and pasting will be faster than learning any sort of new tool!
  • Import.io: Still nascent, but this is a radical re-thinking of how you scrape.  Point and click to train their scraper.  It’s very early, and buggy, but on many simple webpages it works well!
  • Regular Expressions: Install a text editor like Sublime Text and you get the power of regular expressions (which I call “Super Find and Replace”).  It lets you define a pattern and find it in any large document.  Sure the pattern definition is cryptic, but learning it is totally worth it (here’s an online playground).
  • Jquery in the browser: Install the bookmarklet, and you can add the JQuery javascript library to any webpage you are viewing.  From there you can use a basic understanding of javascript and the Javascript console (in most browsers) to pull parts of a webpage into an array.
  • ScraperWiki: There are a few things this makes really easy – getting recent tweets, getting twitter followers, and a few others.  Otherwise this is a good engine for software coding.
  • Software Development: If you are a coder, and the website you need to scrape has javascript and logins and such, then you might need to go this route (ugh).  If so, here’s a functioning example of a scraper built in Python (with Beautiful Soup and Mechanize).  I would use Watir if you want to do this in Ruby.

Data Interrogation and Visualization:

There are even more tools that help you here.  I picked a handful of single-purpose tools, and some generic ones to share.

  • Tabula: There are  few PDF-cleaning tools, but this one has worked particularly well for me.  If your data is in a PDF, and selectable, then I recommend this! (disclosure: the Knight Foundation funds much of my paycheck, and contributed to Tabula’s development as well)
  • OpenRefine: This data cleaning tool lets you do things like cluster rows in your data that are spelled similarly, look for correlations at a high level, and more!  The School of Data has written well about this – read their OpenRefine handbook.
  • Wordle: As maligned as word clouds have been, I still believe in their role as a proxy for deep text analysis.  They give a nice visual representation of how frequently words appear in quotes, writing, etc.
  • Quartz ChartBuilder: If you need to make clean and simple charts, this is the tool for you. Much nicer than the output of Excel.
  • TimelineJS: Need an online timeline?  This is an awesome tool. Disclosure: another Knight-funded project.
  • Google Fusion Tables: This tool has empowered loads of folks to create maps online.  I’m not a big user, but lots of folks recommend it to me.
  • TileMill: Google maps isn’t the only way to make a map.  TileMill lets you create beautiful interactive maps that fit your needs. Disclosure: another Knight-funded project.
  • Tableau Public: Tableau is a much nicer way to explore your data than Excel pivot tables.  You can drag and drop columns onto a grid and it suggests visualizations that might be revealing in your attempts to find stories.

I hope those are helpful in your data scraping and story-finding adventures!

Curious for More Tools?

Keep your eye on the School of Data and Tactical Technology Collective.

Getting Started in Data Journalism

My friends at PenPlusBytes asked me to speak at their second annual bootcamp for student journalists.  There are many people doing great work in this field, so I drew on their experience to build a short talk. I gave my thoughts, examples of techniques for data-driven journalism, and some tips & tricks.  Working with journalists has always been part of the Data Therapy project, but it was nice to get a chance to focus on it more!

Here is an audio recording of the talk, and the Prezi I used to show visuals.