Getting Data to Answer Your Questions

I often introduce the idea that when you start with a dataset you should first start by asking your data some questions.  For instance, in this dataset about food waste in Massachusetts, students in my Data Storytelling Studio course brainstormed a number of questions they wanted ask:

  • if there more food waste in rich areas?
  • do more expensive restaurants waste more food?
  • do restaurants with more waste go out of business at a higher rate?
  • are certain towns more wasteful than others?

This process of asking questions help you move beyond the data you have, to getting the data you need to answer the questions you have.  This question-centric approach is critical to make sure you don’t fall victim to having your dataset in hand be a constraint that stops you from finding an interesting story.

askingn data questons

An Example of Getting More Data

So how do you go from these questions to more data?  I encourage folks to go “data shopping” (a term I enjoy stealing from my colleagues at the Tactical Technology Collective).  This involve taking each of your questions and thinking about what other data you need to answer it, and where you might get that data.  Returning to the food waste example above, to answer the question of whether more expensive restaurants waste more food, you need to categorize restaurants as expensive or not.  My students remembered that most restaurant review sites, like Yelp, have a dollar-bill scale that tells you how expensive a restaurant is.

How could you get that data? You could do it by hand, but that would take a while for all the restaurants in the food waste spreadsheet.  Instead, they pointed out that Yelp has an API, and you could write some software to query that and ask Yelp for the dollar-rating of each restaurant on the list.

Types of Data Sources

This examples uses one source of data – a private company.  There are, of course, others. Here’s the list I tend to introduce:

  • Private Companies – There is tons of data collected and stored by private companies, and sometimes they will give or sell it to you.
  • Governments – There is loads of official data collected by government agencies, and you have a right to the vast majority of it (depending on where you live).
  • Non-Profits or Advocacy Groups – Interest groups typically collect datasets to back up and inform the advocacy they are doing.
  • Crowdsourcing / Do-It-Yourself – Sometimes the data isn’t there, so you need to make it yourself!

That’s the list I use.  Am I missing a category?

Ways to Get Data

Fine, so there is data in a lot of places… how do we get it?  Here’s my list of techniques:

  • Download Open Data – Yes, sometimes the data is just out there waiting for you to find and download it.  This doesn’t mean it is usable, but it is often there.  Usually large non-profits and governments have big data repositories you can poke around.  Sometimes it will be stuck in a PDF or HTML table, but you can still get it out.
  • Ask For It – I mean it. Sometimes you just need to make a phone call and ask. A little social engineering goes a long way!
  • Scrape It – Far too often the data is out there, but not in a nicely usable form… you need to scrape it from a website.  Scraping involves taking taking data is scattered around a website and using a process to get it all in one place in the same format. Nowadays there are lots of tools to help you scrape websites.
  • Manually Collect It – If the data isn’t there, you gotta make it yourself.  This might involve crowd-sourced data collection, a focus group, or asking of social media.

Answering Your Questions

I introduce these two lists, of data sources and ways to get data, in order to support the data shopping process.  With a richer set of data in hand, you’re better positioned to find the most interested and meaningful stories in your data.

Map-Making for the Masses

Here’s a short story about helping my friends at the Metrowest Regional Center for Healthier Communities create some maps, and my reflections about existing efforts to make map-making easier.  Short story – it worked, but being a big computer dork helped.

The issue at hand was their desire to create a map of the Community Health Network Areas (CHNAs) in Massachusetts, colored by a variety of data indicators.  They had various goals and audiences in mind.  Many Eyes makes it easier to map towns in Massachusetts, but these CHNA borders don’t line up with towns so we couldn’t use that.  I decided to try another tool, Google Fusion Tables, because I knew it could import arbitrary geographic shapes.  After some digging I found that the Massachusetts Oliver online GIS tool had a layer for CHNA boundaries.  Even better, Oliver has KML output! Bingo. After looking through the various files I downloaded from the Oliver website, I was able to guess which one I needed to upload to Fusion Tables.  With that, and some text changes in the resulting table, I was able to create a template my colleagues could use to create colored map visualizations for the CHNAs.  Here’s an example map with some random fake data.  Success!

So what’s the point?  Well, I like to talk about how the barrier to entry for creating data presentations has been lowered by new technologies.  Mapping is one area where this is particularly true – the idea that anyone can make and share a map using tools like Google Maps is truly astounding.  That said, there is often a rocky transition when you try to deal with real data.  This map was much easier to generate thanks to Fusion Tables, but still required me:

  • learning the Fusion Tables model and user interface for data and visualizing
  • understanding what GIS layers are
  • navigating the GIS-centric Oliver website to find the CHNA layer that I cared about
  • understanding the difference between the GIS files to know which KML to import into Fusion Tables

….and more.  So it was convenient that I’m a computer geek who didn’t have too hard of a time figuring that stuff out.

Tools have made it easier, but as I’ve pointed out before you still need to learn a lot.  This is why I don’t call tools like Fusion Tables “easy to use” on my tool matrix.  When the rubber hits the road for map-making, sometimes you need to put on your GIS hat and pretend you know what you’re doing.