I’m working on a neat new OpenGov project that has grown out of a few things I’ve worked on the past.
I’m hoping to launch a new SMS-based service this week that will let anyone quickly and easily find important locations in their neighborhood using an ordinary cell phone (i.e., a non-smart phone).
I’ll launch the new service in Philadelphia, but I plan to open source the code for it so that anyone that wants to launch a similar service in another location will have a code base to start from.
As a data source for the new service, I’m planning on using data sets that are available in PHLAPI. The only drawback to this approach, currently, is that there aren’t a whole heck of a lot of point data sets available in PHLAPI.
That’s not necessarily a deal breaker, as there are plenty of data sets out there just waiting to be liberated and added to the PHLAPI instance. So, as a way of getting the process rolling, I decided to start hacking away at liberating some open data.
One of the more interesting data sets I came across in looking for candidates (pun intended) was a listing of polling locations in the City of Philadelphia. I found a PDF document containing a detailed list of locations (prepared one presumes for the primary election in Philly last week) on the site of the Committee of Seventy.
The PDF document I started with can be found here. There is also a web-based app available on the Committee of Seventy site for finding polling locations, but I wanted the data in as raw a format possible (ideally, CSV).
It turned out to be surprisingly easy to do this. Here’s how I did it.
Converting PDF to Text
Since I run a variety of *NIX machines in my home office, it was pretty easy to use pfdtotext to convert the Committee of Seventy PDF document into a text document. Using the -layout option with pdftotext allowed me to maintain the nice table layout of the original document, and helped with further processing.
From Plain Text to CSV
Once I had the document in text format, it was time to fire up Google Refine. This is an indispensable tool in the arsenal of any OpenGov hacker, and it is enormously powerful for cleaning up and enhancing messy data.
Google refine comes with a built in scripting language called Google Refine Expression Language (GREL) – very much like JavaScript in it’s syntax – that lets you manipulate and refine the data in a project.
I won’t go over all of the steps I used to convert the plain text file I started with to CSV, but if you spend any time playing around with Google Refine you’ll see how easy it is to enhance a data set.
Adding Locational Information
Next it was time to geocode the address of each polling location – something that is not included in the original Committee of Seventy document. One of the really awesome features of Google Refine is that it allows you to add a new column to a data set that is created by making a call to a web service.
This functionality makes it pretty straightforward to use Google’s Geocoding API to get the latitude and longitude for each address in the original file. There are some really good screencasts and demos of this technique on the web, and if you look for them you’ll find some good stuff.
Exporting to CSV
One of the last steps in the process, once the data was cleaned up and enhanced, was to simply export it to CSV.
I’ve loaded a copy of the CSV file I created to GitHub. This format is nice because just about any software program or development tool can consume it (you could, for example, simply open this file in Excel).
But what I really hoped to do was to be able to make this data easy to import into a CouchDB instance, like PHLAPI. Not only would this support my new OpenGov project, it would let anyone that wanted to use the data simply replicate it from a publicly available CouchDB instance.
To do this, I turned to Node.js.
Inserting into CouchDB
What I needed to do with my CSV file was to parse it and turn each row into a JSON-formated document that I could then insert into CouchDB via HTTP POST. This a snap to do with Node.js and a few of the handy modules made available by the Node community (most notably, node-csv and the cradle module)
Here is the script I used (also available on GitHub):
A bit of a hack, but it works just fine. So I now have all of the polling places in the City of Philadelphia in one of my CouchDB instances.
Want to replicate it and use it for yourself? Just run this at the command line (assumes you are running CouchDB locally, and have created a DB named phl_polling_places):
curl -X POST "http://127.0.0.1:5984/_replicate"
-H 'Content-type: application/json'
-d '{"source": "http://markh.couchone.com/phl_polling_places", "target": "phl_polling_places"}'
If this post has inspired you to try and liberate some data yourself, let me know. I’d love to help with more efforts like this.
I’m also still in the process of checking my converted data, to ensure that everything looks correct. Once this is done, I’ll work on getting it inserted in PHLAPI.
Stay tuned for my new service, launching soon…!
UPDATE:
Right after I posted this I learned of an even easier way to export data from Google Refine to CouchDB. Max Ogden‘s Refine Uploader – worth checking out if you are using data from Google Refine to a CouchDB instance.