Tuesday, July 31, 2012

From a PDF to a map

Data is often locked up in pdf reports, and if we're lucky it may be in a structured (HTML) or semi-structured form (e.g. email). Extracting data from pdfs requires translating the data into plain text while preserving the structure as much as possible.

One open source tool for extracting text from pdfs is the Apache Project tika. Tika does a reasonably good job of extracting text and preserving the structure. For this example, we'll parse the Brady Campaign's Mass Shootings in the United States Since 2005 document. Looking at the document we can see that it has a structure of location, date and description. Buried in the description is the source in parentheses.

Aurora, CO
July 20, 2012
Twelve people were killed and 58 were injured in Aurora, Colorado during a sold-out midnight
premier of the new Batman movie "The Dark Knight Rises" when 24-year-old James Holmes
unloaded four weapons' full of ammunition into the unsuspecting crowd. He detonated multiple
smoke bombs, and then began firing at viewers in the sold-out auditorium, Ten members of
"The Dark Knight Rises" audience were killed in theater, while two others died later at area
hospitals. Numerous patrons were in critical condition at six local hospitals, the Aurora
police said. (Colorado Movie Theater Shooting: 70 Victims The Largest Mass Shooting, ABC,
July 20, 2012)

To extract the text from the pdf to plain text use tika. Note that you will need to have java installed in order to run tika

# the -t option extracts to plain text
# redirect the output to major-shootings.txt
java -jar tika-app-1.1.jar -t major-shootings.pdf > major-shootings.txt
view raw tika.sh hosted with ❤ by GitHub

The plain text file also contains the footer from the document. In the case of this document which has approximately 62 pages, I find it faster to remove the footer text using a text editor such as vim, emacs, or textmate. If the document was longer than 100 pages, then I would try to be more clever and use a unix utility such as sed or grep to find and remove these lines.

Once the title, footer, and extraneous text are removed separated by one to several newlines. So we have incidents, that include location, date and description, separated by one or several newlines. The new lines act as separators between incidents. The parse_brady.rb script reads the file line-by-line, if the line is not empty then it adds the line as an element to the ary array. When the line is empty, it processes ary by joining all the lines from description into a single line, slices the description array elements form ary and puts them into the row array, then add the single line description back to row, so that it only contains three elements. 

The first element of row is the location, this split into place and city and the geocode function is called. The geocode function returns either the coordinates if successful or an array with two nil members. The coordinates are added to row. The row is written as a tab delimited string to stdout where it can be redirected to a file.

require 'rubygems'
require 'sqlite3'
def geocode(place, state)
query = "select primary_lat_dec, primary_lon_dec from national where feature_class="+%Q["Populated Place"] + " and feature_name=" + %Q["#{place}"] + " and state_alpha=" + %Q["#{state}"];
row = $db.get_first_row( query )
if row != nil
return row
else
return [nil,nil]
end
end
ary = []
$db = SQLite3::Database.new( "GNIS_National" )
File.foreach('major-shootings.txt') do |r|
if !r.chomp.chop.empty?
ary.push(r.chomp)
else
description = ary[2,ary.size()].join
row = ary.slice(0..1)
row << description
location = row[0].split(",")
coords = geocode(location[0],location[1].gsub(/\s+/, ""))
row << coords
puts row.join("\t")+"\n"
row.clear
ary.clear
end
end
view raw parse_brady.rb hosted with ❤ by GitHub

Inspecting the resulting file shows that there are coordinates missing as well as inconsistencies in the date formatting. Some of the geocodes failures are due to misspellings (St.Louis versus St.Louis) or to places such as counties that are not in the GNIS database. I manually cleaned up the misspellings and dates and ran the parser again, then added the county coordinates from Wikipedia.


While there are numerous ways to make a map from a tab delimited file, such as Google Fusion tables, TileMill and GeoCommons; I decided to use CartoDB to try out their offering. The process is simple, create an account, upload your data, and style it accordingly.




In a follow up post, I'll go through the steps to make the map more legible using TileMill as a Carto style editor and by using the CartoDB API to make the map more interactive.


Part 2: Visualizing from a PDF