One open source tool for extracting text from pdfs is the Apache Project tika. Tika does a reasonably good job of extracting text and preserving the structure. For this example, we'll parse the Brady Campaign's Mass Shootings in the United States Since 2005 document. Looking at the document we can see that it has a structure of location, date and description. Buried in the description is the source in parentheses.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Aurora, CO | |
July 20, 2012 | |
Twelve people were killed and 58 were injured in Aurora, Colorado during a sold-out midnight | |
premier of the new Batman movie "The Dark Knight Rises" when 24-year-old James Holmes | |
unloaded four weapons' full of ammunition into the unsuspecting crowd. He detonated multiple | |
smoke bombs, and then began firing at viewers in the sold-out auditorium, Ten members of | |
"The Dark Knight Rises" audience were killed in theater, while two others died later at area | |
hospitals. Numerous patrons were in critical condition at six local hospitals, the Aurora | |
police said. (Colorado Movie Theater Shooting: 70 Victims The Largest Mass Shooting, ABC, | |
July 20, 2012) |
To extract the text from the pdf to plain text use tika. Note that you will need to have java installed in order to run tika
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# the -t option extracts to plain text | |
# redirect the output to major-shootings.txt | |
java -jar tika-app-1.1.jar -t major-shootings.pdf > major-shootings.txt |
The plain text file also contains the footer from the document. In the case of this document which has approximately 62 pages, I find it faster to remove the footer text using a text editor such as vim, emacs, or textmate. If the document was longer than 100 pages, then I would try to be more clever and use a unix utility such as sed or grep to find and remove these lines.
Once the title, footer, and extraneous text are removed separated by one to several newlines. So we have incidents, that include location, date and description, separated by one or several newlines. The new lines act as separators between incidents. The parse_brady.rb script reads the file line-by-line, if the line is not empty then it adds the line as an element to the ary array. When the line is empty, it processes ary by joining all the lines from description into a single line, slices the description array elements form ary and puts them into the row array, then add the single line description back to row, so that it only contains three elements.
The first element of row is the location, this split into place and city and the geocode function is called. The geocode function returns either the coordinates if successful or an array with two nil members. The coordinates are added to row. The row is written as a tab delimited string to stdout where it can be redirected to a file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'rubygems' | |
require 'sqlite3' | |
def geocode(place, state) | |
query = "select primary_lat_dec, primary_lon_dec from national where feature_class="+%Q["Populated Place"] + " and feature_name=" + %Q["#{place}"] + " and state_alpha=" + %Q["#{state}"]; | |
row = $db.get_first_row( query ) | |
if row != nil | |
return row | |
else | |
return [nil,nil] | |
end | |
end | |
ary = [] | |
$db = SQLite3::Database.new( "GNIS_National" ) | |
File.foreach('major-shootings.txt') do |r| | |
if !r.chomp.chop.empty? | |
ary.push(r.chomp) | |
else | |
description = ary[2,ary.size()].join | |
row = ary.slice(0..1) | |
row << description | |
location = row[0].split(",") | |
coords = geocode(location[0],location[1].gsub(/\s+/, "")) | |
row << coords | |
puts row.join("\t")+"\n" | |
row.clear | |
ary.clear | |
end | |
end |
Inspecting the resulting file shows that there are coordinates missing as well as inconsistencies in the date formatting. Some of the geocodes failures are due to misspellings (St.Louis versus St.Louis) or to places such as counties that are not in the GNIS database. I manually cleaned up the misspellings and dates and ran the parser again, then added the county coordinates from Wikipedia.
While there are numerous ways to make a map from a tab delimited file, such as Google Fusion tables, TileMill and GeoCommons; I decided to use CartoDB to try out their offering. The process is simple, create an account, upload your data, and style it accordingly.
In a follow up post, I'll go through the steps to make the map more legible using TileMill as a Carto style editor and by using the CartoDB API to make the map more interactive.
Part 2: Visualizing from a PDF