Tuesday, July 31, 2012

From a PDF to a map

Data is often locked up in pdf reports, and if we're lucky it may be in a structured (HTML) or semi-structured form (e.g. email). Extracting data from pdfs requires translating the data into plain text while preserving the structure as much as possible.

One open source tool for extracting text from pdfs is the Apache Project tika. Tika does a reasonably good job of extracting text and preserving the structure. For this example, we'll parse the Brady Campaign's Mass Shootings in the United States Since 2005 document. Looking at the document we can see that it has a structure of location, date and description. Buried in the description is the source in parentheses.

Aurora, CO
July 20, 2012
Twelve people were killed and 58 were injured in Aurora, Colorado during a sold-out midnight
premier of the new Batman movie "The Dark Knight Rises" when 24-year-old James Holmes
unloaded four weapons' full of ammunition into the unsuspecting crowd. He detonated multiple
smoke bombs, and then began firing at viewers in the sold-out auditorium, Ten members of
"The Dark Knight Rises" audience were killed in theater, while two others died later at area
hospitals. Numerous patrons were in critical condition at six local hospitals, the Aurora
police said. (Colorado Movie Theater Shooting: 70 Victims The Largest Mass Shooting, ABC,
July 20, 2012)

To extract the text from the pdf to plain text use tika. Note that you will need to have java installed in order to run tika

# the -t option extracts to plain text
# redirect the output to major-shootings.txt
java -jar tika-app-1.1.jar -t major-shootings.pdf > major-shootings.txt
view raw tika.sh hosted with ❤ by GitHub

The plain text file also contains the footer from the document. In the case of this document which has approximately 62 pages, I find it faster to remove the footer text using a text editor such as vim, emacs, or textmate. If the document was longer than 100 pages, then I would try to be more clever and use a unix utility such as sed or grep to find and remove these lines.

Once the title, footer, and extraneous text are removed separated by one to several newlines. So we have incidents, that include location, date and description, separated by one or several newlines. The new lines act as separators between incidents. The parse_brady.rb script reads the file line-by-line, if the line is not empty then it adds the line as an element to the ary array. When the line is empty, it processes ary by joining all the lines from description into a single line, slices the description array elements form ary and puts them into the row array, then add the single line description back to row, so that it only contains three elements. 

The first element of row is the location, this split into place and city and the geocode function is called. The geocode function returns either the coordinates if successful or an array with two nil members. The coordinates are added to row. The row is written as a tab delimited string to stdout where it can be redirected to a file.

require 'rubygems'
require 'sqlite3'
def geocode(place, state)
query = "select primary_lat_dec, primary_lon_dec from national where feature_class="+%Q["Populated Place"] + " and feature_name=" + %Q["#{place}"] + " and state_alpha=" + %Q["#{state}"];
row = $db.get_first_row( query )
if row != nil
return row
else
return [nil,nil]
end
end
ary = []
$db = SQLite3::Database.new( "GNIS_National" )
File.foreach('major-shootings.txt') do |r|
if !r.chomp.chop.empty?
ary.push(r.chomp)
else
description = ary[2,ary.size()].join
row = ary.slice(0..1)
row << description
location = row[0].split(",")
coords = geocode(location[0],location[1].gsub(/\s+/, ""))
row << coords
puts row.join("\t")+"\n"
row.clear
ary.clear
end
end
view raw parse_brady.rb hosted with ❤ by GitHub

Inspecting the resulting file shows that there are coordinates missing as well as inconsistencies in the date formatting. Some of the geocodes failures are due to misspellings (St.Louis versus St.Louis) or to places such as counties that are not in the GNIS database. I manually cleaned up the misspellings and dates and ran the parser again, then added the county coordinates from Wikipedia.


While there are numerous ways to make a map from a tab delimited file, such as Google Fusion tables, TileMill and GeoCommons; I decided to use CartoDB to try out their offering. The process is simple, create an account, upload your data, and style it accordingly.




In a follow up post, I'll go through the steps to make the map more legible using TileMill as a Carto style editor and by using the CartoDB API to make the map more interactive.


Part 2: Visualizing from a PDF 

Sunday, July 29, 2012

My Review of PostgreSQL: Up and Running

Originally submitted at O'Reilly

A Practical Guide to the Advanced Open Source Database


Good overview + new features in 9.1-9.2

By spara from San Antonio, TX on 7/29/2012

 

5out of 5

Pros: Well-written, Helpful examples, Accurate, Easy to understand, Concise

Cons: A little too colloquial

Best Uses: Reasonably experienced, Intermediate

Describe Yourself: Data monkey, Occasional devops, Developer

PostgreSQL: Up and Running is a book for experienced PostgreSQL users and people comfortable around RDBMS such as Oracle or SQLSever. This is definitely not a book for someone starting out with PostgreSQL.

For me, the book is a practical reference that begins to answer questions but directs the reader for more detailed information from other resources that are hyperlinked in the book. I'm a long time user of PostgreSQL 8.4, so the book provides an overview of features that are/will be available in 9.1 and 9.2. For example, the serial data type auto-increments integers is now available. Previously, you had to create a sequence object and use it when creating the table. I also learned about arrays for the first time which have been around since 8.3. Certainly a very handy datatype to have.

A nice feature in 9.1 is the ability to load new extensions with the CREATE EXTENSION command. I use PostGIS frequently, and previously you would have to run sql scripts to create a template database and use that to create a spatial database. With CREATE EXTENSION, you can add PostGIS data types, objects, and functions to an existing database.

The book also covers the standard SQL tables, indexes and views as implemented in PostgreSQL. Writing your own functions and aggregates are touched on and examples are provide. Query performance tuning is covered and typical sql errors, such as overusing subqueries (guilty), are discussed and examples for improving performance are given. Basic replication is also discussed as well options for hosting PostgreSQL.

PostgreSQL: Up and Running also gives a sneak preview of new features such as Foreign Data Wrappers for working with data from web APIs, plPython for writing functions in Python or in other supported languages, native JSON output, and the list goes on. The book shows that PostgreSQL is flexible and extensible.

I enjoyed reading the book. It was easy to understand, had good examples, and I learned about new capabilities. The book covers much ground, but goes into sufficient depth to gain an understanding, and it provides additional resources for more in depth treatments of topics. It's a mile wide, deep enough and a good reference for an experienced user.

(legalese)

Thursday, July 26, 2012

Setting up postgres.app with PostGIS

I should know better, but I'm a sucker for an upgrade. I upgraded to OS X Mountain Lion and a number of apps I use wouldn't start. A brew installed version of PostgreSQL couldn't start because the postgres user was changed to nobody. For some reason, I couldn't su to nobody or create a new postgres user using dscl.

The data on the database wasn't particularly large and I also had the scripts to reload the data so I decided to try Heroku's postgres.app install of postgres.
Installation is simple. Download the app and unzip into the Applications directory. Double click to start, and the postgres icon will appear on the task bar.

In order to use the commandline tools, you will need to set the path to the postgres.app binaries. Add this to your .bash_profile or wherever your paths are set in you .dot files. (note: @mweissman suggested this is unnecessary on OSX Lion 10.7 and higher because PostgreSQL and the client tools are already part of the operating system.)

PATH="/Applications/Postgres.app/Contents/MacOS/bin:$PATH"
view raw setpath.sh hosted with ❤ by GitHub

In the terminal, connect to the postgres using psql and create a database. Connect to the database and add the PostGIS extension.

# connect to postgres
psql -h localhost
# create database
CREATE DATABASE mydatabase;
# switch to your database
\connect mydatabase
# add postgis extension
CREATE EXTENSION postgis;
view raw createdb.sql hosted with ❤ by GitHub

Now that the PostGIS extension has been added to your database, creating table is straight forward. In this example, I'm creating a table from a CSV file with point geometry and creating a spatial index on the geometry.


-- create the table
CREATE TABLE brady (
location varchar,
date date,
description text,
latitude real,
longitude real);
-- read CSV file into the table (actually tab delimited)
COPY brady (location, date, description, latitude, longitude)
FROM '/Users/sparafina/projects/gun_violence/major-shootings_geocoded.csv'
WITH DELIMITER AS "\t"
CSV HEADER;
-- add geometry column
SELECT AddGeometryColumn('public', 'brady', 'geom', 4326, 'POINT', 2);
-- create points from latitude and longitude fields
UPDATE brady SET geom = ST_SetSRID(ST_MakePoint(longitude,latitude),4326);
-- create a spatial index on the geometry
CREATE INDEX brady_idx ON brady USING GIST ( geom );

And that's it, quick and simple.