Tuesday, July 31, 2012

From a PDF to a map

Data is often locked up in pdf reports, and if we're lucky it may be in a structured (HTML) or semi-structured form (e.g. email). Extracting data from pdfs requires translating the data into plain text while preserving the structure as much as possible.

One open source tool for extracting text from pdfs is the Apache Project tika. Tika does a reasonably good job of extracting text and preserving the structure. For this example, we'll parse the Brady Campaign's Mass Shootings in the United States Since 2005 document. Looking at the document we can see that it has a structure of location, date and description. Buried in the description is the source in parentheses.

To extract the text from the pdf to plain text use tika. Note that you will need to have java installed in order to run tika

The plain text file also contains the footer from the document. In the case of this document which has approximately 62 pages, I find it faster to remove the footer text using a text editor such as vim, emacs, or textmate. If the document was longer than 100 pages, then I would try to be more clever and use a unix utility such as sed or grep to find and remove these lines.

Once the title, footer, and extraneous text are removed separated by one to several newlines. So we have incidents, that include location, date and description, separated by one or several newlines. The new lines act as separators between incidents. The parse_brady.rb script reads the file line-by-line, if the line is not empty then it adds the line as an element to the ary array. When the line is empty, it processes ary by joining all the lines from description into a single line, slices the description array elements form ary and puts them into the row array, then add the single line description back to row, so that it only contains three elements. 

The first element of row is the location, this split into place and city and the geocode function is called. The geocode function returns either the coordinates if successful or an array with two nil members. The coordinates are added to row. The row is written as a tab delimited string to stdout where it can be redirected to a file.

Inspecting the resulting file shows that there are coordinates missing as well as inconsistencies in the date formatting. Some of the geocodes failures are due to misspellings (St.Louis versus St.Louis) or to places such as counties that are not in the GNIS database. I manually cleaned up the misspellings and dates and ran the parser again, then added the county coordinates from Wikipedia.

While there are numerous ways to make a map from a tab delimited file, such as Google Fusion tables, TileMill and GeoCommons; I decided to use CartoDB to try out their offering. The process is simple, create an account, upload your data, and style it accordingly.

In a follow up post, I'll go through the steps to make the map more legible using TileMill as a Carto style editor and by using the CartoDB API to make the map more interactive.

Part 2: Visualizing from a PDF 

Sunday, July 29, 2012

My Review of PostgreSQL: Up and Running

Originally submitted at O'Reilly

A Practical Guide to the Advanced Open Source Database

Good overview + new features in 9.1-9.2

By spara from San Antonio, TX on 7/29/2012


5out of 5

Pros: Well-written, Helpful examples, Accurate, Easy to understand, Concise

Cons: A little too colloquial

Best Uses: Reasonably experienced, Intermediate

Describe Yourself: Data monkey, Occasional devops, Developer

PostgreSQL: Up and Running is a book for experienced PostgreSQL users and people comfortable around RDBMS such as Oracle or SQLSever. This is definitely not a book for someone starting out with PostgreSQL.

For me, the book is a practical reference that begins to answer questions but directs the reader for more detailed information from other resources that are hyperlinked in the book. I'm a long time user of PostgreSQL 8.4, so the book provides an overview of features that are/will be available in 9.1 and 9.2. For example, the serial data type auto-increments integers is now available. Previously, you had to create a sequence object and use it when creating the table. I also learned about arrays for the first time which have been around since 8.3. Certainly a very handy datatype to have.

A nice feature in 9.1 is the ability to load new extensions with the CREATE EXTENSION command. I use PostGIS frequently, and previously you would have to run sql scripts to create a template database and use that to create a spatial database. With CREATE EXTENSION, you can add PostGIS data types, objects, and functions to an existing database.

The book also covers the standard SQL tables, indexes and views as implemented in PostgreSQL. Writing your own functions and aggregates are touched on and examples are provide. Query performance tuning is covered and typical sql errors, such as overusing subqueries (guilty), are discussed and examples for improving performance are given. Basic replication is also discussed as well options for hosting PostgreSQL.

PostgreSQL: Up and Running also gives a sneak preview of new features such as Foreign Data Wrappers for working with data from web APIs, plPython for writing functions in Python or in other supported languages, native JSON output, and the list goes on. The book shows that PostgreSQL is flexible and extensible.

I enjoyed reading the book. It was easy to understand, had good examples, and I learned about new capabilities. The book covers much ground, but goes into sufficient depth to gain an understanding, and it provides additional resources for more in depth treatments of topics. It's a mile wide, deep enough and a good reference for an experienced user.


Thursday, July 26, 2012

Setting up postgres.app with PostGIS

I should know better, but I'm a sucker for an upgrade. I upgraded to OS X Mountain Lion and a number of apps I use wouldn't start. A brew installed version of PostgreSQL couldn't start because the postgres user was changed to nobody. For some reason, I couldn't su to nobody or create a new postgres user using dscl.

The data on the database wasn't particularly large and I also had the scripts to reload the data so I decided to try Heroku's postgres.app install of postgres.
Installation is simple. Download the app and unzip into the Applications directory. Double click to start, and the postgres icon will appear on the task bar.

In order to use the commandline tools, you will need to set the path to the postgres.app binaries. Add this to your .bash_profile or wherever your paths are set in you .dot files. (note: @mweissman suggested this is unnecessary on OSX Lion 10.7 and higher because PostgreSQL and the client tools are already part of the operating system.)

In the terminal, connect to the postgres using psql and create a database. Connect to the database and add the PostGIS extension.

Now that the PostGIS extension has been added to your database, creating table is straight forward. In this example, I'm creating a table from a CSV file with point geometry and creating a spatial index on the geometry.

And that's it, quick and simple.