Tuesday, December 31, 2013

What's HOT for the GeoHipster in 2014


Skybox Imaging and Planet Labs have launched imaging satellites, expect bunch of cool new image products and imagery derived data in 2014. Also note that Frank Warmerdam is at Planet Labs. 

But wait, there's more! There's another readily available source of imagery data, it's in the photos people are posting to Instagram, Flickr and Facebook. Expect tools to exploit this source of imagery.

Hardware hacking

Arduino and Raspberrypi are moving out of their respective blinky lights infancy. Geohipsters will be connecting them to sensors and talking to via node.js. Expect to see other hardware platforms such as Tessel making inroads on the hardware hacking movement. 

Car hacking is still in it's infancy with blue tooth ODBII modules. But as more cars roll out as mobile platforms replete with a API, car modding will be more than just chip modding for performance.

Thursday, March 21, 2013

A little data for geocoding

What's a geocoder to do without data? Fortunately, there's tons of it and more and more produced every day. I have a project where I need to verify the addresses of non-profits. The IRS provides the Statement of Information (SOI) Tax Statistics for Exempt Organizations Business Master File Extract. The data is provided as both Excel files and as fixed width delimited text files. The fixed width files contain all the records and there is one per state.

Using the same technique I used for importing 2010 Census headers, I imported each line/record as a single field into a temporary table. Using the SQL substring function, I extracted the data into their respective fields. Information about the file structure and fields are available in The Instruction Booklet on the download page. 

Below is the script for importing the data.

When all is said and done, you will have a table containing California tax exempt organizations. The next step may seem a little backward, but I exported the data back to a tab delimited text file.

This may seem a step backward, but until there is a built in Postgres geocoder, handling text files is simpler and faster than write code that extracts data for geocoding using an external service.

Saturday, February 2, 2013

Data Science Tool Kit on Vagrant

Pete Warden has released a version of the Data Science Tool Kit on Vagrant. DSTK is a website for munging all sorts of data and includes a geocoder based on TIGER 2010. The website can be unreliable, requiring an occasional restart, so running a VM is a nice option. The vagrant version upgrades the geocoder to TIGER2012 and is a drop in replacement for Google geocoder requests. To run the DSTK locally

Install vagrant from http://www.vagrantup.com/. Create a directory to hold the vagrantfile, then run the following:

Go to to http://localhost:8080 to start using the DSTK.

Tuesday, November 13, 2012

Hacking a Common Operating Picture

Where’s the common operational picture? How can you get timely and accurate reporting without it?"

I've assisted on several disaster relief efforts since Hurricane Katrina and they have all needed a way to communicate where resources are and where they are needed. In military jargon, this is called a common operating picture or a COP. The most common interface for a COP is a map because it is easy to reference, although there is a case to made for a COP that also shows resources and needs by priority. 

Google Maps has made mapping accessible to the public. I first saw this during Katrina when people used Google Earth to link photos to locations in KMLs so that Katrina refugees could see flooded and damaged areas. I've also seen the availability of other mapping platforms and services such as Ushahidi and CartoDB increase, but it seems most people turn to Google Maps/Earth to create maps and data. While Google Earth has lowered the barrier to creating maps, it still requires a modicum of training in order to keep the map updated with the latest information.

I responded a request to help out with Occupy Sandy Staten Island with their map of resources. They had a map but things were rapidly changing and they wanted a volunteer to be able to update the map. The current map was in Google Maps, so I downloaded a KML of the map as a starting point. 

Consistent Look and Feel
Although there are many mapping platforms available, I decided to stick with Google infrastructure because Occupy Sandy Staten Island were already using it. To make the KML editable, I imported it into Google Fusion Tables which brought in the data but not the styling. I wanted the icons and style for each zone (polygonal area) to be consistent with the existing maps. That means that drop off points, shelters, and operations centers would always have the same symbology without the user having to pick and choose an icon or style every time a new point was added to the map.

Google Fusion Tables supports setting styles based on a column value. Google provides a tutorial showing how to merge tables to apply map styles by column. The tutorial shows how to create a columns in both the styles table and the data table that act as a key. In this case, I created a column called type and populated them descriptive keys such as dropoff, shelter, etc.  Although the tutorial shows how to set icons, map styles for polygons and lines can also be styled by column values. Merging style table to the data table results in a data table, row updates (new entries) to either the styles or data table will be displayed in the merge table. In this case, volunteers can just enter the type of point and it will be displayed with the correct symbology.

One of the primary functions of a COP is to locate things. Fusion table supports several types of geometry encoding ranging from KML geometry to street addresses. Street address are automatically converted into geometry. Even a partial street address without city, state, or zipcode are intelligently gecoded based on the bounding box of all the features.

Adding lines or polygonal data requires digitizing and is a bit beyond the scope. I did add one polygonal area using QGIS and OpenStreet Map as a background.  I originally exported the polygon as a KML, but for some reason QGIS outputs KML geometry in Google's World Mercator instead of WGS84 latitude/longitude coordinates in decimal degrees which is what Fusion Tables wants in the way of geography. The workaround was to export as GML and tweak the tags appropriately.

I wanted to have a spreadsheet interface for updating the map. Why a spreadsheet and not a form or just use Fusion Tables tables directly? Most people have used a spreadsheet so the learning curve is very small. In my experience forms get in the way of data entry, having to submit an entry is just a waste of time. Fusion Tables looks like a spreadsheet but it doesn't behave like one, so that adds to the learning curve. 

John McGrath wrote a script to update Fusion Tables from a spreadsheet; it's available on github. The script works but has shortcomings such as deleting all the entries in the Fusion Table before updating it. That means that you must keep and select all the previous entries in the spreadsheet instead of just updating the new entries.

Closing Comments
This was a quick hack done over a several hours. It's certainly far from perfect but it is functional and accomplishes the basic goal of creating a consistant map that can be updated by volunteers. Here are some takeaways:
  • use existing infrastructure; try to avoid having to install, configure or maintain software
  • use existing code; if you need a function, it's probably in github already
  • make data input simple; as a volunteer translator for Project 4636, I found forms getting in the way of entering information, so I used a spreadsheet
  • make data input convenient; for example, Fusion Tables can geocode an address automatically

Monday, September 10, 2012

Code for America Austin Hackathon: Buses and fun with GTFS

Last Saturday, I participated in the Code for America Austin Hackathon and was impressed by the people who attended. The crowd ranged from very experienced coders to folks who understood and could articulate a real problem as well as policies surrounding the problem. When I said well attended, did I mention that the City of Austin CIO and some of his staff were also there to help out? (Wink, wink, nudge, nudge, say no more to @COSAGOV and rising Democratic political star San Antonio Mayor Julian Castro @juliancastro).

Projects proposed at the hackathon ranged from automating campaign finance forms for the City of Austin, mapping bike routes and amenities, creating a mobile app and API for the  Aunt Bertha service which matches people to services, an online form for clients of the Homeless RV project, and improving bus transportation for riders.

I fell in with the bus group and the problem was presented by Glenn Gabois of Movability Austin. The problem we focused on was notifying riders the location and time of arrival of their bus while they wait at their stop.  Capital Metro, Austin's transportation agency, currently does not provide a feed for their buses while they are in transit. So we divide the problem into three componentes:
  • a crowd sourced bus tracking application
  • the server side infrastructure to provide bus data
  • a mobile application that tells rider a bus' current stop and the number of stops away
Crowd sourcing real time bus locations was an interesting discussion, the group discussed how a mobile app could be used to track buses. Some of the features of the app included:

  • would not require user to turn while on the bus or turn off when off the bus
  • determine which stop where the user boarded the bus
  • disambiguate between buses that used the same stops
The group discussion resulted in a mobile application that ran as a background process on a smart phone that would start sending coordinates when the accelerometer registered a change in speed. Determining which bus a rider boarded was a tougher problem and solutions ranged from QR codes on buses (rejected because it required the user to interact with the application), RFID (problematic because of equipment costs), and NFC (Near Field Communication is relatively new an not a common features on smart phones). In the end, the group centered on algorithmically determining the bus and the route based on data sent by the mobile application. Although this was an interesting discussion, the group decided not to address this problem

In order to use crowd sourced bus tracking data, the group discussed a service architecture to receive the bus tracking coordinates and compare them against the current transit data which Capital Metro provided as GTFS (General Transit File Specification) and as ESRI shape files.  We didn't spend much time discussing this part of the solution and I didn't feel up to the task of setting up a cloud based instance of PostgreSQL and importing the GTFS data into the data base (however, I later proved myself wrong and imported the GTFS data in PostgreSQL and created a sql dump of the database for anyone who wants to use it). For a quick feel good burst of "we're doing something", I uploaded the shape file data to Google Fusion Tables for a quick map. We had five hours to build and demo something at the end of the day.

The group decided to focus on the bus stop application (working title of Bus Time). Naqi Sayed and Kevin Reilly of Hidaya Solutions (@HidayaSolutions) worked on an Android application for determining how many stops away from a rider's current bus stop. With the clock ticking, the group decided that the application would work of just one trip of a single route. My contribution to the group was to rework the GTFS files into a single workable data sets. The GTFS data consisted of the following files

Using the GTFS Reference, I was able to decode how to get from a route to a list of stops. In our case, we picked a route and a trip on that routed, we then needed to find the stops for that trip. Since stops are sequential we just needed a sorted list of stops of the application.

To get the list of stops, I went through the GTFS Reference to find the keys that would allow me to associated stops with a specific trip. The joins between the data tables looked like this.

  1. Routes are comprised of trips (northbound, southbound, and time). Select a route from routes.txt, then select a trip by the route_id.
  2. A single trip list stop_times. Select stop_times by trip_id.
  3. Stop_times are related to stops by stop_id in the stops.txt file. Select the the stops using the stop_id.
I produced a csv file that was an ordered list of the stops for the application using a combination of sed/grep/awk/bash/ruby scripting. It wasn't pretty but in worked under our time constraint.

Naqi already had the barebones of an Android app with geolocation so he incorporated the data into the app. I did not work on the app but I think it worked this way:
  • the app determined the location of the rider via WiFi location or GPS and choose the closest stop
  • the bus position feed was simulated and was snapped to the closest stop
  • the app compared the rider's stop to the bus' stop, since it was an ordered list it was matter of counting the between the two stops
In Lean Startup speak, the minimum viable product (MVP) was this application.

To me, it was amazing that a group of people who had just met could hammer out an application in six hours. Everyone brought something to the table that contributed to the overall goal. I learned lots of new things from other folks and also some technical lessons along the way, such as importing GTFS data into a database using these scripts

We did what we did because we had to explore a problem, generate solutions, and implement one of then in six hours. Now that I've had a little more time to think about it, I think we could vastly improve the application by importing the GTFS data into Sqlite and using it directly in the application. The application would then have every route, trip, fares, and amenities available. The application could then just subscribe to real-time transit feed without having to deploy application back end. Additionally, we could revisit the QR code idea. Instead of riders with a crowd sourcing app, the QR code would be on the bus and the reader would be at each stop, as buses load and unload passengers the QR code reader could send the bus information to the real-time transit feed.

As you can see the ideas keep flowing, but it's important to remember that creating a mobile application would have been impossible with open data in the form of GTFS. 

Saturday, August 11, 2012

Notes on getting CartoDB VM working in OSX VirtualBox

I've been wanting to play some more with CartoDB but I've been traveling and my Internet connection has been provided courtesy of whatever cell tower is close by. A VMWare Ubuntu virtual machine is available at aux 2 centimes. Great if your operating system is Windows or linux because VMWare provides a free player to run the VM. On OSX, VMware Fusion can run the VM but it's not free.

The free alternative is Virtual Box, but it requires converting the VMWare vmdk file to an the open ovf format. Before converting the vmdk file, you will need to use your favorite OSX rar archive tool (MacPar, UnRarX) to reassemble the CartoDB vm which is split into three rar files. 

Download the ovftool from VMWare and install. Before converting the vmdk to an ovf file you will need to edit the vmx file so that the path to the vmdk is not hardcoded.

Convert the vmdk to an ova file.

Import the ova into VirtualBox and accept the default settings, they can be changed later.

The next step is to configure the VirtualBox network so that CartoDB is accessible to your host machine. To do this setup the Host-only Adapter. The Host-only Adapter creates a private network that between the host machine and guest machines. 

The vboxnet0 Host-only Adapter is configured from the VirtualBox Preferences menu.

Click on the 'Edit' or screwdriver icon to display the configuration menu. Note that the adapter's address is, which is the IP address that the guest machine accesses the host.

Now that we know the adapter's address, we can configure the network interface of the guest to have a static IP that is addressable from the host machine. Edit the /etc/network/interfaces file as root to set the static address. It should look like this:

Reboot the guest to start the network with the assigned static IP.

The CartoDB virtual machine is configured with two accounts, a devuser account and an admin account. To access these accounts, two URLs have been configured: http://devuser.localhost.lan (user/password: devuser/devuser) and http://admin.localhost.lan (user/password: admin/admin). 

To access these accounts configure the /private/etc/hosts file on the host side (OSX) by adding these entries:

After updating the hosts file, we can access our local instance of CartoDB through a browser. Note: add 'http://' to the URL so that the browser knows you are looking for the local CartoDB site. The devuser map looks like this:

Happy mapping!

Friday, August 3, 2012


A friend in grad school used to go around the office and exhort us to 'go eat' lunch together. Eventually this just became "gweet."

Similarly, an engineer I worked with would build applications using java jars built by another part of our company. I would ask him how his project was going after the jar update and he would invariably say "It's broke." Over time, this elided into 'sproke.

*True story: One Christmas, I bought everyone a computer bag with NPE embroidered on it because everything was 'sproke.

Thursday, August 2, 2012

Visualizing data from a PDF

In my previous post, I scraped the data from a PDF and wrote it to a CSV file. I uploaded the data to CartoDB to make a quick map using the "Share This Map" function on the CartoDB site.

I received a number of comments on the map, mostly about making it more legible by separating out the years and providing more information about each incident. In the spirit of Brian Timoney's post on how web maps are actually used, I decided to keep the map as simple as possible.

After reading the API documentation and looking at the examples, I used the Data Interaction example as the starting point for my map. One of the comments I received was to separate the incidents by year. The interesting part of the example is that it shows how to dynamically change what is presented on a map by changing the values in a SQL query. That's very powerful because you have the data handling and geoprocessing capabilities of PostgreSQL and PostGIS available in your map client. Here's the map with a control to show incidents by year. (The blog form squeezes the map, the see the full map here.)

The script for the map is below.

One of the nice things about the map is that it doesn't use external mapping client libraries. It's simple and shows the user where the incidents occurred. However, I would like to show more information about each incident through a popup. CartoDB plays well with leaflet, modestmaps.js, and wax, and it's easy to extend the basic map with these libraries. In the following map, I used leaflet and the CartoDB popup script to make a popup containing the description of each incident when a marker is clicked. (As with the other map, the full size map can be viewed here.)

The script for the clickable map is below:

The script essentially the same as the previous script but it uses leaflet.js to add a base map from MapBox and the leaflet-carto.js to add the Major Shootings layer from CartoDB. Switching between years is handled differently from the previous example. The resetLayer function uses the setOptions function to update the query for each year. Note that I cheated a little and put the operator as part of the value, e.g. '= 2005". It was the most direct solution to handling the 'all' case in the SQL statement, which required that query for all the records as part of a SQL WHERE clause. I admit that putting the number of incidents on the buttons is cheesy; and if I had more time, I would put a label that would change on top of the map that read "x incidents in xxxx" when a different year is selected.. 

I like CartoDB because it's flexible and agnostic with regards to client mapping libraries. Building a simple mapHowever, what I find exciting is that CartoDB puts an API in front of PostgreSQL and PostGIS. This opens up all sorts of possibilities and removes the need to preprocess or reformat data to create a visualization.

The project file is available here.

Tuesday, July 31, 2012

From a PDF to a map

Data is often locked up in pdf reports, and if we're lucky it may be in a structured (HTML) or semi-structured form (e.g. email). Extracting data from pdfs requires translating the data into plain text while preserving the structure as much as possible.

One open source tool for extracting text from pdfs is the Apache Project tika. Tika does a reasonably good job of extracting text and preserving the structure. For this example, we'll parse the Brady Campaign's Mass Shootings in the United States Since 2005 document. Looking at the document we can see that it has a structure of location, date and description. Buried in the description is the source in parentheses.

To extract the text from the pdf to plain text use tika. Note that you will need to have java installed in order to run tika

The plain text file also contains the footer from the document. In the case of this document which has approximately 62 pages, I find it faster to remove the footer text using a text editor such as vim, emacs, or textmate. If the document was longer than 100 pages, then I would try to be more clever and use a unix utility such as sed or grep to find and remove these lines.

Once the title, footer, and extraneous text are removed separated by one to several newlines. So we have incidents, that include location, date and description, separated by one or several newlines. The new lines act as separators between incidents. The parse_brady.rb script reads the file line-by-line, if the line is not empty then it adds the line as an element to the ary array. When the line is empty, it processes ary by joining all the lines from description into a single line, slices the description array elements form ary and puts them into the row array, then add the single line description back to row, so that it only contains three elements. 

The first element of row is the location, this split into place and city and the geocode function is called. The geocode function returns either the coordinates if successful or an array with two nil members. The coordinates are added to row. The row is written as a tab delimited string to stdout where it can be redirected to a file.

Inspecting the resulting file shows that there are coordinates missing as well as inconsistencies in the date formatting. Some of the geocodes failures are due to misspellings (St.Louis versus St.Louis) or to places such as counties that are not in the GNIS database. I manually cleaned up the misspellings and dates and ran the parser again, then added the county coordinates from Wikipedia.

While there are numerous ways to make a map from a tab delimited file, such as Google Fusion tables, TileMill and GeoCommons; I decided to use CartoDB to try out their offering. The process is simple, create an account, upload your data, and style it accordingly.

In a follow up post, I'll go through the steps to make the map more legible using TileMill as a Carto style editor and by using the CartoDB API to make the map more interactive.

Part 2: Visualizing from a PDF 

Sunday, July 29, 2012

My Review of PostgreSQL: Up and Running

Originally submitted at O'Reilly

A Practical Guide to the Advanced Open Source Database

Good overview + new features in 9.1-9.2

By spara from San Antonio, TX on 7/29/2012


5out of 5

Pros: Well-written, Helpful examples, Accurate, Easy to understand, Concise

Cons: A little too colloquial

Best Uses: Reasonably experienced, Intermediate

Describe Yourself: Data monkey, Occasional devops, Developer

PostgreSQL: Up and Running is a book for experienced PostgreSQL users and people comfortable around RDBMS such as Oracle or SQLSever. This is definitely not a book for someone starting out with PostgreSQL.

For me, the book is a practical reference that begins to answer questions but directs the reader for more detailed information from other resources that are hyperlinked in the book. I'm a long time user of PostgreSQL 8.4, so the book provides an overview of features that are/will be available in 9.1 and 9.2. For example, the serial data type auto-increments integers is now available. Previously, you had to create a sequence object and use it when creating the table. I also learned about arrays for the first time which have been around since 8.3. Certainly a very handy datatype to have.

A nice feature in 9.1 is the ability to load new extensions with the CREATE EXTENSION command. I use PostGIS frequently, and previously you would have to run sql scripts to create a template database and use that to create a spatial database. With CREATE EXTENSION, you can add PostGIS data types, objects, and functions to an existing database.

The book also covers the standard SQL tables, indexes and views as implemented in PostgreSQL. Writing your own functions and aggregates are touched on and examples are provide. Query performance tuning is covered and typical sql errors, such as overusing subqueries (guilty), are discussed and examples for improving performance are given. Basic replication is also discussed as well options for hosting PostgreSQL.

PostgreSQL: Up and Running also gives a sneak preview of new features such as Foreign Data Wrappers for working with data from web APIs, plPython for writing functions in Python or in other supported languages, native JSON output, and the list goes on. The book shows that PostgreSQL is flexible and extensible.

I enjoyed reading the book. It was easy to understand, had good examples, and I learned about new capabilities. The book covers much ground, but goes into sufficient depth to gain an understanding, and it provides additional resources for more in depth treatments of topics. It's a mile wide, deep enough and a good reference for an experienced user.