Saturday, May 26, 2012

Shaping data and re-thinking web mapping architecture

Like many IT folks who came of age in the '90s, I was trained in the intricacies of relational databases, jamming data into third normal form, creating entity-relationship diagrams, and reveling in joins and views. Boyce-Codd semper fi. I never questioned why we tortured data to conform, just assumed that it was the way it was done; i.e. I get paid because I know how to do it. 

I can't remember how many times I've ranted about dumb users sending an Excel file as a database. After watching Max Schireson's presentation on Reinventing the Database, I am shamed. The gist of Schireson's presentation is that we spent a lot of time strongly and parsimoniously typing data to accomodate the technological constraints of computers when memory and disk were expensive.  Obviously, those constraints are no longer valid.
I work with the TreeKit project and they collect their data in this format.

The project collects data about trees and treebeds along a street by measuring along the street. The begin and end coordinates of the street are extracted from a GIS and volunteers using tape measures to collect the distance between tree beds, height and width, and tree data. Using this data, we can calculate the coordinates of each treebed. The data is collected on paper and entered into a PostgreSQL database via an online form. The online form allows users to see the data as entered allowing them to make corrections during data entry, but for the sake of simplicity (i.e. network not required) volunteers use paper forms for data collection.

Collecting tree data along a street in the form of a list is a very natural way for volunteers to gather information. A street is the basic entity that contains trees and treebeds and it constitutes a single record. However, data in this format violates first normal form in that holding the trees along with the street data creates repeating groups with in a single record.

In an ideal relational database there would be a street table, a treebed table, and a tree table with one-to-many relationships defined between the tables. However, the data is entered into a table with repeating groups. An external java process (using GeoTools) calculates the coordinates of trees and treebeds and inserts them into Postgres/PostGIS tables so they can be displayed.  So the mapping stack in this case is Postgres with a TileMill frontend for creating maps, which is served through MapBox. 

In this architecture, Postgres is grossly underutilized. Sure it stores both the raw data and the spatial data but that's all it does. The external java process that creates the spatial data is legacy code (written in an earlier iteration) can be written in something else and in fact is replicated in the data entry form in javascript. In it's simplest formulation, the data can be expressed in JSON.

The raw data would be further transformed into tree and treebeds that are used for drawing maps. The current process calculates a geohash of the centroid of each treebed which is used as the key between trees and treebeds. This is useful for the other analysis that uses the TreeKit data. The tree and treebed data can be stored as GeoJSON, shapefiles, or any other format that TileMill supports. Postgres can be removed from the stack, because it only adds overhead and no advantages.

There are situations where Postgres/PostGIS would be advantageous, such as where data changes or where additional processing or sophisticated querying is needed. Storing the data in simple format that can be easily consumed by most web applications has several advantages over storing it in a database. First, GeoJSON can be consumed directly my many web mapping clients; second, the overhead of maintaining another piece of software is removed; and finally the data can be transformed into other formats easily. So if someone hands you a Excel or csv file, try using a simpler format such as GeoJSON and simplify your web mapping stack.

No comments:

Post a Comment