Tuesday, November 13, 2012

Hacking a Common Operating Picture

Where’s the common operational picture? How can you get timely and accurate reporting without it?"

I've assisted on several disaster relief efforts since Hurricane Katrina and they have all needed a way to communicate where resources are and where they are needed. In military jargon, this is called a common operating picture or a COP. The most common interface for a COP is a map because it is easy to reference, although there is a case to made for a COP that also shows resources and needs by priority. 

Google Maps has made mapping accessible to the public. I first saw this during Katrina when people used Google Earth to link photos to locations in KMLs so that Katrina refugees could see flooded and damaged areas. I've also seen the availability of other mapping platforms and services such as Ushahidi and CartoDB increase, but it seems most people turn to Google Maps/Earth to create maps and data. While Google Earth has lowered the barrier to creating maps, it still requires a modicum of training in order to keep the map updated with the latest information.

I responded a request to help out with Occupy Sandy Staten Island with their map of resources. They had a map but things were rapidly changing and they wanted a volunteer to be able to update the map. The current map was in Google Maps, so I downloaded a KML of the map as a starting point. 

Consistent Look and Feel
Although there are many mapping platforms available, I decided to stick with Google infrastructure because Occupy Sandy Staten Island were already using it. To make the KML editable, I imported it into Google Fusion Tables which brought in the data but not the styling. I wanted the icons and style for each zone (polygonal area) to be consistent with the existing maps. That means that drop off points, shelters, and operations centers would always have the same symbology without the user having to pick and choose an icon or style every time a new point was added to the map.

Google Fusion Tables supports setting styles based on a column value. Google provides a tutorial showing how to merge tables to apply map styles by column. The tutorial shows how to create a columns in both the styles table and the data table that act as a key. In this case, I created a column called type and populated them descriptive keys such as dropoff, shelter, etc.  Although the tutorial shows how to set icons, map styles for polygons and lines can also be styled by column values. Merging style table to the data table results in a data table, row updates (new entries) to either the styles or data table will be displayed in the merge table. In this case, volunteers can just enter the type of point and it will be displayed with the correct symbology.

One of the primary functions of a COP is to locate things. Fusion table supports several types of geometry encoding ranging from KML geometry to street addresses. Street address are automatically converted into geometry. Even a partial street address without city, state, or zipcode are intelligently gecoded based on the bounding box of all the features.

Adding lines or polygonal data requires digitizing and is a bit beyond the scope. I did add one polygonal area using QGIS and OpenStreet Map as a background.  I originally exported the polygon as a KML, but for some reason QGIS outputs KML geometry in Google's World Mercator instead of WGS84 latitude/longitude coordinates in decimal degrees which is what Fusion Tables wants in the way of geography. The workaround was to export as GML and tweak the tags appropriately.

I wanted to have a spreadsheet interface for updating the map. Why a spreadsheet and not a form or just use Fusion Tables tables directly? Most people have used a spreadsheet so the learning curve is very small. In my experience forms get in the way of data entry, having to submit an entry is just a waste of time. Fusion Tables looks like a spreadsheet but it doesn't behave like one, so that adds to the learning curve. 

John McGrath wrote a script to update Fusion Tables from a spreadsheet; it's available on github. The script works but has shortcomings such as deleting all the entries in the Fusion Table before updating it. That means that you must keep and select all the previous entries in the spreadsheet instead of just updating the new entries.

Closing Comments
This was a quick hack done over a several hours. It's certainly far from perfect but it is functional and accomplishes the basic goal of creating a consistant map that can be updated by volunteers. Here are some takeaways:
  • use existing infrastructure; try to avoid having to install, configure or maintain software
  • use existing code; if you need a function, it's probably in github already
  • make data input simple; as a volunteer translator for Project 4636, I found forms getting in the way of entering information, so I used a spreadsheet
  • make data input convenient; for example, Fusion Tables can geocode an address automatically

Monday, September 10, 2012

Code for America Austin Hackathon: Buses and fun with GTFS

Last Saturday, I participated in the Code for America Austin Hackathon and was impressed by the people who attended. The crowd ranged from very experienced coders to folks who understood and could articulate a real problem as well as policies surrounding the problem. When I said well attended, did I mention that the City of Austin CIO and some of his staff were also there to help out? (Wink, wink, nudge, nudge, say no more to @COSAGOV and rising Democratic political star San Antonio Mayor Julian Castro @juliancastro).

Projects proposed at the hackathon ranged from automating campaign finance forms for the City of Austin, mapping bike routes and amenities, creating a mobile app and API for the  Aunt Bertha service which matches people to services, an online form for clients of the Homeless RV project, and improving bus transportation for riders.

I fell in with the bus group and the problem was presented by Glenn Gabois of Movability Austin. The problem we focused on was notifying riders the location and time of arrival of their bus while they wait at their stop.  Capital Metro, Austin's transportation agency, currently does not provide a feed for their buses while they are in transit. So we divide the problem into three componentes:
  • a crowd sourced bus tracking application
  • the server side infrastructure to provide bus data
  • a mobile application that tells rider a bus' current stop and the number of stops away
Crowd sourcing real time bus locations was an interesting discussion, the group discussed how a mobile app could be used to track buses. Some of the features of the app included:

  • would not require user to turn while on the bus or turn off when off the bus
  • determine which stop where the user boarded the bus
  • disambiguate between buses that used the same stops
The group discussion resulted in a mobile application that ran as a background process on a smart phone that would start sending coordinates when the accelerometer registered a change in speed. Determining which bus a rider boarded was a tougher problem and solutions ranged from QR codes on buses (rejected because it required the user to interact with the application), RFID (problematic because of equipment costs), and NFC (Near Field Communication is relatively new an not a common features on smart phones). In the end, the group centered on algorithmically determining the bus and the route based on data sent by the mobile application. Although this was an interesting discussion, the group decided not to address this problem

In order to use crowd sourced bus tracking data, the group discussed a service architecture to receive the bus tracking coordinates and compare them against the current transit data which Capital Metro provided as GTFS (General Transit File Specification) and as ESRI shape files.  We didn't spend much time discussing this part of the solution and I didn't feel up to the task of setting up a cloud based instance of PostgreSQL and importing the GTFS data into the data base (however, I later proved myself wrong and imported the GTFS data in PostgreSQL and created a sql dump of the database for anyone who wants to use it). For a quick feel good burst of "we're doing something", I uploaded the shape file data to Google Fusion Tables for a quick map. We had five hours to build and demo something at the end of the day.

The group decided to focus on the bus stop application (working title of Bus Time). Naqi Sayed and Kevin Reilly of Hidaya Solutions (@HidayaSolutions) worked on an Android application for determining how many stops away from a rider's current bus stop. With the clock ticking, the group decided that the application would work of just one trip of a single route. My contribution to the group was to rework the GTFS files into a single workable data sets. The GTFS data consisted of the following files

Using the GTFS Reference, I was able to decode how to get from a route to a list of stops. In our case, we picked a route and a trip on that routed, we then needed to find the stops for that trip. Since stops are sequential we just needed a sorted list of stops of the application.

To get the list of stops, I went through the GTFS Reference to find the keys that would allow me to associated stops with a specific trip. The joins between the data tables looked like this.

  1. Routes are comprised of trips (northbound, southbound, and time). Select a route from routes.txt, then select a trip by the route_id.
  2. A single trip list stop_times. Select stop_times by trip_id.
  3. Stop_times are related to stops by stop_id in the stops.txt file. Select the the stops using the stop_id.
I produced a csv file that was an ordered list of the stops for the application using a combination of sed/grep/awk/bash/ruby scripting. It wasn't pretty but in worked under our time constraint.

Naqi already had the barebones of an Android app with geolocation so he incorporated the data into the app. I did not work on the app but I think it worked this way:
  • the app determined the location of the rider via WiFi location or GPS and choose the closest stop
  • the bus position feed was simulated and was snapped to the closest stop
  • the app compared the rider's stop to the bus' stop, since it was an ordered list it was matter of counting the between the two stops
In Lean Startup speak, the minimum viable product (MVP) was this application.

To me, it was amazing that a group of people who had just met could hammer out an application in six hours. Everyone brought something to the table that contributed to the overall goal. I learned lots of new things from other folks and also some technical lessons along the way, such as importing GTFS data into a database using these scripts

We did what we did because we had to explore a problem, generate solutions, and implement one of then in six hours. Now that I've had a little more time to think about it, I think we could vastly improve the application by importing the GTFS data into Sqlite and using it directly in the application. The application would then have every route, trip, fares, and amenities available. The application could then just subscribe to real-time transit feed without having to deploy application back end. Additionally, we could revisit the QR code idea. Instead of riders with a crowd sourcing app, the QR code would be on the bus and the reader would be at each stop, as buses load and unload passengers the QR code reader could send the bus information to the real-time transit feed.

As you can see the ideas keep flowing, but it's important to remember that creating a mobile application would have been impossible with open data in the form of GTFS. 

Saturday, August 11, 2012

Notes on getting CartoDB VM working in OSX VirtualBox

I've been wanting to play some more with CartoDB but I've been traveling and my Internet connection has been provided courtesy of whatever cell tower is close by. A VMWare Ubuntu virtual machine is available at aux 2 centimes. Great if your operating system is Windows or linux because VMWare provides a free player to run the VM. On OSX, VMware Fusion can run the VM but it's not free.

The free alternative is Virtual Box, but it requires converting the VMWare vmdk file to an the open ovf format. Before converting the vmdk file, you will need to use your favorite OSX rar archive tool (MacPar, UnRarX) to reassemble the CartoDB vm which is split into three rar files. 

Download the ovftool from VMWare and install. Before converting the vmdk to an ovf file you will need to edit the vmx file so that the path to the vmdk is not hardcoded.

Convert the vmdk to an ova file.

Import the ova into VirtualBox and accept the default settings, they can be changed later.

The next step is to configure the VirtualBox network so that CartoDB is accessible to your host machine. To do this setup the Host-only Adapter. The Host-only Adapter creates a private network that between the host machine and guest machines. 

The vboxnet0 Host-only Adapter is configured from the VirtualBox Preferences menu.

Click on the 'Edit' or screwdriver icon to display the configuration menu. Note that the adapter's address is, which is the IP address that the guest machine accesses the host.

Now that we know the adapter's address, we can configure the network interface of the guest to have a static IP that is addressable from the host machine. Edit the /etc/network/interfaces file as root to set the static address. It should look like this:

Reboot the guest to start the network with the assigned static IP.

The CartoDB virtual machine is configured with two accounts, a devuser account and an admin account. To access these accounts, two URLs have been configured: http://devuser.localhost.lan (user/password: devuser/devuser) and http://admin.localhost.lan (user/password: admin/admin). 

To access these accounts configure the /private/etc/hosts file on the host side (OSX) by adding these entries:

After updating the hosts file, we can access our local instance of CartoDB through a browser. Note: add 'http://' to the URL so that the browser knows you are looking for the local CartoDB site. The devuser map looks like this:

Happy mapping!

Friday, August 3, 2012


A friend in grad school used to go around the office and exhort us to 'go eat' lunch together. Eventually this just became "gweet."

Similarly, an engineer I worked with would build applications using java jars built by another part of our company. I would ask him how his project was going after the jar update and he would invariably say "It's broke." Over time, this elided into 'sproke.

*True story: One Christmas, I bought everyone a computer bag with NPE embroidered on it because everything was 'sproke.

Thursday, August 2, 2012

Visualizing data from a PDF

In my previous post, I scraped the data from a PDF and wrote it to a CSV file. I uploaded the data to CartoDB to make a quick map using the "Share This Map" function on the CartoDB site.

I received a number of comments on the map, mostly about making it more legible by separating out the years and providing more information about each incident. In the spirit of Brian Timoney's post on how web maps are actually used, I decided to keep the map as simple as possible.

After reading the API documentation and looking at the examples, I used the Data Interaction example as the starting point for my map. One of the comments I received was to separate the incidents by year. The interesting part of the example is that it shows how to dynamically change what is presented on a map by changing the values in a SQL query. That's very powerful because you have the data handling and geoprocessing capabilities of PostgreSQL and PostGIS available in your map client. Here's the map with a control to show incidents by year. (The blog form squeezes the map, the see the full map here.)

The script for the map is below.

One of the nice things about the map is that it doesn't use external mapping client libraries. It's simple and shows the user where the incidents occurred. However, I would like to show more information about each incident through a popup. CartoDB plays well with leaflet, modestmaps.js, and wax, and it's easy to extend the basic map with these libraries. In the following map, I used leaflet and the CartoDB popup script to make a popup containing the description of each incident when a marker is clicked. (As with the other map, the full size map can be viewed here.)

The script for the clickable map is below:

The script essentially the same as the previous script but it uses leaflet.js to add a base map from MapBox and the leaflet-carto.js to add the Major Shootings layer from CartoDB. Switching between years is handled differently from the previous example. The resetLayer function uses the setOptions function to update the query for each year. Note that I cheated a little and put the operator as part of the value, e.g. '= 2005". It was the most direct solution to handling the 'all' case in the SQL statement, which required that query for all the records as part of a SQL WHERE clause. I admit that putting the number of incidents on the buttons is cheesy; and if I had more time, I would put a label that would change on top of the map that read "x incidents in xxxx" when a different year is selected.. 

I like CartoDB because it's flexible and agnostic with regards to client mapping libraries. Building a simple mapHowever, what I find exciting is that CartoDB puts an API in front of PostgreSQL and PostGIS. This opens up all sorts of possibilities and removes the need to preprocess or reformat data to create a visualization.

The project file is available here.

Tuesday, July 31, 2012

From a PDF to a map

Data is often locked up in pdf reports, and if we're lucky it may be in a structured (HTML) or semi-structured form (e.g. email). Extracting data from pdfs requires translating the data into plain text while preserving the structure as much as possible.

One open source tool for extracting text from pdfs is the Apache Project tika. Tika does a reasonably good job of extracting text and preserving the structure. For this example, we'll parse the Brady Campaign's Mass Shootings in the United States Since 2005 document. Looking at the document we can see that it has a structure of location, date and description. Buried in the description is the source in parentheses.

To extract the text from the pdf to plain text use tika. Note that you will need to have java installed in order to run tika

The plain text file also contains the footer from the document. In the case of this document which has approximately 62 pages, I find it faster to remove the footer text using a text editor such as vim, emacs, or textmate. If the document was longer than 100 pages, then I would try to be more clever and use a unix utility such as sed or grep to find and remove these lines.

Once the title, footer, and extraneous text are removed separated by one to several newlines. So we have incidents, that include location, date and description, separated by one or several newlines. The new lines act as separators between incidents. The parse_brady.rb script reads the file line-by-line, if the line is not empty then it adds the line as an element to the ary array. When the line is empty, it processes ary by joining all the lines from description into a single line, slices the description array elements form ary and puts them into the row array, then add the single line description back to row, so that it only contains three elements. 

The first element of row is the location, this split into place and city and the geocode function is called. The geocode function returns either the coordinates if successful or an array with two nil members. The coordinates are added to row. The row is written as a tab delimited string to stdout where it can be redirected to a file.

Inspecting the resulting file shows that there are coordinates missing as well as inconsistencies in the date formatting. Some of the geocodes failures are due to misspellings (St.Louis versus St.Louis) or to places such as counties that are not in the GNIS database. I manually cleaned up the misspellings and dates and ran the parser again, then added the county coordinates from Wikipedia.

While there are numerous ways to make a map from a tab delimited file, such as Google Fusion tables, TileMill and GeoCommons; I decided to use CartoDB to try out their offering. The process is simple, create an account, upload your data, and style it accordingly.

In a follow up post, I'll go through the steps to make the map more legible using TileMill as a Carto style editor and by using the CartoDB API to make the map more interactive.

Part 2: Visualizing from a PDF 

Sunday, July 29, 2012

My Review of PostgreSQL: Up and Running

Originally submitted at O'Reilly

A Practical Guide to the Advanced Open Source Database

Good overview + new features in 9.1-9.2

By spara from San Antonio, TX on 7/29/2012


5out of 5

Pros: Well-written, Helpful examples, Accurate, Easy to understand, Concise

Cons: A little too colloquial

Best Uses: Reasonably experienced, Intermediate

Describe Yourself: Data monkey, Occasional devops, Developer

PostgreSQL: Up and Running is a book for experienced PostgreSQL users and people comfortable around RDBMS such as Oracle or SQLSever. This is definitely not a book for someone starting out with PostgreSQL.

For me, the book is a practical reference that begins to answer questions but directs the reader for more detailed information from other resources that are hyperlinked in the book. I'm a long time user of PostgreSQL 8.4, so the book provides an overview of features that are/will be available in 9.1 and 9.2. For example, the serial data type auto-increments integers is now available. Previously, you had to create a sequence object and use it when creating the table. I also learned about arrays for the first time which have been around since 8.3. Certainly a very handy datatype to have.

A nice feature in 9.1 is the ability to load new extensions with the CREATE EXTENSION command. I use PostGIS frequently, and previously you would have to run sql scripts to create a template database and use that to create a spatial database. With CREATE EXTENSION, you can add PostGIS data types, objects, and functions to an existing database.

The book also covers the standard SQL tables, indexes and views as implemented in PostgreSQL. Writing your own functions and aggregates are touched on and examples are provide. Query performance tuning is covered and typical sql errors, such as overusing subqueries (guilty), are discussed and examples for improving performance are given. Basic replication is also discussed as well options for hosting PostgreSQL.

PostgreSQL: Up and Running also gives a sneak preview of new features such as Foreign Data Wrappers for working with data from web APIs, plPython for writing functions in Python or in other supported languages, native JSON output, and the list goes on. The book shows that PostgreSQL is flexible and extensible.

I enjoyed reading the book. It was easy to understand, had good examples, and I learned about new capabilities. The book covers much ground, but goes into sufficient depth to gain an understanding, and it provides additional resources for more in depth treatments of topics. It's a mile wide, deep enough and a good reference for an experienced user.


Thursday, July 26, 2012

Setting up postgres.app with PostGIS

I should know better, but I'm a sucker for an upgrade. I upgraded to OS X Mountain Lion and a number of apps I use wouldn't start. A brew installed version of PostgreSQL couldn't start because the postgres user was changed to nobody. For some reason, I couldn't su to nobody or create a new postgres user using dscl.

The data on the database wasn't particularly large and I also had the scripts to reload the data so I decided to try Heroku's postgres.app install of postgres.
Installation is simple. Download the app and unzip into the Applications directory. Double click to start, and the postgres icon will appear on the task bar.

In order to use the commandline tools, you will need to set the path to the postgres.app binaries. Add this to your .bash_profile or wherever your paths are set in you .dot files. (note: @mweissman suggested this is unnecessary on OSX Lion 10.7 and higher because PostgreSQL and the client tools are already part of the operating system.)

In the terminal, connect to the postgres using psql and create a database. Connect to the database and add the PostGIS extension.

Now that the PostGIS extension has been added to your database, creating table is straight forward. In this example, I'm creating a table from a CSV file with point geometry and creating a spatial index on the geometry.

And that's it, quick and simple.

Saturday, May 26, 2012

Shaping data and re-thinking web mapping architecture

Like many IT folks who came of age in the '90s, I was trained in the intricacies of relational databases, jamming data into third normal form, creating entity-relationship diagrams, and reveling in joins and views. Boyce-Codd semper fi. I never questioned why we tortured data to conform, just assumed that it was the way it was done; i.e. I get paid because I know how to do it. 

I can't remember how many times I've ranted about dumb users sending an Excel file as a database. After watching Max Schireson's presentation on Reinventing the Database, I am shamed. The gist of Schireson's presentation is that we spent a lot of time strongly and parsimoniously typing data to accomodate the technological constraints of computers when memory and disk were expensive.  Obviously, those constraints are no longer valid.
I work with the TreeKit project and they collect their data in this format.

The project collects data about trees and treebeds along a street by measuring along the street. The begin and end coordinates of the street are extracted from a GIS and volunteers using tape measures to collect the distance between tree beds, height and width, and tree data. Using this data, we can calculate the coordinates of each treebed. The data is collected on paper and entered into a PostgreSQL database via an online form. The online form allows users to see the data as entered allowing them to make corrections during data entry, but for the sake of simplicity (i.e. network not required) volunteers use paper forms for data collection.

Collecting tree data along a street in the form of a list is a very natural way for volunteers to gather information. A street is the basic entity that contains trees and treebeds and it constitutes a single record. However, data in this format violates first normal form in that holding the trees along with the street data creates repeating groups with in a single record.

In an ideal relational database there would be a street table, a treebed table, and a tree table with one-to-many relationships defined between the tables. However, the data is entered into a table with repeating groups. An external java process (using GeoTools) calculates the coordinates of trees and treebeds and inserts them into Postgres/PostGIS tables so they can be displayed.  So the mapping stack in this case is Postgres with a TileMill frontend for creating maps, which is served through MapBox. 

In this architecture, Postgres is grossly underutilized. Sure it stores both the raw data and the spatial data but that's all it does. The external java process that creates the spatial data is legacy code (written in an earlier iteration) can be written in something else and in fact is replicated in the data entry form in javascript. In it's simplest formulation, the data can be expressed in JSON.

The raw data would be further transformed into tree and treebeds that are used for drawing maps. The current process calculates a geohash of the centroid of each treebed which is used as the key between trees and treebeds. This is useful for the other analysis that uses the TreeKit data. The tree and treebed data can be stored as GeoJSON, shapefiles, or any other format that TileMill supports. Postgres can be removed from the stack, because it only adds overhead and no advantages.

There are situations where Postgres/PostGIS would be advantageous, such as where data changes or where additional processing or sophisticated querying is needed. Storing the data in simple format that can be easily consumed by most web applications has several advantages over storing it in a database. First, GeoJSON can be consumed directly my many web mapping clients; second, the overhead of maintaining another piece of software is removed; and finally the data can be transformed into other formats easily. So if someone hands you a Excel or csv file, try using a simpler format such as GeoJSON and simplify your web mapping stack.

Wednesday, May 16, 2012

ARE2012 Keynote: Serendipity

Slide 1:

Serendipity is another way to say good luck. It's a concept and belief that something fortuitous occurs from a confluence of factors

Slide 2:

Waldo Tobler's First Law of Geography states that things near you have more influence that things farther away. This idea has been applied in a number of situations.

Hecht, Brent and Emily Moxley. "Terabytes of Tobler: Evaluating the First Law in a
Massive, Domain-Neutral Representation of World KnowledgeCOSIT'09 Proceedings of the 9th international conference on Spatial information theory, 2009,  pp 88-105.

Sui, D. "Tobler’s First Law of Geography: A Big Idea for a Small World?" Annals of the Association of American Geographers, 94(2), 2004, pp. 269–277.

Tobler, W. , "On the First Law of Geography: A Reply," Annals of the Association of American Geographers, 94, 2004, pp. 304-310.

Slide 3:

Will Wright recently gave a talk at O'Reilly's Whereconf titled "Gaming Reality." One of his points was related to Tobler's First Law of Geography, things that are closest are most likely to be of interest. 

Shute, T. Ugotrade. Will Wright, “Gaming Reality,” Where 2012

Slide 4:

Proximity can be measured along many different dimensions: spatial, temporal, social network, and conceptual.

Slide 5:

Developers are building mobile applications based on these ideas. For example, GeoLoqi implements geofencing to notify users of events when inside a defined area. Forecast is another application which broadcasts when and where you will be to your friends, increasing the likelihood that you will meet. Other applications can notify of events and sales that occur as you pass through an area.

Brownlee, J. This Creepy App Isn’t Just Stalking Women Without Their Knowledge, It’s A Wake-Up Call About Facebook Privacy. Cult of Mac, March 30, 2012.
Huffington, A. GPS for the Soul: A Killer App for Better Living. Huffington Post04/16/2012.

Slide 6:

Social media, such as Twitter, have been analyzed to determine states of emotion and mapped by Dan Zarella. This just one example of how data can be used to find a person's proximity to an emotion based on location and time.

Zarrella, D. Using Twitter Data to Map Emotions Geographically. The Social Media ScientistMay 7th 2012

Slide 7:

Connections between people in the form of social networks or social graphs provides a rich source of data for measuring conceptual phenomena. For example, Klout declares that it is a measure of influence, LinkedIn can be a measure of a person's professional sphere, Twitter can  while Pintrest can reflect the material culture of a person or a group.

Stevenson, S. What Your Klout Score Really Means. WiredApril 24, 2012

Slide 8:

Will Wright postulated that there are at least 50 different dimensions where proximity creates a value gradient. The closer to a person, the greater the value along the value gradient. These gradients can be emotions, communities of interest, school affiliations, or any number of factors that can influence a person's behavior and choices. By bringing all these dimensions to bear on a person, it could be possible to build game dynamics that take advantage of physical world behaviors.

Slide 9:

Measurement of the value gradient is the first step in engineering serendipity. There are a number of ways of quantifying the value gradient, but proximity is often modeled on a network structure. Nodes in the network represent people and possible dimensions of interest and the connections (or links) between nodes can measure the gradient.

Slide 10:

Will Wright suggested that Central Place Theory as one model of understand the effects of proximity. It is a classic geographic model proposed by Walter Christaller for explaining the hierarchy of places. When applied to influencing serendipity, the concepts of threshold and range are key to using the model to measure the influence of proximity. Threshold is the minimum interaction of a dimension needed to influence a person, whereas range is the maximum distance a person will 'travel' to acquire something.

Dempsey, C. Distance Decay and Its Use in GIS. GIS Lounge3/15/12.

Slide 11:

There are a number of ways to measure the effects and or the importance of links. Google's page rank algorithm is perhaps the most famous. Page rank indicates the importance of a page based on number of incoming links. Another form of link analysis used by the intelligence community focuses on the transactions between people, organizations, places and time as exemplified by Palantir software.

Holden, C. Osama Bin Laden Letters Analyzed. Analysis Intelligence. May 4, 2012.

Holden, C. From the Bin Laden Letters: Mapping OBL’s Reach into Yemen. Analysis Intelligence. May 11, 2012.

Page, Lawrence and Brin, Sergey and Motwani, Rajeev and Winograd, Terry (1999) The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.

Slide 12:

Ultimately, all these measures of proximity are attempting to answer this question, "If you friend Joey jumped off a bridge, would you jump?" I.e., would you jump off a bridge because everyone is doing it (social influence/contagion) or would you jump because you are similar to Joey (homophily). A recent paper, Homophily and Contagion are Generally Confounded in Observational Network Studies, posits both the subject and the answer in it's title. 

Shalizi,C and A. Thomas. Homophily and Contagion are Generally Confounded in Observational Network StudiesSociological Methods and Research, vol. 40 (2011), pp. 211-239.

Slide 13:

The comic XKCD manages to summarize the result in a single panel. We don't know.

Munroe, R. Cat Proximity, xkcd.

Slide 14:

The maxim, "Models are wrong, but are useful" has been a truism in research. The idea that models are not only wrong, but that research can be successful without them is starting to gain currency in the era of Big Data. Access to very large datasets and the capability to manipulate them inexpensively is changing how research is performed.

Allen, R. Data as seeds of content. O'Reilly Radar. April 5, 2012.

Slide 15:

With large numbers on our side, petabytes or even yottabytes of data can reveal patterns not possible with sampled data.

Shaw, A. Big Data, Gamification and Obama 2012. OWNI.EU. April 4, 2012.

Slide 16:

Flip Kromer of Infochimps illustrates how a preponderance of data leads us to determine the boundaries of places called Paris and which location is the one used in a particular context

Kromer, F. On Being Wrong In Paris: Finding Truth in Wrong Answers. The Infochimps Blog. Dec 1, 2011.

Slide 17:

Third party agents are continuously collecting information about people from social media, social networks, and ecommerce. This provides a wealth of data about people for a third party perspective. In addition, the quantified self is a concept where individuals document every aspect of their lives in order to optimize their day to day interactions.

However, Goodhart's law stipulates that any indicator used to influence a particular behavior will decrease the usefulness of that indicator. In other words, users will game the system and degrade the quality of the information in order to achieve the objective

Doctrow, C. Goodhart's Law: Once you measure something, it changes. boingboing.net. April 29, 2010

Sharwood, S. Social networks breeding spatial junk. The Register. March 6, 2012.

Slide 18:

There is an emerging an corollary concept of the quantified self. Rather than a continuous collection of data there is an alternate of source of data that reflects information selected and shared but not for the purposes of participating in social networks, i.e. a view to a person's internal life. For example, Amazon collects highlighted phrases from Kindle users as well collecting wish lists which represent material culture.

Carrigan, M. Mass observation, quantified self, and human nature. markcarrigan.net. April 19, 2012.

Currion, P. The Qualified SelfThe Unforgiving Minute. November 30, 2011.

Slide 19:

To bring it back to serendipity, perhaps it's time to re-evaluate how we understand how multiple factors affect an individual's choices. Models based on physical properties such proximity may lack the nuance necessary to explain a behavior. Simply creating a confluence of events within many possible proximal dimensions may not be enough to explain or influence. However, a new alternative is possible through the use of big data and the tools of machine learning and algorithms to describe behavior. We should harness these tools to better understand the factors that affect serendipity and let go of Newtonian models that reduce the rich interplay of social factors.

Tuesday, May 1, 2012

Where are the gazetteers?

Last time I checked it was 2012 and given all the excitement about open data in the US government, I would have expected the USGS (or anyone) providing a friendly GNIS based gazetteer service. I send the service a place name and a state and it returns a coordinate pair.

Sure there's Google Maps API, geonames.org, Nominatim, The National Map and a whole host of other services that require agreeing to a end user license, compile a number of data sources into one, return more than I need or want, not open data, not open source and on and on.

In less time than I spent searching for a service, I rolled my own gazetteer of sorts. I downloaded the GNIS national file and created a sqlite database

And here's a quick ruby script to query the database.

Wednesday, April 11, 2012

FOSS4G-NA: Demographic analysis using open source

My FOSS4G-NA talk expands on an earlier blog post on loading S1 data into PostgreSQL by looking at how to create a population pyramid and how to compare population change between the 2000 and 2010 census. The scripts I used for this presentation are available on github.

Database Preparation

I've updated the original scripts to correct a few errors with regards to character set and data types. We will load the SF1 files for 2000 and 2010, the Census Tract Relationship files and the census tracts of interest from TIGER.

The following scripts create the 2010 and 2000 SF1 tables in a postgresql database:
Loading data into the tables is covered in a previous post.

Census geography changes between each decennial census. The Census Bureau provides Census Tract Relationship files that describe these changes. The following sql scripts create the files as tables.
The data can be downloaded from Census and loaded into the tables using the postgres copy command as in the previous example.

The Relationship Between SF1 and TIGER

Summary File 1 (SF1)
"...contains the data compiled from the questions asked of all people and about every housing unit. Population items include sex, age, race, Hispanic or Latino origin, household relationship, household type, household size, family type, family size, and group quarters. Housing items include occupancy status, vacancy status, and tenure (whether a housing unit is owner-occupied or renter-occupied)." 
SF1 data for either the 2000 or 2010 census includes 40 plus files (for either decennial census) and thousands of columns cryptically named. In order to understand which file contains variables of interest you will need SF1 Table Matrix. In addition, to understand how to query for the level of geography (blockgroup, tract, county, etc) you will need Summary Level Sequence chart.

There a number of ways to retrieve the logrecno for a given census geography. The first example shows how to retrieve the logrecno from a set of TIGER census tracts loaded into PostGIS. The census tract table is called tract_d3.

You can also get the logrecno from a list of census tracts. This example retrieves the logrecno number from the 2000 census tracts.

This query can be incorporated as subquery for other queries such as building a population pyramid.

Building a population pyramid

The following query builds a population pyramid for females for a group of 2010 census tracts. The table sf1_00017 contains the counts for each age, the query sums the counts in five year increments which is how population pyramids are typically constructed. The subquery selects tracts based on logrecno.

A script retrieve the data for the male portion of the population pyramid is also available.

There are a number of sources for building an online population pyramid. Most of the are based on the d3 javascript visualization library. Here are several:
If you use OSX, the Population Pyramid Generator by the Perihelion Group can quickly generate a standard population pyramid.

Demographic Change

Measuring change between decennial census is a typical demographic task. Comparison require standardizing changes to geographic boundaries over time. There are four types of change:
  • no change, geographic boundaries remain the same
  • revision - a small change that does not affect a larger portion of the population
  • merge - previous geographic boundaries are joined to form the current boundaries
  • split - current boundaries are created by splitting the previous boundaries
These 2010 census tracts belong to a city council district in San Antonio, Texas. This example exhibits all four types of change:

  • The light blue/whitish lines are the 2010 tract boundaries that haven't changed from the previous census. 
  • The red lines represent the 2000 tract boundaries. The tracts with the red line between them is a merger of the 2000 tracts that form a 2010 tract outlined in a light blue.
  • The cyan lines represent new 2010 tracts that are the result of splitting 2000 boundaries. This is noticeable at the bottom most tract which shows the cyan line from a split and the old 2000 boundary in red.
  • The scale of the image hides a revision in one of the tracts.

There are several standardization methods for each type of change:

  • merge - locate merges using the first 4 digits of the tract, manually edit the tract, or spatial joins by polygon or centroid
  • split - area allocation and union (assumes that population is equally distributed)
  • use the Census Relationship files which tracks the changes in census geography

Using the Census Relationship files, the methodology is:
  • query for tracts without change
  • query for tracts that split
  • query for tracts that have merged
  • combine the results to a base geography
The following query is for the total population of 2000 census tracts with the no change; the list of tracts were selected by matching census tract numbers in QGIS.

A similar script for 2010 census tracts without change is available. The Census Relationship files contain the splits that occurred between each census.  The following query retrieves the 2000 tracts, 2010 tracts, and the 2010 population from the relationship files.

Since there was only one 2010 tract that was merged, it was possible to retrieve the values with a direct query to the table. All the results were merged into a csv file with both the 2000 and 2010 tract numbers and joined to the 2000 tract boundaries using QGIS. Below are the population changes using the 2000 census boundaries.

2000 Population by 2000 tract boundaries
2010 population by 2000 tract boundaries
The dark blue tracts indicates the most populous areas in a quantile classification and white tracts represent the least populous. The maps show an overall increase in population in the selected tracts. There are some increases in the smaller tracts indicating increasing density in those areas.


I've covered some basic techniques for retrieving data for demographic analysis. The sql scripts provide basic building blocks for typical demographic processes such as calculating fertility, mortality, and migration. PostgreSQL and QGIS were used in combination to perform these queries and display the results.

Thursday, April 5, 2012

Whereconf 2012: The Dead Pool and Also-Rans

My talk at Whereconf 2012 was originally about companies in the location business that failed or were acquired for less than the money they raised. My source of information was CrunchBase, an online database of companies, people and investors that is freely editable. Since it is crowd sourced, the quality of the data can vary and for the most part, the dead pool or acquisition information was not sufficient for analysis. I decided to look at acquisitions and IPOs to see how the location industry was doing.


My first task was to gather a list of location oriented companies. CrunchBase tags companies with terms such as location or geo or gps. Although I could have used these tags to create a pool of location oriented companies, the tags had a lot of variation to determine if it was a location company. I decided to use a different method. Each company entry in CrunchBase includes a list of competitors; I wrote a depth first search program to gather a list of companies. I used a set of well known companies such as FourSquare, SkyHook, Loopt, and SimpleGeo to generate a seed list of companies. As expected, this process generated a list with many duplicates; so with a bit of experimentation, I determined that three iterations generated a sufficient amount companies with the duplicates removed. I then ran my seed list through the depth first search program which yielded 1945 unique companies. A number of those company entries lacked basic information such as founding date or money raised; I eliminated them and reduced the list to 1505 companies with sufficient data for analysis.

Initially, I attempted to segment the companies into groups based on descriptive tags using hierarchical clustering, but the tags were too unique to provide a meaningful cluster. Fortunately, companies can be assigned to a category in CrunchBase and this is a handy way to segment the data set. The pie chart below shows the different categories.

Location companies by CrunchBase category

Refining the location business segments

I wanted to further narrow the data set by focusing on companies that comprised the majority of the location industry. I then ran a hierarchical cluster again using category. The result was five clusters represented by six categories.

The higher a category or term occurs in the plot, the more frequently it appears in the data set. Terms that are closer to each other are more related; the plot shows that the categories are quite discreet.

A look at the raw counts and percentages show that roughly 20% of companies are acquired and 5% reach an initial public offering.

Raw counts



A closer look at the companies that had an IPO revealed several companies in some of the categories that were clearly not location companies (these included major mobile carriers such as Verizon). These were removed and counts were adjusted.

Adjusted counts and percentages

Who's acquiring companies 

I created a word cloud to show which companies are acquiring business in this location space. Companies such as Google, Microsoft and Yahoo are doing most of acquisition, but a lot of companies with "Media" as part of their name stands out.

CrunchBase includes a field for the media source announcing acquisitions. Google, Social Media, and Video are prominent terms.

How much money did these companies raise prior to acquisition?
For the top three categories (web, games, and mobile), I generated descriptive statistics for the amount of money they raised. Note that I removed all companies that do not have data for money raised.

Amount of money raised by web companies 



The results are impressive with over $9 billion raised, however the results are skewed because they include FaceBook, Twitter, Groupon, and AOL which all had large multiple rounds. The important statistic to look at is the mode, which indicates what most companies raised.

Amount of money raised by game/video companies


The mode is quite a bit larger, $5 million, in comparison to companies in the web category. This is most likely due to web companies having long tails on both ends of the distribution.

Amount of money raised by mobile companies


Of note, the mode is similar to companies in the game/video category.

Closing thoughts 

CrunchBase is essentially a wiki, with all the strengths and weaknesses of crowd sourced data. It is a good place to begin exploring the performance of technology businesses, especially location based startups. While there are databases for established companies there are few openly available resources on startups.

Using DFS on business competitors is a good way to generate a list of businesses and startups in the location space, but an extra step to filter out businesses adjacent but not in the location space is needed. Whether this can be done with clever text mining of tags or other ancillary data remains to be scene. The approach of using the predetermined categories to segment the pool of location businesses is a reasonable approach and allows for cross comparison across other business verticals.

Basic descriptive statistics have been provided but conclusions can't be drawn on the basis of data presented. However this provides a first approximation of a sketch of the location based startups and businesses.

Data availability

Data and the scripts used to generate the data and analyses are available at: https://github.com/spara/whereconf2012


  1. Word clouds generated at http://www.wordle.net/
  2. crunchbase gem was forked from https://github.com/tylercunnion/crunchbase
  3. R data loading script from http://www.cloudstat.org/index.php?do=/kaichew/blog/crunchbase-api-scrapping-companies-info-from-crunchbase-with-r-packages/
  4. R hierarchical clustering script from http://heuristically.wordpress.com/2011/04/08/text-data-mining-twitter-r/