Monday, September 10, 2012

Code for America Austin Hackathon: Buses and fun with GTFS

Last Saturday, I participated in the Code for America Austin Hackathon and was impressed by the people who attended. The crowd ranged from very experienced coders to folks who understood and could articulate a real problem as well as policies surrounding the problem. When I said well attended, did I mention that the City of Austin CIO and some of his staff were also there to help out? (Wink, wink, nudge, nudge, say no more to @COSAGOV and rising Democratic political star San Antonio Mayor Julian Castro @juliancastro).

Projects proposed at the hackathon ranged from automating campaign finance forms for the City of Austin, mapping bike routes and amenities, creating a mobile app and API for the Aunt Bertha service which matches people to services, an online form for clients of the Homeless RV project, and improving bus transportation for riders.

I fell in with the bus group and the problem was presented by Glenn Gabois of Movability Austin. The problem we focused on was notifying riders the location and time of arrival of their bus while they wait at their stop. Capital Metro, Austin's transportation agency, currently does not provide a feed for their buses while they are in transit. So we divide the problem into three componentes:

a crowd sourced bus tracking application
the server side infrastructure to provide bus data
a mobile application that tells rider a bus' current stop and the number of stops away

Crowd sourcing real time bus locations was an interesting discussion, the group discussed how a mobile app could be used to track buses. Some of the features of the app included:

would not require user to turn while on the bus or turn off when off the bus
determine which stop where the user boarded the bus
disambiguate between buses that used the same stops

The group discussion resulted in a mobile application that ran as a background process on a smart phone that would start sending coordinates when the accelerometer registered a change in speed. Determining which bus a rider boarded was a tougher problem and solutions ranged from QR codes on buses (rejected because it required the user to interact with the application), RFID (problematic because of equipment costs), and NFC (Near Field Communication is relatively new an not a common features on smart phones). In the end, the group centered on algorithmically determining the bus and the route based on data sent by the mobile application. Although this was an interesting discussion, the group decided not to address this problem

In order to use crowd sourced bus tracking data, the group discussed a service architecture to receive the bus tracking coordinates and compare them against the current transit data which Capital Metro provided as GTFS (General Transit File Specification) and as ESRI shape files. We didn't spend much time discussing this part of the solution and I didn't feel up to the task of setting up a cloud based instance of PostgreSQL and importing the GTFS data into the data base (however, I later proved myself wrong and imported the GTFS data in PostgreSQL and created a sql dump of the database for anyone who wants to use it). For a quick feel good burst of "we're doing something", I uploaded the shape file data to Google Fusion Tables for a quick map. We had five hours to build and demo something at the end of the day.

The group decided to focus on the bus stop application (working title of Bus Time). Naqi Sayed and Kevin Reilly of Hidaya Solutions (@HidayaSolutions) worked on an Android application for determining how many stops away from a rider's current bus stop. With the clock ticking, the group decided that the application would work of just one trip of a single route. My contribution to the group was to rework the GTFS files into a single workable data sets. The GTFS data consisted of the following files

agency.txt
amenities.txt
calendar.txt
calendar_dates.txt
fare_attributes.txt
fare_rules.txt
routes.txt
shapes.txt
stop_times.txt
stops.txt
trips.txt

Using the GTFS Reference, I was able to decode how to get from a route to a list of stops. In our case, we picked a route and a trip on that routed, we then needed to find the stops for that trip. Since stops are sequential we just needed a sorted list of stops of the application.

To get the list of stops, I went through the GTFS Reference to find the keys that would allow me to associated stops with a specific trip. The joins between the data tables looked like this.

Routes are comprised of trips (northbound, southbound, and time). Select a route from routes.txt, then select a trip by the route_id.

A single trip list stop_times. Select stop_times by trip_id.

Stop_times are related to stops by stop_id in the stops.txt file. Select the the stops using the stop_id.

I produced a csv file that was an ordered list of the stops for the application using a combination of sed/grep/awk/bash/ruby scripting. It wasn't pretty but in worked under our time constraint.

Naqi already had the barebones of an Android app with geolocation so he incorporated the data into the app. I did not work on the app but I think it worked this way:

the app determined the location of the rider via WiFi location or GPS and choose the closest stop
the bus position feed was simulated and was snapped to the closest stop
the app compared the rider's stop to the bus' stop, since it was an ordered list it was matter of counting the between the two stops

In Lean Startup speak, the minimum viable product (MVP) was this application.

To me, it was amazing that a group of people who had just met could hammer out an application in six hours. Everyone brought something to the table that contributed to the overall goal. I learned lots of new things from other folks and also some technical lessons along the way, such as importing GTFS data into a database using these scripts.

We did what we did because we had to explore a problem, generate solutions, and implement one of then in six hours. Now that I've had a little more time to think about it, I think we could vastly improve the application by importing the GTFS data into Sqlite and using it directly in the application. The application would then have every route, trip, fares, and amenities available. The application could then just subscribe to real-time transit feed without having to deploy application back end. Additionally, we could revisit the QR code idea. Instead of riders with a crowd sourcing app, the QR code would be on the bus and the reader would be at each stop, as buses load and unload passengers the QR code reader could send the bus information to the real-time transit feed.

As you can see the ideas keep flowing, but it's important to remember that creating a mobile application would have been impossible with open data in the form of GTFS.

Saturday, August 11, 2012

Notes on getting CartoDB VM working in OSX VirtualBox

I've been wanting to play some more with CartoDB but I've been traveling and my Internet connection has been provided courtesy of whatever cell tower is close by. A VMWare Ubuntu virtual machine is available at aux 2 centimes. Great if your operating system is Windows or linux because VMWare provides a free player to run the VM. On OSX, VMware Fusion can run the VM but it's not free.

The free alternative is Virtual Box, but it requires converting the VMWare vmdk file to an the open ovf format. Before converting the vmdk file, you will need to use your favorite OSX rar archive tool (MacPar, UnRarX) to reassemble the CartoDB vm which is split into three rar files.

Download the ovftool from VMWare and install. Before converting the vmdk to an ovf file you will need to edit the vmx file so that the path to the vmdk is not hardcoded.

Convert the vmdk to an ova file.

Import the ova into VirtualBox and accept the default settings, they can be changed later.

The next step is to configure the VirtualBox network so that CartoDB is accessible to your host machine. To do this setup the Host-only Adapter. The Host-only Adapter creates a private network that between the host machine and guest machines.

The vboxnet0 Host-only Adapter is configured from the VirtualBox Preferences menu.

Click on the 'Edit' or screwdriver icon to display the configuration menu. Note that the adapter's address is 192.168.56.1, which is the IP address that the guest machine accesses the host.

Now that we know the adapter's address, we can configure the network interface of the guest to have a static IP that is addressable from the host machine. Edit the /etc/network/interfaces file as root to set the static address. It should look like this:

Reboot the guest to start the network with the assigned static IP.

The CartoDB virtual machine is configured with two accounts, a devuser account and an admin account. To access these accounts, two URLs have been configured: http://devuser.localhost.lan (user/password: devuser/devuser) and http://admin.localhost.lan (user/password: admin/admin).

To access these accounts configure the /private/etc/hosts file on the host side (OSX) by adding these entries:

After updating the hosts file, we can access our local instance of CartoDB through a browser. Note: add 'http://' to the URL so that the browser knows you are looking for the local CartoDB site. The devuser map looks like this:

Happy mapping!

Friday, August 3, 2012

Etymology

A friend in grad school used to go around the office and exhort us to 'go eat' lunch together. Eventually this just became "gweet."

Similarly, an engineer I worked with would build applications using java jars built by another part of our company. I would ask him how his project was going after the jar update and he would invariably say "It's broke." Over time, this elided into 'sproke.

*True story: One Christmas, I bought everyone a computer bag with NPE embroidered on it because everything was 'sproke.

Thursday, August 2, 2012

Visualizing data from a PDF

In my previous post, I scraped the data from a PDF and wrote it to a CSV file. I uploaded the data to CartoDB to make a quick map using the "Share This Map" function on the CartoDB site.

I received a number of comments on the map, mostly about making it more legible by separating out the years and providing more information about each incident. In the spirit of Brian Timoney's post on how web maps are actually used, I decided to keep the map as simple as possible.

After reading the API documentation and looking at the examples, I used the Data Interaction example as the starting point for my map. One of the comments I received was to separate the incidents by year. The interesting part of the example is that it shows how to dynamically change what is presented on a map by changing the values in a SQL query. That's very powerful because you have the data handling and geoprocessing capabilities of PostgreSQL and PostGIS available in your map client. Here's the map with a control to show incidents by year. (The blog form squeezes the map, the see the full map here.)

The script for the map is below.

One of the nice things about the map is that it doesn't use external mapping client libraries. It's simple and shows the user where the incidents occurred. However, I would like to show more information about each incident through a popup. CartoDB plays well with leaflet, modestmaps.js, and wax, and it's easy to extend the basic map with these libraries. In the following map, I used leaflet and the CartoDB popup script to make a popup containing the description of each incident when a marker is clicked. (As with the other map, the full size map can be viewed here.)

The script for the clickable map is below:

The script essentially the same as the previous script but it uses leaflet.js to add a base map from MapBox and the leaflet-carto.js to add the Major Shootings layer from CartoDB. Switching between years is handled differently from the previous example. The resetLayer function uses the setOptions function to update the query for each year. Note that I cheated a little and put the operator as part of the value, e.g. '= 2005". It was the most direct solution to handling the 'all' case in the SQL statement, which required that query for all the records as part of a SQL WHERE clause. I admit that putting the number of incidents on the buttons is cheesy; and if I had more time, I would put a label that would change on top of the map that read "x incidents in xxxx" when a different year is selected..

I like CartoDB because it's flexible and agnostic with regards to client mapping libraries. Building a simple mapHowever, what I find exciting is that CartoDB puts an API in front of PostgreSQL and PostGIS. This opens up all sorts of possibilities and removes the need to preprocess or reformat data to create a visualization.

The project file is available here.

Tuesday, July 31, 2012

From a PDF to a map

Data is often locked up in pdf reports, and if we're lucky it may be in a structured (HTML) or semi-structured form (e.g. email). Extracting data from pdfs requires translating the data into plain text while preserving the structure as much as possible.

One open source tool for extracting text from pdfs is the Apache Project tika. Tika does a reasonably good job of extracting text and preserving the structure. For this example, we'll parse the Brady Campaign's Mass Shootings in the United States Since 2005 document. Looking at the document we can see that it has a structure of location, date and description. Buried in the description is the source in parentheses.

To extract the text from the pdf to plain text use tika. Note that you will need to have java installed in order to run tika

The plain text file also contains the footer from the document. In the case of this document which has approximately 62 pages, I find it faster to remove the footer text using a text editor such as vim, emacs, or textmate. If the document was longer than 100 pages, then I would try to be more clever and use a unix utility such as sed or grep to find and remove these lines.

Once the title, footer, and extraneous text are removed separated by one to several newlines. So we have incidents, that include location, date and description, separated by one or several newlines. The new lines act as separators between incidents. The parse_brady.rb script reads the file line-by-line, if the line is not empty then it adds the line as an element to the ary array. When the line is empty, it processes ary by joining all the lines from description into a single line, slices the description array elements form ary and puts them into the row array, then add the single line description back to row, so that it only contains three elements.

The first element of row is the location, this split into place and city and the geocode function is called. The geocode function returns either the coordinates if successful or an array with two nil members. The coordinates are added to row. The row is written as a tab delimited string to stdout where it can be redirected to a file.

Inspecting the resulting file shows that there are coordinates missing as well as inconsistencies in the date formatting. Some of the geocodes failures are due to misspellings (St.Louis versus St.Louis) or to places such as counties that are not in the GNIS database. I manually cleaned up the misspellings and dates and ran the parser again, then added the county coordinates from Wikipedia.

While there are numerous ways to make a map from a tab delimited file, such as Google Fusion tables, TileMill and GeoCommons; I decided to use CartoDB to try out their offering. The process is simple, create an account, upload your data, and style it accordingly.

In a follow up post, I'll go through the steps to make the map more legible using TileMill as a Carto style editor and by using the CartoDB API to make the map more interactive.

Part 2: Visualizing from a PDF

Sunday, July 29, 2012

My Review of PostgreSQL: Up and Running

Originally submitted at O'Reilly

A Practical Guide to the Advanced Open Source Database

Good overview + new features in 9.1-9.2

By spara from San Antonio, TX on 7/29/2012

Pros: Well-written, Helpful examples, Accurate, Easy to understand, Concise

Cons: A little too colloquial

Best Uses: Reasonably experienced, Intermediate

Describe Yourself: Data monkey, Occasional devops, Developer

PostgreSQL: Up and Running is a book for experienced PostgreSQL users and people comfortable around RDBMS such as Oracle or SQLSever. This is definitely not a book for someone starting out with PostgreSQL.

For me, the book is a practical reference that begins to answer questions but directs the reader for more detailed information from other resources that are hyperlinked in the book. I'm a long time user of PostgreSQL 8.4, so the book provides an overview of features that are/will be available in 9.1 and 9.2. For example, the serial data type auto-increments integers is now available. Previously, you had to create a sequence object and use it when creating the table. I also learned about arrays for the first time which have been around since 8.3. Certainly a very handy datatype to have.

A nice feature in 9.1 is the ability to load new extensions with the CREATE EXTENSION command. I use PostGIS frequently, and previously you would have to run sql scripts to create a template database and use that to create a spatial database. With CREATE EXTENSION, you can add PostGIS data types, objects, and functions to an existing database.

The book also covers the standard SQL tables, indexes and views as implemented in PostgreSQL. Writing your own functions and aggregates are touched on and examples are provide. Query performance tuning is covered and typical sql errors, such as overusing subqueries (guilty), are discussed and examples for improving performance are given. Basic replication is also discussed as well options for hosting PostgreSQL.

PostgreSQL: Up and Running also gives a sneak preview of new features such as Foreign Data Wrappers for working with data from web APIs, plPython for writing functions in Python or in other supported languages, native JSON output, and the list goes on. The book shows that PostgreSQL is flexible and extensible.

I enjoyed reading the book. It was easy to understand, had good examples, and I learned about new capabilities. The book covers much ground, but goes into sufficient depth to gain an understanding, and it provides additional resources for more in depth treatments of topics. It's a mile wide, deep enough and a good reference for an experienced user.

(legalese)

Thursday, July 26, 2012

Setting up postgres.app with PostGIS

I should know better, but I'm a sucker for an upgrade. I upgraded to OS X Mountain Lion and a number of apps I use wouldn't start. A brew installed version of PostgreSQL couldn't start because the postgres user was changed to nobody. For some reason, I couldn't su to nobody or create a new postgres user using dscl.

The data on the database wasn't particularly large and I also had the scripts to reload the data so I decided to try Heroku's postgres.app install of postgres.
Installation is simple. Download the app and unzip into the Applications directory. Double click to start, and the postgres icon will appear on the task bar.

In order to use the commandline tools, you will need to set the path to the postgres.app binaries. Add this to your .bash_profile or wherever your paths are set in you .dot files. (note: @mweissman suggested this is unnecessary on OSX Lion 10.7 and higher because PostgreSQL and the client tools are already part of the operating system.)

In the terminal, connect to the postgres using psql and create a database. Connect to the database and add the PostGIS extension.

Now that the PostGIS extension has been added to your database, creating table is straight forward. In this example, I'm creating a table from a CSV file with point geometry and creating a spatial index on the geometry.

And that's it, quick and simple.

Saturday, May 26, 2012

Shaping data and re-thinking web mapping architecture

Like many IT folks who came of age in the '90s, I was trained in the intricacies of relational databases, jamming data into third normal form, creating entity-relationship diagrams, and reveling in joins and views. Boyce-Codd semper fi. I never questioned why we tortured data to conform, just assumed that it was the way it was done; i.e. I get paid because I know how to do it.

I can't remember how many times I've ranted about dumb users sending an Excel file as a database. After watching Max Schireson's presentation on Reinventing the Database, I am shamed. The gist of Schireson's presentation is that we spent a lot of time strongly and parsimoniously typing data to accomodate the technological constraints of computers when memory and disk were expensive. Obviously, those constraints are no longer valid.

I work with the TreeKit project and they collect their data in this format.

The project collects data about trees and treebeds along a street by measuring along the street. The begin and end coordinates of the street are extracted from a GIS and volunteers using tape measures to collect the distance between tree beds, height and width, and tree data. Using this data, we can calculate the coordinates of each treebed. The data is collected on paper and entered into a PostgreSQL database via an online form. The online form allows users to see the data as entered allowing them to make corrections during data entry, but for the sake of simplicity (i.e. network not required) volunteers use paper forms for data collection.

Collecting tree data along a street in the form of a list is a very natural way for volunteers to gather information. A street is the basic entity that contains trees and treebeds and it constitutes a single record. However, data in this format violates first normal form in that holding the trees along with the street data creates repeating groups with in a single record.

In an ideal relational database there would be a street table, a treebed table, and a tree table with one-to-many relationships defined between the tables. However, the data is entered into a table with repeating groups. An external java process (using GeoTools) calculates the coordinates of trees and treebeds and inserts them into Postgres/PostGIS tables so they can be displayed. So the mapping stack in this case is Postgres with a TileMill frontend for creating maps, which is served through MapBox.

In this architecture, Postgres is grossly underutilized. Sure it stores both the raw data and the spatial data but that's all it does. The external java process that creates the spatial data is legacy code (written in an earlier iteration) can be written in something else and in fact is replicated in the data entry form in javascript. In it's simplest formulation, the data can be expressed in JSON.

The raw data would be further transformed into tree and treebeds that are used for drawing maps. The current process calculates a geohash of the centroid of each treebed which is used as the key between trees and treebeds. This is useful for the other analysis that uses the TreeKit data. The tree and treebed data can be stored as GeoJSON, shapefiles, or any other format that TileMill supports. Postgres can be removed from the stack, because it only adds overhead and no advantages.

There are situations where Postgres/PostGIS would be advantageous, such as where data changes or where additional processing or sophisticated querying is needed. Storing the data in simple format that can be easily consumed by most web applications has several advantages over storing it in a database. First, GeoJSON can be consumed directly my many web mapping clients; second, the overhead of maintaining another piece of software is removed; and finally the data can be transformed into other formats easily. So if someone hands you a Excel or csv file, try using a simpler format such as GeoJSON and simplify your web mapping stack.

Wednesday, May 16, 2012

ARE2012 Keynote: Serendipity

Slide 1:

Serendipity is another way to say good luck. It's a concept and belief that something fortuitous occurs from a confluence of factors

Slide 2:

Waldo Tobler's First Law of Geography states that things near you have more influence that things farther away. This idea has been applied in a number of situations.

Hecht, Brent and Emily Moxley. "Terabytes of Tobler: Evaluating the First Law in a
Massive, Domain-Neutral Representation of World Knowledge" COSIT'09 Proceedings of the 9th international conference on Spatial information theory, 2009, pp 88-105.

Sui, D. "Tobler’s First Law of Geography: A Big Idea for a Small World?" Annals of the Association of American Geographers, 94(2), 2004, pp. 269–277.

Tobler, W. , "On the First Law of Geography: A Reply," Annals of the Association of American Geographers, 94, 2004, pp. 304-310.

Slide 3:

Will Wright recently gave a talk at O'Reilly's Whereconf titled "Gaming Reality." One of his points was related to Tobler's First Law of Geography, things that are closest are most likely to be of interest.

Shute, T. Ugotrade. Will Wright, “Gaming Reality,” Where 2012

Slide 4:

Proximity can be measured along many different dimensions: spatial, temporal, social network, and conceptual.

Slide 5:

Developers are building mobile applications based on these ideas. For example, GeoLoqi implements geofencing to notify users of events when inside a defined area. Forecast is another application which broadcasts when and where you will be to your friends, increasing the likelihood that you will meet. Other applications can notify of events and sales that occur as you pass through an area.

Brownlee, J. This Creepy App Isn’t Just Stalking Women Without Their Knowledge, It’s A Wake-Up Call About Facebook Privacy. Cult of Mac, March 30, 2012.
Huffington, A. GPS for the Soul: A Killer App for Better Living. Huffington Post, 04/16/2012.

Slide 6:

Social media, such as Twitter, have been analyzed to determine states of emotion and mapped by Dan Zarella. This just one example of how data can be used to find a person's proximity to an emotion based on location and time.

Zarrella, D. Using Twitter Data to Map Emotions Geographically. The Social Media Scientist, May 7th 2012

Slide 7:

Connections between people in the form of social networks or social graphs provides a rich source of data for measuring conceptual phenomena. For example, Klout declares that it is a measure of influence, LinkedIn can be a measure of a person's professional sphere, Twitter can while Pintrest can reflect the material culture of a person or a group.

Stevenson, S. What Your Klout Score Really Means. Wired, April 24, 2012

Slide 8:

Will Wright postulated that there are at least 50 different dimensions where proximity creates a value gradient. The closer to a person, the greater the value along the value gradient. These gradients can be emotions, communities of interest, school affiliations, or any number of factors that can influence a person's behavior and choices. By bringing all these dimensions to bear on a person, it could be possible to build game dynamics that take advantage of physical world behaviors.

Slide 9:

Measurement of the value gradient is the first step in engineering serendipity. There are a number of ways of quantifying the value gradient, but proximity is often modeled on a network structure. Nodes in the network represent people and possible dimensions of interest and the connections (or links) between nodes can measure the gradient.

Slide 10:

Will Wright suggested that Central Place Theory as one model of understand the effects of proximity. It is a classic geographic model proposed by Walter Christaller for explaining the hierarchy of places. When applied to influencing serendipity, the concepts of threshold and range are key to using the model to measure the influence of proximity. Threshold is the minimum interaction of a dimension needed to influence a person, whereas range is the maximum distance a person will 'travel' to acquire something.

Dempsey, C. Distance Decay and Its Use in GIS. GIS Lounge, 3/15/12.

Slide 11:

There are a number of ways to measure the effects and or the importance of links. Google's page rank algorithm is perhaps the most famous. Page rank indicates the importance of a page based on number of incoming links. Another form of link analysis used by the intelligence community focuses on the transactions between people, organizations, places and time as exemplified by Palantir software.

Holden, C. Osama Bin Laden Letters Analyzed. Analysis Intelligence. May 4, 2012.

Holden, C. From the Bin Laden Letters: Mapping OBL’s Reach into Yemen. Analysis Intelligence. May 11, 2012.

Page, Lawrence and Brin, Sergey and Motwani, Rajeev and Winograd, Terry (1999) The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.

Slide 12:

Ultimately, all these measures of proximity are attempting to answer this question, "If you friend Joey jumped off a bridge, would you jump?" I.e., would you jump off a bridge because everyone is doing it (social influence/contagion) or would you jump because you are similar to Joey (homophily). A recent paper, Homophily and Contagion are Generally Confounded in Observational Network Studies, posits both the subject and the answer in it's title.

Shalizi,C and A. Thomas. Homophily and Contagion are Generally Confounded in Observational Network Studies. Sociological Methods and Research, vol. 40 (2011), pp. 211-239.

Slide 13:

The comic XKCD manages to summarize the result in a single panel. We don't know.

Munroe, R. Cat Proximity, xkcd.

Slide 14:

The maxim, "Models are wrong, but are useful" has been a truism in research. The idea that models are not only wrong, but that research can be successful without them is starting to gain currency in the era of Big Data. Access to very large datasets and the capability to manipulate them inexpensively is changing how research is performed.

Allen, R. Data as seeds of content. O'Reilly Radar. April 5, 2012.

Anderson, C. The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. Wired Magazine. June 23, 2008.

Slide 15:

With large numbers on our side, petabytes or even yottabytes of data can reveal patterns not possible with sampled data.

Shaw, A. Big Data, Gamification and Obama 2012. OWNI.EU. April 4, 2012.

Slide 16:

Flip Kromer of Infochimps illustrates how a preponderance of data leads us to determine the boundaries of places called Paris and which location is the one used in a particular context

Kromer, F. On Being Wrong In Paris: Finding Truth in Wrong Answers. The Infochimps Blog. Dec 1, 2011.

Slide 17:

Third party agents are continuously collecting information about people from social media, social networks, and ecommerce. This provides a wealth of data about people for a third party perspective. In addition, the quantified self is a concept where individuals document every aspect of their lives in order to optimize their day to day interactions.

However, Goodhart's law stipulates that any indicator used to influence a particular behavior will decrease the usefulness of that indicator. In other words, users will game the system and degrade the quality of the information in order to achieve the objective

Doctrow, C. Goodhart's Law: Once you measure something, it changes. boingboing.net. April 29, 2010

Sharwood, S. Social networks breeding spatial junk. The Register. March 6, 2012.

Slide 18:

There is an emerging an corollary concept of the quantified self. Rather than a continuous collection of data there is an alternate of source of data that reflects information selected and shared but not for the purposes of participating in social networks, i.e. a view to a person's internal life. For example, Amazon collects highlighted phrases from Kindle users as well collecting wish lists which represent material culture.

Carrigan, M. Mass observation, quantified self, and human nature. markcarrigan.net. April 19, 2012.

Currion, P. The Qualified Self. The Unforgiving Minute. November 30, 2011.

Slide 19:

To bring it back to serendipity, perhaps it's time to re-evaluate how we understand how multiple factors affect an individual's choices. Models based on physical properties such proximity may lack the nuance necessary to explain a behavior. Simply creating a confluence of events within many possible proximal dimensions may not be enough to explain or influence. However, a new alternative is possible through the use of big data and the tools of machine learning and algorithms to describe behavior. We should harness these tools to better understand the factors that affect serendipity and let go of Newtonian models that reduce the rich interplay of social factors.

Tuesday, May 1, 2012

Where are the gazetteers?

Last time I checked it was 2012 and given all the excitement about open data in the US government, I would have expected the USGS (or anyone) providing a friendly GNIS based gazetteer service. I send the service a place name and a state and it returns a coordinate pair.

Sure there's Google Maps API, geonames.org, Nominatim, The National Map and a whole host of other services that require agreeing to a end user license, compile a number of data sources into one, return more than I need or want, not open data, not open source and on and on.

In less time than I spent searching for a service, I rolled my own gazetteer of sorts. I downloaded the GNIS national file and created a sqlite database

And here's a quick ruby script to query the database.

'sproke