Saturday, May 26, 2012

Shaping data and re-thinking web mapping architecture

Like many IT folks who came of age in the '90s, I was trained in the intricacies of relational databases, jamming data into third normal form, creating entity-relationship diagrams, and reveling in joins and views. Boyce-Codd semper fi. I never questioned why we tortured data to conform, just assumed that it was the way it was done; i.e. I get paid because I know how to do it.

I can't remember how many times I've ranted about dumb users sending an Excel file as a database. After watching Max Schireson's presentation on Reinventing the Database, I am shamed. The gist of Schireson's presentation is that we spent a lot of time strongly and parsimoniously typing data to accomodate the technological constraints of computers when memory and disk were expensive. Obviously, those constraints are no longer valid.

I work with the TreeKit project and they collect their data in this format.

The project collects data about trees and treebeds along a street by measuring along the street. The begin and end coordinates of the street are extracted from a GIS and volunteers using tape measures to collect the distance between tree beds, height and width, and tree data. Using this data, we can calculate the coordinates of each treebed. The data is collected on paper and entered into a PostgreSQL database via an online form. The online form allows users to see the data as entered allowing them to make corrections during data entry, but for the sake of simplicity (i.e. network not required) volunteers use paper forms for data collection.

Collecting tree data along a street in the form of a list is a very natural way for volunteers to gather information. A street is the basic entity that contains trees and treebeds and it constitutes a single record. However, data in this format violates first normal form in that holding the trees along with the street data creates repeating groups with in a single record.

In an ideal relational database there would be a street table, a treebed table, and a tree table with one-to-many relationships defined between the tables. However, the data is entered into a table with repeating groups. An external java process (using GeoTools) calculates the coordinates of trees and treebeds and inserts them into Postgres/PostGIS tables so they can be displayed. So the mapping stack in this case is Postgres with a TileMill frontend for creating maps, which is served through MapBox.

In this architecture, Postgres is grossly underutilized. Sure it stores both the raw data and the spatial data but that's all it does. The external java process that creates the spatial data is legacy code (written in an earlier iteration) can be written in something else and in fact is replicated in the data entry form in javascript. In it's simplest formulation, the data can be expressed in JSON.

The raw data would be further transformed into tree and treebeds that are used for drawing maps. The current process calculates a geohash of the centroid of each treebed which is used as the key between trees and treebeds. This is useful for the other analysis that uses the TreeKit data. The tree and treebed data can be stored as GeoJSON, shapefiles, or any other format that TileMill supports. Postgres can be removed from the stack, because it only adds overhead and no advantages.

There are situations where Postgres/PostGIS would be advantageous, such as where data changes or where additional processing or sophisticated querying is needed. Storing the data in simple format that can be easily consumed by most web applications has several advantages over storing it in a database. First, GeoJSON can be consumed directly my many web mapping clients; second, the overhead of maintaining another piece of software is removed; and finally the data can be transformed into other formats easily. So if someone hands you a Excel or csv file, try using a simpler format such as GeoJSON and simplify your web mapping stack.

Wednesday, May 16, 2012

ARE2012 Keynote: Serendipity

Slide 1:

Serendipity is another way to say good luck. It's a concept and belief that something fortuitous occurs from a confluence of factors

Slide 2:

Waldo Tobler's First Law of Geography states that things near you have more influence that things farther away. This idea has been applied in a number of situations.

Hecht, Brent and Emily Moxley. "Terabytes of Tobler: Evaluating the First Law in a
Massive, Domain-Neutral Representation of World Knowledge" COSIT'09 Proceedings of the 9th international conference on Spatial information theory, 2009, pp 88-105.

Sui, D. "Tobler’s First Law of Geography: A Big Idea for a Small World?" Annals of the Association of American Geographers, 94(2), 2004, pp. 269–277.

Tobler, W. , "On the First Law of Geography: A Reply," Annals of the Association of American Geographers, 94, 2004, pp. 304-310.

Slide 3:

Will Wright recently gave a talk at O'Reilly's Whereconf titled "Gaming Reality." One of his points was related to Tobler's First Law of Geography, things that are closest are most likely to be of interest.

Shute, T. Ugotrade. Will Wright, “Gaming Reality,” Where 2012

Slide 4:

Proximity can be measured along many different dimensions: spatial, temporal, social network, and conceptual.

Slide 5:

Developers are building mobile applications based on these ideas. For example, GeoLoqi implements geofencing to notify users of events when inside a defined area. Forecast is another application which broadcasts when and where you will be to your friends, increasing the likelihood that you will meet. Other applications can notify of events and sales that occur as you pass through an area.

Brownlee, J. This Creepy App Isn’t Just Stalking Women Without Their Knowledge, It’s A Wake-Up Call About Facebook Privacy. Cult of Mac, March 30, 2012.
Huffington, A. GPS for the Soul: A Killer App for Better Living. Huffington Post, 04/16/2012.

Slide 6:

Social media, such as Twitter, have been analyzed to determine states of emotion and mapped by Dan Zarella. This just one example of how data can be used to find a person's proximity to an emotion based on location and time.

Zarrella, D. Using Twitter Data to Map Emotions Geographically. The Social Media Scientist, May 7th 2012

Slide 7:

Connections between people in the form of social networks or social graphs provides a rich source of data for measuring conceptual phenomena. For example, Klout declares that it is a measure of influence, LinkedIn can be a measure of a person's professional sphere, Twitter can while Pintrest can reflect the material culture of a person or a group.

Stevenson, S. What Your Klout Score Really Means. Wired, April 24, 2012

Slide 8:

Will Wright postulated that there are at least 50 different dimensions where proximity creates a value gradient. The closer to a person, the greater the value along the value gradient. These gradients can be emotions, communities of interest, school affiliations, or any number of factors that can influence a person's behavior and choices. By bringing all these dimensions to bear on a person, it could be possible to build game dynamics that take advantage of physical world behaviors.

Slide 9:

Measurement of the value gradient is the first step in engineering serendipity. There are a number of ways of quantifying the value gradient, but proximity is often modeled on a network structure. Nodes in the network represent people and possible dimensions of interest and the connections (or links) between nodes can measure the gradient.

Slide 10:

Will Wright suggested that Central Place Theory as one model of understand the effects of proximity. It is a classic geographic model proposed by Walter Christaller for explaining the hierarchy of places. When applied to influencing serendipity, the concepts of threshold and range are key to using the model to measure the influence of proximity. Threshold is the minimum interaction of a dimension needed to influence a person, whereas range is the maximum distance a person will 'travel' to acquire something.

Dempsey, C. Distance Decay and Its Use in GIS. GIS Lounge, 3/15/12.

Slide 11:

There are a number of ways to measure the effects and or the importance of links. Google's page rank algorithm is perhaps the most famous. Page rank indicates the importance of a page based on number of incoming links. Another form of link analysis used by the intelligence community focuses on the transactions between people, organizations, places and time as exemplified by Palantir software.

Holden, C. Osama Bin Laden Letters Analyzed. Analysis Intelligence. May 4, 2012.

Holden, C. From the Bin Laden Letters: Mapping OBL’s Reach into Yemen. Analysis Intelligence. May 11, 2012.

Page, Lawrence and Brin, Sergey and Motwani, Rajeev and Winograd, Terry (1999) The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.

Slide 12:

Ultimately, all these measures of proximity are attempting to answer this question, "If you friend Joey jumped off a bridge, would you jump?" I.e., would you jump off a bridge because everyone is doing it (social influence/contagion) or would you jump because you are similar to Joey (homophily). A recent paper, Homophily and Contagion are Generally Confounded in Observational Network Studies, posits both the subject and the answer in it's title.

Shalizi,C and A. Thomas. Homophily and Contagion are Generally Confounded in Observational Network Studies. Sociological Methods and Research, vol. 40 (2011), pp. 211-239.

Slide 13:

The comic XKCD manages to summarize the result in a single panel. We don't know.

Munroe, R. Cat Proximity, xkcd.

Slide 14:

The maxim, "Models are wrong, but are useful" has been a truism in research. The idea that models are not only wrong, but that research can be successful without them is starting to gain currency in the era of Big Data. Access to very large datasets and the capability to manipulate them inexpensively is changing how research is performed.

Allen, R. Data as seeds of content. O'Reilly Radar. April 5, 2012.

Anderson, C. The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. Wired Magazine. June 23, 2008.

Slide 15:

With large numbers on our side, petabytes or even yottabytes of data can reveal patterns not possible with sampled data.

Shaw, A. Big Data, Gamification and Obama 2012. OWNI.EU. April 4, 2012.

Slide 16:

Flip Kromer of Infochimps illustrates how a preponderance of data leads us to determine the boundaries of places called Paris and which location is the one used in a particular context

Kromer, F. On Being Wrong In Paris: Finding Truth in Wrong Answers. The Infochimps Blog. Dec 1, 2011.

Slide 17:

Third party agents are continuously collecting information about people from social media, social networks, and ecommerce. This provides a wealth of data about people for a third party perspective. In addition, the quantified self is a concept where individuals document every aspect of their lives in order to optimize their day to day interactions.

However, Goodhart's law stipulates that any indicator used to influence a particular behavior will decrease the usefulness of that indicator. In other words, users will game the system and degrade the quality of the information in order to achieve the objective

Doctrow, C. Goodhart's Law: Once you measure something, it changes. boingboing.net. April 29, 2010

Sharwood, S. Social networks breeding spatial junk. The Register. March 6, 2012.

Slide 18:

There is an emerging an corollary concept of the quantified self. Rather than a continuous collection of data there is an alternate of source of data that reflects information selected and shared but not for the purposes of participating in social networks, i.e. a view to a person's internal life. For example, Amazon collects highlighted phrases from Kindle users as well collecting wish lists which represent material culture.

Carrigan, M. Mass observation, quantified self, and human nature. markcarrigan.net. April 19, 2012.

Currion, P. The Qualified Self. The Unforgiving Minute. November 30, 2011.

Slide 19:

To bring it back to serendipity, perhaps it's time to re-evaluate how we understand how multiple factors affect an individual's choices. Models based on physical properties such proximity may lack the nuance necessary to explain a behavior. Simply creating a confluence of events within many possible proximal dimensions may not be enough to explain or influence. However, a new alternative is possible through the use of big data and the tools of machine learning and algorithms to describe behavior. We should harness these tools to better understand the factors that affect serendipity and let go of Newtonian models that reduce the rich interplay of social factors.

Tuesday, May 1, 2012

Where are the gazetteers?

Last time I checked it was 2012 and given all the excitement about open data in the US government, I would have expected the USGS (or anyone) providing a friendly GNIS based gazetteer service. I send the service a place name and a state and it returns a coordinate pair.

Sure there's Google Maps API, geonames.org, Nominatim, The National Map and a whole host of other services that require agreeing to a end user license, compile a number of data sources into one, return more than I need or want, not open data, not open source and on and on.

In less time than I spent searching for a service, I rolled my own gazetteer of sorts. I downloaded the GNIS national file and created a sqlite database

And here's a quick ruby script to query the database.

'sproke