Thursday, March 21, 2013
Using the same technique I used for importing 2010 Census headers, I imported each line/record as a single field into a temporary table. Using the SQL substring function, I extracted the data into their respective fields. Information about the file structure and fields are available in The Instruction Booklet on the download page.
Below is the script for importing the data.
When all is said and done, you will have a table containing California tax exempt organizations. The next step may seem a little backward, but I exported the data back to a tab delimited text file.
This may seem a step backward, but until there is a built in Postgres geocoder, handling text files is simpler and faster than write code that extracts data for geocoding using an external service.
Saturday, February 2, 2013
Install vagrant from http://www.vagrantup.com/. Create a directory to hold the vagrantfile, then run the following:
Go to to http://localhost:8080 to start using the DSTK.
Tuesday, November 13, 2012
I've assisted on several disaster relief efforts since Hurricane Katrina and they have all needed a way to communicate where resources are and where they are needed. In military jargon, this is called a common operating picture or a COP. The most common interface for a COP is a map because it is easy to reference, although there is a case to made for a COP that also shows resources and needs by priority.
Google Maps has made mapping accessible to the public. I first saw this during Katrina when people used Google Earth to link photos to locations in KMLs so that Katrina refugees could see flooded and damaged areas. I've also seen the availability of other mapping platforms and services such as Ushahidi and CartoDB increase, but it seems most people turn to Google Maps/Earth to create maps and data. While Google Earth has lowered the barrier to creating maps, it still requires a modicum of training in order to keep the map updated with the latest information.
I responded a request to help out with Occupy Sandy Staten Island with their map of resources. They had a map but things were rapidly changing and they wanted a volunteer to be able to update the map. The current map was in Google Maps, so I downloaded a KML of the map as a starting point.
Consistent Look and Feel
Although there are many mapping platforms available, I decided to stick with Google infrastructure because Occupy Sandy Staten Island were already using it. To make the KML editable, I imported it into Google Fusion Tables which brought in the data but not the styling. I wanted the icons and style for each zone (polygonal area) to be consistent with the existing maps. That means that drop off points, shelters, and operations centers would always have the same symbology without the user having to pick and choose an icon or style every time a new point was added to the map.
Google Fusion Tables supports setting styles based on a column value. Google provides a tutorial showing how to merge tables to apply map styles by column. The tutorial shows how to create a columns in both the styles table and the data table that act as a key. In this case, I created a column called type and populated them descriptive keys such as dropoff, shelter, etc. Although the tutorial shows how to set icons, map styles for polygons and lines can also be styled by column values. Merging style table to the data table results in a data table, row updates (new entries) to either the styles or data table will be displayed in the merge table. In this case, volunteers can just enter the type of point and it will be displayed with the correct symbology.
One of the primary functions of a COP is to locate things. Fusion table supports several types of geometry encoding ranging from KML geometry to street addresses. Street address are automatically converted into geometry. Even a partial street address without city, state, or zipcode are intelligently gecoded based on the bounding box of all the features.
Adding lines or polygonal data requires digitizing and is a bit beyond the scope. I did add one polygonal area using QGIS and OpenStreet Map as a background. I originally exported the polygon as a KML, but for some reason QGIS outputs KML geometry in Google's World Mercator instead of WGS84 latitude/longitude coordinates in decimal degrees which is what Fusion Tables wants in the way of geography. The workaround was to export as GML and tweak the tags appropriately.
I wanted to have a spreadsheet interface for updating the map. Why a spreadsheet and not a form or just use Fusion Tables tables directly? Most people have used a spreadsheet so the learning curve is very small. In my experience forms get in the way of data entry, having to submit an entry is just a waste of time. Fusion Tables looks like a spreadsheet but it doesn't behave like one, so that adds to the learning curve.
John McGrath wrote a script to update Fusion Tables from a spreadsheet; it's available on github. The script works but has shortcomings such as deleting all the entries in the Fusion Table before updating it. That means that you must keep and select all the previous entries in the spreadsheet instead of just updating the new entries.
This was a quick hack done over a several hours. It's certainly far from perfect but it is functional and accomplishes the basic goal of creating a consistant map that can be updated by volunteers. Here are some takeaways:
- use existing infrastructure; try to avoid having to install, configure or maintain software
- use existing code; if you need a function, it's probably in github already
- make data input simple; as a volunteer translator for Project 4636, I found forms getting in the way of entering information, so I used a spreadsheet
- make data input convenient; for example, Fusion Tables can geocode an address automatically
Monday, September 10, 2012
Projects proposed at the hackathon ranged from automating campaign finance forms for the City of Austin, mapping bike routes and amenities, creating a mobile app and API for the Aunt Bertha service which matches people to services, an online form for clients of the Homeless RV project, and improving bus transportation for riders.
I fell in with the bus group and the problem was presented by Glenn Gabois of Movability Austin. The problem we focused on was notifying riders the location and time of arrival of their bus while they wait at their stop. Capital Metro, Austin's transportation agency, currently does not provide a feed for their buses while they are in transit. So we divide the problem into three componentes:
- a crowd sourced bus tracking application
- the server side infrastructure to provide bus data
- a mobile application that tells rider a bus' current stop and the number of stops away
- would not require user to turn while on the bus or turn off when off the bus
- determine which stop where the user boarded the bus
- disambiguate between buses that used the same stops
agency.txtUsing the GTFS Reference, I was able to decode how to get from a route to a list of stops. In our case, we picked a route and a trip on that routed, we then needed to find the stops for that trip. Since stops are sequential we just needed a sorted list of stops of the application.
To get the list of stops, I went through the GTFS Reference to find the keys that would allow me to associated stops with a specific trip. The joins between the data tables looked like this.
- Routes are comprised of trips (northbound, southbound, and time). Select a route from routes.txt, then select a trip by the route_id.
- A single trip list stop_times. Select stop_times by trip_id.
- Stop_times are related to stops by stop_id in the stops.txt file. Select the the stops using the stop_id.
- the app determined the location of the rider via WiFi location or GPS and choose the closest stop
- the bus position feed was simulated and was snapped to the closest stop
- the app compared the rider's stop to the bus' stop, since it was an ordered list it was matter of counting the between the two stops
Saturday, August 11, 2012
The vboxnet0 Host-only Adapter is configured from the VirtualBox Preferences menu.
Click on the 'Edit' or screwdriver icon to display the configuration menu. Note that the adapter's address is 192.168.56.1, which is the IP address that the guest machine accesses the host.
Now that we know the adapter's address, we can configure the network interface of the guest to have a static IP that is addressable from the host machine. Edit the /etc/network/interfaces file as root to set the static address. It should look like this:
Reboot the guest to start the network with the assigned static IP.
The CartoDB virtual machine is configured with two accounts, a devuser account and an admin account. To access these accounts, two URLs have been configured: http://devuser.localhost.lan (user/password: devuser/devuser) and http://admin.localhost.lan (user/password: admin/admin).
To access these accounts configure the /private/etc/hosts file on the host side (OSX) by adding these entries:
After updating the hosts file, we can access our local instance of CartoDB through a browser. Note: add 'http://' to the URL so that the browser knows you are looking for the local CartoDB site. The devuser map looks like this:
Friday, August 3, 2012
Similarly, an engineer I worked with would build applications using java jars built by another part of our company. I would ask him how his project was going after the jar update and he would invariably say "It's broke." Over time, this elided into 'sproke.
*True story: One Christmas, I bought everyone a computer bag with NPE embroidered on it because everything was 'sproke.
Thursday, August 2, 2012
I received a number of comments on the map, mostly about making it more legible by separating out the years and providing more information about each incident. In the spirit of Brian Timoney's post on how web maps are actually used, I decided to keep the map as simple as possible.
After reading the API documentation and looking at the examples, I used the Data Interaction example as the starting point for my map. One of the comments I received was to separate the incidents by year. The interesting part of the example is that it shows how to dynamically change what is presented on a map by changing the values in a SQL query. That's very powerful because you have the data handling and geoprocessing capabilities of PostgreSQL and PostGIS available in your map client. Here's the map with a control to show incidents by year. (The blog form squeezes the map, the see the full map here.)
The script for the map is below.
One of the nice things about the map is that it doesn't use external mapping client libraries. It's simple and shows the user where the incidents occurred. However, I would like to show more information about each incident through a popup. CartoDB plays well with leaflet, modestmaps.js, and wax, and it's easy to extend the basic map with these libraries. In the following map, I used leaflet and the CartoDB popup script to make a popup containing the description of each incident when a marker is clicked. (As with the other map, the full size map can be viewed here.)
The script for the clickable map is below:
The script essentially the same as the previous script but it uses leaflet.js to add a base map from MapBox and the leaflet-carto.js to add the Major Shootings layer from CartoDB. Switching between years is handled differently from the previous example. The resetLayer function uses the setOptions function to update the query for each year. Note that I cheated a little and put the operator as part of the value, e.g. '= 2005". It was the most direct solution to handling the 'all' case in the SQL statement, which required that query for all the records as part of a SQL WHERE clause. I admit that putting the number of incidents on the buttons is cheesy; and if I had more time, I would put a label that would change on top of the map that read "x incidents in xxxx" when a different year is selected..
I like CartoDB because it's flexible and agnostic with regards to client mapping libraries. Building a simple mapHowever, what I find exciting is that CartoDB puts an API in front of PostgreSQL and PostGIS. This opens up all sorts of possibilities and removes the need to preprocess or reformat data to create a visualization.
The project file is available here.
Tuesday, July 31, 2012
One open source tool for extracting text from pdfs is the Apache Project tika. Tika does a reasonably good job of extracting text and preserving the structure. For this example, we'll parse the Brady Campaign's Mass Shootings in the United States Since 2005 document. Looking at the document we can see that it has a structure of location, date and description. Buried in the description is the source in parentheses.
To extract the text from the pdf to plain text use tika. Note that you will need to have java installed in order to run tika
The plain text file also contains the footer from the document. In the case of this document which has approximately 62 pages, I find it faster to remove the footer text using a text editor such as vim, emacs, or textmate. If the document was longer than 100 pages, then I would try to be more clever and use a unix utility such as sed or grep to find and remove these lines.
Once the title, footer, and extraneous text are removed separated by one to several newlines. So we have incidents, that include location, date and description, separated by one or several newlines. The new lines act as separators between incidents. The parse_brady.rb script reads the file line-by-line, if the line is not empty then it adds the line as an element to the ary array. When the line is empty, it processes ary by joining all the lines from description into a single line, slices the description array elements form ary and puts them into the row array, then add the single line description back to row, so that it only contains three elements.
The first element of row is the location, this split into place and city and the geocode function is called. The geocode function returns either the coordinates if successful or an array with two nil members. The coordinates are added to row. The row is written as a tab delimited string to stdout where it can be redirected to a file.
Inspecting the resulting file shows that there are coordinates missing as well as inconsistencies in the date formatting. Some of the geocodes failures are due to misspellings (St.Louis versus St.Louis) or to places such as counties that are not in the GNIS database. I manually cleaned up the misspellings and dates and ran the parser again, then added the county coordinates from Wikipedia.
While there are numerous ways to make a map from a tab delimited file, such as Google Fusion tables, TileMill and GeoCommons; I decided to use CartoDB to try out their offering. The process is simple, create an account, upload your data, and style it accordingly.
In a follow up post, I'll go through the steps to make the map more legible using TileMill as a Carto style editor and by using the CartoDB API to make the map more interactive.
Part 2: Visualizing from a PDF
Sunday, July 29, 2012
A Practical Guide to the Advanced Open Source Database
Good overview + new features in 9.1-9.2
Pros: Well-written, Helpful examples, Accurate, Easy to understand, Concise
Cons: A little too colloquial
Best Uses: Reasonably experienced, Intermediate
Describe Yourself: Data monkey, Occasional devops, Developer
PostgreSQL: Up and Running is a book for experienced PostgreSQL users and people comfortable around RDBMS such as Oracle or SQLSever. This is definitely not a book for someone starting out with PostgreSQL.
For me, the book is a practical reference that begins to answer questions but directs the reader for more detailed information from other resources that are hyperlinked in the book. I'm a long time user of PostgreSQL 8.4, so the book provides an overview of features that are/will be available in 9.1 and 9.2. For example, the serial data type auto-increments integers is now available. Previously, you had to create a sequence object and use it when creating the table. I also learned about arrays for the first time which have been around since 8.3. Certainly a very handy datatype to have.
A nice feature in 9.1 is the ability to load new extensions with the CREATE EXTENSION command. I use PostGIS frequently, and previously you would have to run sql scripts to create a template database and use that to create a spatial database. With CREATE EXTENSION, you can add PostGIS data types, objects, and functions to an existing database.
The book also covers the standard SQL tables, indexes and views as implemented in PostgreSQL. Writing your own functions and aggregates are touched on and examples are provide. Query performance tuning is covered and typical sql errors, such as overusing subqueries (guilty), are discussed and examples for improving performance are given. Basic replication is also discussed as well options for hosting PostgreSQL.
PostgreSQL: Up and Running also gives a sneak preview of new features such as Foreign Data Wrappers for working with data from web APIs, plPython for writing functions in Python or in other supported languages, native JSON output, and the list goes on. The book shows that PostgreSQL is flexible and extensible.
I enjoyed reading the book. It was easy to understand, had good examples, and I learned about new capabilities. The book covers much ground, but goes into sufficient depth to gain an understanding, and it provides additional resources for more in depth treatments of topics. It's a mile wide, deep enough and a good reference for an experienced user.
Thursday, July 26, 2012
The data on the database wasn't particularly large and I also had the scripts to reload the data so I decided to try Heroku's postgres.app install of postgres.
Installation is simple. Download the app and unzip into the Applications directory. Double click to start, and the postgres icon will appear on the task bar.
In order to use the commandline tools, you will need to set the path to the postgres.app binaries. Add this to your .bash_profile or wherever your paths are set in you .dot files. (note: @mweissman suggested this is unnecessary on OSX Lion 10.7 and higher because PostgreSQL and the client tools are already part of the operating system.)
In the terminal, connect to the postgres using psql and create a database. Connect to the database and add the PostGIS extension.
Now that the PostGIS extension has been added to your database, creating table is straight forward. In this example, I'm creating a table from a CSV file with point geometry and creating a spatial index on the geometry.
And that's it, quick and simple.