Wednesday, April 11, 2012

FOSS4G-NA: Demographic analysis using open source

My FOSS4G-NA talk expands on an earlier blog post on loading S1 data into PostgreSQL by looking at how to create a population pyramid and how to compare population change between the 2000 and 2010 census. The scripts I used for this presentation are available on github.


Database Preparation


I've updated the original scripts to correct a few errors with regards to character set and data types. We will load the SF1 files for 2000 and 2010, the Census Tract Relationship files and the census tracts of interest from TIGER.


The following scripts create the 2010 and 2000 SF1 tables in a postgresql database:
Loading data into the tables is covered in a previous post.

Census geography changes between each decennial census. The Census Bureau provides Census Tract Relationship files that describe these changes. The following sql scripts create the files as tables.
The data can be downloaded from Census and loaded into the tables using the postgres copy command as in the previous example.


The Relationship Between SF1 and TIGER


Summary File 1 (SF1)
"...contains the data compiled from the questions asked of all people and about every housing unit. Population items include sex, age, race, Hispanic or Latino origin, household relationship, household type, household size, family type, family size, and group quarters. Housing items include occupancy status, vacancy status, and tenure (whether a housing unit is owner-occupied or renter-occupied)." 
SF1 data for either the 2000 or 2010 census includes 40 plus files (for either decennial census) and thousands of columns cryptically named. In order to understand which file contains variables of interest you will need SF1 Table Matrix. In addition, to understand how to query for the level of geography (blockgroup, tract, county, etc) you will need Summary Level Sequence chart.


There a number of ways to retrieve the logrecno for a given census geography. The first example shows how to retrieve the logrecno from a set of TIGER census tracts loaded into PostGIS. The census tract table is called tract_d3.




You can also get the logrecno from a list of census tracts. This example retrieves the logrecno number from the 2000 census tracts.




This query can be incorporated as subquery for other queries such as building a population pyramid.


Building a population pyramid


The following query builds a population pyramid for females for a group of 2010 census tracts. The table sf1_00017 contains the counts for each age, the query sums the counts in five year increments which is how population pyramids are typically constructed. The subquery selects tracts based on logrecno.

A script retrieve the data for the male portion of the population pyramid is also available.


There are a number of sources for building an online population pyramid. Most of the are based on the d3 javascript visualization library. Here are several:
If you use OSX, the Population Pyramid Generator by the Perihelion Group can quickly generate a standard population pyramid.



Demographic Change


Measuring change between decennial census is a typical demographic task. Comparison require standardizing changes to geographic boundaries over time. There are four types of change:
  • no change, geographic boundaries remain the same
  • revision - a small change that does not affect a larger portion of the population
  • merge - previous geographic boundaries are joined to form the current boundaries
  • split - current boundaries are created by splitting the previous boundaries
These 2010 census tracts belong to a city council district in San Antonio, Texas. This example exhibits all four types of change:



  • The light blue/whitish lines are the 2010 tract boundaries that haven't changed from the previous census. 
  • The red lines represent the 2000 tract boundaries. The tracts with the red line between them is a merger of the 2000 tracts that form a 2010 tract outlined in a light blue.
  • The cyan lines represent new 2010 tracts that are the result of splitting 2000 boundaries. This is noticeable at the bottom most tract which shows the cyan line from a split and the old 2000 boundary in red.
  • The scale of the image hides a revision in one of the tracts.

There are several standardization methods for each type of change:

  • merge - locate merges using the first 4 digits of the tract, manually edit the tract, or spatial joins by polygon or centroid
  • split - area allocation and union (assumes that population is equally distributed)
  • use the Census Relationship files which tracks the changes in census geography
Methodology

Using the Census Relationship files, the methodology is:
  • query for tracts without change
  • query for tracts that split
  • query for tracts that have merged
  • combine the results to a base geography
The following query is for the total population of 2000 census tracts with the no change; the list of tracts were selected by matching census tract numbers in QGIS.

A similar script for 2010 census tracts without change is available. The Census Relationship files contain the splits that occurred between each census.  The following query retrieves the 2000 tracts, 2010 tracts, and the 2010 population from the relationship files.

Since there was only one 2010 tract that was merged, it was possible to retrieve the values with a direct query to the table. All the results were merged into a csv file with both the 2000 and 2010 tract numbers and joined to the 2000 tract boundaries using QGIS. Below are the population changes using the 2000 census boundaries.


2000 Population by 2000 tract boundaries
2010 population by 2000 tract boundaries
The dark blue tracts indicates the most populous areas in a quantile classification and white tracts represent the least populous. The maps show an overall increase in population in the selected tracts. There are some increases in the smaller tracts indicating increasing density in those areas.


Summary


I've covered some basic techniques for retrieving data for demographic analysis. The sql scripts provide basic building blocks for typical demographic processes such as calculating fertility, mortality, and migration. PostgreSQL and QGIS were used in combination to perform these queries and display the results.





Thursday, April 5, 2012

Whereconf 2012: The Dead Pool and Also-Rans

My talk at Whereconf 2012 was originally about companies in the location business that failed or were acquired for less than the money they raised. My source of information was CrunchBase, an online database of companies, people and investors that is freely editable. Since it is crowd sourced, the quality of the data can vary and for the most part, the dead pool or acquisition information was not sufficient for analysis. I decided to look at acquisitions and IPOs to see how the location industry was doing.


Methodology


My first task was to gather a list of location oriented companies. CrunchBase tags companies with terms such as location or geo or gps. Although I could have used these tags to create a pool of location oriented companies, the tags had a lot of variation to determine if it was a location company. I decided to use a different method. Each company entry in CrunchBase includes a list of competitors; I wrote a depth first search program to gather a list of companies. I used a set of well known companies such as FourSquare, SkyHook, Loopt, and SimpleGeo to generate a seed list of companies. As expected, this process generated a list with many duplicates; so with a bit of experimentation, I determined that three iterations generated a sufficient amount companies with the duplicates removed. I then ran my seed list through the depth first search program which yielded 1945 unique companies. A number of those company entries lacked basic information such as founding date or money raised; I eliminated them and reduced the list to 1505 companies with sufficient data for analysis.


Initially, I attempted to segment the companies into groups based on descriptive tags using hierarchical clustering, but the tags were too unique to provide a meaningful cluster. Fortunately, companies can be assigned to a category in CrunchBase and this is a handy way to segment the data set. The pie chart below shows the different categories.



Location companies by CrunchBase category


Refining the location business segments


I wanted to further narrow the data set by focusing on companies that comprised the majority of the location industry. I then ran a hierarchical cluster again using category. The result was five clusters represented by six categories.




The higher a category or term occurs in the plot, the more frequently it appears in the data set. Terms that are closer to each other are more related; the plot shows that the categories are quite discreet.


A look at the raw counts and percentages show that roughly 20% of companies are acquired and 5% reach an initial public offering.


Raw counts

Category
Total
Acquired
IPO
advertising
110
29
5
ecommerce
114
18
8
games_video
221
46
8
mobile
182
34
6
software
185
36
10
web
693
136
13


Percentages
Category
Acquired
IPO
advertising
26%
5%
ecommerce
16%
7%
games_video
21%
4%
mobile
19%
3%
software
19%
5%
web
20%
2%


A closer look at the companies that had an IPO revealed several companies in some of the categories that were clearly not location companies (these included major mobile carriers such as Verizon). These were removed and counts were adjusted.


Adjusted counts and percentages
Category
Raw
Percentage
advertising
5
5%
ecommerce
4
4%
games_video
8
4%
mobile
0
0%
software
7
4%
web
4
1%



Who's acquiring companies 


I created a word cloud to show which companies are acquiring business in this location space. Companies such as Google, Microsoft and Yahoo are doing most of acquisition, but a lot of companies with "Media" as part of their name stands out.

CrunchBase includes a field for the media source announcing acquisitions. Google, Social Media, and Video are prominent terms.



How much money did these companies raise prior to acquisition?
For the top three categories (web, games, and mobile), I generated descriptive statistics for the amount of money they raised. Note that I removed all companies that do not have data for money raised.

Amount of money raised by web companies 


Count
300
Mean
$31,183,226
Median
$5,900,000
Mode
$1,000,000
Minimum

$15,000
Maximum
$1,160,166,511
Range
$1,160,151,511
Sum
$9,354,967,849

The results are impressive with over $9 billion raised, however the results are skewed because they include FaceBook, Twitter, Groupon, and AOL which all had large multiple rounds. The important statistic to look at is the mode, which indicates what most companies raised.

Amount of money raised by game/video companies


Count
99
Mean
$21,756,267
Median
$11,015,719
Mode
$5,000,000
Minimum
$15,000
Maximum
$183,240,000
Range
$183,225,000
Sum
$2,153,870,470

The mode is quite a bit larger, $5 million, in comparison to companies in the web category. This is most likely due to web companies having long tails on both ends of the distribution.

Amount of money raised by mobile companies


Count
92
Mean
$22,577,161
Median
$11,075,000
Mode
$4,000,000
Minimum
$11,000
Maximum
$168,000,000
Range
$167,989,000
Sum
$2,077,098,772

Of note, the mode is similar to companies in the game/video category.

Closing thoughts 


CrunchBase is essentially a wiki, with all the strengths and weaknesses of crowd sourced data. It is a good place to begin exploring the performance of technology businesses, especially location based startups. While there are databases for established companies there are few openly available resources on startups.


Using DFS on business competitors is a good way to generate a list of businesses and startups in the location space, but an extra step to filter out businesses adjacent but not in the location space is needed. Whether this can be done with clever text mining of tags or other ancillary data remains to be scene. The approach of using the predetermined categories to segment the pool of location businesses is a reasonable approach and allows for cross comparison across other business verticals.


Basic descriptive statistics have been provided but conclusions can't be drawn on the basis of data presented. However this provides a first approximation of a sketch of the location based startups and businesses.


Data availability


Data and the scripts used to generate the data and analyses are available at: https://github.com/spara/whereconf2012


Acknowledgments

  1. Word clouds generated at http://www.wordle.net/
  2. crunchbase gem was forked from https://github.com/tylercunnion/crunchbase
  3. R data loading script from http://www.cloudstat.org/index.php?do=/kaichew/blog/crunchbase-api-scrapping-companies-info-from-crunchbase-with-r-packages/
  4. R hierarchical clustering script from http://heuristically.wordpress.com/2011/04/08/text-data-mining-twitter-r/