Wednesday, April 11, 2012

FOSS4G-NA: Demographic analysis using open source

My FOSS4G-NA talk expands on an earlier blog post on loading S1 data into PostgreSQL by looking at how to create a population pyramid and how to compare population change between the 2000 and 2010 census. The scripts I used for this presentation are available on github.

Database Preparation

I've updated the original scripts to correct a few errors with regards to character set and data types. We will load the SF1 files for 2000 and 2010, the Census Tract Relationship files and the census tracts of interest from TIGER.

The following scripts create the 2010 and 2000 SF1 tables in a postgresql database:
Loading data into the tables is covered in a previous post.

Census geography changes between each decennial census. The Census Bureau provides Census Tract Relationship files that describe these changes. The following sql scripts create the files as tables.
The data can be downloaded from Census and loaded into the tables using the postgres copy command as in the previous example.

The Relationship Between SF1 and TIGER

Summary File 1 (SF1)
"...contains the data compiled from the questions asked of all people and about every housing unit. Population items include sex, age, race, Hispanic or Latino origin, household relationship, household type, household size, family type, family size, and group quarters. Housing items include occupancy status, vacancy status, and tenure (whether a housing unit is owner-occupied or renter-occupied)." 
SF1 data for either the 2000 or 2010 census includes 40 plus files (for either decennial census) and thousands of columns cryptically named. In order to understand which file contains variables of interest you will need SF1 Table Matrix. In addition, to understand how to query for the level of geography (blockgroup, tract, county, etc) you will need Summary Level Sequence chart.

There a number of ways to retrieve the logrecno for a given census geography. The first example shows how to retrieve the logrecno from a set of TIGER census tracts loaded into PostGIS. The census tract table is called tract_d3.

You can also get the logrecno from a list of census tracts. This example retrieves the logrecno number from the 2000 census tracts.

This query can be incorporated as subquery for other queries such as building a population pyramid.

Building a population pyramid

The following query builds a population pyramid for females for a group of 2010 census tracts. The table sf1_00017 contains the counts for each age, the query sums the counts in five year increments which is how population pyramids are typically constructed. The subquery selects tracts based on logrecno.

A script retrieve the data for the male portion of the population pyramid is also available.

There are a number of sources for building an online population pyramid. Most of the are based on the d3 javascript visualization library. Here are several:
If you use OSX, the Population Pyramid Generator by the Perihelion Group can quickly generate a standard population pyramid.

Demographic Change

Measuring change between decennial census is a typical demographic task. Comparison require standardizing changes to geographic boundaries over time. There are four types of change:
  • no change, geographic boundaries remain the same
  • revision - a small change that does not affect a larger portion of the population
  • merge - previous geographic boundaries are joined to form the current boundaries
  • split - current boundaries are created by splitting the previous boundaries
These 2010 census tracts belong to a city council district in San Antonio, Texas. This example exhibits all four types of change:

  • The light blue/whitish lines are the 2010 tract boundaries that haven't changed from the previous census. 
  • The red lines represent the 2000 tract boundaries. The tracts with the red line between them is a merger of the 2000 tracts that form a 2010 tract outlined in a light blue.
  • The cyan lines represent new 2010 tracts that are the result of splitting 2000 boundaries. This is noticeable at the bottom most tract which shows the cyan line from a split and the old 2000 boundary in red.
  • The scale of the image hides a revision in one of the tracts.

There are several standardization methods for each type of change:

  • merge - locate merges using the first 4 digits of the tract, manually edit the tract, or spatial joins by polygon or centroid
  • split - area allocation and union (assumes that population is equally distributed)
  • use the Census Relationship files which tracks the changes in census geography

Using the Census Relationship files, the methodology is:
  • query for tracts without change
  • query for tracts that split
  • query for tracts that have merged
  • combine the results to a base geography
The following query is for the total population of 2000 census tracts with the no change; the list of tracts were selected by matching census tract numbers in QGIS.

A similar script for 2010 census tracts without change is available. The Census Relationship files contain the splits that occurred between each census.  The following query retrieves the 2000 tracts, 2010 tracts, and the 2010 population from the relationship files.

Since there was only one 2010 tract that was merged, it was possible to retrieve the values with a direct query to the table. All the results were merged into a csv file with both the 2000 and 2010 tract numbers and joined to the 2000 tract boundaries using QGIS. Below are the population changes using the 2000 census boundaries.

2000 Population by 2000 tract boundaries
2010 population by 2000 tract boundaries
The dark blue tracts indicates the most populous areas in a quantile classification and white tracts represent the least populous. The maps show an overall increase in population in the selected tracts. There are some increases in the smaller tracts indicating increasing density in those areas.


I've covered some basic techniques for retrieving data for demographic analysis. The sql scripts provide basic building blocks for typical demographic processes such as calculating fertility, mortality, and migration. PostgreSQL and QGIS were used in combination to perform these queries and display the results.

No comments:

Post a Comment