Tuesday, January 10, 2012

Importing 2010 SF1 Census in PostgreSQL

The Census Bureau provides a template MS Access database for importing fixed width/comma delimited  SF1 files. Nice if you have MS Acess, which I don't.

With the help of MDBTools, I extracted the schema (using the mdb-schema command) from the Census provided template database. With a little bit of sed/awk/vi munging I converted the the schema into valid sql script that creates the SF1 tables in PostgreSQL.

Create a data base and run the script to create the tables
createdb SF1_2010
psql SF1_2010 -f create_SF1_tables.sql
You can import all or just the ones you need. For example, I wanted the total population count by age and sex by census tract. The 2010 Summary File 1 document shows the file containing the data I wanted (Chapter 6 The Data Dictionary), which was in xx000172010.sf1. Use the postgres copy command to import the data tables. 
COPY sf1_00017 FROM '/path/to/xx000172010.sf1' CSV;
The SF1 database also provides a geo_header table common to all SF1 data files. Unlike the SF1 data files, which are comma delimited, the geo_header file uses fixed length records. This can be imported with a bit file munging and some sql foo. The first step is to read the geoheader into a temporary table with each line as a record. There's a script for this also.

CREATE TABLE geo_header_staging (data text);
COPY geo_header_staging FROM '/path/to/dcgeo2010.sf1' DELIMITER AS '|';
Note that "|" is used as a delimiter to read the entire line.  

The second part is to use PostgreSQL's string functions to extract the values and insert them into the correct field in the geo_header table. Writing the sql would have taken some time, but fortunately the Census provides a SAS import script which I used as a starting point. With some more sed/awk/cut/paste/vi munging, I produced an import script.
psql SF1_2010 -f geo_header_to_postgresql.sql
This will create a PostgreSQL database with a valid geo_header table that you can use for querying by census geography. The primary key is logrecno, which ties together the geography with the data. Query for the logrecno for the geography of interest, then query for the data using the logrecno.
SELECT logrecno from geo_header_sf1 WHERE tract ='192200' AND sumlev ='140' AND name = ' Census Tract 1922';

SELECT * FROM sf1_00017 WHERE logrecno ='69229';
After selecting and summarizing the data, I produced a population pyramid. Whew!

* Thank you to @richgibson for corrections