'sproke

Thursday, April 5, 2012

Whereconf 2012: The Dead Pool and Also-Rans

My talk at Whereconf 2012 was originally about companies in the location business that failed or were acquired for less than the money they raised. My source of information was CrunchBase, an online database of companies, people and investors that is freely editable. Since it is crowd sourced, the quality of the data can vary and for the most part, the dead pool or acquisition information was not sufficient for analysis. I decided to look at acquisitions and IPOs to see how the location industry was doing.

Methodology

My first task was to gather a list of location oriented companies. CrunchBase tags companies with terms such as location or geo or gps. Although I could have used these tags to create a pool of location oriented companies, the tags had a lot of variation to determine if it was a location company. I decided to use a different method. Each company entry in CrunchBase includes a list of competitors; I wrote a depth first search program to gather a list of companies. I used a set of well known companies such as FourSquare, SkyHook, Loopt, and SimpleGeo to generate a seed list of companies. As expected, this process generated a list with many duplicates; so with a bit of experimentation, I determined that three iterations generated a sufficient amount companies with the duplicates removed. I then ran my seed list through the depth first search program which yielded 1945 unique companies. A number of those company entries lacked basic information such as founding date or money raised; I eliminated them and reduced the list to 1505 companies with sufficient data for analysis.

Initially, I attempted to segment the companies into groups based on descriptive tags using hierarchical clustering, but the tags were too unique to provide a meaningful cluster. Fortunately, companies can be assigned to a category in CrunchBase and this is a handy way to segment the data set. The pie chart below shows the different categories.

Location companies by CrunchBase category

Refining the location business segments

I wanted to further narrow the data set by focusing on companies that comprised the majority of the location industry. I then ran a hierarchical cluster again using category. The result was five clusters represented by six categories.

The higher a category or term occurs in the plot, the more frequently it appears in the data set. Terms that are closer to each other are more related; the plot shows that the categories are quite discreet.

A look at the raw counts and percentages show that roughly 20% of companies are acquired and 5% reach an initial public offering.

Raw counts

Category	Total	Acquired	IPO
advertising	110	29	5
ecommerce	114	18	8
games_video	221	46	8
mobile	182	34	6
software	185	36	10
web	693	136	13

Percentages

Category	Acquired	IPO
advertising	26%	5%
ecommerce	16%	7%
games_video	21%	4%
mobile	19%	3%
software	19%	5%
web	20%	2%

A closer look at the companies that had an IPO revealed several companies in some of the categories that were clearly not location companies (these included major mobile carriers such as Verizon). These were removed and counts were adjusted.

Adjusted counts and percentages

Category	Raw	Percentage
advertising	5	5%
ecommerce	4	4%
games_video	8	4%
mobile	0	0%
software	7	4%
web	4	1%

Who's acquiring companies

I created a word cloud to show which companies are acquiring business in this location space. Companies such as Google, Microsoft and Yahoo are doing most of acquisition, but a lot of companies with "Media" as part of their name stands out.

CrunchBase includes a field for the media source announcing acquisitions. Google, Social Media, and Video are prominent terms.

How much money did these companies raise prior to acquisition?

For the top three categories (web, games, and mobile), I generated descriptive statistics for the amount of money they raised. Note that I removed all companies that do not have data for money raised.

Amount of money raised by web companies


Count	300
Mean	$31,183,226
Median	$5,900,000
Mode	$1,000,000
Minimum	$15,000
Maximum	$1,160,166,511
Range	$1,160,151,511
Sum	$9,354,967,849

The results are impressive with over $9 billion raised, however the results are skewed because they include FaceBook, Twitter, Groupon, and AOL which all had large multiple rounds. The important statistic to look at is the mode, which indicates what most companies raised.

Amount of money raised by game/video companies


Count	99
Mean	$21,756,267
Median	$11,015,719
Mode	$5,000,000
Minimum	$15,000
Maximum	$183,240,000
Range	$183,225,000
Sum	$2,153,870,470

The mode is quite a bit larger, $5 million, in comparison to companies in the web category. This is most likely due to web companies having long tails on both ends of the distribution.

Amount of money raised by mobile companies


Count	92
Mean	$22,577,161
Median	$11,075,000
Mode	$4,000,000
Minimum	$11,000
Maximum	$168,000,000
Range	$167,989,000
Sum	$2,077,098,772

Of note, the mode is similar to companies in the game/video category.

Closing thoughts

CrunchBase is essentially a wiki, with all the strengths and weaknesses of crowd sourced data. It is a good place to begin exploring the performance of technology businesses, especially location based startups. While there are databases for established companies there are few openly available resources on startups.

Using DFS on business competitors is a good way to generate a list of businesses and startups in the location space, but an extra step to filter out businesses adjacent but not in the location space is needed. Whether this can be done with clever text mining of tags or other ancillary data remains to be scene. The approach of using the predetermined categories to segment the pool of location businesses is a reasonable approach and allows for cross comparison across other business verticals.

Basic descriptive statistics have been provided but conclusions can't be drawn on the basis of data presented. However this provides a first approximation of a sketch of the location based startups and businesses.

Data availability

Data and the scripts used to generate the data and analyses are available at: https://github.com/spara/whereconf2012

Acknowledgments

Word clouds generated at http://www.wordle.net/
crunchbase gem was forked from https://github.com/tylercunnion/crunchbase
R data loading script from http://www.cloudstat.org/index.php?do=/kaichew/blog/crunchbase-api-scrapping-companies-info-from-crunchbase-with-r-packages/
R hierarchical clustering script from http://heuristically.wordpress.com/2011/04/08/text-data-mining-twitter-r/

Wednesday, March 7, 2012

This is important

Listen up, this is important. If you've seen my WhereCampDC video "WMS is Dead", then you will know that what is commonly called web mapping is what I consider a chart or a static graphic plotting information along any arbitrary xy axis. In my view, this means projections aren't important so we might as well get on with our lives and use web mercator.

But what if we could have real interactive cartography, so that for any location at any scale we could view it using a carefully curated set of projections to preserve area and shape? I have a secret. Somewhere in my black heart is a tiny undernourished cartographic flame starved by nearly two decades of shite web mapping. This is the part where that flame goes, "Wooosh!"

Two projects in my twitter feed showed up. The first is Scale Adaptive Web Maps, which changes the projection based on scale. It's pretty cool to watch the projections change while zooming, and hey there's the projection you can use to display OCONUS. Alaska and Hawaii thank you.

The second project is kartograph.js which posits a convincing argument for yet another mapping framework. I think the work speaks for itself.

However, the question that remains is "Does it scale?" At O'Reilly's StrataConf last week, Mano Marks and Chris Broadfoot presented work showing WebGL rendering features with 6 million vertices in the browser. When asked for a public demo, Mano said that it wasn't quite ready for release. Despite that, I'm optimistic that this may be a new way forward for interactive digital cartography.

Monday, February 6, 2012

Mapping Software for the 84%

This article covers downloading, installing, and loading spatial data into PostGIS using QGIS. It also covers working with tabular data using unix command line tools. The goal of these articles is to perform typical GIS tasks using data published on the web with open source Windows mapping tools. As with many GIS tasks the final output will be maps that are suitable for printing or for display in a web browser. Although the final output are maps, I want to push a little beyond styling a pre-processed set of data and into the area of creating new data through analysis.

Download and Install
Download and install the following software:

QGIS 1.7.3
Postgres/PostGIS 9.1.2-1 (x86-32)
TileMill 0.9

QGIS and TileMill are straight forward. Postgres installs first (take note of your username and password) then launches the Application Stack Builder for installing PostGIS. Check off PostGIS and the ODBC connector is also handy when connecting to Microsoft formats.

Download data

We will use data from several sources to for our analysis and maps.

Download the shape files and unzip them into a single directory. Note that the TEA (Texas Education Agency) site assigns the same name to both the campus and district dropout rate csv files, so use the 'Save as' option to give the files unique names when downloading the files. Also of importance are the metadata describing the TEA files.

Loading spatial data into PostGIS

I'll start by saying PostGIS isn't necessary for spatial analysis. Shape files, geoson,spatialite, etc are all perfectly acceptable data formats that can be consumed by QGIS and/or TileMill. The main advantage of using PostGIS in this context is data management with the added bonus of being able to performing spatial operations if you are willing to learn SQL. We'll touch on using SQL in subsequent articles but as always there is more than one way to do any task. For those not quite ready to jump into learning SQL, QGIS can act as front end for PostGIS.

Buckets

Postgres is a relational database and there are a number of best practices associated with building and maintaining transactional databases. We're going to chuck all of that aside, because like the honey badger, spatial data doesn't care about 3rd normal form or referential integrity. We're going to treat databases as big buckets of gooey geo goodness.

First we're going to create a database using PGAdmin III, which is a graphical frontend for Postgres included in the install.

Your Postgres instance should be already connected but in case it isn't, please see this page. Add a new database by right clicking on 'Databases'>'New Database'

Name the new database 'texas'

Click on the 'Definition' tab, and select 'template_postgis' as the Template and click 'OK'. Template_postgis is a predefined database that contains the Postgis tables and spatial functions. Congratulations, you've made a bucket!

Reprojecting data in QGIS

Previously we downloaded a number of shape files from the Census, TEA and Cloudmade's OpenStreet Map extract of Texas. It's always a good idea to check the .prj or metadata files associated with the shape file. The OSM and Census files are both in WGS84, but the TEA files are in the Texas Centric Mapping System projection. For the purposes of this project, we'll use QGIS to reproject the TEA files to WGS84. Then load the shape files using SPIT in QGIS.

First add the TEA shape files. Click on Layer>Add Vector Layer

Browse to the directory the TEA files are in and select them. Click OK to complete.

Select 'Schools2010' and right click and select 'Save as...'

Browse to the directory to save the file and name it 'tea_school_campus'

Next we'll set the output CRS (projection) to WGS84. Click 'Browse' on the CRS field and select WGS84 under the 'Geographic Coordinate Systems' list or use the 'Search' box to find it. Click 'Ok' with WGS84 selected. Then click 'OK' to reproject and save the file.

Repeat the same process for the Districts_10_11 shape file.

From QGIS to PostGIS

We'll use SPIT (Shapefile to PostGIS Import Tool) to import the shape files into PostGIS. Click 'Database'>'Spit'>'Import Shapefiles to PostGIS'

In the Spit window, create a connection to PostGIS by clicking on 'New'. Fill out the 'Name' (texas), 'Host' (localhost)', 'Database' (texas), 'Username' (postgres), 'Password' (your password). Save your user name and password if desired. Click on 'Test Connect' to test the connection; if successful, click 'OK' to save the connection.

Now that we have a connection to the data base, click 'Connect' to establish the connection. We will use the default SRID, since all the shape files are in the same projection. Click 'Add' and select the OSM, Census and the reprojected TEA shape files. Click 'OK' to load them into PostGIS. This make take a bit of time, especially for the larger files such as the OSM highways file.

You can check that the data was imported using PGAdmin III. Press 'F5' to refresh the view and click on the 'texas' database to open the tree view, then click on 'Tables' to view them.

Data prep using command line tools
Advocacy for open data has made more data available in the web; but often the data is in an 'as-is' state that requires additional refinements to make it usable. The TEA dropout rate data is no different and it is important to look at the data to determine what the individual columns mean and if they are in a format that is suitable for analysis.

Things to look out for are tocheck that data type is consistent and the data can be accurately and consistently represented as continuous or categorical values. CSV files come in many flavors and when importing the data into any database requires a degree of data sleuthing to ensure the data is imported correctly. There are many tools for examining and modifying data that range from text editors and spreadsheets. My preferred method is to use unix utilities such as sed, awk, and tr because they don't have file size or row limits found in text editors or spreadsheets.

If you have a preferred method feel free to skip ahead to the loading CSV into Postgres section. This part will describe how to use unix utilities to correct the CSV files and also write SQL scripts to create tables to hold the TEA dropout rate data. One of many pleasant surprises of QGIS for Windows is that it includes MSYS which is a collection of gnu utilities such as sed, awk, tr, and vi. Unlike Cygwin, it is not a full environment that includes the gcc compiler. It has a smaller foot print and provides just what is needed for manipulating data.

If we look at the header and first line of data in the campus dropout rate CSV file, we see that they each entry is quoted and comma delimited. That's OK for the header, but quoted data means that it will be imported as text into a database. Furthermore, if we look through out the data set, we see that many of the fields contain a mixture of numeric and text data. If you read through the site metadata, it says the some fields are purposely obfuscated to preserve privacy. That's the reason why some of the data contains a less than sign in front of the value. Here are a couple of examples on how to use head and tail to look at various parts of a file.

One of the problems with the files is that null or empty values are represented by ".", this is unacceptable. We'll fix this by removing the values using sed, which is a utility that edit a file as stream instead of opening the entire file at one time. You can find and replace values using substitution. The general form of substitution in sed takes the form of 's/text to replace/new text/g'. The 's' means substitution and the 'g' means global or the perform this on the entire file. In the example below, we will need to escape the quote and period with a backslash because they are special characters. The version of sed in MSYS does not support in place editing, so we will redirect the output using '>' to another file. Finally, we remove the column headers in the file by copying it to a tmp file, then using sed to remove the first line. The file 'campus_dropout_clean.csv' is now ready for importing into Postgres. Repeat the same process for the district file.

However, before we can import the data, we will need to define and create the table structure in Postgres using SQL. The CREATE TABLE statement is straightforward and can be done by typing it into psql or the PGAdmin SQL console. I'll be the first to admit that I'm a terrible typist and very prone to error. To overcome my lack of typing skills I use sed to create the SQL.

I start with the header line of the CSV file because it contains all the column names. I use head to grab the first line and save it to 'campus_header' file. This file has all the column names in a single line, so I use sed to to add a newline and write the output to 'campus_dropout_raw.sql' file. Note that I use an actual carriage return instead of the newline ('\n') because of a limitation in sed. Use 'cat' to see the new file.

To finish up the SQL we'll add the CREATE TABLE statement and the closing parentheses using sed. We want to only operate on the first line of the file so we user '1s' to perform the substitution. Note that the output is redirected to a temporary file because this version of sed lacks in place editing. We take the temp file and add the varchar type definition and closing parentheses using sed again. Note that we want to only operate on the last line of the file so we use '$s'. The output is redirected to the 'campus_dropout_raw.sql' file.

Now we have a SQL file to create the table holding the raw TEA campus dropout rates. One of the problems of the file is that all the data is mixed between numeric and text types especially for the fields used to calculate the rate. Fortunately, the dropout rate fields contain the calculated rate as numeric values. We can create a table with the dropout ratio defined as a numeric value instead of a character string. Later on, we will use the raw data to populate the actual dropout rate table.

In the previous example we wanted to get all the column names, but in this example we want to get just the column names with the different dropout rates. We use tr to add carriage returns after the comma then send it to sed to extract the fields of interest. Luckily, the dropput rates occur in a regular pattern so we use '2~3p' to specify that starting on the second line extract every third line in the file. We then send the ouput to a file called campus_dropout_rate.sql.

If we look at the output file using cat, we can see it extracted the desired lines plus a few extra at the end. To get a clean file of only the dropout rate fields we extract the first 21 lines and redirect it to a 'tmp' file. To get the count of the lines we want, use wc to get the total number of lines and subtract the lines you don't want. Add the numeric type definitions using sed and redirect the output to the 'first' file.

The remaining nine fields in the campus_header file are useful fields that contain the campus identifier as well as other important information such as school name. We want those fields so we extract them using tr and tail and write the results to a 'tmp' file. As in the previous example, add the varchar type definition using sed and write the output to the 'last' file. We glue the 'first' and 'last' file using the cat command and redirect the output to the campus_dropout_rate.sql file. Finally, we finish by adding the CREATE TABLE statement and closing parentheses was we did in the previous example.

In this section, we used command line utilities to clean up files and create SQL scripts to create tables in Postgres. Repeat the process for the district dropout rate data. When we have completed the cleaning and making SQL scripts for the district data, we'll be ready to import the data into Postgres.

Importing CSV files into Postgres
In this section we will use PGAdmin and the SQL editor to load the data. Open PGAdmin III, select the 'texas' database, and click on the 'SQL' button to launch the SQL editor.

In the SQL query window, click on the 'File'>'Open' to open the campus_dropout_raw.sql file. Load the file, then click on Click on 'Query'>'Execute' to run the script.

Repeat the same process for the campus_dropout_rate.sql, district_dropout_raw.sql, district_dropout_rate.sql files. You should have four new tables listed when you are finished.

The final step is to import the raw data and insert the cleaned data in to the campus and district rate tables. Postgres provides the COPY command to import text files into tables. You can use the SQL window to write and execute SQL. Clear the the window by clicking on 'Edit'>'Clear window' and type the following and substituting the appropriate path.

The next step is to populate the campus_dropout_rate table from the table containing the raw data. To do this, we use and INSERT statement to copy the data and convert the data to a numeric type. Using the same techniques in the previous example, I created this script.

Copy and past the script into the SQL window and run the script.

Repeat the same process for the district data. That's it for loading the tabular TEA data on campus and district dropout rates. Later on, we will join the tables to the school and district PostGIS layers for analysis.

Summary
We've done a lot - installed software, created a database, reprojected and loaded shape files in PostGIS using QGIS, used command line tools to clean up CSV files and create SQL scripts, and finally created tables and loaded data in Postgres. While some of this may seem complicated (especially the command line portion), this is the type of work flow I would use if someone asked me to make a map of data pulled from the Internet. It's the kind of workflow that I used as a GIS analyst in a previous life to overcome the shortcomings of spreadsheets and text editors as data cleaning tools. Overall, I'm impressed that I can do all of this in Windows with just QGIS and Postgres.

The following article will cover spatial analysis in QGIS and Postgres. We'll look at QGIS plug-ins and exercise the spatial functions in PostGIS.

Friday, February 3, 2012

Game Changer: Open Source Mapping in Windows

One of the trepidations I have with teaching mapping courses using open source is that it usually requires some modicum of programming which is always way beyond the scope of any beginner class about making maps. In addition, open source tends to favor linux or unix based tool chains that require config/make/make install tap dance before starting anything. This is akin to telling folks that they will need to forge their tools before they can start building a dog house. Finally, 84% of the computers and (in my experience) 100% of the training facilities in the world run some variant of Windows, which means that people have to depend on the kindness of strangers who compile and distribute Windows binaries.

While the sheer number of mapping related programs has exploded, the learning curve has yet to flatten significantly so that the average Windows user can produce maps without sending sizable checks and bits of their soul to Redlands. Until now.

Two tools, QGIS and Postgres/PostGIS, have had windows implementations and are well on their way to being polished products with fantastic user communities that extend their capabilities. The QGIS plug-in architecture has fostered a number of repositories with plugins ranging from basic geometry manipulation to domain specific analysis. The Windows 1.7.3 Wroclaw version also includes GRASS and msys, which is a shell environment containing common unix tools such as sed and awk – mainstays in my data crunching/munging tool kit.

Postgres/PostGIS installation is also very polished thanks to EnterpriseDB's Stackbuilder installer. It simplifies PostGIS installation as well as providing ODBC drivers needed for communicating with Microsoft products. PostGIS 2.0 has not been released and is not included in the current installer, but the final release will include routing, geocoding, and many improvements that are already available in the alpha 2.0 release.

The real game changer is the release of TileMill for Windows. TileMill is a cartographic studio that uses Mapnik, which is described:

"Mapnik is about making beautiful maps. It uses the AGG library and offers world class anti-aliasing rendering with subpixel accuracy for geographic data."

That means no more jagged lines that you see on both web maps and print maps produced by commercial GIS programs. TileMill uses MSS (Map Style Sheets) to make maps; it's rather bare bones in that there is no tool to select a feature and set the line style, color, or fill in a menu. You have to create each style using MSS in a TileMill editor, fortunately TileMill syntax is simple and mirrors CSS so that designers can use it. Calling TileMill a cartographic studio is a stretch. For example, it lacks tools for automatically symbolizing categorical or continuous variables, or include a basic set of cartographic markers. What it does do very well is produce beautiful maps. Labeling is generally well done in terms of distribution, collision avoidance, and kerning along curves. Output formats include png, svg, pdf and mbtiles, which is a compact tile format that is gaining traction especially for mobile devices.

Red dot fever is a meme that won't die*

The lack of a really good cartographic tool has been the achilles heel for the adoption open source mapping, but TileMill plugs that deficiency. Windows users, henceforth known as the 84%, can now store and manipulate geospatial data in PostGIS, analyze and modify the data in QGIS, and publish it using TileMill.

This is an introduction to series of articles on using open source mapping software in Windows. The next install will cover installation, finding data, and loading it into PostGIS.

*Yes, that's an OSX screenshot, that's what happens when I stay up past my bedtime. On second thought, let's just say I'm lazy.

UPDATE: Second part in this series

Tuesday, January 10, 2012

Importing 2010 SF1 Census in PostgreSQL

The Census Bureau provides a template MS Access database for importing fixed width/comma delimited SF1 files. Nice if you have MS Acess, which I don't.

With the help of MDBTools, I extracted the schema (using the mdb-schema command) from the Census provided template database. With a little bit of sed/awk/vi munging I converted the the schema into valid sql script that creates the SF1 tables in PostgreSQL.

Create a data base and run the script to create the tables

createdb SF1_2010
psql SF1_2010 -f create_SF1_tables.sql

You can import all or just the ones you need. For example, I wanted the total population count by age and sex by census tract. The 2010 Summary File 1 document shows the file containing the data I wanted (Chapter 6 The Data Dictionary), which was in xx000172010.sf1. Use the postgres copy command to import the data tables.

COPY sf1_00017 FROM '/path/to/xx000172010.sf1' CSV;

The SF1 database also provides a geo_header table common to all SF1 data files. Unlike the SF1 data files, which are comma delimited, the geo_header file uses fixed length records. This can be imported with a bit file munging and some sql foo. The first step is to read the geoheader into a temporary table with each line as a record. There's a script for this also.

CREATE TABLE geo_header_staging (data text);
COPY geo_header_staging FROM '/path/to/dcgeo2010.sf1' DELIMITER AS '|';

Note that "|" is used as a delimiter to read the entire line.

The second part is to use PostgreSQL's string functions to extract the values and insert them into the correct field in the geo_header table. Writing the sql would have taken some time, but fortunately the Census provides a SAS import script which I used as a starting point. With some more sed/awk/cut/paste/vi munging, I produced an import script.

psql SF1_2010 -f geo_header_to_postgresql.sql

This will create a PostgreSQL database with a valid geo_header table that you can use for querying by census geography. The primary key is logrecno, which ties together the geography with the data. Query for the logrecno for the geography of interest, then query for the data using the logrecno.

SELECT logrecno from geo_header_sf1 WHERE tract ='192200' AND sumlev ='140' AND name = ' Census Tract 1922';

SELECT * FROM sf1_00017 WHERE logrecno ='69229';

After selecting and summarizing the data, I produced a population pyramid. Whew!

* Thank you to @richgibson for corrections

Wednesday, October 26, 2011

Conferences and Workshop Materials

At FOSS4G 2011, I gave a workshop on Deploying Map Services on Amazon Web Services. The materials are here: https://github.com/spara/aws_foss4g_2011

I'll be presenting "Mapping: What in the Wide, Wide World of Sports Is Going On?" at the 2011 Texas GIS Forum. The presentation is here: http://prezi.com/mc7wstzgky5-/mapping-what-in-the-wide-wide-world-of-sports-is-going-on/

Sunday, July 31, 2011

Loading shapefiles into Elasticsearch

Elasticsearch is an open source (Apache 2), distributed, RESTful, search engine built on top of Lucene. Elasticsearch supports point data as a geo_point data type; which provides several types of spatial search:

Elasticsearch provides a REST API for configuring, loading, and querying data. It also has a bulk loading interface. To load shapefiles into elastic search I wrote a ruby script to convert shapefiles to Elasticsearch's bulk loading format. Here's the process for loading multiple shapefiles as multiple types into a single index (think of an index as a 'database' and types as 'tables').

If you haven't installed Elasticsearch, download it, unzip it, and read the README file to start the node.

1. Create the index

curl -XPUT 'http://localhost:9200/geo/'

2. Create a mapping file (place.json) in your favorite editor. This example is for a shapefile called place

{
    "place" : {
        "properties" : {
            "geometry": {
                "properties": {
                    "coordinates": {
                        "type": "geo_point"
                    }
                }
            }
        }
    }
}

curl -XPUT http://localhost:9200/geo/place/_mapping -d @place.json

3. Convert the shapefile to the ES bulk format.

ruby shapefile2elasticsearch.rb > place_data.json

4. Bulk load the data.

curl -XPUT 'http://localhost:9200/_bulk/' --data-binary @place_data.json

5. Test query.

curl -XGET 'http://localhost:9200/geo/place/_search' -d '{
    "query": {
        "filtered": {
            "query": {
                "match_all": {}
            },
            "filter": {
                "geo_distance": {
                    "distance": "20km",
                    "geometry.coordinates": {
                        "lat": 32.25,
                        "lon": -97.75
                    }
                }
            }
        }
    }
}'

6. Add a second type definition for zipcodes:

{
    "zipcode" : {
        "properties" : {
            "geometry": {
                "properties": {
                    "coordinates": {
                        "type": "geo_point"
                    }
                }
            }
        }
    }
}

curl -XPUT http://localhost:9200/geo/place/_mapping -d @zipcode.json

7. Convert and bulk load the data.

curl -XPUT 'http://localhost:9200/_bulk/' --data-binary @zipcode_data.json

8. Test query.

curl -XGET 'http://localhost:9200/geo/zipcode/_search' -d '{
    "query": {
        "filtered": {
            "query": {
                "match_all": {}
            },
            "filter": {
                "geo_distance": {
                    "distance": "20km",
                    "geometry.coordinates": {
                        "lat": 32.25,
                        "lon": -97.75
                    }
                }
            }
        }
    }
}'