Thursday, April 5, 2012

Whereconf 2012: The Dead Pool and Also-Rans

My talk at Whereconf 2012 was originally about companies in the location business that failed or were acquired for less than the money they raised. My source of information was CrunchBase, an online database of companies, people and investors that is freely editable. Since it is crowd sourced, the quality of the data can vary and for the most part, the dead pool or acquisition information was not sufficient for analysis. I decided to look at acquisitions and IPOs to see how the location industry was doing.


Methodology


My first task was to gather a list of location oriented companies. CrunchBase tags companies with terms such as location or geo or gps. Although I could have used these tags to create a pool of location oriented companies, the tags had a lot of variation to determine if it was a location company. I decided to use a different method. Each company entry in CrunchBase includes a list of competitors; I wrote a depth first search program to gather a list of companies. I used a set of well known companies such as FourSquare, SkyHook, Loopt, and SimpleGeo to generate a seed list of companies. As expected, this process generated a list with many duplicates; so with a bit of experimentation, I determined that three iterations generated a sufficient amount companies with the duplicates removed. I then ran my seed list through the depth first search program which yielded 1945 unique companies. A number of those company entries lacked basic information such as founding date or money raised; I eliminated them and reduced the list to 1505 companies with sufficient data for analysis.


Initially, I attempted to segment the companies into groups based on descriptive tags using hierarchical clustering, but the tags were too unique to provide a meaningful cluster. Fortunately, companies can be assigned to a category in CrunchBase and this is a handy way to segment the data set. The pie chart below shows the different categories.



Location companies by CrunchBase category


Refining the location business segments


I wanted to further narrow the data set by focusing on companies that comprised the majority of the location industry. I then ran a hierarchical cluster again using category. The result was five clusters represented by six categories.




The higher a category or term occurs in the plot, the more frequently it appears in the data set. Terms that are closer to each other are more related; the plot shows that the categories are quite discreet.


A look at the raw counts and percentages show that roughly 20% of companies are acquired and 5% reach an initial public offering.


Raw counts

Category
Total
Acquired
IPO
advertising
110
29
5
ecommerce
114
18
8
games_video
221
46
8
mobile
182
34
6
software
185
36
10
web
693
136
13


Percentages
Category
Acquired
IPO
advertising
26%
5%
ecommerce
16%
7%
games_video
21%
4%
mobile
19%
3%
software
19%
5%
web
20%
2%


A closer look at the companies that had an IPO revealed several companies in some of the categories that were clearly not location companies (these included major mobile carriers such as Verizon). These were removed and counts were adjusted.


Adjusted counts and percentages
Category
Raw
Percentage
advertising
5
5%
ecommerce
4
4%
games_video
8
4%
mobile
0
0%
software
7
4%
web
4
1%



Who's acquiring companies 


I created a word cloud to show which companies are acquiring business in this location space. Companies such as Google, Microsoft and Yahoo are doing most of acquisition, but a lot of companies with "Media" as part of their name stands out.

CrunchBase includes a field for the media source announcing acquisitions. Google, Social Media, and Video are prominent terms.



How much money did these companies raise prior to acquisition?
For the top three categories (web, games, and mobile), I generated descriptive statistics for the amount of money they raised. Note that I removed all companies that do not have data for money raised.

Amount of money raised by web companies 


Count
300
Mean
$31,183,226
Median
$5,900,000
Mode
$1,000,000
Minimum

$15,000
Maximum
$1,160,166,511
Range
$1,160,151,511
Sum
$9,354,967,849

The results are impressive with over $9 billion raised, however the results are skewed because they include FaceBook, Twitter, Groupon, and AOL which all had large multiple rounds. The important statistic to look at is the mode, which indicates what most companies raised.

Amount of money raised by game/video companies


Count
99
Mean
$21,756,267
Median
$11,015,719
Mode
$5,000,000
Minimum
$15,000
Maximum
$183,240,000
Range
$183,225,000
Sum
$2,153,870,470

The mode is quite a bit larger, $5 million, in comparison to companies in the web category. This is most likely due to web companies having long tails on both ends of the distribution.

Amount of money raised by mobile companies


Count
92
Mean
$22,577,161
Median
$11,075,000
Mode
$4,000,000
Minimum
$11,000
Maximum
$168,000,000
Range
$167,989,000
Sum
$2,077,098,772

Of note, the mode is similar to companies in the game/video category.

Closing thoughts 


CrunchBase is essentially a wiki, with all the strengths and weaknesses of crowd sourced data. It is a good place to begin exploring the performance of technology businesses, especially location based startups. While there are databases for established companies there are few openly available resources on startups.


Using DFS on business competitors is a good way to generate a list of businesses and startups in the location space, but an extra step to filter out businesses adjacent but not in the location space is needed. Whether this can be done with clever text mining of tags or other ancillary data remains to be scene. The approach of using the predetermined categories to segment the pool of location businesses is a reasonable approach and allows for cross comparison across other business verticals.


Basic descriptive statistics have been provided but conclusions can't be drawn on the basis of data presented. However this provides a first approximation of a sketch of the location based startups and businesses.


Data availability


Data and the scripts used to generate the data and analyses are available at: https://github.com/spara/whereconf2012


Acknowledgments

  1. Word clouds generated at http://www.wordle.net/
  2. crunchbase gem was forked from https://github.com/tylercunnion/crunchbase
  3. R data loading script from http://www.cloudstat.org/index.php?do=/kaichew/blog/crunchbase-api-scrapping-companies-info-from-crunchbase-with-r-packages/
  4. R hierarchical clustering script from http://heuristically.wordpress.com/2011/04/08/text-data-mining-twitter-r/






No comments:

Post a Comment