Methodology
My first task was to gather a list of location oriented companies. CrunchBase tags companies with terms such as location or geo or gps. Although I could have used these tags to create a pool of location oriented companies, the tags had a lot of variation to determine if it was a location company. I decided to use a different method. Each company entry in CrunchBase includes a list of competitors; I wrote a depth first search program to gather a list of companies. I used a set of well known companies such as FourSquare, SkyHook, Loopt, and SimpleGeo to generate a seed list of companies. As expected, this process generated a list with many duplicates; so with a bit of experimentation, I determined that three iterations generated a sufficient amount companies with the duplicates removed. I then ran my seed list through the depth first search program which yielded 1945 unique companies. A number of those company entries lacked basic information such as founding date or money raised; I eliminated them and reduced the list to 1505 companies with sufficient data for analysis.
Initially, I attempted to segment the companies into groups based on descriptive tags using hierarchical clustering, but the tags were too unique to provide a meaningful cluster. Fortunately, companies can be assigned to a category in CrunchBase and this is a handy way to segment the data set. The pie chart below shows the different categories.
Location companies by CrunchBase category |
Refining the location business segments
I wanted to further narrow the data set by focusing on companies that comprised the majority of the location industry. I then ran a hierarchical cluster again using category. The result was five clusters represented by six categories.
The higher a category or term occurs in the plot, the more frequently it appears in the data set. Terms that are closer to each other are more related; the plot shows that the categories are quite discreet.
A look at the raw counts and percentages show that roughly 20% of companies are acquired and 5% reach an initial public offering.
Raw counts
Category
|
Total
|
Acquired
|
IPO
|
advertising
|
110
|
29
|
5
|
ecommerce
|
114
|
18
|
8
|
games_video
|
221
|
46
|
8
|
mobile
|
182
|
34
|
6
|
software
|
185
|
36
|
10
|
web
|
693
|
136
|
13
|
Percentages
Category
|
Acquired
|
IPO
|
advertising
|
26%
|
5%
|
ecommerce
|
16%
|
7%
|
games_video
|
21%
|
4%
|
mobile
|
19%
|
3%
|
software
|
19%
|
5%
|
web
|
20%
|
2%
|
A closer look at the companies that had an IPO revealed several companies in some of the categories that were clearly not location companies (these included major mobile carriers such as Verizon). These were removed and counts were adjusted.
Adjusted counts and percentages
Category
|
Raw
|
Percentage
|
advertising
|
5
|
5%
|
ecommerce
|
4
|
4%
|
games_video
|
8
|
4%
|
mobile
|
0
|
0%
|
software
|
7
|
4%
|
web
|
4
|
1%
|
Who's acquiring companies
I created a word cloud to show which companies are acquiring business in this location space. Companies such as Google, Microsoft and Yahoo are doing most of acquisition, but a lot of companies with "Media" as part of their name stands out.
CrunchBase includes a field for the media source announcing acquisitions. Google, Social Media, and Video are prominent terms.
How much money did these companies raise prior to acquisition?
For the top three categories (web, games, and mobile), I generated descriptive statistics for the amount of money they raised. Note that I removed all companies that do not have data for money raised.
Amount of money raised by web companies
Count
|
300
|
Mean
|
$31,183,226
|
Median
|
$5,900,000
|
Mode
|
$1,000,000
|
Minimum
|
$15,000
|
Maximum
|
$1,160,166,511
|
Range
|
$1,160,151,511
|
Sum
|
$9,354,967,849
|
The results are impressive with over $9 billion raised, however the results are skewed because they include FaceBook, Twitter, Groupon, and AOL which all had large multiple rounds. The important statistic to look at is the mode, which indicates what most companies raised.
Amount of money raised by game/video companies
Count
|
99
|
Mean
|
$21,756,267
|
Median
|
$11,015,719
|
Mode
|
$5,000,000
|
Minimum
|
$15,000
|
Maximum
|
$183,240,000
|
Range
|
$183,225,000
|
Sum
|
$2,153,870,470
|
The mode is quite a bit larger, $5 million, in comparison to companies in the web category. This is most likely due to web companies having long tails on both ends of the distribution.
Amount of money raised by mobile companies
Count
|
92
|
Mean
|
$22,577,161
|
Median
|
$11,075,000
|
Mode
|
$4,000,000
|
Minimum
|
$11,000
|
Maximum
|
$168,000,000
|
Range
|
$167,989,000
|
Sum
|
$2,077,098,772
|
Of note, the mode is similar to companies in the game/video category.
Closing thoughts
CrunchBase is essentially a wiki, with all the strengths and weaknesses of crowd sourced data. It is a good place to begin exploring the performance of technology businesses, especially location based startups. While there are databases for established companies there are few openly available resources on startups.
Using DFS on business competitors is a good way to generate a list of businesses and startups in the location space, but an extra step to filter out businesses adjacent but not in the location space is needed. Whether this can be done with clever text mining of tags or other ancillary data remains to be scene. The approach of using the predetermined categories to segment the pool of location businesses is a reasonable approach and allows for cross comparison across other business verticals.
Basic descriptive statistics have been provided but conclusions can't be drawn on the basis of data presented. However this provides a first approximation of a sketch of the location based startups and businesses.
Data availability
Data and the scripts used to generate the data and analyses are available at: https://github.com/spara/whereconf2012
Acknowledgments
CrunchBase is essentially a wiki, with all the strengths and weaknesses of crowd sourced data. It is a good place to begin exploring the performance of technology businesses, especially location based startups. While there are databases for established companies there are few openly available resources on startups.
Using DFS on business competitors is a good way to generate a list of businesses and startups in the location space, but an extra step to filter out businesses adjacent but not in the location space is needed. Whether this can be done with clever text mining of tags or other ancillary data remains to be scene. The approach of using the predetermined categories to segment the pool of location businesses is a reasonable approach and allows for cross comparison across other business verticals.
Basic descriptive statistics have been provided but conclusions can't be drawn on the basis of data presented. However this provides a first approximation of a sketch of the location based startups and businesses.
Data availability
Data and the scripts used to generate the data and analyses are available at: https://github.com/spara/whereconf2012
Acknowledgments
- Word clouds generated at http://www.wordle.net/
- crunchbase gem was forked from https://github.com/tylercunnion/crunchbase
- R data loading script from http://www.cloudstat.org/index.php?do=/kaichew/blog/crunchbase-api-scrapping-companies-info-from-crunchbase-with-r-packages/
- R hierarchical clustering script from http://heuristically.wordpress.com/2011/04/08/text-data-mining-twitter-r/
No comments:
Post a Comment