In Search of Hadoop

During a panel discussion at INFORMS 2011 a question was posed about the significance of Hadoop for graduate programs in Analytics like ours. My response, based on a general impression from conversations with the 100+ companies that have sought to recruit our students, was to suggest that Hadoop is important and should be looked at, but right now it shows up mostly on the “West Coast” where its use is common in Internet-based companies. In other parts of the country, with banks, insurers, pharmaceuticals and other data-dominated industries, by comparison, Hadoop is less common today.

One source of data that could be used to examine this question is Linkedin: members who use Hadoop may be inclined to mention it in their profiles. By no means a perfect measure, it can perhaps offer insight into the geographic and industry distributions of Hadoop users.

While it would be ideal to work with Linkedin’s data scientists to perform the analysis, a quick and easy approach is to use the advanced search function on the site. “Hadoop” is reasonably unambiguous as far as keywords go. A search of Linkedin members yields the following frequency distributions.

Michael Rappa (updated November 25, 2011)

Please send comments to me.


(1) The Linkedin members represented in the data may be associated with more than one school in their profile, to account for both undergraduate and graduate degrees earned. Whether or not a member actually used Hadoop while in school cannot be inferred from the data.

(2) Current company designations in a profile do not always correlate with a single geo-location. IBM, for example, has large technical organizations on both coasts, however, the west coast operation shows three times more member profiles with Hadoop mentioned (46-percent) than the next largest concentration in the NYC area (14-percent).

(3) The second most frequent country with Hadoop profile mentions is India, and within India the single most frequent company designation is Yahoo! (about 10-percent of the total).

(4) Clearly Hadoop is very important to a certain industry segment clustered mostly around the S.F. Bay area. However, this does not mean Hadoop is unimportant elsewhere or that it’s importance will not grow in the future.

(5) The data represent a single snapshot in time. The growing significance of Hadoop will become better apparent when, perhaps, another snapshot can be taken 12 months from now.

(6) Data used here are self-reported by Linkedin’s users. Its accuracy and freshness depends on individual users and the frequency of their profile updates. Including the keyword “Hadoop” in a profile may imply different things depending on the user. Talent recruiters may list technical areas they specialize in recruiting, for example. Conversely, it is also true that some individuals who are deeply involved with Hadoop may overlook highlighting this fact in their profiles.

(7) The analysis of Yahoo! migration patterns admittedly pushes the limit of what can be done with Linkedin data using the advanced search function. The numbers will vary slightly depending on the approach, though the basic distributions remain largely in tact.

(8) Linkedin data has tremendous potential for understanding all manner of technological trends and the pattern migration of skills among technical professionals. Surely, the Linkedin data science team is working on it!


NC State University

This website is designed, written and produced by Dr. Michael Rappa © 2006-2013. All rights reserved.