In Search of Data Scientists

Asked recently about the emergence of data scientists—who they are and what they do—it gave me an opportunity to think about how employers delineate occupational roles. As a job title, “data scientist” is becoming more common in some industry circles, but not yet common enough to be defined in Wikipedia! Having spent the last five years educating analytics professionals, I was curious to see how the two roles might overlap or diverge in practice.

Wanting to learn more, I did what I suppose a data scientist would do: I went looking for data. An obvious place to start is Linkedin. Not because the company itself is one of the premiere places where you’ll find employees who hold the title data scientist—indeed it is—but because Linkedin is a treasure trove of real-time data about occupations, how people define their professional lives, and how those definitions evolve over time. Scraping data from Linkedin is relatively easy and, I confess, something that has become a habit for me (see my exploration of talent migration surrounding Hadoop). To be sure, Linkedin’s data scientists are in a better position to analyze their data and perhaps someday they may look into the subject in a more comprehensive manner than I can here.

Using Linkedin’s search functionality, I queried the phrase “data scientist” in the job title field (current or past title). At the time this was done (in October 2011), a total of 394 Linkedin members were identified who are located in the U.S. Limiting the search to individuals with “data scientist” in their current title reduced the number to 287 members. The search phrase used could be all or part of the job title (e.g, “Principal Data Scientist”).

Q1: “data scientist” in current or past title = 394

Four hundred or so is not a particularly large number as far as occupations go. To put “data scientist” in context, members with “data” or “scientist” in their title are vastly larger populations.

Q2: “data” in current or past title = 261,717

Q3: “scientist” in current or past title = 212,540

To add further context, a number of other searches were done on descriptive terms or phrases in jobs titles that deal with data analysis. The occupational role of “statistician” is of particular interest. Given the standard definition (from Wikipedia and elsewhere), statistics is the study of the collection, organization, analysis, and interpretation of data.

Q4: “statistician” as current or past title = 10,766

It’s fair to say that statisticians—an occupation with a long and distinguished history and one of the oldest professional societies in the country—are close to the ground when it comes to data analysis. Whether data scientists are the new statisticians or something different, I will leave it to others to sort out.

Not to intentionally stir the pot, I couldn’t resist the temptation of searching for how many Linkedin members in the data world identify their occupation as a “scientist” or “engineer.” Data engineers currently outnumber data scientists by more than 4 to 1.

Q5: “data engineer” as current or past title = 1,652

More interesting to me, as the director of an educational program in analytics, is the expansive use of the term “analytics” in job titles.

Q6: “analytics” in current or past title = 34,077

As an occupational role that builds on math and statistics, this suggests that analytics goes well beyond the organizational space inhabited by both data scientists and statisticians—especially in industry.

Back to the question of who are data scientists and where can we find them. Looking more closely at the data it becomes apparent the occupation “data scientist” shows up to two distinctly different industry contexts. In the pharmaceutical industry, where usage of the phrase is most frequent, it appears in job titles much earlier in time. However, digging deeper into individual member profiles, one discovers the predominant title in the pharmaceutical industry is “clinical data scientist” (which accounts for about one-quarter of all data scientists in the identified population).

The occupational role of clinical data scientist is defined in a way that is quite different from today’s discussion of data scientists. Clinical data scientists have lower levels of education, on average, and their education is frequently a branch of information and (to use an earlier term) library science. Today’s data scientist in the computer software and Internet sector tends to have a high level of education, typically a masters or Ph.D., in fields such as computer science and statistics.

(Note: the classification of industry in Linkedin is user defined and subject to inconsistencies in the data).

The split between software/Internet and the pharmaceutical industry can be seen in the company-level data.

The geographic location of data scientists is split largely between the East Coast (dominated by pharmaceutical companies) and West Coast, particularly in Silicon Valley (dominated by computer software and Internet companies).

On a global basis, the geographic distribution of data scientists is predominantly in the U.S., followed by the UK. The number of data scientists located in other countries drops off significantly (into the single digits), though it should be noted that language differences outside of predominantly English-speaking countries could mask the true numbers.

Where did (or do) data scientists go to school? Linkedin data provides insight into this question, as well.

The term “science” is sometimes misused as a label in occupations or fields that have little or no actual resemblance to formal science. While I am not advocating or discouraging its usage, I certainly understand the current attraction to embrace the data scientist as a new occupational category by which we define individuals skilled in the art and science of deriving insights from vast quantities and varieties of data.

Michael Rappa (October 29, 2011) Comments welcome