Cusick Awarded Ph.D. – Master of Science in Analytics

Congratulations to doctoral candidate Mark Cusick, who successfully defended his Ph.D. dissertation — Human Generated Topics: A Gold Standard for Automated Topic Evaluation — aimed at making better sense of the vast amount of text, such as product reviews, generated on the web. Specializing in natural language processing, machine learning, and text mining, Cusick worked as a research assistant to the Institute’s director Dr. Michael Rappa since enrolling in the Ph.D. program in Computer Science in 2011. In August, Dr. Cusick will join Amazon.com as a data scientist in the company’s Cambridge (Mass.) research office.

Dissertation Abstract: With the ever increasing amount of online large scale text corpora including product reviews, news articles, research papers, and social media posts, researchers have proposed numerous automated topic generation approaches to help users understand and navigate this information. In this dissertation we propose an approach that improves upon previous research as well as introduce a means to directly and accurately compare the results of automated approaches against human judgments of usefulness.

Many automated approaches utilize strong statistical methods but often produce topics poorly aligned with human judgments. While other approaches use taxonomy resources to consistently produce well aligned topics at the cost of weak or explicit means of term selection. We introduce our Extensible Topic Model (ETM): an approach combining statistical term selection with topic generation based on an external taxonomy. Based on a comparative evaluation, ETM successfully combines these two forms of topic generation and produces topics aligned with human judgments.

Automated topic generation approaches often evaluate their results by using either easily manipulated statistical functions or costly human evaluation studies. We introduce Human Generated Topics (HGT): a repeatable, low cost approach for generating a set of gold standard topics for any text corpus requiring only non-expert judges. We use HGT to create a gold standard of topics within the domain of online retail product reviews as well as validate and demonstrate the ability of HGT to directly and accurately compare the results of automated approaches.

Committee:
Dr. Michael Rappa (Chair)
Dr. Christopher Healey
Dr. Aric LaBarr
Dr. Oleg Veryovka
Dr. Benjamin Watson