Fighting Big Tobacco with Big Data

Institution:	Stanford University
Investigator(s):	Robert Proctor, PhD
Award Cycle:	2016 (Cycle 25)	Grant #: 25IP-0017	Award: $229,057
Subject Area:	State and Local Tobacco Control Policy Research
Award Type:	High Impact Pilot Award

Abstracts

Initial Award Abstract

For over twenty years, the study of millions of pages of formerly secret tobacco industry documents has been one of the main drivers of tobacco control research. Containing internal memos, plans, and research reports, these documents have enabled researchers, journalists and policymakers to prove widespread fraud and deception by the industry. This research has depended on technological advances: once placed online, for example, the documents became accessible to researchers (and to anyone with an Internet connection) from all over the world. More recently, optical character recognition (OCR) has enabled targeted searches for documents matching particular keywords or word strings. For the past ten years or so, however, methods of working with these documents have remained essentially the same: search for documents using keywords or word strings, read those documents, and find new documents or new nodes in a social network using a "snowball" approach.
What if, instead, you could go to a website capable of plotting out the social network of a person of interest, group similar people together, and identify words that best describe the connections? What if an algorithm, given 30,000 newspaper articles containing the word “nicotine,” could automatically identify topics in these articles, for example the FDA’s attempt to regulate cigarettes as nicotine delivery devices, the Waxman hearings, or the rising number of lawsuits brought against the tobacco industry? What if it could then also provide links to documents most representative for each topic, prioritized by degree of affinity?

We propose to use machine learning algorithms to enable a new way to explore the tobacco documents corpus. Machine learning is a new field in computer science, which studies and creates algorithms that automatically detect meaningful patterns in data. Applied to large text corpora such as the Truth Tobacco Documents Library (TTDL), machine learning can be used to automatically classify documents, organize them into clusters, identify representative documents from these clusters and/or find key terms across hundreds of thousands of documents. Algorithms of this sort can guide the exploration of a new topic, organize documents into coherent clusters, and suggest documents to study more closely. This approach cannot of course replace the detailed study of individual documents, but it can complement traditional methods, enabling new kinds of inquiries and new kinds of scholarly end-products. Thanks to ever declining costs for computing power, it has now become feasible to apply machine learning algorithms not just to sub-samples of a few thousands of documents, but to all of the 14 million documents constituting the TTDL archive. An approach such as this can be of assistance to scholars working in the area of tobacco control; tools of the sort we are developing for use in the tobacco context could also prove useful for understanding other large scale sets of texts from industries involved in pharmaceutical, petrochemical, or automotive production.

We will make the tools that we develop publicly available on a website already under development: www.tobacco-analytics.org. In their original design, these applications are grounded in our needs as historians and public health advocates. The methods we are developing should enable us to better understand how the tobacco industry coordinated misinformation campaigns across nominally competing companies, and how knowledge held in confidence by the industry differed from that available publicly. We see tobacco control research as a collaborative and interdisciplinary endeavor, however, and therefore want to make these tools available publicly online, so other researchers can use them to answer their own questions.