Fighting Big Tobacco with Big Data
|Investigator(s):||Robert Proctor, PhD|
|Award Cycle:||2016 (Cycle 25)||Grant #: 25IP-0017||Award: $229,057|
|Subject Area:||State and Local Tobacco Control Policy Research|
|Award Type:||High Impact Pilot Award|
Initial Award Abstract
For over twenty years, the study of millions of pages of formerly secret tobacco industry documents has been one of the main drivers of tobacco control research. Containing internal memos, plans, and research reports, these documents have enabled researchers, journalists and policymakers to prove widespread fraud and deception by the industry. This research has depended on technological advances: once placed online, for example, the documents became accessible to researchers (and to anyone with an Internet connection) from all over the world. More recently, optical character recognition (OCR) has enabled targeted searches for documents matching particular keywords or word strings. For the past ten years or so, however, methods of working with these documents have remained essentially the same: search for documents using keywords or word strings, read those documents, and find new documents or new nodes in a social network using a "snowball" approach.
We propose to use machine learning algorithms to enable a new way to explore the tobacco documents corpus. Machine learning is a new field in computer science, which studies and creates algorithms that automatically detect meaningful patterns in data. Applied to large text corpora such as the Truth Tobacco Documents Library (TTDL), machine learning can be used to automatically classify documents, organize them into clusters, identify representative documents from these clusters and/or find key terms across hundreds of thousands of documents. Algorithms of this sort can guide the exploration of a new topic, organize documents into coherent clusters, and suggest documents to study more closely. This approach cannot of course replace the detailed study of individual documents, but it can complement traditional methods, enabling new kinds of inquiries and new kinds of scholarly end-products. Thanks to ever declining costs for computing power, it has now become feasible to apply machine learning algorithms not just to sub-samples of a few thousands of documents, but to all of the 14 million documents constituting the TTDL archive. An approach such as this can be of assistance to scholars working in the area of tobacco control; tools of the sort we are developing for use in the tobacco context could also prove useful for understanding other large scale sets of texts from industries involved in pharmaceutical, petrochemical, or automotive production.
We will make the tools that we develop publicly available on a website already under development: www.tobacco-analytics.org. In their original design, these applications are grounded in our needs as historians and public health advocates. The methods we are developing should enable us to better understand how the tobacco industry coordinated misinformation campaigns across nominally competing companies, and how knowledge held in confidence by the industry differed from that available publicly. We see tobacco control research as a collaborative and interdisciplinary endeavor, however, and therefore want to make these tools available publicly online, so other researchers can use them to answer their own questions.