Tackling the Law of Text and Data Mining for Computational Research

Guest post by Dave Hansen, Executive Director of the Authors Alliance (and a former Duke Library staff) and co-PI of “Text and Data Mining: Demonstrating Fair Use,” a project supported by the Mellon Foundation. 


Over the last several years, Duke, like many other institutions, has made a significant investment in computational research, recognizing that such research techniques can have wide-ranging benefits from translational research in the biomedical sciences to the digital humanities, this work can and has been transformative.  Much of this work is reliant on researchers being able to engage in text and data-mining (TDM) to produce the data-sets necessary for large-scale computational analysis. For the sciences, this can range from compiling research data across a whole series of research projects, to collecting large numbers of research articles for computer-aided systematic reviews. For the humanities, it may mean assembling a corpus of digitized books, DVDs, music, or images for analysis into how language, literary themes, or depictions have changed over time. 

The Law of Text and Data Mining

The techniques and tools for text and data-mining have advanced rapidly, but one constant for TDM researchers has been a fear of legal risk. For data-sets composed of copyrighted works, the risk of liability can seem staggering. With copyright’s statutory damages set as high as $150,000 per work infringed, a corpus of several hundred works can cause real concern. 

However, the risks of just avoiding copyrighted works are also high. Given the extensive reach of copyright law, avoiding protected or unlicensed works can mean narrowing research to focus on extremely limited datasets, which can in turn  lead to biased and incomplete results. For example, avoiding copyright for many researchers means using very old,  public domain sources materials, which skews their scholarship to focus on works written by authors that do not represent the diverse voices found in modern publications. 

Thankfully, there is a legal pathway forward for TDM researchers.  Unlike the situation in most other nations, where text and data-mining has benefited from special enabling legislation,  the United States has instead relied on fair use, the flexible copyright doctrine that has been key to US innovation policy. While fair use has the reputation of being nebulous  and confusing (you might recall hearing it described as the  “right to hire a lawyer”) there are good reasons to believe that with appropriate safeguards, non-commerical academic research is reliably protected by fair use.  Only a handful of recent efforts have focused on helping researchers better understand the scope of these fair use rights for TDM research. For example, UC Berkeley spearheaded an NEH-funded project to build legal literacies for text and data mining in 2020. I’m happy to say that Authors Alliance, a nonprofit that supports authors who research and write for the public benefit,  is working to further advance understanding of fair use as applied to TDM research through new resources and direct consultation with researchers under a new Mellon Foundation supported project titled “Text and Data Mining: Demonstrating Fair Use.” 

Unfortunately, fair use isn’t the only legal barrier to text and data-mining research. For researchers who seek to use modern digital works–for example, ebooks available only in ePub format, or movies only available on DVDs–a whole series of other laws can stand in the way. In particular,  under the Digital Millennium Copyright Act (the “DMCA,” a creature of late-90s copyright and information policy), Congress created a special set of restrictions on users of digital materials, seeking to give copyright owners the right to place digital locks on their works, such as DRM, to prevent online piracy. The DMCA imposes significant liability for users of copyrighted works who circumvent technical protection measures (e.g., content scramble for DVDs) unless those users comply with a series of complex exemptions promulgated by the U.S. Copyright Office. 

In 2021, Authors Alliance, the Association of Research Libraries, and the American Association of University Professors joined together to successfully petition the US Copyright Office for such a DMCA exemption for text and data mining in support of academic research. That exemption now allows researchers to circumvent technological protection measures that restrict access to literary works and motion pictures. Like other exemptions, it is complicated, containing requirements such as the implementation of strict security measures. But, it is not impenetrable, especially with clear guidance. 

An Invitation to Learn with Us About Legal Issues in Text and Data Mining

To that end, I’m pleased that Duke University Libraries, the Franklin Humanities Institute, and others units at Duke are working with Authors Alliance to take the lead in supporting researchers to overcome legal obstacles to TDM. Together, this spring we will host a series of workshops for faculty, librarians, and others at Duke as well as other Triangle area universities. On March 23, we’ll host a workshop focused on legal issues in TDM using textual materials, and then on April 4, another workshop on TDM with visual and audio-visual materials. Each workshop will give an overview of the state of law as applied to TDM – practical tips and guidance, as well as substantial hands-on discussion about how to address particular challenges. We also plan to use these workshops to gather feedback: about where the law is confusing,  or in its current state, inadequate for researchers. That work is done with an eye toward identifying ways to improve the law to make computational research using TDM techniques more accessible and efficient. 

All are invited to join. You can register for these workshops below.

Legal Issues in Text and Data Mining: Literature and Text-Based Works

Thursday, March 23
12:00 – 1:00 p.m. (Lunch Provided)
The Edge Workshop Room (Bostock Library 127)
Register to attend

Legal Issues in Computational Research Using Images and Audiovisual Works

Tuesday, April 4
2:00 – 3:00 p.m.
Ahmadieh Family Lecture Hall (Smith Warehouse, Bay 4, C105)
Register to attend