Data Lost, but not Forgotten

Intern Experience: Kaylee Alexander (Data & Visualization Services)

This post by Kaylee Alexander, 2019 Humanities Unbounded Graduate Assistant, is part of a series on graduate students’ “Intern Experience” at Duke University Libraries. 

With the growing popularity of digital humanities projects, the question of how humanists should manage data, and specifically missing data and data limitations, is of increasing importance. Often the glittering possibilities of integrating technology and data-driven research methods into historical analysis makes us forget that we are still dealing with imperfect information, albeit processed in new and meaningful ways. In my own research on 19th-century funerary monuments in Paris, the issue of survival bias has been pervasive, as very few tombs—only the most expensive—have survived into the present day.

Survival bias occurs when we focus on people or things that have passed through a selection process and overlooking those that haven’t. In 1943, for example, damaged bomber planes returning from combat were being studied to identify areas that needed additional reinforcements. However, these planes had survived. What about those that didn’t? Where had they sustained damage? This was the question posed by statistician Abraham Wald, who argued that damage to returned planes represented not where improvements were needed, but rather where planes could sustain damage and could still return safely. It was the undamaged areas that were more telling.

Diagram showing areas of damage to returned WWII bomber planes (red) and recommended areas for reinforcement based on Wald’s analysis.

Historical studies are, not surprisingly, prone to such survival biases. Objects and documents get lost or damaged; others are not deemed worthy of being kept. Some information is just never recorded. But, just like Wald’s returned bomber planes, what does survive can be used to consider what we’ve lost. This is a concept that I work with all of the time, and a bias that my work specifically tries to overcome through data-driven practices. However, it is not something that I had yet considered in the context of inherited datasets.

As a Humanities Unbounded Graduate Assistant with Duke Libraries’ Data and Visualization Services, I began working with members of the Representing Migration Humanities Lab in preparation for their Data+ project, “Remembering the Middle Passage.” Led by English professor Charlotte Sussman, one of the original goals of the project was to use data representing nearly 36,000 transatlantic slave voyages to see if it would be possible to map a reasonable location for a deep-sea memorial to the transatlantic slave trade. Data on these voyages had been compiled and made openly accessible online by a team of researchers working with the Emory Center for Digital Scholarship (among others). The promises of these data were great; we just had to figure out how to use them.

My primary task was getting to know the data and providing support in preparation for the upcoming Data+ session. So, I began with the Slave Voyages website.

Home page for the Slave Voyages website: https://www.slavevoyages.org/

The landing page for the database boasts that “this digital memorial raises questions about the largest slave trades in history and offers access to the documentation available to answer them.” Here, you can view and download data on these voyages as well as access summary tables and interactive data visualizations, timelines, and maps, allowing users to easily interact with a wealth of information. Clearly labeled columns, filled with rows of data, project an image of endless research possibilities with all the data you could ever need.

Web-based interface to voyages data in the Trans-Atlantic Slave Trade Database.

However, the online interactive database only represents about half of the variables included in the full version of the database, which can be downloaded, but certainly isn’t as user-friendly as the front-facing version. One of the most glaring things I noticed when I first opened this file was all of the empty cells.

Excel sheet showing the full version of the Trans-Atlantic Slave Trade Database downloaded from https://slavevoyages.org/voyage/downloads.

It soon became clear that the online version only included a selection of the most complete variables (many of which were estimates based on original sources).

One of the first things I do when working with a new dataset in my own work is to create an overview of all of my variables and the percentage of records that have each variable. This provides me with useful insights into how complete my data are, and also how reliable certain variables will be for the types of questions I want to ask. I find this to be particularly useful when working with data that I have not compiled myself, even when a codebook already exists, as it helps you to get really quickly familiar with exactly what you have and what might be possible. More often than not, I end up revising my research questions as a result of this process. So, I wondered how this might help the Data+ team set their goals.

While the original questions of the project had been formed around mortality and how to map the experiences of enslaved people on board these voyages, a reconsideration of the data showed how the answers to these questions would only be attainable for a fraction of the voyages in the database—and nothing of any voyages that hadn’t been accounted for.

The question of all this missing data then became an essential part of the research project. How could all these gaps inform us about what isn’t there? Why were data missing, and how could we use this to think more broadly about erasure in the context of the slave trade? If our goal was to memorialize lives lost, how could we best and most appropriately accomplish this given the data we didn’t have?

There is still much work to be done before we can even begin answering these questions, and I leave that in the capable hands of the Data+ team and the Representing Migration Lab. But until then, my take away is this: missing data should not become forgotten data. Knowing what we’re working with, whether it be inherited data or data we’ve constructed, and being aware of the data we’re missing, allows us to reformulate our research objectives in new and more meaningful ways.

Kaylee P. Alexander is a Ph.D. Candidate in the Department of Art, Art History & Visual Studies, where she is also a research assistant with the Duke Art, Law & Markets Initiative (DALMI). Her dissertation research focuses on the visual culture of the cemetery and the market for funerary monuments in nineteenth-century Paris. In the spring of 2019, she served as a Humanities Unbounded graduate assistant with Data and Visualization Services at Duke University Libraries. Follow her on Twitter @kpalex91