Good News from the DPC: Digitization of Behind the Veil Tapes is Underway

This post was written by Jen Jordan, a graduate student at Simmons University studying Library Science with a concentration in Archives Management. She is the Digital Collections intern with the Digital Collections and Curation Services Department.  Jen will complete her masters degree in December 2021. 

The Digital Production Center (DPC) is thrilled to announce that work is underway on a 3-year long National Endowment for the Humanities (NEH) grant-funded project to digitize the entirety of Behind the Veil: Documenting African-American Life in the Jim Crow South, an oral history project that produced 1,260 interviews spanning more than 1,800 audio cassette tapes. Accompanying the 2,000 plus hours of audio is a sizable collection of visual materials (e.g.- photographic prints and slides) that form a connection with the recorded voices.

We are here to summarize the logistical details relating to the digitization of this incredible collection. To learn more about its historical significance and the grant that is funding this project, titled “Documenting African American Life in the Jim Crow South: Digital Access to the Behind the Veil Project Archive,” please take some time to read the July announcement written by John Gartrell, Director of the John Hope Franklin Research Center and Principal Investigator for this project. Co-Principal Investigator of this grant is Giao Luong Baker, Digital Production Services Manager.

Digitizing Behind the Veil (BTV) will require, in part, the services of outside vendors to handle the audio digitization and subsequent captioning of the recordings. While the DPC regularly digitizes audio recordings, we are not equipped to do so at this scale (while balancing other existing priorities). The folks at Rubenstein Library have already been hard at work double checking the inventory to ensure that each cassette tape and case are labeled with identifiers. The DPC then received the tapes, filling 48 archival boxes, along with a digitization guide (i.e. – an Excel spreadsheet) containing detailed metadata for each tape in the collection. Upon receiving the tapes, DPC staff set to boxing them for shipment to the vendor. As of this writing, the boxes are snugly wrapped on a pallet in Perkins Shipping & Receiving, where they will soon begin their journey to a digital format.

The wait has begun! In eight to twelve weeks we anticipate receiving the digital files, at which point we will perform quality control (QC) on each one before sending them off for captioning. As the captions are returned, we will run through a second round of QC. From there, the files will be ingested into the Duke Digital Repository, at which point our job is complete. Of course, we still have the visual materials to contend with, but we’ll save that for another blog! 

As we creep closer to the two-year mark of the COVID-19 pandemic and the varying degrees of restrictions that have come with it, the DPC will continue to focus on fulfilling patron reproduction requests, which have comprised the bulk of our work for some time now. We are proud to support researchers by facilitating digital access to materials, and we are equally excited to have begun work on a project of the scale and cultural impact that is Behind the Veil. When finished, this collection will be accessible for all to learn from and meditate on—and that’s what it’s all about. 

 

Auditing Archival Description for Harmful Language: A Computer and Community Effort

This post was written by Miriam Shams-Rainey, a third-year undergraduate at Duke studying Computer Science and Linguistics with a minor in Arabic. As a student employee in the Rubenstein’s Technical Services Department in the Summer of 2021, Miriam helped build a tool to audit archival description in the Rubenstein for potentially harmful language. In this post, she summarizes her work on that project.

The Rubenstein Library has collections ranging across centuries. Its collections are massive and often contain rare manuscripts or one of a kind data. However, with this wide-ranging history often comes language that is dated, harmful, often racist, sexist, homophobic, and/or colonialist. As important as it is to find and remediate these instances of potentially harmful language, there is lot of data that must be searched.

With over 4,000 collection guides (finding aids) and roughly 12,000 catalog records describing archival collections, archivists would need to spend months of time combing their metadata to find harmful or problematic language before even starting to find ways to handle this language. That is, unless there was a way to optimize this workflow.

Working under Noah Huffman’s direction and the imperatives of the Duke Libraries’ Anti-Racist Roadmap, I developed a Python program capable of finding occurrences of potentially harmful language in library metadata and recording them for manual analysis and remediation. What would have taken months of work can now be done in a few button clicks and ten minutes of processing time. Moreover, the tools I have developed are accessible to any interested parties via a GitHub repository to modify or expand upon.

Although these gains in speed push metadata language remediation efforts at the Rubenstein forward significantly, a computer can only take this process so far; once uses of this language have been identified, the responsibility of determining the impact of the term in context falls onto archivists and the communities their work represents. To this end, I have also outlined categories of harmful language occurrences to act as a starting point for archivists to better understand the harmful narratives their data uphold and developed best practices to dismantle them.

Building an automated audit tool

Audit Tool GUI Screenshot
The simple, yet user-friendly interface that allows archivists to customize the search audit to their specific needs.

I created an executable that allows users to interact with the program regardless of their familiarity with Python or with using their computer’s command line. With an executable, all that a user must do is simply click on the program (titled “description_audit.exe”) and the script will load with all of its dependencies in a self-contained environment. There’s nothing that a user needs to install, not even Python.

Within this executable, I also created a user interface to allow users to set up the program with their specific audit parameters. To use this program, users should first create a CSV file (spreadsheet) containing each list of words they want to look for in their metadata.

Snippet of Lexicon CSV
Snippet from a sample lexicon CSV file containing harmful terms to search

In this CSV file of “lexicons”, each category of terms  should have its own column, for example RaceTerms could be the first row in a column of terms such as “colored” or “negro,” and GenderTerms could be the first row in a column of gendered terms such as “homemaker” or “wife.”  See these lexicon CSV file examples.

Once this CSV has been created, users can select this CSV of lexicons in the program’s user interface and then select which columns of terms they want the program to use when searching across the source metadata. Users can either use all lexicon categories (all columns) by default or specify a subset by typing out those column headers. For the Rubenstein’s purposes, there is also a rather long lexicon called HateBase (from a regional, multilingual database of potential hate speech terms often used in online moderating) that is only enabled when a checkbox is checked; users from other institutions can download the HateBase lexicon for themselves and use it or they can simply ignore it.

In the CSV reports that are output by the program, matches for harmful terms and phrases will be tagged with the specific lexicon category the match came from, allowing users to filter results to certain categories of potentially harmful terms.

Users also need to designate a folder on their desktop where report outputs should be stored, along with the folder containing their source EAD records in .xml format and their source MARCXML file containing all of the MARC records they wish to process as a single XML file. Results from MARC and EAD records are reported separately, so only one type of record is required to use the program, however both can be provided in the same session.

How archival metadata is parsed and analyzed

Once users submit their input parameters in the GUI, the program begins by accessing the specified lexicons from the given CSV file. For each lexicon, a “rule” is created for a SpaCy rule-based matcher, using the column name (e.g. RaceTerms or GenderTerms) as the name of the specific rule. The same SpaCy matcher object identifies matches to each of the several lexicons or “rules”. Once the matcher has been configured, the program assesses whether valid MARC or EAD records were given and starts reading in their data.

To access important pieces of data from each of these records, I used a Python library called BeautifulSoup to parse the XML files. For each individual record, the program parses the call numbers and collection or entry name so that information can be included in the CSV reports. For EAD records, the collection title and component titles  are also parsed to be analyzed for matches to the lexicons, along with any data that is in a paragraph (<p>) tag. For MARC records, the program also parses the author or creator of the item, the extent of the collection, and the timestamp of when the description of the item was last updated. In each MARC record, the 520 field (summary)  and 545 field (biography/history note) are all concatenated together and analyzed as a single entity.

Data from each record is stored in a Python dictionary with the names of fields (as strings) as keys mapping to the collection title, call number, etc. Each of these dictionaries is stored in a list, with a separate structure for EAD and MARC records.

Once data has been parsed and stored, each record is checked for matches to the given lexicons using the SpaCy rule-based matcher. For each record, any matches that are found are then stored in the dictionary with the matching term, the context of the term (the entire field or surrounding few sentences, depending on length), and the rule the term matches (such as RaceTerms). These matches are found using simple tokenization from SpaCy that allow matches to be identified quickly and without regard for punctuation, capitalization, etc. 

Although this process doesn’t necessarily use the cutting-edge of natural language processing that the SpaCy library makes accessible, this process is adaptable in ways that matching procedures like using regular expressions often isn’t. Moreover, identifying and remedying harmful language is a fundamentally human process which, at the end of the day, needs a significant level of input both from historically marginalized communities and from archivists.

Matches to any of the lexicons, along with all other associated data (the record’s call number, title, etc.) are then written into CSV files for further analysis and further categorization by users. You can see sample CSV audit reports here. The second phase of manual categorization is still a lengthy process, yielding roughly 14600 matches from the Rubenstein Library’s EAD data and 4600 from its MARC data which must still be read through and analyzed by hand, but the process of identifying these matches has been computerized to take a mere ten minutes, where it could otherwise be a months-long process.

Categorizing matches: an archivist and community effort

An excerpt of initial data returned by the audit program for EAD records. This data should be further categorized manually to ensure a comprehensive and nuanced understanding of these instances of potentially harmful language.
An excerpt of initial data returned by the audit program for EAD records. This data should be further categorized manually to ensure a comprehensive and nuanced understanding of these instances of potentially harmful language.

To better understand these matches and create a strategy to remediate the harmful language they represent, it is important to consider each match in several different facets.

Looking at the context provided with each match allows archivists to understand the way in which the term was used. The remediation strategy for the use of a potentially harmful term in a proper noun used as a positive, self-identifying term, such as the National Association for the Advancement of Colored People, for example, is vastly different from that of a white person using the word “colored” as a racist insult.

The three ways in which I proposed we evaluate context are as follows:

  1. Match speaker: who was using the term? Was the sensitive term being used as a form of self-identification or reclaiming by members of a marginalized group, was it being used by an archivist, or was it used by someone with privilege over the marginalized group the term targets (e.g. a white person using an anti-Black term or a cisgender straight person using an anti-LGBTQ+ term)? Within this category, I proposed three potential categories for uses of a term: in-group, out-group, and archivist. If a term is used by a member (or members) of the identity group it references, its use is considered an in-group use. If the term is used by someone who is not a member of the identity group the term references, that usage of the term is considered out-group. Paraphrasing or dated term use by archivists is designated simply as archivist use.
  2. Match context: how was the term in question being used? Modifying the text used in a direct quote or a proper noun constitutes a far greater liberty by the archivist than removing a paraphrased section or completely archivist-written section of text that involved harmful language. Although this category is likely to evolve as more matches are categorized, my initial proposed categories are: proper noun, direct quote, paraphrasing, and archivist narrative.
  3. Match impact: what was the impact of the term? Was this instance a false positive, wherein the use of the term was in a completely unrelated and innocuous context (e.g. the use of the word “colored” to describe the colors used in visual media), or was the use of the term in fact harmful? Was the use of the term derogatory, or was it merely a mention of politicized identities? In many ways, determining the impact of a particular term or use of potentially harmful language is a community effort; if a community member with a marginalized identity says that the use of a term in that particular context is harmful to people with that identity, archivists are in no position to disagree or invalidate those feelings and experiences. The categories that I’ve laid out initially–dated original term, dated Rubenstein term, mention of marginalized issues, mention of marginalized identity, downplaying bias (e.g. calling racism and discrimination an issue with “race relations”), dehumanization of marginalized people, false positive–only hope to serve as an entry point and rudimentary categorization of these nuances to begin this process.
A short excerpt of categorized EAD metadata
A short excerpt of categorized EAD metadata

Here you can find more documentation on the manual categorization strategy.

Categorizing each of these instances of potentially harmful language remains a time-consuming, meticulous process. Although much of this work can be computerized, decolonization is a fundamentally human and fundamentally community-centered practice. No computer can dismantle the colonial, white supremacist narratives that archival work often upholds. This work requires our full attention and, for better or for worse, a lot of time, even with the productivity boost technology gives us.

Once categories have been established, at least on a preliminary level, I found that about 100-200 instances of potentially harmful language could be manually parsed and categorized in an hour.

Conclusion

Decolonization and anti-racist efforts in archival work are an ongoing process. It is bound to take active learning, reflection, and lots of remediation. However, using technology to start this process creates a much less daunting entry point. Anti-racism work is essential in archival spaces.

The ways we talk about history can either work to uphold traditional white supremacist, racist, ableist, etc. narratives, or they can work to dismantle them. In many ways, archival work has often upheld these narratives in the past, however this audit represents the sincere beginnings of work to further equitable narratives in the future.

On Protecting Patron Privacy

First, a bit of history

Back in the summer of 2018, calls for applications for the National Web Privacy Forum started circulating around the library community. I’ll be honest — at that point I knew almost nothing about how libraries protect patron privacy. That summer I’d been conducting a library data inventory, interviewing stakeholders of various data systems across the library, and I had just gotten my first hints of some of the processes we use to protect the data we collect from patrons.

Long story short, Duke Libraries submitted an application to the Forum, we were selected, and I attended. The experience was really meaningful, and it gave me a nice overview of the various issues that affect a library’s ability to protect patron privacy. The following spring (2019), the leaders of the National Forum released an action handbook that recommended conducting a data privacy audit, and DUL undertook such an audit during the Fall of 2019. The results of that audit suggested that we still have a bit of work to do to make sure all of our systems are working together to protect our patrons.

Forming a task force

In response to the audit report, Duke Libraries charged a task force called the Data Privacy and Retention Task Force. Despite the pandemic and lockdown, this task force started meeting in the spring of 2020, and we met biweekly for the rest of the year. Our goals were to develop guiding principles and priorities around data privacy and retention, as well as to recommend specific project work that should be undertaken to improve our systems.

The task force included staff members from across the various divisions of the library. Pretty quickly, we determined that we all come with different experiences around patron privacy. We decided to begin with a sort of book club, identifying and reviewing introductory materials related to different components of patron privacy, from web analytics to the GDPR to privacy in archives and special collections. Once we all felt a bit more knowledgeable, we turned our attention to creating a statement of our priorities and principles.

Defining our values

There are a lot of existing statements of library values, and many make mention of patron privacy. Other documents that cover privacy values include regulatory documents and organizational privacy statements. Some of the statements we reviewed include:

Duke University Libraries' Strategic PlanWhile these statements are all relevant, the task force found some of them far too general to truly guide action for an organization. We were looking to create a document that outlined more specifics, helped us make decisions about how to organize our work. At Duke Libraries, we already have one document we use to organize our work and make decisions — our strategic plan.

When we reviewed the strategic plan, we noticed that for each section of the plan, a focus on patron privacy resulted in a set of implications for our work. To express these implications, we devised a rough hierarchy of directed action, indicating our ability and obligation to undertake certain actions.  We use the following terms in our final report:

For actions within our sphere of influence:

  • obligation: DUL should devote significant time and resources toward this work
  • responsibility: DUL should make a concerted effort toward this work, but the work may not receive the same attention and resources as that devoted to our obligations

For actions outside our sphere of influence:

  • commitment: DUL will need to partner with other groups to perform this work and thus cannot promise to accomplish all tasks

An example of our principles and priorities

One section from our strategic plan is Strategic Priority #2: Our Libraries Teach and Support Emerging Literacies. Within this priority, the strategic plan identifies the following goals:

  1. Expand the presence of library staff in the student experience in order to understand and support emerging scholarship, information, data, and literacy needs
  2. Mentor first-year students in scholarly research and learning practices, embracing and building upon their diverse backgrounds, prior knowledge, literacies, and expectations as they begin their Duke experience.
  3. Partner with faculty to develop research methods, curricula, and collaborative projects connecting their courses to our collections.
  4. Enhance the library instruction curriculum, focusing on standards and best practices for pedagogy that will prepare users for lifelong learning in a global and ever-changing research environment.

In our final report, Priorities and Guiding Principles for Protecting Patron Privacy, we identify the following actions for this same strategic priority:

  • We have an obligation to communicate in plain language what data we and our partners collect while providing our services.
  • We have a responsibility to provide education, tools, and collection materials to shed light on the general processes of information exchange behind technology systems.
  • We commit to partnering with researchers seeking to understand the effects of information exchange processes and related policy interventions.

We now have the strategic plan, which outlines types of activities we might undertake, and the new report on protecting patron privacy, which adds to that list new activities and methods to achieve patron privacy protections in each area.

Next steps

The final work of the task force was to propose new project work based on our identified priorities and principles. The task force will share a list of recommended projects with library administration, who will start the hard work of evaluating these projects and identifying staff to undertake them. In the meantime, we hope the report will offer immediate guidance to staff for considerations they should be taking in different areas of their work, as well as serving as a model for future documents that guide our efforts.

Automated Accessibility Testing and Continuous Integration

We use several different tools and strategies at the Duke Libraries to ensure that our web interfaces are accessible. Our aim is to comply with WCAG 2.0 AA and Section 508 guidelines.

One of our favorite accessibility checking tools in our toolbox is the axe DevTools Browser Extension by Deque Systems. It’s easy to use, whether on a live site, or in our own local development environments while we’re working on building new features for our applications. Simply open your browser’s Developer Tools (F12), click the Scan button, and get an instant report of any violations, complete with recommendations about how to fix them.

screenshot of axe DevTools browser extension in action
An axe DevTools test result for our archival finding aids homepage.

Now, it’s one thing to make software compliant; it’s another to keep it that way. Duke’s Web Accessibility office astutely notes:

Keeping a website compliant is a continuous effort. Websites are living things. Content changes, features are added. When a site becomes compliant it does not stay compliant.

One of the goals we set out to accomplish in 2021 was to figure out how to add automated, continuous accessibility testing for our ArcLight software, which powers our archival finding aids search and discovery application. We got it implemented successfully a few months ago and we’re pleased with how it has been working so far.

The solution: using the Deque Systems Axe Core RSpec gem, along with a Selenium Standalone Chrome Docker image, in our GitLab CI pipeline.

Say what?

Let me back up a bit and give some background on the concepts and tools upon which this solution depends. Namely: Continuous Integration, RSpec, Capybara, Selenium, and Docker.

Continuous Integration

For any given software project, we may have several developers making multiple changes to a shared codebase in the same day. Continuous Integration is the practice of ensuring that these code changes 1) don’t break any existing functionality, and 2) comply with established guidelines for code quality.

At Duke Libraries, we use GitLab to host code repositories for most of our software (e.g., our ArcLight application). It’s similar to GitHub, with built-in tooling for DevOps. One such tool is GitLab CI/CD, which we use to manage continuous integration. It’s common for us to configure pipelines to run automated tasks for each new code branch or tag, in at least few stages, e.g.: build > test > deploy.

screenshot of GitLab CI pipeline stages
Evidence of a GitLab CI pipeline that has run — and passed all stages — for a code change.

For accessibility testing, we knew we needed to add something to our existing CI pipeline that would check for compliance and flag any issues.

RSpec Testing Framework

Many of our applications are built using Ruby on Rails. We write tests for our code using RSpec, a popular testing framework for Rails applications. Developers write tests (using the framework’s DSL / domain-specific language) to accompany their code changes. Those tests all execute as part of our CI pipeline (see above). Any failing tests will prevent code from being merged or getting deployed to production.

RSpec logo

There are many different types of tests.  On one end of the spectrum, there are “unit tests,” which verify that one small piece of code (e.g., one method) returns what we expect it to when it is given different inputs. On the other, there are “feature tests,” which typically verify that several pieces of code are working together as intended in different conditions. This often simulates the use of a feature by a person (e.g., when a user clicks this button, test that they get to this page and verify this link gets rendered, etc.). Feature tests might alternatively be called “integration tests,” “acceptance tests,” or even “system tests” — the terminology is both squishy and evolving.

At any rate, accessibility testing is a specific kind of feature test.

Capybara

On its own, RSpec unfortunately doesn’t natively support feature tests. It requires a companion piece of software called Capybara, which can simulate a user interacting with a web interface. Capybara brings with it a DSL to visit pages, fill out forms, or click on elements within RSpec tests, and special matchers to check that the page is behaving as intended.

screenshot of Capybara homepage
Homepage for Capybara, featuring an actual capybara.

When configuring Capybara, you set up the driver you want it to use when running different kinds of tests. Its default driver is RackTest, which is fast but it can’t execute JavaScript like a real web browser can. Our ArcLight UI, for instance, uses a bunch of JavaScript. So we knew that any accessibility tests would have to be performed using a driver for an actual browser; the default Capybara alone wouldn’t cut it.

Selenium

Writing code to drive various browsers was probably a nightmare until Selenium came along. Selenium is an “umbrella project for a range of tools and libraries that enable and support the automation of web browsers.” Its WebDriver platform gives you a language-agnostic coding interface that is now compatible with all the major web browsers. That makes it a valuable component in automated browser testing.

screenshot of Selenium WebDriver website
Documentation for Selenium WebDriver for browser automation

The best way we could find to get Capybara to control a real browser in an RSpec test was to add the selenium-webdriver gem to our project’s Gemfile.

Docker

Over the past few years, we have evolved our DevOps practice and embraced containerizing our applications using Docker. Complex applications that have a lot of interwoven infrastructure dependencies used to be quite onerous to build, run, and share.  Getting one’s local development environment into shape to successfully run an application used be a whole-day affair. Worse still, the infrastructure on the production server used to bear little resemblance to what developers were working with on their local machines.

systems diagram for ArcLight infrastructure
A systems diagram depicting the various services that run to support our ArcLight app.

Docker helps a dev team configure in their codebase all of these system dependencies in a way that that’s easily reproducible in any environment. It builds a network of “containers” on a single host, each running a service that’s crucial to the application.  Now a developer can simply check out the code, run a couple commands, wait a few minutes, and they’re good to go.

The same basic setup also applies in a production environment (with a few easily-configurable differences). And that simplicity also carries over to the CI environment where the test suite will run.

Systems diagram for ArcLight, highlighting containerized components of the infrastructure
Orange boxes depict each service / container we have defined in our Docker configuration.

So what we needed to do was add another container to our existing Docker configuration that would be dedicated to running any JavaScript-dependent feature tests — including accessibility tests — in a browser controlled by Selenium WebDriver.

Keeping this part containerized would hopefully ensure that when a developer runs the tests in their local environment, the exact same browser version and drivers get used in the CI pipeline. We can steer clear of any “well, it worked on my machine” issues.

Putting it All Together

Phew. OK, with all of that background out of the way, let’s look closer at how we put all of these puzzle pieces together.

Gems

We had to add the following three gems to our Gemfile‘s test group:

group :test do
  gem 'axe-core-rspec' # accessibility testing
  gem 'capybara'
  gem 'selenium-webdriver'
end

A Docker Container for Selenium

The folks at Selenium HQ host “standalone” browser Docker images in Docker Hub, complete with the browser software and accompanying drivers. We found a tagged version of their Standalone Chrome image that worked well, and pull that into our newly-defined “selenium” container for our test environments.

In docker-compose.test.yml

services:
  selenium:
    image: selenium/standalone-chrome:3.141.59-xenon
    ports:
      - 4444:4444
    volumes:
      - /dev/shm:/dev/shm
    environment:
      - JAVA_OPTS=-Dwebdriver.chrome.whitelistedIps=
      - START_XVFB=false
[...]

Since this is new territory for us, and already fairly complex, we’re starting with just one browser: Chrome. We may be able to add more in the future.

Capybara Driver Configuration

Next thing we needed to do was tell Capybara that whenever it encounters any javascript-dependent feature tests, it should run them in our standalone Chrome container (using a “remote” driver).  The Selenium WebDriver gem lets us set some options for how we want Chrome to run.

The key setting here is “headless” — that is, run Chrome, but more efficiently, without all the fancy GUI stuff that a real user might see.

In spec_helper.rb

Capybara.javascript_driver = :selenium_remote

Capybara.register_driver :selenium_remote do |app|
  capabilities = Selenium::WebDriver::Remote::Capabilities.chrome(
    chromeOptions: { args: [
      'headless',
      'no-sandbox',
      'disable-gpu',
      'disable-infobars',
      'window-size=1400,1000',
      'enable-features=NetworkService,NetworkServiceInProcess'
    ] }
  )

  Capybara::Selenium::Driver.new(app,
                                 browser: :remote,
                                 desired_capabilities: capabilities,
                                 url: 'http://selenium:4444/wd/hub')
end

That last URL http://selenium:4444/wd/hub is the location of our Chrome driver within our selenium container.

There are a few other important Capybara settings configured in spec_helper.rb that are needed in order to get our app and seleniumcontainers to play nicely together.

Capybara.server = :puma, { Threads: '1:1' }
Capybara.server_port = '3002'
Capybara.server_host = '0.0.0.0'
Capybara.app_host = "http://app:#{Capybara.server_port}"
Capybara.always_include_port = true
Capybara.default_max_wait_time = 30 # our ajax responses are sometimes slow
Capybara.enable_aria_label = true
[...]

The server_port, server_host and app_host variables are the keys here. Basically, we’re saying:

  • Capybara (which runs in our app container) should start up Puma to run the test app, listening on http://0.0.0.0:3002 for requests beyond the current host during a test.
  • The selenium container (where the Chrome browser resides) should access the application under test at http://app:3002 (since it’s in the app container).

Some Actual RSpec Accessibility Tests

Here’s the fun part, where we actually get to write the accessibility tests. The axe-core-rspec gem makes it a breeze. The be_axe_clean matcher ensures that if we have a WCAG 2.0 AA or Section 508 violation, it’ll trip the wire and report a failing test.

In accessibility_spec.rb

require 'spec_helper'
require 'axe-rspec'

RSpec.describe 'Accessibility (WCAG, 508, Best Practices)', type: :feature, js: true, accessibility: true do
  describe 'homepage' do
    it 'is accessible' do
      visit '/'
      expect(page).to be_axe_clean
    end
  end

  [...]
end

With type: :feature and js: true we signal to RSpec that this block of tests should be handled by Capybara and must be run in our headless Chrome Selenium container.

The example above is the simplest case: 1) visit the homepage and 2) do an Axe check. We also make sure to test several different kinds of pages, and with some different variations on UI interactions. E.g., test after clicking to open the Advanced Search modal.

The CI Pipeline

We started out having accessibility tests run along with all our other RSpec tests during the test stage in our GitLab CI pipeline. But we eventually determined it was better to keep accessibility tests isolated in a separate job, one that would not block a code merge or deployment in the event of a failure. We use the accessibility: true tag in our RSpec accessibility test blocks (see the above example) to distinguish them from other feature tests.

No, we don’t condone pushing inaccessible code to production! It’s just that we sometimes get false positives — violations are reported where there are none — particularly in Javascript-heavy pages. There are likely some timing issues there that we’ll work to refine with more configuration.

A Successful Accessibility Test

Here’s a completed job in our CI pipeline logs where the accessibility tests all passed:

Screenshot of passing CI pipeline, including accessibility

Screenshot from successful accessibility test job in GitLab CI
Output from a successful automated accessibility test job run in a GitLab CI pipeline.

Our GitLab CI logs are not publicly available, so here’s a brief snippet from a successful test.

An Accessibility Test With Failures

Screenshot displaying a failed accessibility test jobHere’s a CI pipeline for a code branch that adds two buttons to the homepage with color contrast and aria-label violations. The axe tests flag the issues as FAILED and recommend revisions (see snippet from the logs).

Concluding Thoughts

Automation and accessibility testing are both rapidly evolving areas, and the setup that’s working for our ArcLight app today might look considerably different within the next several months. Still, I thought it’d be useful to pause and reflect on the steps we took to get automated accessibility testing up and running. This strategy would be reasonably reproducible for many other applications we support.

A lot of what I have outlined could also be accomplished with variations in tooling. Don’t use GitLab CI? No problem — just substitute your own CI platform. The five most important takeaways here are:

  1. Accessibility testing is important to do, continually
  2. Use continuous integration to automate testing that used to be manual
  3. Containerizing helps streamline continuous integration, including testing
  4. You can run automated browser-based tests in a ready-made container
  5. Deque’s open source Axe testing tools are easy to use and pluggable into your existing test framework

Many thanks to David Chandek-Stark (Duke) for architecting a large portion of this work. Thanks also to Simon Choy (Duke), Dann Bohn (Penn St.), and Adam Wead (Penn St.) for their assistance helping us troubleshoot and understand how these pieces fit together.


The banner image in this post uses three icons from the FontAwesome Free 5 icon set, unchanged, and licensed under a CC-BY 4.0 license.


REVISION 7/29/21: This post was updated, adding links to snippets from CI logs that demonstrate successful accessibility tests vs. those that reveal violations.

Curating for a community: joining the DCN

At DUL, we talk quite a lot about the value of research data curation. The Libraries provide a curatorial review of all data packages submitted to the Research Data Repository for publication. This review can help to enhance a researcher’s dataset by enabling a second or third pair of eyes to look over the data and ensure that all documentation is as complete as possible and that the dataset as a whole has been optimized for long term reuse. Although it’s not necessary to have expertise in the domain of the data under review, it can be helpful to give the curator a fuller picture of what is needed to help make those data FAIR. While data curators working in the Libraries possess a wealth of knowledge about general research data-related best practices, and are especially well-versed in the vagaries of social sciences data, they may not always have the all the information they need to sufficiently assess the state of a dataset from a researcher.

As I discussed in a blog post back in 2019, for the last few years, Duke has been a part of a project designed to address gaps in domain proficiency that are a natural part of a curation program of our size. The Data Curation Network has functioned as grant-supported consortium of data curation professionals located in research institutions who have pooled their knowledge to provide enhanced review for data that fall outside the expertise of local curators. Partner institutions can submit datasets to the Network and they will be matched with a DCN curator with the relevant domain experience. Beyond providing curation services, the DCN generates a variety of community resources pertaining to data curation, including a standardized set of curation steps and workflow, a list of essential data curation activities, and a growing roster of instructional primers to support the curation of various kinds of data.

The DCN has grown since my last post, and now includes curators from 11 institutions and the Dryad research data repository. DCN curators work with data from disciplines ranging from aerospace engineering to urban and regional planning and tackle data types from qualitative survey responses to machine learning model training datasets.

Updated for 2021!

Although two members have worked with the DCN for a few years, the rest of the DUL research data curation team is now getting in on the action. Last week, the two Repository Services Analysts embedded with the curation team began the process of onboarding to serve as DCN curators. While we have been able to contribute to local curation of datasets for the RDR, this new opportunity presents us with a chance to not only gain valuable experience working with some practiced curators, but also to contribute back to the community that has helped to support our work. We are very excited to expand and deepen our DCN participation!

Indexing variant names from the Library of Congress Name Authority File (LCNAF) in TRLN Discovery

You might or might not have noticed a TRLN Discovery feature announcement in the February TRLN News Roundup. It mentioned that we are now indexing variant names from the Library of Congress Name Authority File in TRLN Discovery. I thought in this post I would expand on what this change means for Duke’s Books & Media catalog, add some details about the technical implementation, and discuss some related features we might add in the future based on this work.

What is it?

First, the practical matter:  what does this feature mean for people who search the catalog? Our catalog records contain authoritative forms of creator names. This is the specific form of the person’s name chosen as the authoritative form by the Library of Congress. For example, the authoritative form of Emily Dickinson’s name is “Dickinson, Emily, 1830-1886.” If you search the Books & Media catalog using this form of the poet’s name you will find all records associated with her name (example search with the authoritative name). Previously, if you had searched the catalog and added the poet’s middle name, “Elizabeth,” it’s likely you would have missed many relevant results because “Elizabeth” is not included in the authoritative form of the name. It is, however, included in one of the variant names in the LC Name Authority File. The full list of variant names for Emily Dickinson is:

  • Dickinson, Emilia, 1830-1886
  • Dickinson, Emily Elizabeth, 1830-1886
  • Dickinson, Emily (Emily Elizabeth), 1830-1886
  • Dikinson, Ėmili, 1830-1886
  • D̲ikinson, Emily, 1830-1886
  • Ti-chin-sen, Ai-mi-li, 1830-1886
  • דיקינסון, אמילי, 1830־1886
  • דיקינסון, אמילי, 1886־1830
  • Dykinsan, Ėmili, 1830-1886

Emily Dickinson

Since we are now indexing these forms in TRLN Discovery you now get much better results if you happen to add Emily Dickinson’s middle name to your search (example search including a variant form of the name). Additionally, various romanizations and vernacular forms are indexed (example search for “דיקינסון, אמילי”).

If you clicked through to the example searches you may have noticed that the result counts and result order are slightly different when searching the authoritative form vs. the variant forms.

The variant forms are only indexed on records that include a URI that references the LC Name Authority File. If this URI reference is missing the variant names are not indexed for that record. Additionally, some records may not have been updated since we implemented this feature. In time all records that include URIs for names will have variant names indexed.

The difference in result order is due to how the variant names are indexed. For the authoritative form of the name we distinguish between creators, editors, contributors, etc. and give matches in these categories different boosts in the relevance ranking. At the moment, the variant names from the LCNAF file are indexed in a single field and so we lose the nuance needed for more granular relevance ranking. This is something that could be revised in the future if needed.

How does it work?

This feature relies on the fact that our MARC records include URI references to the LC Name Authority File. As an example, here’s a MARC XML 100 Main Entry-Personal Name field for Emily Dickinson with a URI reference to the authority file.

<datafield tag="100" ind1="1" ind2=" ">
<subfield code="a">Dickinson, Emily,</subfield>
<subfield code="d">1830-1886.</subfield>
<subfield code="0">http://id.loc.gov/authorities/names/n79054166</subfield>
</datafield>

We store this URI reference in the TRLN Discovery name field and then use this URI reference at ingest time to lookup and index the variant names from a local cache of the variant names. Here’s the stored name for Emily Dickinson in TRLN Discovery index.

names_a: ["{\"name\":\"Dickinson, Emily, 1830-1886\",\"rel\":\"author\",\"type\":\"creator\",\"id\":\"http://id.loc.gov/authorities/names/n79054166\"}"]

The TRLN Discovery ingest service keeps its own cache of the name identifiers and variant names for efficient lookup at ingest time. We use Redis, an open-source, in-memory (very fast) data store to make the variant names available when records are ingested. This local cache is built from the LC Name Authority File. Since the name authority file changes over time we will refresh our local cache of the data every 3 months to keep it up to date. We’ve written a script (Rails Rake task) that automates this update process.

What’s next?

The addition of stored name authority URIs in the TRLN Discovery index opens up opportunities to add more features in the future. I’m especially interested in displaying more contextual information about creators in our catalog. We could also expose “See also” references from the authority files to make it easier to find works by the same person published under different names (“Twain, Mark, 1835-1910” being a good example):

  • Clemens, Samuel Langhorne, 1835-1910
  • Conte, Louis de, 1835-1910
  • Snodgrass, Quintus Curtius, 1835-1910

Mark Twain

As always, we continue to add features and make incremental improvements to TRLN Discovery, and your feedback is critical. Please let us know how things are working for you using the feedback form available on every page of the Books & Media Catalog.

Library study space design: Intentional, inclusive, flexible

In the Assessment & User Experience department, one of our ongoing tasks is to gather and review patron feedback in order to identify problems and suggest improvements. While the libraries offer a wide variety of services to our patrons, one of the biggest and trickiest areas to get right is the design of our physical spaces. Typically inhabited by students, our library study spaces come in a variety of sizes and shapes and are distributed somewhat haphazardly throughout our buildings. How can we design our study spaces to meet the needs of our patrons? When we have study spaces with different features, how can we let our patrons know about them?

These questions and the need for a deeper assessment of library study space design inspired the formation of a small team – the Spaces With Intentional Furniture Team (or SWIFT). This team was charged with identifying best practices in study space furniture arrangement, as well as making recommendations on opportunities for improvements to existing spaces and outreach efforts. The team reviewed and summarized relevant literature on library study space design in report (public version now available). In this post, we will share a few of the most surprising and valuable suggestions from our literature review.

Increase privacy in large, open spaces

Some of the floors in our library buildings have large, open study spaces that can accommodate a large number of patrons. Because study space is limited, we are highly motivated to make the most of the space we have. The way a space is designed, however, influences how comfortable patrons feel spending a lot of time in the space.

With large open spaces, the topic of privacy came up across several different studies. In this context, privacy relates to both to visibility in a space and to the ability to make noise without being overheard. Even when policies allow for noise in a space, a lack of privacy can make students nervous to go ahead and be noisy. For spaces where silence is the norm, a lack of privacy can make patrons feel on display and especially nervous about any movements or sound they might make.

The literature suggests that there are ways to improve privacy in open spaces. For group spaces, placing dividers or partitions between group table arrangements may both offer privacy and provide useful amenities, like writeable surfaces. For quiet spaces, privacy can be improved by varying the type and height of furniture and by turning furniture in different directions so individuals are not facing each other. Seating density should also be restricted in quiet spaces.

Isolate noisy zones from quiet zones

Controlling noise is a common topic in the literature. Libraries are some of the only spaces on campus that offer a quiet study environment, but the need for quiet spaces needs to be balanced with the need to engage in the increasingly collaborative work required by modern classes. Libraries are often in central locations on campus and offer prime real estate for groups to meet in between or after classes. How to provide enough quiet space for people who need to work without distractions while still accommodating group work and socializing?

Once strategy is to make sure that people feel comfortable with making noise in spaces where it is encouraged. Libraries can position noisy spaces to take advantage of other sources of noise to provide some noise “cover” – for example,  a staff service desk, copy machines, elevators, and meeting rooms.  Quiet spaces should be isolated from these sources of noise, perhaps by placing them on separate floors. Stacks can also help separate spaces, as books provide some sound absorption, and the visual obstruction reduces visual distractions for students studying quietly.

Reservable private study rooms meet several needs

Sometimes, enforcing noise policies to keep spaces quiet only solves part of the problem. Quiet study spaces reduce distractions caused by noise, but students can be sensitive to other kinds of distractions – visual distractions, strong or chemical smells, etc. For students needing spaces completely free of distractions, libraries might consider creating reservable rooms available for individual study.

This kind of service is useful for more than low-distraction study needs. Making exceptions for pandemics, libraries often employ a first-come, first-served approach to seats in study spaces. Patrons with mobility issues or limited time to study would benefit greatly from being able to reserve a study space in advance. Identifying reservable study spaces for individuals, either within a larger study space or as part of a set of reservable private rooms, might meet a variety of currently unmet needs.

Physical spaces need web presences

As SWIFT begins to think about recommendations, we know we have to address our outreach around spaces. Patrons currently have few options for learning about our spaces. We have some signage in our buildings to identify different noise policies, and we have a few websites that give a basic overview of the spaces, but patrons are often reduced to simply performing exhaustive circuits around the buildings to discover all that we have available. More likely, students find a few of our spaces either by chance or by word of mouth, and if those spaces don’t meet their needs, they may not return.

One detailed review (Brunskill, 2020) offers very explicit guidance on the design of websites to support patrons with disabilities. As is commonly true, improvements that support one group of patrons often improve services for all patrons. Prominently sharing the following information about physical spaces will better support all patrons looking to find their space in the libraries:

  • details about navigating physical spaces (maps, floorplans, photos)
  • sensory information for spaces (noise, privacy, lighting, chemical sensitivity)
  • physical building accessibility
  • parking/transportation information
  • disability services contact (with name, contact form)
  • assistive technologies hardware and equipment
  • any accessibility problems with spaces

Next steps

Throughout our literature review, we saw the same advice over and over again: patrons need variety. There is no one-size-fits-all solution to patron needs. Luckily, at Duke we have several library buildings and many many study spaces. With some careful planning, we should be able to take an intentional approach to our space design in order to better accommodate the needs of our patrons. The libraries have new groups tasked with acting on these and related recommendations, and while it may take some time, our goal is to create a shared understanding of the best practices for library study space design.

Relevant Literature

Furniture Arrangement

Noise Isolation

Private Study Rooms

Websites about Spaces

Data Sharing and Equity: Sabrina McCutchan, Data Architect

This post is part of the Research Data Curation Team’s ‘Researcher Highlight’ series.

Equity in Collaboration

The landscape of research and data is enterprising, expansive and diverse. This dynamic is notably visible in the work done at Duke Global Health Institute (DGHI). Collaboration with international partners inherently comes with many challenges. In a conversation with the Duke Research Data Curation team, Sabrina McCutchan of the Research Design and Analysis Core (RDAC) at DGHI shares her thoughts on why data sharing and access is critical to global health research.

Questions of equity must be addressed when discussing research data and scholarship on a global scale. For the DGHI data equity is a priority. International research partners deserve equal access to primary data to better understand what’s happening in their communities, contribute to policy initiatives that support their populations, and support their own professional advancement by publishing in research and medical journals.

 “We work with so many different countries, people groups, and populations around the world that often themselves don’t have access to the same infrastructure, technologies or training in data. It can be challenging to collect quality primary data on their own, but  becomes a little easier in partnership with a big research institution like Duke.”

Collaborations like the Adolescent Mental Health in Africa Network Initiative (AMANI) demonstrate the significance of data sharing. AMANI is led by Dr. Dorothy Dow of DGHI, Dr. Lukoye Atwoli of Moi University School of Medicine, and Dr. Sylvia Kaaya of Muhimbili University of Health and Allied Sciences (MUHAS) and involves participating researchers from academic and medical institutions in South Africa, Kenya, and Tanzania.

Why Share Data?

As a Data Architect, Sabrina is available to support DGHI in achieving their data sharing goals. She takes a holistic approach to identifying areas where the team needs data support. Considering at each stage of the project lifecycle how system design and data architecture will influence how data can be shared. This may entail drafting informed consent documents, developing strategies for de-identification, curating and managing data, or discovering solutions for data storage and publishing. For instance, in collaboration with CDVS Research Data Management Consultants, Sabrina has helped AMANI create a Dataverse to enable sharing restricted access health data for international junior researchers. Data from one of DGHI’s studies are also available in the Duke Research Data Repository.

“All of these components are interconnected to each other. You really need to think about what are going to be the impacts of a decision made early in the process of gathering data for this study further downstream when we’re analyzing that data and publishing findings from it.”

Reproducibility is another reason that sharing and publishing data is important to Sabrina. DGHI wants to increase data availability in accordance with FAIR principles so other researchers can independently verify, reproduce, and iterate on their work. This supports peers and contributes to the advancement of the field. Publishing data in an open repository can also increase their reach and impact. DGHI is also currently examining how to incorporate the CARE principles and other frameworks for ethical data sharing within their international collaborations.

Global collaborations in research are vital in these times. Sabrina advises that it’s important for researchers, especially Principal Investigators, to think holistically about research projects. For example, thinking about data sharing at the very beginning of the project and writing consent forms that support what they hope to do with the data. Equitable practices paired with data sharing create opportunities for greater discovery and progress in research.

 

What does it mean to be an actively antiracist developer?

The library has been committed to Diversity, Equity, and Inclusion for the past year extended, specifically through the work of DivE-In and the Anti-Racist Roadmap. And to that end, the Digital Strategies and Technology department, where I work, has also been focusing on these issues. So lately I’ve been thinking a lot about how, as a web developer, I can be actively antiracist in my work.

First, some context. As a cis-gendered white male who is gainfully employed and resides in one of the best places to live in the country, I am soaking in privilege. So take everything I have to say with that large grain of salt. My first job out of college was working at a tech startup that was founded and run by a black person. To my memory, the overall makeup of the staff was something like 40–50% BIPOC, so my introduction to the professional IT world was that it was normal to see people who were different than me. However, in subsequent jobs my coworker pool has been much less diverse and more representative of the industry in general, which is to say very white and very male, which I think is a problem. So how can an industry that lacks diversity actively work on promoting the importance of diversity? How can we push back against systematic racism and oppression when we benefit from those very systems? I don’t think there are any easy answers.

Antiracist Baby Cover
Antiracist Baby by Ibram X. Kendi

I think it’s important to recognize that for organizations driven by top-down decision making, sweeping change needs to come from above. To quote one of my favorite bedtime stories, “Point at policies as the problem, not people. There’s nothing wrong with the people!” But that doesn’t excuse ‘the people’ from doing the hard work that can lead to profound change. I believe an important first step is to acknowledge your own implicit bias (if you are able, attend Duke IT’s Implicit Bias in the Workplace Training). Confronting these issues is an uncomfortable process, but I think ultimately that’s a good thing. And at least for me, I think doing this work is an ongoing process. I don’t think my implicit biases will ever truly go away, so it’s up to me to constantly be on the lookout for them and to broaden my horizons and experiences.

So in addition to working on our internalized biases, I think we can also work on how we communicate with each other as coworkers. In a recent DST-wide meeting concerning racial equity at DUL, the group I was in talked a lot about interpersonal communication. We should recognize that we all have blind spots and patterns that we slip into, like being overly jargony, being terse and/or confrontational, and so on. We have the power to change these patterns. I think we also need to be thoughtful of the language we use and the words that we speak. We need to appreciate diversity of backgrounds and be mindful of the mental taxation of code switching. We can try to help each other feel more comfortable in own skin and feel safe expressing our thoughts and ideas. I think it’s profoundly important to meet people from a place of empathy and mutual respect. And we should not pass up the opportunities to have difficult conversations with each other. If I say something loaded with a microaggression and make a colleague feel uncomfortable or slighted, I want to be called out. I want to learn from my mistakes, and I would think that’s true for all of my coworkers.

aze-con
Axe-con is an open and inclusive digital accessibility conference

We can also incorporate anti-racist practices into the things we create. Throughout my career, I’ve tried to always promote the benefits of building accessible interfaces that follow the practices of universal design. Building things with accessibility in mind is good for everyone, not just those who make use of assistive technologies. And as an aside, axe-con 2021 was packed full of great presentations, and recording are available for free. We can take small steps like removing problematic language from our workflows (“master” branches are now “main”). But I think and hope we can do more. Some areas where I think we have an opportunity to be more proactive would be doing an assessment of our projects and tools to see to what degree (if at all) we seek out feedback and input from BIPOC staff and patrons. How can we make sure their voices are represented in what we create?

I don’t have many good answers, but I will keep listening, and learning, and growing.

An Intern’s Investigation on Decolonizing Archival Descriptions and Legacy Metadata

This post was written by Laurier Cress. Laurier Cress is a graduate student at the University of Denver studying Library Science with an emphasis on digital collections, rare books and manuscripts, and social justice in librarianship and archives. In addition to LIS topics, she is also interested in Medieval and Early Modern European History. Laurier worked as a practicum intern with the Digital Collections and Curation Services Department this winter to investigate auditing practices for decolonizing archival descriptions and metadata. Laurier will complete her masters degree in the Fall of 2021. In her spare time, she also runs a YouTube channel called, Old Dirty History, where she discusses historic events, people, and places throughout history.

Now that diversity, equity, and inclusion (DEI) are popular concerns for libraries throughout the United States, discussions on DEI are inescapable. These three words have become reoccurring buzzwords dropped in meetings, classroom lectures, class syllabi, presentations, and workshops across the LIS landscape. While in some contexts, topics in DEI are thrown around with no sincere intent or value behind them, some institutions are taking steps to give meaning to DEI in librarianship. As an African American MLIS student at the University of Denver, I can say I have listened to one too many superficial talks on why DEI is important in our field. These conversations customarily exclude any examples on what DEI work actually looks like. When Duke Libraries advertised a practicum opportunity devoted to hands on experience exploring auditing practices for legacy metadata and harmful archival descriptions, I was immediately sold. I saw this experience as an opportunity to learn what scholars in our field are actually doing to make libraries a more equitable and diverse place.

As a practicum intern in Duke Libraries’ Digital Collections and Curation Services (DCCS) department, I spent three months exploring frameworks for auditing legacy metadata against DEI values and investigating harmful language statements for the department. Part of this work also included applying what I learned to Duke’s collections. Duke’s digital collections boasts 131,169 items and 997 collections, across 1,000 years of history from all over the world. Many of the collections represent a diverse array of communities that contribute to the preservation of a variety of cultural identities. It is the responsibility of institutions with cultural heritage holdings to present, catalog, and preserve their collections in a manner that accurately and respectively portrays the communities depicted within them. However, many institutions housing cultural heritage collections use antiquated archival descriptions and legacy metadata that should be revisited to better reflect 21st century language and ideologies. It is my hope that this brief overview on decolonizing archival collections not only aids Duke, but other institutions as well.

Harmful Language Statement Investigation

During the first phase of my investigation, I conducted an analysis on harmful language statements across several educational institutions throughout the United States. This analysis served as a launchpad for investigating how Duke can improve upon their inclusive description statement for their digital collections. During my investigation, I created a list that comprises of 41 harmful language statements. Some of these institutions include:

  • The Walters Museum of Art
  • Princeton University
  • University of Denver
  • Stanford University
  • Yale University

After gathering a list of institutions with harmful language statements, the next phase of my investigation was to conduct a comparative analysis to uncover what they had in common and how they differed. For this analysis, 12 harmful language statements were selected at random from the total list. From this investigation, I created the Harmful Statement Research Log to record my findings. The research log comprises of two tabs. The first tab includes a list of harmful statements from 12 institutions, with supplemental comments and information about each statement. The second tab provides a list of 15 observations deduced from cross examining the 12 harmful language statements. Some observations made include placement, length, historical context, and Library of Congress Subject Heading (LCSH) disclaimers. It is important for me to note, while some of the information provided within the research log is based on pure observation, much of the report also includes conclusions based on personal opinions born from my own perspective as a user.

Decolonizing Archival Descriptions & Legacy Metadata

The next phase in my research was to investigate frameworks and current sentiments on decolonizing archival description and legacy metadata for Duke’s digital collections. Due to the limited amount of research on this subject, most of the information I came across was related to decolonizing collections describing Indigenous peoples in Canada and African American communities. I found that the influence of late 19th and early 20th centuries library classification systems can still be found within archival descriptions and metadata in contemporary library collections. The use of dated language within library and archival collections encourages the inequality of underrepresented groups through the promotion of discriminatory infrastructures established by these earlier classification systems. In many cases, offensive archival descriptions are sourced from donors and creators. While it is important for information institutions to preserve the historical context of records within their collections, descriptions written by creators should be contextualized to help users better understand the racial connotation surrounding the record. Issues regarding contextualizing racist ideologies from the past can be found throughout Duke’s digital collections.

During my investigation, I examined Duke’s MARC records from the collection level to locate examples of harmful language used within their descriptions. The first harmful archival description I encountered was from the Alfred Boyd Papers. The archival description describes a girl referenced within the papers as “a free mulatto girl”.  This is an example of when archival description should not shy away from the realities of racist language used during the period the collection was created in; however, context should be applied. “Mulatto” was an offensive term used during the era of slavery in the United States to refer to people of African and White European ancestry. It originates from the Spanish word “mulato”, and its literal meaning is “young mule”. While this word is used to describe the girl within the papers, it should not be used to describe the person within the archival description without historical context.Screenshot of metadata from the Alfred Boyd papers

When describing materials concerning marginalized peoples, it is important to preserve creator-sourced descriptions, while also contextualizing them. To accomplish this, there should be a defined distinction between descriptions from the creator and the institution’s archivists. Some institutions, like The Morgan Library and Museum, use quotation marks as part of their in-house archival description procedure to differentiate between language originating from collectors or dealers versus their archivists. It is important to preserve contextual information, when racism is at the core of the material being described, in order for users to better understand the collection’s historic significance. While this type of language can bring about feelings of discomfort, it is also important to not allow your desire for comfort to take precedence over conveying histories of oppression and power dynamics. Placing context over personal comfort also takes the form of describing relationships of power and acts of violence just as they are. Acts of racism, colonization, and white supremacy should be labeled as such. For example, Duke’s Stephen Duvall Doar Correspondence collection describes the act of “hiring” enslaved people during the Civil War. Slavery does not imply hired labor because hiring implies some form of compensation. Slavery can only equate to forced labor and should be described as such.

Several academic institutions have taken steps to decolonize their collections. At the beginning of my investigation, a mentor of mine referred me to the University of Alberta Library’s (UAL) Head of Metadata Strategies, Sharon Farnel. Farnel and her colleagues have done extensive work on decolonizing UAL’s holdings related to Indigenous communities. The university declared a call to action to protect the representation of Indigenous groups and to build relationships with other institutions and Indigenous communities. Although UAL’s call to action not only encompasses decolonizing their collections, for the sake of this article, I will solely focus on the framework they established to decolonize their archival descriptions.

Community Engagement is Not Optional

Farnel and her colleagues created a team called the Decolonizing Description Working Group (DDWG). Their purpose was to propose a plan of action on how descriptive metadata practices could more accurately and respectfully represent Indigenous peoples. The DDWG included a Metadata Coordinator, a Cataloguer, a Public Service Librarian, a Coordinator of Indigenous Initiatives, and a self-identified Indigenous MLIS Intern. Much of their work consisted of consulting with the community and collaborating with other institutions. When I reached out to Farnel, she was so kind and generous with sharing her experience as part of the DDWG. Farnel told me that the community engagement approach taken is dependent on the community. Marginalized peoples are not a monolith; therefore, there is no “one size fits all” solution. If you are going to consult community members, recognize the time and expertise the community provides. This relationship has to be mutually beneficial, with the community’s needs and requests at the forefront at all times.

For the DDWG, the best course of action was to start building a relationship with local Indigenous communities. Before engaging with the entire community, the team first engaged with community elders to learn how to proceed with consulting the community from a place of respect. Because the DDWG’s work took place prior to COVID-19, most meetings with the community took place in person. Farnel refers to these meetings as “knowledge gathering events”. Food and beverages were provided and a safe space for open conversation. A community elder would start the session to set the tone.

In addition to knowledge gathering events, Aboriginal and non-Aboriginal students and alumni were consulted through an informal short online survey. The survey was advertised through an informal social media posting. Once the participants confirmed the desire to partake in the survey, they received an email with a link to complete it. Participants were asked questions based on their feelings and reactions to potentially changing the Library of Congress Subject Headings (LCSH) that related to Aboriginal content.

Auditing Legacy Metadata and Archival Descriptions

There is more than one approach an institution can take to start auditing legacy metadata and descriptions. In a case study written by Dorothy Berry, who is currently the Digital Collections Program Manager at Harvard’s Houghton Library, she describes a digitization project that took place at the University of Minnesota Libraries. The purpose of the project was to not only digitize African American heritage materials within the university’s holdings, but to also explore ways mass digitization projects can help re-aggregate marginalized materials. This case study serves as an example of how collections can be audited for legacy metadata and archival descriptions during mass digitization projects. Granted, this specific project received funding to support such an undertaking and not all institutions have the amount of currency required to take on an initiative of this magnitude. However, this type of work can be done slowly over a longer period of time. Simply running a report to search for offensive terms such as “negro”, or in my case “mulatto”, is a good place to start. Be open to having discussions with staff to learn what offensive language they also have come across. Self-reflection and research are equally important. Princeton University Library’s inclusive description working group spent two years researching and gathering data on their collections before implementing any changes. Part of their auditing process also included using a XQuery script to locate harmful descriptions and recover histories that were marginalized due to lackluster description.

Creators Over Community = Problematic

While exploring Duke’s digital collections, one problem that stood out to me the most was the perpetual valorization of creators. This is often found in collections with creators who are white men. Adjectives like “renowned”, “genius’, “talented”, and “preeminent” are used to praise the creators and make the collection more about them instead of the community depicted within the collection. An example of this troublesome language can be found in Duke’s Sidney D. Gamble’s Photographs collection. This collection comprises of over 5,000 black and white photographs taken by Sidney D. Gamble during his four visits to China from 1908 to 1932. Content within the photographs encompass depictions of people, architecture, livestock, landscapes, and more. Very little emphasis is placed on the community represented within this collection. Little, if any, historical or cultural context is given to help educate users on the culture behind the collection. And the predominate language used here is English. However, there is a
full page of information on the life and exploits of Gamble.

Screenshot of a description of the Sidney Gamble digital collection.

Describing Communities

Harmful language used to describe individuals represented within digital collections can be found everywhere. This is not always intentional. Dorothy Berry’s presentation with the Sunshine State Digital Network on conscious editing serves as a great source of knowledge on problematic descriptions that can be easily overlooked. Some of Berry’s examples include:

  • Class: Examples include using descriptions such as “poor family” or “below the poverty line”.
  • Race & Ethnicity: Examples include using dehumanizing vocabulary to describe someone of a specific ethnicity or excluding describing someone of a specific race within an image.
  • Gender: Example includes referring to a woman using her husband’s full name (Mrs. John Doe) instead of her own.
  • Ability: Example includes using offensive language like “cripple” to describe disabled individuals.

This is only a handful of problematic description examples from Berry’s presentation. I highly recommend watching not only Berry’s presentation, but the entire Introduction to Conscious Editing Series.

Library of Congress Subject Headings (LCSH) Are Unavoidable

I could talk about LCSH in relation to decolonizing archival descriptions for days on end, but for the sake of wrapping up this post I won’t. In a perfect world we would stop using LCSH altogether. Unfortunately, this is impossible. Many institutions use custom made subject headings to promote their collections respectfully and appropriately. However, the problem with using custom made subject headings that are more culturally relevant and respectful is accessibility. If no one is using your custom-made subject headings when conducting a search, users and aggregators won’t find the information. This defeats the purpose of decolonizing archival collections, which is to make collections that represent marginalized communities more accessible.

What we can do is be as cognizant as possible of the LCSHs we are using and avoid harmful subject headings as much as possible. If you are uncertain if a LCSH is harmful, conduct research or consult with communities who desire to be part of your quest to remove harmful language from your collections. Let your users know why you are limited to subject headings that may be harmful and that you recognize the issue this presents to the communities you serve. Also consider collaborating with Cataloginglab.org to help design new LCSH proposals and to stay abreast on new LCSH that better reflect DEI values. There are also some alternative thesauri, like homosaurus.org and Xwi7xwa Subject Headings, that better describe underrepresented communities.

Resources

In support of Duke Libraries’ intent to decolonize their digital collections, I created a Google Drive folder that includes all the fantastic resources I included in my research on this subject. Some of these resources include metadata auditing practices from other institutions, recommendations on how to include communities in archival description, and frameworks for decolonizing their descriptions.

While this short overview provides a wealth of information gathered from many scholars, associations, and institutions who have worked hard to make libraries a better place for all people, I encourage anyone reading this to continue reading literature on this topic. This overview does not come close to covering half of what invested scholars and institutions have contributed to this work. I do hope it encourages librarians, catalogers, and metadata architects to take a closer look at their collections.

Notes from the Duke University Libraries Digital Projects Team