Category Archives: Behind the Scenes

Sometimes You Feel Like a Nutch: The Un-Googlification of a Library Search Service

Quick—when was the last time you went a full day without using a Google product or service? How many years ago was that day?

We all know Google has permeated so many facets of our personal and professional lives. A lot of times, using a Google something-or-other is your organization’s best option to get a job done, given your available resources. If you ever searched the Duke Libraries website at any point over the past seventeen years, you were using Google.

It’s really no secret that when you have a website with a lot of pages, you need to provide a search box so people can actually find things. Even the earliest version of the library website known to the Wayback Machine–from “way back” in 1997–had a search box. Those days, search was powered by the in-house supported Texis Webinator. Google was yet to exist.

July 24, 2004 was an eventful day for the library IT staff. We went live with a shiny new Integrated Library System from Ex Libris called Aleph (that we are still to this day working to replace). On that very same day, we launched a new library website, and in the top-right corner of the masthead on that site was–for the very first time–a Google search box.

2004 version of the library website, with a Google search box in the masthead.
2004 version of the library website, with a Google search box in the masthead.

Years went by. We redesigned the website several times. Interface trends came and went. But one thing remained constant: there was a search box on the site, and if you used it, somewhere on the next page you were going to get search results from a Google index.

That all changed in summer 2021, when we implemented Nutch…

Nutch logo

Why Not Google?

Google Programmable Search Engine (recently rebranded from “Google Custom Search Engine”), is easy to use. It’s “free.” It’s fast, familiar, and being a Google thing, it’s unbeatable at search relevancy. So why ditch it now? Well…

  • Protecting patron privacy has always been a core library value. Recent initiatives at Duke Libraries and beyond have helped us to refocus our efforts around ensuring that we meet our obligations in this area.
  • Google’s service changed recently, and creating a new engine now involves some major hoop-jumping to be able to use it ad-free.
  • It doesn’t work in China, where we actually have a Duke campus, and a library.
  • The results are capped at 100 per query. Google prioritizes speed and page 1 relevancy, but it won’t give you a precise hit count nor an exhaustive list of results.
  • It’s a black box. You don’t really get to see why pages get ranked higher or lower than others.
  • There’s a search API you could potentially build around, but if you exceed 100 searches/day, you have to start paying to use it.

What’s Nutch?

Apache Nutch is open source web crawler software written in Java. It’s been around for nearly 20 years–almost as long as Google. It supports out-of-the-box integration with Apache Solr for indexing.

Diagram showing how Nutch works.
Slide from Sebastian Nagel’s “Web Crawling With Apache Nutch” presentation at ApacheCon EU 2014.

What’s So Good About Nutch?

  • Solr. Our IT staff have grown quite accustomed to the Solr search platform over the past decade; we already support around ten different applications that use it under the hood.
  • Self-Hosted. You run it yourself, so you’re in complete control of the data being crawled, collected, and indexed. User search data is not being collected by a third party like Google.
  • Configurable. You have a lot of control over how it works. All our configs are in a public code repository so we have record of what we have changed and why.

What are the Drawbacks to Using Nutch?

  • Maintenance. Using open source software requires a commitment of IT staff resources to build and maintain over time. It’s free, but it’s not really free.
  • Interface. Nutch doesn’t come with a user interface to actually use the indexed data from the crawls; you have to build a web application. Here’s ours.
  • Relevancy. Though Google considers such factors as page popularity and in-link counts to deem pages as more relevant than others for a particular query, Nutch can’t. Or, at least, its optional features that attempt to do so are flawed enough that not using them gets us better results. So we rely on other factors for our relevancy algorithm, like the segment of the site that a page resides, URL slugs, page titles, subheading text, inlink text, and more.
  • Documentation. Some open source platforms have really clear, easy to understand instruction manuals online to help you understand how to use them. Nutch is not one of those platforms.

How Does Nutch Work at Duke?

The main Duke University Libraries website is hosted in Drupal, where we manage around 1,500 webpages. But the full scope of what we crawl for library website searching is more than ten times that size. This includes pages from our blogs, LibGuides, exhibits, staff directory, and more. All told: 16,000 pages of content.

Searching from the website masthead or the default “All” box in the tabbed section on our homepage brings you to QuickSearch results page.

Two boxes on the library homepage will search QuickSearch.
Use either of these search boxes to search QuickSearch.

You’ll see a search results page rendered by our QuickSearch app. It includes sections of results from various places, like articles, books & media, and more. One of the sections is “Our Website” — it shows the relevant pages that we’ve crawled with Nutch.

A QuickSearch page showing results in various boxes
QuickSearch results page includes a section of results from “Our Website”

You can just search the website specifically if you’re not interested in all those other resources.

Search results from the library website search box.
An example website-only search.

Three pieces work in concert to enable searching the website: Nutch, Solr, and QuickSearch. Here’s what they do:

Nutch

  • Crawls web pages that we want to include in the website search.
  • Parses HTML content; writes it to Solr fields.
  • Includes configuration for what pages to include/exclude, crawler settings, field mappings

Solr

  • Index & document store for crawled website content.

QuickSearch

Crawls happen every night to pick up new pages and changes to existing ones. We use an “adaptive fetch schedule” so by default each page gets recrawled every 30 days. If a page changes frequently, it’ll get re-crawled sooner automatically.

Summary

Overall, we’re satisfied with how the switch to Nutch has been working out for us. The initial setup was challenging, but it has been running reliably without needing much in the way of developer intervention.  Here’s hoping that continues!


Many thanks to Derrek Croney and Cory Lown for their help implementing Nutch at Duke, and to Kevin Beswick (NC State University Libraries) for consulting with our team.

The Shortest Year

Featured image – screenshot from the Sunset Tripod2 project charter.

Realizing that my most recent post here went up more than a year ago, I pause to reflect. What even happened over these last twelve months? Pandemic and vaccine, election and insurrection, mandates and mayhem – outside of our work bubble, October 2020 to October 2021 has been a churn of unprecedented and often dark happenings. Bitstreams, however, broadcasts from inside the bubble, where we have modeled cooperation and productivity, met many milestones, and kept our collective cool, despite working nearly 100% remotely as a team, with our stakeholders, and across organizational lines.

Last October, I wrote about Sunsetting Tripod2, a homegrown platform for our digital collections and archival finding aids that was also the final service we had running on a physical server. “Firm plans,” I said we had for the work that remained. Still, in looking toward that setting sun, I worried about “all sorts of comical and embarrassing misestimations by myself on the pages of this very blog over the years.” I was optimistic, but cautiously so, that we would banish the ghosts of Django-based systems past.

Reader, I have returned to Bitstreams to tell you that we did it. Sometime in Q1 of 2021, we said so long, farewell, adieu to Tripod2. It was a good feeling, like when you get your laundry folded, or your teeth cleaned, only better.

However, we did more in the past year than just power down exhausted old servers. What follows are a few highlights from the work of the Digital Strategies and Technology division of Duke University Libraries’ software developers, and our collaborators (whom we cannot thank or praise enough) over the past twelve months. 

In November, Digital Projects Developer Sean Aery posted on Implementing ArcLight: A Reflection. The work of replacing and improving upon our implementation for the Rubenstein Library’s collection guides was one of the main components that allowed us to turn off Tripod2. We actually completed it in July of 2020, but that team earned its Q4 victory laps, including Sean’s post and a session at Blacklight Summit a few days after my own post last October.

As the new year began, the MorphoSource team rolled out version 2.0 of that platform. MorphoSource Repository Developer Jocelyn Triplett shared a A Preview of MorphoSource 2 Beta in these pages on January 20. The launch took place on February 1.

One project we had underway as I was writing last October was the integration of Globus, a transfer service for large datasets, into the Duke Research Data Repository. We completed that work in Q1 of 2021, prompting our colleague, Senior Research Data Management Consultant Sophia Lafferty-Hess, to post Share More Data in the Duke Research Data Repository! in a neighboring location that shares our charming cul-de-sac of library blogs.

The seventeen months since the murder of George Floyd have seen major changes in how we think and talk about race in the Libraries. We committed ourselves to the DUL Racial Justice Roadmap, a pathway for recognizing and attacking the pervasive influence of white supremacy in our society, in higher education, at Duke, in the field of librarianship, in our library, in the field of information technology, and in our own IT practices. During this time, members of our division have also participated broadly in DiversifyIT, a campus-wide group of IT professionals who seek to foster a culture of inclusion “by providing professional development, networking, and outreach opportunities.”

Digital Projects Developer Michael Daul shared his own point of view with great thoughtfulness in his April post, What does it mean to be an actively antiracist developer? He touched on representation in the IT industry, acknowledging bias, being aware of one’s own patterns of communication, and bringing these ideas to the systems we build and maintain. 

One of the ideas that Michael identified for software development is web accessibility; as he wrote, we can “promote the benefits of building accessible interfaces that follow the practices of universal design.” We put that idea into action a few months later, as Sean described in precise technical terms in his July post, Automated Accessibility Testing and Continuous Integration. Currently that process applies to the ArcLight platform, but when we have a chance, we’ll see if we can expand it to other services.

The question of when we’ll have that chance is a big one, as it hinges on the undertaking that now dominates our attention. Over the past year we have ramped up on the migration of our website from Drupal 7 to Drupal 9, to head off the end-of-life for 7. This project has transformed into the raging beast that our colleagues at NC State Libraries warned us it would at the Code4Lib Southeast in May of 2019

Screenshot of NC State Libraries presentation on Drupal migration
They warned us – Screenshot from “Drupal 7 to Drupal 8: Our Journey,” by Erik Olson and Meredith Wynn of NC State Libraries’ User Experience Department, presented at Code4Lib Southeast in May of 2019.

We are on a path to complete the Drupal migration in March 2022 – we have “firm plans,” you could say – and I’m certain that its various aspects will come to feature in Bitstreams in due time. For now I will mention that it spawned two sub-projects that have challenged our team over the past six months or so, both of which involve refactoring functionality previously implemented as Drupal modules into standalone Rails applications:

  1. Quicksearch, aka unified search, aka “Bento search” – see Michael’s Bento is Coming! from 2014 – is now a standalone app; it also uses the open-source tool Apache Nutch, rather than Google CSE.
  2. The staff directory app that went live in 2019, which Michael wrote about in Building a new Staff Directory, also no longer runs as a Drupal module.

Each of these implementations was necessary to prepare the way for a massive migration of theme and content that will take place over the coming months. 

Screenshot of a Jira issue related to the Decouple Staff Directory project.
Screenshot of a Jira issue related to the Decouple Staff Directory project.

When it’s done, maybe we’ll have a chance to catch our breath. Who can really say? I could not have guessed a year ago where we’d be now, and anyway, the period of the last twelve months gets my nod as the shortest year ever. Assuming we’re here, whatever “here” means in the age of remote/hybrid/flexible work arrangements, then I expect we’ll be burning down backlogs, refactoring this or that, deploying some service, and making firm plans for something grand.

Using an M1 Mac for development work

Due to a battery issue with my work laptop (an Intel-based MacBook pro), I had an opportunity to try using a newer (ARM-based) M1 Mac to do development work. Since roughly a year had passed since these new machines had been introduced I assumed the kinks would have been generally worked out and I was excited to give my speedy new M1 Mac Mini a test run at some serious work. However, upon trying to do make some updates to a recent project (by the way, we launched our new staff directory!) I ran into many stumbling blocks.

M1 Mac Mini ensconced beneath multitudes of cables in my home office

My first step in starting with a new machine was to get my development environment setup. On my old laptop I’d typically use homebrew for managing packages and RVM (and previously rbenv) for ruby version management in different projects. I tried installing the tools normally and ran into multitudes of weirdness. Some guides suggested setting up a parallel version of homebrew (ibrew) using Rosetta (which is a translation layer for running Intel-native code). So I tried that – and then ran into all kinds of issues with managing Ruby versions. Oh and also apparently RVM / rbenv are no longer cool and you should be using chruby or asdf. So I tried those too, and ran into more problems. In the end, I stumbled on this amazing script by Moncef Belyamani. It was really simple to run and it just worked, plain and simple. Yay – working dev environment!

We’ve been using Docker extensively in our recent library projects over the past few years and the Staff Directory was setup to run inside a container on our local machines. So my next step was to get Docker up and running. The light research I’d done suggested that Docker was more or less working now with M1 macs so I dived in thinking things would go smoothly. I installed Docker Desktop (hopefully not a bad idea) and tried to build the project, but bundle install failed. The staff directory project is built in ruby on rails, and in this instance was using therubyracer gem, which embeds the V8 JS library. However, I learned that the particular version of the V8 library used by therubyracer is not compiled for ARM and breaks the build. And as you tend to do when running into questions like these, I went down a rabbit hole of potential work-arounds. I tried manually installing a different version of the V8 library and getting the bundle process to use that instead, but never quite got it working. I also explored using a different gem (like mini racer) that would correctly compile for ARM, or just using Node instead of V8, but neither was a good option for this project. So I was stuck.

Building the Staff Directory app in Docker

My text attempt at a solution was to try setting up a remote Docker host. I’ve got a file server at home running TrueNAS, so I was able to easily spin up a Ubuntu VM on that machine and setup Docker there. You could do something similar using Duke’s VCM service. I followed various guides, setup user accounts and permissions, generated ssh keys, and with some trial and error I was finally able to get things running correctly. You can setup a context for a Docker remote host and switch to it (something like: docker context use ubuntu), and then your subsequent Docker commands point to that remote making development work entirely seamless. It’s kind of amazing. And it worked great when testing with a hello-world app like whoami. Running docker run --rm -it -p 80:80 containous/whoami worked flawlessly. But anything that was more complicated, like running an app that used two containers as was the case with the Staff Dir app, seemed to break. So stuck again.

After consulting with a few of my brilliant colleagues, another option was suggested and this ended up being the best work around. Take my same ubuntu VM and instead of setting it up as a docker remote host, use it as the development server and setup a tunnel connection (something like: ssh -N -L localhost:8080:localhost:80 docker@ip.of.VM.machine) to it such that I would be able to view running webpages at localhost:8080. This approach requires the extra step of pushing code up to the git repository from the Mac and then pulling it back down on the VM, but that only takes a few extra keystrokes. And having a viable dev environment is well worth the hassle IMHO!

As apple moves away from Intel-based machines – rumors seem to indicate that the new MacBook Pros coming out this fall will be ARM-only – I think these development issues will start to be talked about more widely. And hopefully some smart people will be able to get everything working well with ARM. But in the meantime, running Docker on a Linux VM via a tunnel connection seems like a relatively painless way to ensure that more complicated Docker/Rails projects can be worked on locally using an M1 Mac.

Good News from the DPC: Digitization of Behind the Veil Tapes is Underway

This post was written by Jen Jordan, a graduate student at Simmons University studying Library Science with a concentration in Archives Management. She is the Digital Collections intern with the Digital Collections and Curation Services Department.  Jen will complete her masters degree in December 2021. 

The Digital Production Center (DPC) is thrilled to announce that work is underway on a 3-year long National Endowment for the Humanities (NEH) grant-funded project to digitize the entirety of Behind the Veil: Documenting African-American Life in the Jim Crow South, an oral history project that produced 1,260 interviews spanning more than 1,800 audio cassette tapes. Accompanying the 2,000 plus hours of audio is a sizable collection of visual materials (e.g.- photographic prints and slides) that form a connection with the recorded voices.

We are here to summarize the logistical details relating to the digitization of this incredible collection. To learn more about its historical significance and the grant that is funding this project, titled “Documenting African American Life in the Jim Crow South: Digital Access to the Behind the Veil Project Archive,” please take some time to read the July announcement written by John Gartrell, Director of the John Hope Franklin Research Center and Principal Investigator for this project. Co-Principal Investigator of this grant is Giao Luong Baker, Digital Production Services Manager.

Digitizing Behind the Veil (BTV) will require, in part, the services of outside vendors to handle the audio digitization and subsequent captioning of the recordings. While the DPC regularly digitizes audio recordings, we are not equipped to do so at this scale (while balancing other existing priorities). The folks at Rubenstein Library have already been hard at work double checking the inventory to ensure that each cassette tape and case are labeled with identifiers. The DPC then received the tapes, filling 48 archival boxes, along with a digitization guide (i.e. – an Excel spreadsheet) containing detailed metadata for each tape in the collection. Upon receiving the tapes, DPC staff set to boxing them for shipment to the vendor. As of this writing, the boxes are snugly wrapped on a pallet in Perkins Shipping & Receiving, where they will soon begin their journey to a digital format.

The wait has begun! In eight to twelve weeks we anticipate receiving the digital files, at which point we will perform quality control (QC) on each one before sending them off for captioning. As the captions are returned, we will run through a second round of QC. From there, the files will be ingested into the Duke Digital Repository, at which point our job is complete. Of course, we still have the visual materials to contend with, but we’ll save that for another blog! 

As we creep closer to the two-year mark of the COVID-19 pandemic and the varying degrees of restrictions that have come with it, the DPC will continue to focus on fulfilling patron reproduction requests, which have comprised the bulk of our work for some time now. We are proud to support researchers by facilitating digital access to materials, and we are equally excited to have begun work on a project of the scale and cultural impact that is Behind the Veil. When finished, this collection will be accessible for all to learn from and meditate on—and that’s what it’s all about. 

 

Auditing Archival Description for Harmful Language: A Computer and Community Effort

This post was written by Miriam Shams-Rainey, a third-year undergraduate at Duke studying Computer Science and Linguistics with a minor in Arabic. As a student employee in the Rubenstein’s Technical Services Department in the Summer of 2021, Miriam helped build a tool to audit archival description in the Rubenstein for potentially harmful language. In this post, she summarizes her work on that project.

The Rubenstein Library has collections ranging across centuries. Its collections are massive and often contain rare manuscripts or one of a kind data. However, with this wide-ranging history often comes language that is dated, harmful, often racist, sexist, homophobic, and/or colonialist. As important as it is to find and remediate these instances of potentially harmful language, there is lot of data that must be searched.

With over 4,000 collection guides (finding aids) and roughly 12,000 catalog records describing archival collections, archivists would need to spend months of time combing their metadata to find harmful or problematic language before even starting to find ways to handle this language. That is, unless there was a way to optimize this workflow.

Working under Noah Huffman’s direction and the imperatives of the Duke Libraries’ Anti-Racist Roadmap, I developed a Python program capable of finding occurrences of potentially harmful language in library metadata and recording them for manual analysis and remediation. What would have taken months of work can now be done in a few button clicks and ten minutes of processing time. Moreover, the tools I have developed are accessible to any interested parties via a GitHub repository to modify or expand upon.

Although these gains in speed push metadata language remediation efforts at the Rubenstein forward significantly, a computer can only take this process so far; once uses of this language have been identified, the responsibility of determining the impact of the term in context falls onto archivists and the communities their work represents. To this end, I have also outlined categories of harmful language occurrences to act as a starting point for archivists to better understand the harmful narratives their data uphold and developed best practices to dismantle them.

Building an automated audit tool

Audit Tool GUI Screenshot
The simple, yet user-friendly interface that allows archivists to customize the search audit to their specific needs.

I created an executable that allows users to interact with the program regardless of their familiarity with Python or with using their computer’s command line. With an executable, all that a user must do is simply click on the program (titled “description_audit.exe”) and the script will load with all of its dependencies in a self-contained environment. There’s nothing that a user needs to install, not even Python.

Within this executable, I also created a user interface to allow users to set up the program with their specific audit parameters. To use this program, users should first create a CSV file (spreadsheet) containing each list of words they want to look for in their metadata.

Snippet of Lexicon CSV
Snippet from a sample lexicon CSV file containing harmful terms to search

In this CSV file of “lexicons”, each category of terms  should have its own column, for example RaceTerms could be the first row in a column of terms such as “colored” or “negro,” and GenderTerms could be the first row in a column of gendered terms such as “homemaker” or “wife.”  See these lexicon CSV file examples.

Once this CSV has been created, users can select this CSV of lexicons in the program’s user interface and then select which columns of terms they want the program to use when searching across the source metadata. Users can either use all lexicon categories (all columns) by default or specify a subset by typing out those column headers. For the Rubenstein’s purposes, there is also a rather long lexicon called HateBase (from a regional, multilingual database of potential hate speech terms often used in online moderating) that is only enabled when a checkbox is checked; users from other institutions can download the HateBase lexicon for themselves and use it or they can simply ignore it.

In the CSV reports that are output by the program, matches for harmful terms and phrases will be tagged with the specific lexicon category the match came from, allowing users to filter results to certain categories of potentially harmful terms.

Users also need to designate a folder on their desktop where report outputs should be stored, along with the folder containing their source EAD records in .xml format and their source MARCXML file containing all of the MARC records they wish to process as a single XML file. Results from MARC and EAD records are reported separately, so only one type of record is required to use the program, however both can be provided in the same session.

How archival metadata is parsed and analyzed

Once users submit their input parameters in the GUI, the program begins by accessing the specified lexicons from the given CSV file. For each lexicon, a “rule” is created for a SpaCy rule-based matcher, using the column name (e.g. RaceTerms or GenderTerms) as the name of the specific rule. The same SpaCy matcher object identifies matches to each of the several lexicons or “rules”. Once the matcher has been configured, the program assesses whether valid MARC or EAD records were given and starts reading in their data.

To access important pieces of data from each of these records, I used a Python library called BeautifulSoup to parse the XML files. For each individual record, the program parses the call numbers and collection or entry name so that information can be included in the CSV reports. For EAD records, the collection title and component titles  are also parsed to be analyzed for matches to the lexicons, along with any data that is in a paragraph (<p>) tag. For MARC records, the program also parses the author or creator of the item, the extent of the collection, and the timestamp of when the description of the item was last updated. In each MARC record, the 520 field (summary)  and 545 field (biography/history note) are all concatenated together and analyzed as a single entity.

Data from each record is stored in a Python dictionary with the names of fields (as strings) as keys mapping to the collection title, call number, etc. Each of these dictionaries is stored in a list, with a separate structure for EAD and MARC records.

Once data has been parsed and stored, each record is checked for matches to the given lexicons using the SpaCy rule-based matcher. For each record, any matches that are found are then stored in the dictionary with the matching term, the context of the term (the entire field or surrounding few sentences, depending on length), and the rule the term matches (such as RaceTerms). These matches are found using simple tokenization from SpaCy that allow matches to be identified quickly and without regard for punctuation, capitalization, etc. 

Although this process doesn’t necessarily use the cutting-edge of natural language processing that the SpaCy library makes accessible, this process is adaptable in ways that matching procedures like using regular expressions often isn’t. Moreover, identifying and remedying harmful language is a fundamentally human process which, at the end of the day, needs a significant level of input both from historically marginalized communities and from archivists.

Matches to any of the lexicons, along with all other associated data (the record’s call number, title, etc.) are then written into CSV files for further analysis and further categorization by users. You can see sample CSV audit reports here. The second phase of manual categorization is still a lengthy process, yielding roughly 14600 matches from the Rubenstein Library’s EAD data and 4600 from its MARC data which must still be read through and analyzed by hand, but the process of identifying these matches has been computerized to take a mere ten minutes, where it could otherwise be a months-long process.

Categorizing matches: an archivist and community effort

An excerpt of initial data returned by the audit program for EAD records. This data should be further categorized manually to ensure a comprehensive and nuanced understanding of these instances of potentially harmful language.
An excerpt of initial data returned by the audit program for EAD records. This data should be further categorized manually to ensure a comprehensive and nuanced understanding of these instances of potentially harmful language.

To better understand these matches and create a strategy to remediate the harmful language they represent, it is important to consider each match in several different facets.

Looking at the context provided with each match allows archivists to understand the way in which the term was used. The remediation strategy for the use of a potentially harmful term in a proper noun used as a positive, self-identifying term, such as the National Association for the Advancement of Colored People, for example, is vastly different from that of a white person using the word “colored” as a racist insult.

The three ways in which I proposed we evaluate context are as follows:

  1. Match speaker: who was using the term? Was the sensitive term being used as a form of self-identification or reclaiming by members of a marginalized group, was it being used by an archivist, or was it used by someone with privilege over the marginalized group the term targets (e.g. a white person using an anti-Black term or a cisgender straight person using an anti-LGBTQ+ term)? Within this category, I proposed three potential categories for uses of a term: in-group, out-group, and archivist. If a term is used by a member (or members) of the identity group it references, its use is considered an in-group use. If the term is used by someone who is not a member of the identity group the term references, that usage of the term is considered out-group. Paraphrasing or dated term use by archivists is designated simply as archivist use.
  2. Match context: how was the term in question being used? Modifying the text used in a direct quote or a proper noun constitutes a far greater liberty by the archivist than removing a paraphrased section or completely archivist-written section of text that involved harmful language. Although this category is likely to evolve as more matches are categorized, my initial proposed categories are: proper noun, direct quote, paraphrasing, and archivist narrative.
  3. Match impact: what was the impact of the term? Was this instance a false positive, wherein the use of the term was in a completely unrelated and innocuous context (e.g. the use of the word “colored” to describe the colors used in visual media), or was the use of the term in fact harmful? Was the use of the term derogatory, or was it merely a mention of politicized identities? In many ways, determining the impact of a particular term or use of potentially harmful language is a community effort; if a community member with a marginalized identity says that the use of a term in that particular context is harmful to people with that identity, archivists are in no position to disagree or invalidate those feelings and experiences. The categories that I’ve laid out initially–dated original term, dated Rubenstein term, mention of marginalized issues, mention of marginalized identity, downplaying bias (e.g. calling racism and discrimination an issue with “race relations”), dehumanization of marginalized people, false positive–only hope to serve as an entry point and rudimentary categorization of these nuances to begin this process.
A short excerpt of categorized EAD metadata
A short excerpt of categorized EAD metadata

Here you can find more documentation on the manual categorization strategy.

Categorizing each of these instances of potentially harmful language remains a time-consuming, meticulous process. Although much of this work can be computerized, decolonization is a fundamentally human and fundamentally community-centered practice. No computer can dismantle the colonial, white supremacist narratives that archival work often upholds. This work requires our full attention and, for better or for worse, a lot of time, even with the productivity boost technology gives us.

Once categories have been established, at least on a preliminary level, I found that about 100-200 instances of potentially harmful language could be manually parsed and categorized in an hour.

Conclusion

Decolonization and anti-racist efforts in archival work are an ongoing process. It is bound to take active learning, reflection, and lots of remediation. However, using technology to start this process creates a much less daunting entry point. Anti-racism work is essential in archival spaces.

The ways we talk about history can either work to uphold traditional white supremacist, racist, ableist, etc. narratives, or they can work to dismantle them. In many ways, archival work has often upheld these narratives in the past, however this audit represents the sincere beginnings of work to further equitable narratives in the future.

FFV1: The Gains of Lossless

One of the greatest challenges to digitizing analog moving-image sources such as videotape and film reels isn’t the actual digitization. It’s the enormous file sizes that result, and the high costs associated with storing and maintaining those files for long-term preservation. For many years, Duke Libraries has generated 10-bit uncompressed preservation master files when digitizing our vast inventory of analog videotapes.

Unfortunately, one hour of uncompressed video can produce a 100 gigabyte file. That’s at least 50 times larger than an audio preservation file of the same duration, and about 1000 times larger than most still image preservation files. That’s a lot of data, and as we digitize more and more moving-image material over time, the long-term storage costs for these files can grow exponentially.

To help offset this challenge, Duke Libraries has recently implemented the FFV1 video codec as its primary format for moving image preservation. FFV1 was first created as part of the open-source FFmpeg software project, and has been developed, updated and improved by various contributors in the Association of Moving Image Archivists (AMIA) community.

FFV1 enables lossless compression of moving-image content. Just like uncompressed video, FFV1 delivers the highest possible image resolution, color quality and sharpness, while avoiding the motion compensation and compression artifacts that can occur with “lossy” compression. Yet, FFV1 produces a file that is, on average, 1/3 the size of its uncompressed counterpart.

sleeping bag
FFV1 produces a file that is, on average, 1/3 the size of its uncompressed counterpart. Yet, the audio & video content is identical, thanks to lossless compression.

The algorithms used in lossless compression are complex, but if you’ve ever prepared for a fall backpacking trip, and tightly rolled your fluffy goose-down sleeping bag into one of those nifty little stuff-sacks, essentially squeezing all the air out of it, you just employed (a simplified version of) lossless compression. After you set up your tent, and unpack your sleeping bag, it decompresses, and the sleeping bag is now physically identical to the way it was before you packed.

Yet, during the trek to the campsite, it took up a lot less room in your backpack, just like FFV1 files take up a lot less room in our digital repository. Like that sleeping bag, FFV1 lossless compression ensures that the compressed video file is mathematically identical to it’s pre-compressed state. No data is “lost” or irreversibly altered in the process.

Duke Libraries’ Digital Production Center utilizes a pair of 6-foot-tall video racks, which house a current total of eight videotape decks, comprised of a variety of obsolete formats such as U-matic (NTSC), U-matic (PAL), Betacam, DigiBeta, VHS (NTSC) and VHS (PAL, Secam). Each deck is converted from analog to digital (SDI) using Blackmagic Design Mini Converters.

The SDI signals are sent to a Blackmagic Design Smart Videohub, which is the central routing center for the entire system. Audio mixers and video transcoders allow the Digitization Specialist to tweak the analog signals so the waveform, vectorscope and decibel levels meet broadcast standards and the digitized video is faithful to its analog source. The output is then routed to one of two Retina 5K iMacs via Blackmagic UltraStudio devices, which convert the SDI signal to Thunderbolt 3.

FFV1 video digitization in progress in the Digital Production Center.

Because no major company (Apple, Microsoft, Adobe, Blackmagic, etc.) has yet adopted the FFV1 codec, multiple foundational layers of mostly open-source systems software had to be installed, tested and tweaked on our iMacs to make FFV1 work: Apple’s Xcode, Homebrew, AMIA’s vrecord, FFmpeg, Hex Fiend, AMIA’s ffmprovisr, GitHub Desktop, MediaInfo, and QCTools.

FFV1 operates via terminal command line prompts, so some understanding of programming language is helpful to enter the correct prompts, and be able to decipher the terminal logs.

The FFV1 files are “wrapped” in the open source Matroska (.mkv) media container. Our FFV1 scripts employ several degrees of quality-control checks, input logs and checksums, which ensure file integrity. The files can then be viewed using VLC media player, for Mac and Windows. Finally, we make an H.264 (.mp4) access derivative from the FFV1 preservation master, which can be sent to patrons, or published via Duke’s Digital Collections Repository.

An added bonus is that, not only can Duke Libraries digitize analog videotapes and film reels in FFV1, we can also utilize the codec (via scripting) to target a large batch of uncompressed video files (that were digitized from analog sources years ago) and make much smaller FFV1 copies, that are mathematically lossless. The script runs checksums on both the original uncompressed video file, and its new FFV1 counterpart, and verifies the content inside each container is identical.

Now, a digital collection of uncompressed masters that took up 9 terabytes can be deleted, and the newly-generated batch of FFV1 files, which only takes up 3 terabytes, are the new preservation masters for that collection. But no data has been lost, and the content is identical. Just like that goose-down sleeping bag, this helps the Duke University budget managers sleep better at night.

2020 Highlights from Digital Collections

Welcome to the 2020 digital collections round up!

In spite of the dumpster fire of 2020, Duke Digital Collections had a productive and action packed year (maybe too action packed at times). 

Per usual we launched new and added content to existing digital collections (full list below). We are also wrapping up our mega-migration from our old digital collections system (Tripod2) to the Duke Digital Repository! This migration has been in process for 5 years, yes 5 years. We plan to celebrate this exciting milestone more in January so stay tuned. 

A classroom and auditorium blueprint, digitized for a patron and launched this month.

The Digital Production Center, in collaboration with the Rubenstein Library, shifted to a new folder level workflow for patron and instruction requests. This workflow was introduced just in time for the pandemic and the resulting unprecedented number of digitization requests.  As a result of the demand for digital images, all project work has been put aside and the DPC is focusing on patron and instruction requests only. Since late June, the DPC has produced over 40,000 images!  

Another digital collections highlight from 2020 is the development of new features for our preservation and access interface, the Duke Digital Repository.  We have wasted no time using these new features especially “metadata only”  and the DDR to CONTENTdm connection

Looking ahead to 2021, our priorities will be the folder level digitization workflow for researcher and instruction requests. The DPC received 200+ requests since June, and we need to get all those digitized folders moved into the repository. We are also experimenting with preserving scans created outside of the DPC. For example Rubenstein Library staff created a huge number of access copies using reading room scanners, and we would like to make them available to others.  Lastly, we have a few bigger digital collections to ingest and launch as well. 

Thanks to everyone associated with Digital Collections for their incredible work this year!!  Whew, it has been…a year. 

One of our newest digital collections features postcards from Greece: Salonica / Selanik / Thessaloniki
One of the Radio Haiti photographs launched recently.

Laundry list of 2020 Digital Collections

New Collections

Digital Collections Additions

Migrated Collections

Access for One, Access for All: DPC’s Approach towards Folder Level Digitization

Earlier this year and prior to the pandemic, Digital Production Center (DPC) staff piloted an alternative approach to digitize patron requests with the Rubenstein Library’s Research Services (RLRS) team. The previous approach was focused on digitizing specific items that instruction librarians and patrons requested, and these items were delivered directly to that person. The alternative strategy, the Folder Level digitization approach, involves digitizing the contents of the entire folder that the item is contained in, ingesting these materials to the Duke Digital Repository (to enable Duke Library staff to retrieve these items), and when possible, publishing these materials so that they are available to anyone with internet access. This soft launch prepared us for what is now an all-hands-on-deck-but-in-a-socially-distant-manner digitization workflow.

Giao Luong Baker assessing folders in the DPC.

Since returning to campus for onsite digitization in late June, the DPC’s primary focus has been to perfect and ramp up this new workflow. It is important to note that the term “folder” in this case is more of a concept and that its contents and their conditions vary widely. Some folders may have 2 pages, other folders have over 300 pages. Some folders consists of pamphlets, notebooks, maps, papyri, and bound items. All this to say that a “folder” is a relatively loose term.

Like many initiatives at Duke Libraries, Folder Level Digitization is not just a DPC operation, it is a collaborative effort. This effort includes RLRS working with instructors and patrons to identify and retrieve the materials. RLRS also works with Rubenstein Library Technical Services (RLTS) to create starter digitization guides, which are the building blocks for our digitization guide. Lastly, RLRS vets the materials and determines their level of access. When necessary, Duke Library’s Conservation team steps in to prepare materials for digitization. After the materials are digitized, ingest and metadata work by the Digital Collections and Curation Services as well as the RLTS teams ensure that the materials are preserved and available in our systems.

Kristin Phelps captures a color target.

Doing this work in the midst of a pandemic requires that DPC work closely with the Rubenstein Library Access Services Reproduction Team (a section of RLRS) to track our workflow using a Google Doc. We track the point where the materials are identified by RLRS, through multiple quarantine periods, scanning, post processing, file delivery, to ingest. Also, DPC staff are digitizing in a manner that is consistent with COVID-19 guidelines. Materials are quarantined before and after they arrive at the DPC, machines and workspaces are cleaned before and after use, capture is done in separate rooms, and quality control is done off site with specialized calibrated monitors.

Since we started Folder Level digitization, the DPC has received close to 200 unique Instruction and Patron requests from RLRS. As of the publication of this post, 207 individual folders (an individual request may contain several folders) have been digitized. In total, we’ve scanned and quality controlled over 26,000 images since we returned to campus!

By digitizing entire folders, we hope this will allow for increased access to the materials without risking damage through their physical handling. So far we anticipate that 80 new digital collections will be ingested to the Duke Digital Repository. This number will only grow as we receive more requests. Folder Level Digitization is an exciting approach towards digital collection development, as it is directly responsive to instruction and researcher needs. With this approach, it is access for one, access for all!

Here’s What Happened Next: The Duke Digital Production Center in the Era of the COVID-19 Pandemic

On March 20, 2020, the Duke University Libraries were closed related to the COVID-19 pandemic.  Surrounded by a great deal of uncertainty as to when the Libraries would reopen, most library staff were sent home to work for the next months from home.  During this time, the Digital Production Center’s employees followed suit and, as part of that time away from the DPC, completed post-processing of images, image quality control, participated in project planning and wrote blogs on the closing of the Libraries, labor in the time of the coronavirus, and the history of videotelephony.  Following the end of the North Carolina Stay-at-Home order on April 29, discussions began in earnest about what the new reality would be for the Libraries.  It was determined that the DPC’s unique skill set was needed on site sooner rather than later, and so on June 26, we returned to Duke’s campus as “essential workers.”

Upon our return, we needed to make sure that our equipment was sanitized and in good working order.  Along with testing our scanner and cameras, we also recalibrated our monitors to ensure color accuracy and established our new workflow. 

It was determined that our efforts were most needed to prepare for Duke’s fall instruction materials.  With the uncertainty as to whether or not classes would be held in person or virtually, preparing digital materials to work with was prioritized.  So, we shifted from our normal project work to focus solely on digitization of these materials.  Each digitization specialist was asked to be onsite for 3 days a week to maximize use of our capture equipment.  The remaining two days of the week would be spent working from home to do quality control work on the images as well as various administrative tasks.  We had a plan; our remit was clear and we were working towards a goal.

On August 17, classes began for Duke University and our images began being used as part of instruction materials.  Duke University Library’s digitized images helped bridge the gap between the currently inaccessible library collections that Duke faculty and students normally rely on for coursework and the Fall 2020 students.

Thinking about the change in use and accessibility for collection materials leads to an interesting question:  With the lockdown which happened for most of the US, did digital collections receive more visits as people were restricted from leaving home and libraries were closed?  A quick glance at the Google Analytics for the Duke Digital Collections shows a 34% increase of unique page views from April 1-June 30 of this year as compared to the same time period in 2019.  While it is impossible to state definitively why the increase occurred, the pandemic is very likely a contributing factor.  Digital collections are arguably valuable assets for any institution which supports them.  They provide easy access to rarely seen or inaccessible materials and they have the potential to incite curiosity in the larger institutional holdings.  It is indeed interesting to consider what types of innovative scholarship and creative use of digital content may result from the pandemic’s “forced” use of digital collections over the next twelve months.

Of course, the rapid onset of the COVID-19 pandemic illuminated the need for alternative ways of operating.  At least temporarily, it has changed the way in which the Duke University Libraries are conducting business as usual these days.  And, in July of this year, Research Libraries UK published a document entitled “COVID19 and the Digital Shift in Action.”  This document reports on the effect of the pandemic on UK research libraries and suggests strategies for emphasis and support of the digital aspects of libraries as well as the need for change and flexibility within library collections.  Digital collections, e-books, e-textbooks, and digital content had their moment to shine during the pandemic and they have proven their value and importance.

And with the potential increased reliance on digitized material, many cultural heritage digitization specialists are now back on site in libraries, museums and archives, working to provide their expertise to add to existing digital collections.  Naturally, at the Duke Digital Production Center, we have been asked a number of times since our return if we are nervous about being back in our studio space.  Of course, we are, but we also recognize how our skills and contributions continue to create value for Duke University and Duke University Libraries.

Further reading:

Biswas, P., & Marchesoni, J. “Analyzing Digital Collections Entrances: What Gets Used and Why It Matters.” Information Technology and Libraries, v. 35, n. 4, p. 19-34, 30 December 2016.

Greenhall, M. “Covid-19 and the digital shift in action,” RLUK Report. 2020.  Can be accessed at:  https://www.rluk.ac.uk/wp-content/uploads/2020/06/Covid19-and-the-digital-shift-in-action-report-FINAL.pdf

Markin, Pablo.  “Pandemic Restrictions on Library Borrowing Showcase the Importance of Digital Collections and the Advantages of Open Access.” Open Research Community.  11 August 2020.  https://openresearch.community/posts/pandemic-restrictions-on-library-borrowing-showcase-the-importance-of-digital-collections-and-the-advantages-of-open-access

 

 

 

 

Sharing data and research in a time of global pandemic, Part 2

[Header image from Fischer, E., Fischer, M., Grass, D., Henrion, I., Warren, W., Westman, E. (2020, August 07). Low-cost measurement of facemask efficacy for filtering expelled droplets during speech. Science Advances. https://advances.sciencemag.org/content/early/2020/08/07/sciadv.abd3083]

Back in March, just as things were rapidly shutting down across the United States, I wrote a post reflecting on how integral the practice of sharing and preserving research data would be to any solution to the crisis posed by COVID-19. While some of the language in that post seems a bit naive in retrospect (particularly the bit about RDAP’s annual meeting being one of the last in-person conferences of just the spring, as opposed to the entire calendar year!), the emphasis on the importance of rapid and robust data sharing has stood the test of time. In late June, the Research Data Alliance released a set of recommendations and guidelines for sharing research data under circumstances shaped by COVID-19, and a number of organizations, including the National Institutes of Health, have established portals for finding data related to the disease. Access to data has been forefront in the minds of many researchers.

Perhaps in response to this general sentiment (or maybe because folks haven’t been able to access their labs?!), we in the Libraries have seen a notable increase in the number of submissions to our Research Data Repository for data publication. These datasets have derived from a broad range of disciplines, spanning Environmental Sciences to Dermatology. I wanted to use this blog post as an opportunity to highlight a few of our accessions from the last several months.

One of our most prolific sources of data deposits has historically been the lab of Dr. Patrick Charbonneau, associate professor of Chemistry and Physics. Dr. Charbonneau’s lab investigates glass and its physical properties and contributes to a project known as The Simons Collaboration on Cracking the Glass Problem, which addresses issues like disorder, nonlinear response and far-from-equilibrium dynamics. The most recent contribution from Dr. Charbonneau’s research group, published just last week, is fairly characteristic of the materials we receive from Dr. Charbonneau’s group. It contains the raw binary observational data and scripts that were used to create the figures which appear in the researcher’s article. Making these research products available helps other scholars to repeat or reproduce (and thereby strengthen) the findings elucidated in an associated research publication.

Fig01 / Fig02b, Data from: Finite-dimensional vestige of spinodal criticality above the dynamical glass transition

 

Another recent data deposit—a first of its kind for the RDR—is a Q-sort concourse for the Human Dimensions of Large Marine Protected Areas project, which investigates the formulation of large marine protected areas (defined by the project as “any ocean area larger than 100,000 km² that has been designated for the purpose of conservation”) as a global movement. Q-methodology is a psychology and social sciences research method used to study viewpoints. In this study, 40 interviewees were asked to evaluate statements related to large-scale marine protected areas. Q-sorts can be particularly helpful when researchers wish to describe subjective viewpoints related to an issue.

Q sort record sheet from: Q-Sort Concourse and Data for the Human Dimensions of Large MPAs project

Finally, perhaps our most timely deposit has come from a group investigating an alternate method to evaluate the efficacy of masks to reduce the transmission of respiratory droplets during regular speech. “Low-cost measurement of facemask efficacy for filtering expelled droplets during speech,” published last week in Science Advances, is a proof-of-concept study that proposes an optical measurement technique that the group asserts is both inexpensive and easy to use. Because the topic of measuring mask efficiency is still both complex and unsettled, the group hopes this work will help improve evaluation in order to guide mask selection and policy decisions.

Screenshot of Speaker1_None_05.mp4, Video data from: Low-cost measurement of facemask efficacy for filtering expelled droplets during speech

The dataset consists of a series of movie recordings, that capture an operator wearing a face mask and speaking in the direction of an expanded laser beam inside a dark enclosure. Droplets that propagate through the laser beam scatter light, which is then recorded with a cell phone camera. The group tested 12 kinds of masks (see below), and recorded 2 sets of controls with no masks. 

Figure 2 from Low-cost measurement of facemask efficacy for filtering expelled droplets during speech

We hope to keep up the momentum our data management, curation, and publication program has gained over the last few months, but we need your help! For more information on using the Duke Research Data Repository to share and preserve your data, please visit our website, or drop up a line at datamangement@duke.edu. A full list of the datasets we’ve published since moving to fully remote operations in March is available below.

  • Zhang, Y. (2020). Data from: Contributions of World Regions to the Global Tropospheric Ozone Burden Change from 1980 to 2010. Duke Research Data Repository. https://doi.org/10.7924/r40p13p11
  • Campbell, L. M., Gray, N., & Gruby, R. (2020). Data from: Q-Sort Concourse and Data for the Human Dimensions of Large MPAs project. Duke Research Data Repository. https://doi.org/10.7924/r4j38sg3b
  • Berthier, L., Charbonneau, P., & Kundu, J. (2020). Data from: Finite-dimensional vestige of spinodal criticality above the dynamical glass transition. Duke Research Data Repository. https://doi.org/10.7924/r4jh3m094
  • Fischer, E., Fischer, M., Grass, D., Henrion, I., Warren, W., Westman, E. (2020). Video data files from: Low-cost measurement of facemask efficacy for filtering expelled droplets during speech. Duke Research Data Repository. V2 https://doi.org/10.7924/r4ww7dx6q
  • Lin, Y., Kouznetsova, T., Chang, C., Craig, S. (2020). Data from: Enhanced polymer mechanical degradation through mechanochemically unveiled lactonization. Duke Research Data Repository. V2 https://doi.org/10.7924/r4fq9x365
  • Chavez, S. P., Silva, Y., & Barros, A. P. (2020). Data from: High-elevation monsoon precipitation processes in the Central Andes of Peru. Duke Research Data Repository. V2 https://doi.org/10.7924/r41n84j94
  • Jeuland, M., Ohlendorf, N., Saparapa, R., & Steckel, J. (2020). Data from: Climate implications of electrification projects in the developing world: a systematic review. Duke Research Data Repository. https://doi.org/10.7924/r42n55g1z
  • Cardones, A. R., Hall, III, R. P., Sullivan, K., Hooten, J., Lee, S. Y., Liu, B. L., Green, C., Chao, N., Rowe Nichols, K., Bañez, L., Shah, A., Leung, N., & Palmeri, M. L. (2020). Data from: Quantifying skin stiffness in graft-versus-host disease, morphea and systemic sclerosis using acoustic radiation force impulse imaging and shear wave elastography. Duke Research Data Repository. https://doi.org/10.7924/r4h995b4q
  • Caves, E., Schweikert, L. E., Green, P. A., Zipple, M. N., Taboada, C., Peters, S., Nowicki, S., & Johnsen, S. (2020). Data and scripts from: Variation in carotenoid-containing retinal oil droplets correlates with variation in perception of carotenoid coloration. Duke Research Data Repository. https://doi.org/10.7924/r4jw8dj9h
  • DiGiacomo, A. E., Bird, C. N., Pan, V. G., Dobroski, K., Atkins-Davis, C., Johnston, D. W., Ridge, J. T. (2020). Data from: Modeling salt marsh vegetation height using Unoccupied Aircraft Systems and Structure from Motion. Duke Research Data Repository. https://doi.org/10.7924/r4w956k1q
  • Hall, III, R. P., Bhatia, S. M., Streilein, R. D. (2020). Data from: Correlation of IgG autoantibodies against acetylcholine receptors and desmogleins in patients with pemphigus treated with steroid sparing agents or rituximab. Duke Research Data Repository. https://doi.org/10.7924/r4rf5r157
  • Jin, Y., Ru, X., Su, N., Beratan, D., Zhang, P., & Yang, W. (2020). Data from: Revisiting the Hole Size in Double Helical DNA with Localized Orbital Scaling Corrections. Duke Research Data Repository. https://doi.org/10.7924/r4k072k9s
  • Kaleem, S. & Swisher, C. B. (2020). Data from: Electrographic Seizure Detection by Neuro ICU Nurses via Bedside Real-Time Quantitative EEG. Duke Research Data Repository. https://doi.org/10.7924/r4mp51700
  • Yi, G. & Grill, W. M. (2020). Data and code from: Waveforms optimized to produce closed-state Na+ inactivation eliminate onset response in nerve conduction block. Duke Research Data Repository. https://doi.org/10.7924/r4z31t79k
  • Flanagan, N., Wang, H., Winton, S., Richardson, C. (2020). Data from: Low-severity fire as a mechanism of organic matter protection in global peatlands: thermal alteration slows decomposition. Duke Research Data Repository. https://doi.org/10.7924/r4s46nm6p
  • Gunsch, C. (2020). Data from: Evaluation of the mycobiome of ballast water and implications for fungal pathogen distribution. Duke Research Data Repository. https://doi.org/10.7924/r4t72cv5v
  • Warnell, K., & Olander, L. (2020). Data from: Opportunity assessment for carbon and resilience benefits on natural and working lands in North. Carolina. Duke Research Data Repository. https://doi.org/10.7924/r4ww7cd91