Category Archives: Technology

The Shortest Year

Featured image – screenshot from the Sunset Tripod2 project charter.

Realizing that my most recent post here went up more than a year ago, I pause to reflect. What even happened over these last twelve months? Pandemic and vaccine, election and insurrection, mandates and mayhem – outside of our work bubble, October 2020 to October 2021 has been a churn of unprecedented and often dark happenings. Bitstreams, however, broadcasts from inside the bubble, where we have modeled cooperation and productivity, met many milestones, and kept our collective cool, despite working nearly 100% remotely as a team, with our stakeholders, and across organizational lines.

Last October, I wrote about Sunsetting Tripod2, a homegrown platform for our digital collections and archival finding aids that was also the final service we had running on a physical server. “Firm plans,” I said we had for the work that remained. Still, in looking toward that setting sun, I worried about “all sorts of comical and embarrassing misestimations by myself on the pages of this very blog over the years.” I was optimistic, but cautiously so, that we would banish the ghosts of Django-based systems past.

Reader, I have returned to Bitstreams to tell you that we did it. Sometime in Q1 of 2021, we said so long, farewell, adieu to Tripod2. It was a good feeling, like when you get your laundry folded, or your teeth cleaned, only better.

However, we did more in the past year than just power down exhausted old servers. What follows are a few highlights from the work of the Digital Strategies and Technology division of Duke University Libraries’ software developers, and our collaborators (whom we cannot thank or praise enough) over the past twelve months. 

In November, Digital Projects Developer Sean Aery posted on Implementing ArcLight: A Reflection. The work of replacing and improving upon our implementation for the Rubenstein Library’s collection guides was one of the main components that allowed us to turn off Tripod2. We actually completed it in July of 2020, but that team earned its Q4 victory laps, including Sean’s post and a session at Blacklight Summit a few days after my own post last October.

As the new year began, the MorphoSource team rolled out version 2.0 of that platform. MorphoSource Repository Developer Jocelyn Triplett shared a A Preview of MorphoSource 2 Beta in these pages on January 20. The launch took place on February 1.

One project we had underway as I was writing last October was the integration of Globus, a transfer service for large datasets, into the Duke Research Data Repository. We completed that work in Q1 of 2021, prompting our colleague, Senior Research Data Management Consultant Sophia Lafferty-Hess, to post Share More Data in the Duke Research Data Repository! in a neighboring location that shares our charming cul-de-sac of library blogs.

The seventeen months since the murder of George Floyd have seen major changes in how we think and talk about race in the Libraries. We committed ourselves to the DUL Racial Justice Roadmap, a pathway for recognizing and attacking the pervasive influence of white supremacy in our society, in higher education, at Duke, in the field of librarianship, in our library, in the field of information technology, and in our own IT practices. During this time, members of our division have also participated broadly in DiversifyIT, a campus-wide group of IT professionals who seek to foster a culture of inclusion “by providing professional development, networking, and outreach opportunities.”

Digital Projects Developer Michael Daul shared his own point of view with great thoughtfulness in his April post, What does it mean to be an actively antiracist developer? He touched on representation in the IT industry, acknowledging bias, being aware of one’s own patterns of communication, and bringing these ideas to the systems we build and maintain. 

One of the ideas that Michael identified for software development is web accessibility; as he wrote, we can “promote the benefits of building accessible interfaces that follow the practices of universal design.” We put that idea into action a few months later, as Sean described in precise technical terms in his July post, Automated Accessibility Testing and Continuous Integration. Currently that process applies to the ArcLight platform, but when we have a chance, we’ll see if we can expand it to other services.

The question of when we’ll have that chance is a big one, as it hinges on the undertaking that now dominates our attention. Over the past year we have ramped up on the migration of our website from Drupal 7 to Drupal 9, to head off the end-of-life for 7. This project has transformed into the raging beast that our colleagues at NC State Libraries warned us it would at the Code4Lib Southeast in May of 2019

Screenshot of NC State Libraries presentation on Drupal migration
They warned us – Screenshot from “Drupal 7 to Drupal 8: Our Journey,” by Erik Olson and Meredith Wynn of NC State Libraries’ User Experience Department, presented at Code4Lib Southeast in May of 2019.

We are on a path to complete the Drupal migration in March 2022 – we have “firm plans,” you could say – and I’m certain that its various aspects will come to feature in Bitstreams in due time. For now I will mention that it spawned two sub-projects that have challenged our team over the past six months or so, both of which involve refactoring functionality previously implemented as Drupal modules into standalone Rails applications:

  1. Quicksearch, aka unified search, aka “Bento search” – see Michael’s Bento is Coming! from 2014 – is now a standalone app; it also uses the open-source tool Apache Nutch, rather than Google CSE.
  2. The staff directory app that went live in 2019, which Michael wrote about in Building a new Staff Directory, also no longer runs as a Drupal module.

Each of these implementations was necessary to prepare the way for a massive migration of theme and content that will take place over the coming months. 

Screenshot of a Jira issue related to the Decouple Staff Directory project.
Screenshot of a Jira issue related to the Decouple Staff Directory project.

When it’s done, maybe we’ll have a chance to catch our breath. Who can really say? I could not have guessed a year ago where we’d be now, and anyway, the period of the last twelve months gets my nod as the shortest year ever. Assuming we’re here, whatever “here” means in the age of remote/hybrid/flexible work arrangements, then I expect we’ll be burning down backlogs, refactoring this or that, deploying some service, and making firm plans for something grand.

Using an M1 Mac for development work

Due to a battery issue with my work laptop (an Intel-based MacBook pro), I had an opportunity to try using a newer (ARM-based) M1 Mac to do development work. Since roughly a year had passed since these new machines had been introduced I assumed the kinks would have been generally worked out and I was excited to give my speedy new M1 Mac Mini a test run at some serious work. However, upon trying to do make some updates to a recent project (by the way, we launched our new staff directory!) I ran into many stumbling blocks.

M1 Mac Mini ensconced beneath multitudes of cables in my home office

My first step in starting with a new machine was to get my development environment setup. On my old laptop I’d typically use homebrew for managing packages and RVM (and previously rbenv) for ruby version management in different projects. I tried installing the tools normally and ran into multitudes of weirdness. Some guides suggested setting up a parallel version of homebrew (ibrew) using Rosetta (which is a translation layer for running Intel-native code). So I tried that – and then ran into all kinds of issues with managing Ruby versions. Oh and also apparently RVM / rbenv are no longer cool and you should be using chruby or asdf. So I tried those too, and ran into more problems. In the end, I stumbled on this amazing script by Moncef Belyamani. It was really simple to run and it just worked, plain and simple. Yay – working dev environment!

We’ve been using Docker extensively in our recent library projects over the past few years and the Staff Directory was setup to run inside a container on our local machines. So my next step was to get Docker up and running. The light research I’d done suggested that Docker was more or less working now with M1 macs so I dived in thinking things would go smoothly. I installed Docker Desktop (hopefully not a bad idea) and tried to build the project, but bundle install failed. The staff directory project is built in ruby on rails, and in this instance was using therubyracer gem, which embeds the V8 JS library. However, I learned that the particular version of the V8 library used by therubyracer is not compiled for ARM and breaks the build. And as you tend to do when running into questions like these, I went down a rabbit hole of potential work-arounds. I tried manually installing a different version of the V8 library and getting the bundle process to use that instead, but never quite got it working. I also explored using a different gem (like mini racer) that would correctly compile for ARM, or just using Node instead of V8, but neither was a good option for this project. So I was stuck.

Building the Staff Directory app in Docker

My text attempt at a solution was to try setting up a remote Docker host. I’ve got a file server at home running TrueNAS, so I was able to easily spin up a Ubuntu VM on that machine and setup Docker there. You could do something similar using Duke’s VCM service. I followed various guides, setup user accounts and permissions, generated ssh keys, and with some trial and error I was finally able to get things running correctly. You can setup a context for a Docker remote host and switch to it (something like: docker context use ubuntu), and then your subsequent Docker commands point to that remote making development work entirely seamless. It’s kind of amazing. And it worked great when testing with a hello-world app like whoami. Running docker run --rm -it -p 80:80 containous/whoami worked flawlessly. But anything that was more complicated, like running an app that used two containers as was the case with the Staff Dir app, seemed to break. So stuck again.

After consulting with a few of my brilliant colleagues, another option was suggested and this ended up being the best work around. Take my same ubuntu VM and instead of setting it up as a docker remote host, use it as the development server and setup a tunnel connection (something like: ssh -N -L localhost:8080:localhost:80 docker@ip.of.VM.machine) to it such that I would be able to view running webpages at localhost:8080. This approach requires the extra step of pushing code up to the git repository from the Mac and then pulling it back down on the VM, but that only takes a few extra keystrokes. And having a viable dev environment is well worth the hassle IMHO!

As apple moves away from Intel-based machines – rumors seem to indicate that the new MacBook Pros coming out this fall will be ARM-only – I think these development issues will start to be talked about more widely. And hopefully some smart people will be able to get everything working well with ARM. But in the meantime, running Docker on a Linux VM via a tunnel connection seems like a relatively painless way to ensure that more complicated Docker/Rails projects can be worked on locally using an M1 Mac.

Auditing Archival Description for Harmful Language: A Computer and Community Effort

This post was written by Miriam Shams-Rainey, a third-year undergraduate at Duke studying Computer Science and Linguistics with a minor in Arabic. As a student employee in the Rubenstein’s Technical Services Department in the Summer of 2021, Miriam helped build a tool to audit archival description in the Rubenstein for potentially harmful language. In this post, she summarizes her work on that project.

The Rubenstein Library has collections ranging across centuries. Its collections are massive and often contain rare manuscripts or one of a kind data. However, with this wide-ranging history often comes language that is dated, harmful, often racist, sexist, homophobic, and/or colonialist. As important as it is to find and remediate these instances of potentially harmful language, there is lot of data that must be searched.

With over 4,000 collection guides (finding aids) and roughly 12,000 catalog records describing archival collections, archivists would need to spend months of time combing their metadata to find harmful or problematic language before even starting to find ways to handle this language. That is, unless there was a way to optimize this workflow.

Working under Noah Huffman’s direction and the imperatives of the Duke Libraries’ Anti-Racist Roadmap, I developed a Python program capable of finding occurrences of potentially harmful language in library metadata and recording them for manual analysis and remediation. What would have taken months of work can now be done in a few button clicks and ten minutes of processing time. Moreover, the tools I have developed are accessible to any interested parties via a GitHub repository to modify or expand upon.

Although these gains in speed push metadata language remediation efforts at the Rubenstein forward significantly, a computer can only take this process so far; once uses of this language have been identified, the responsibility of determining the impact of the term in context falls onto archivists and the communities their work represents. To this end, I have also outlined categories of harmful language occurrences to act as a starting point for archivists to better understand the harmful narratives their data uphold and developed best practices to dismantle them.

Building an automated audit tool

Audit Tool GUI Screenshot
The simple, yet user-friendly interface that allows archivists to customize the search audit to their specific needs.

I created an executable that allows users to interact with the program regardless of their familiarity with Python or with using their computer’s command line. With an executable, all that a user must do is simply click on the program (titled “description_audit.exe”) and the script will load with all of its dependencies in a self-contained environment. There’s nothing that a user needs to install, not even Python.

Within this executable, I also created a user interface to allow users to set up the program with their specific audit parameters. To use this program, users should first create a CSV file (spreadsheet) containing each list of words they want to look for in their metadata.

Snippet of Lexicon CSV
Snippet from a sample lexicon CSV file containing harmful terms to search

In this CSV file of “lexicons”, each category of terms  should have its own column, for example RaceTerms could be the first row in a column of terms such as “colored” or “negro,” and GenderTerms could be the first row in a column of gendered terms such as “homemaker” or “wife.”  See these lexicon CSV file examples.

Once this CSV has been created, users can select this CSV of lexicons in the program’s user interface and then select which columns of terms they want the program to use when searching across the source metadata. Users can either use all lexicon categories (all columns) by default or specify a subset by typing out those column headers. For the Rubenstein’s purposes, there is also a rather long lexicon called HateBase (from a regional, multilingual database of potential hate speech terms often used in online moderating) that is only enabled when a checkbox is checked; users from other institutions can download the HateBase lexicon for themselves and use it or they can simply ignore it.

In the CSV reports that are output by the program, matches for harmful terms and phrases will be tagged with the specific lexicon category the match came from, allowing users to filter results to certain categories of potentially harmful terms.

Users also need to designate a folder on their desktop where report outputs should be stored, along with the folder containing their source EAD records in .xml format and their source MARCXML file containing all of the MARC records they wish to process as a single XML file. Results from MARC and EAD records are reported separately, so only one type of record is required to use the program, however both can be provided in the same session.

How archival metadata is parsed and analyzed

Once users submit their input parameters in the GUI, the program begins by accessing the specified lexicons from the given CSV file. For each lexicon, a “rule” is created for a SpaCy rule-based matcher, using the column name (e.g. RaceTerms or GenderTerms) as the name of the specific rule. The same SpaCy matcher object identifies matches to each of the several lexicons or “rules”. Once the matcher has been configured, the program assesses whether valid MARC or EAD records were given and starts reading in their data.

To access important pieces of data from each of these records, I used a Python library called BeautifulSoup to parse the XML files. For each individual record, the program parses the call numbers and collection or entry name so that information can be included in the CSV reports. For EAD records, the collection title and component titles  are also parsed to be analyzed for matches to the lexicons, along with any data that is in a paragraph (<p>) tag. For MARC records, the program also parses the author or creator of the item, the extent of the collection, and the timestamp of when the description of the item was last updated. In each MARC record, the 520 field (summary)  and 545 field (biography/history note) are all concatenated together and analyzed as a single entity.

Data from each record is stored in a Python dictionary with the names of fields (as strings) as keys mapping to the collection title, call number, etc. Each of these dictionaries is stored in a list, with a separate structure for EAD and MARC records.

Once data has been parsed and stored, each record is checked for matches to the given lexicons using the SpaCy rule-based matcher. For each record, any matches that are found are then stored in the dictionary with the matching term, the context of the term (the entire field or surrounding few sentences, depending on length), and the rule the term matches (such as RaceTerms). These matches are found using simple tokenization from SpaCy that allow matches to be identified quickly and without regard for punctuation, capitalization, etc. 

Although this process doesn’t necessarily use the cutting-edge of natural language processing that the SpaCy library makes accessible, this process is adaptable in ways that matching procedures like using regular expressions often isn’t. Moreover, identifying and remedying harmful language is a fundamentally human process which, at the end of the day, needs a significant level of input both from historically marginalized communities and from archivists.

Matches to any of the lexicons, along with all other associated data (the record’s call number, title, etc.) are then written into CSV files for further analysis and further categorization by users. You can see sample CSV audit reports here. The second phase of manual categorization is still a lengthy process, yielding roughly 14600 matches from the Rubenstein Library’s EAD data and 4600 from its MARC data which must still be read through and analyzed by hand, but the process of identifying these matches has been computerized to take a mere ten minutes, where it could otherwise be a months-long process.

Categorizing matches: an archivist and community effort

An excerpt of initial data returned by the audit program for EAD records. This data should be further categorized manually to ensure a comprehensive and nuanced understanding of these instances of potentially harmful language.
An excerpt of initial data returned by the audit program for EAD records. This data should be further categorized manually to ensure a comprehensive and nuanced understanding of these instances of potentially harmful language.

To better understand these matches and create a strategy to remediate the harmful language they represent, it is important to consider each match in several different facets.

Looking at the context provided with each match allows archivists to understand the way in which the term was used. The remediation strategy for the use of a potentially harmful term in a proper noun used as a positive, self-identifying term, such as the National Association for the Advancement of Colored People, for example, is vastly different from that of a white person using the word “colored” as a racist insult.

The three ways in which I proposed we evaluate context are as follows:

  1. Match speaker: who was using the term? Was the sensitive term being used as a form of self-identification or reclaiming by members of a marginalized group, was it being used by an archivist, or was it used by someone with privilege over the marginalized group the term targets (e.g. a white person using an anti-Black term or a cisgender straight person using an anti-LGBTQ+ term)? Within this category, I proposed three potential categories for uses of a term: in-group, out-group, and archivist. If a term is used by a member (or members) of the identity group it references, its use is considered an in-group use. If the term is used by someone who is not a member of the identity group the term references, that usage of the term is considered out-group. Paraphrasing or dated term use by archivists is designated simply as archivist use.
  2. Match context: how was the term in question being used? Modifying the text used in a direct quote or a proper noun constitutes a far greater liberty by the archivist than removing a paraphrased section or completely archivist-written section of text that involved harmful language. Although this category is likely to evolve as more matches are categorized, my initial proposed categories are: proper noun, direct quote, paraphrasing, and archivist narrative.
  3. Match impact: what was the impact of the term? Was this instance a false positive, wherein the use of the term was in a completely unrelated and innocuous context (e.g. the use of the word “colored” to describe the colors used in visual media), or was the use of the term in fact harmful? Was the use of the term derogatory, or was it merely a mention of politicized identities? In many ways, determining the impact of a particular term or use of potentially harmful language is a community effort; if a community member with a marginalized identity says that the use of a term in that particular context is harmful to people with that identity, archivists are in no position to disagree or invalidate those feelings and experiences. The categories that I’ve laid out initially–dated original term, dated Rubenstein term, mention of marginalized issues, mention of marginalized identity, downplaying bias (e.g. calling racism and discrimination an issue with “race relations”), dehumanization of marginalized people, false positive–only hope to serve as an entry point and rudimentary categorization of these nuances to begin this process.
A short excerpt of categorized EAD metadata
A short excerpt of categorized EAD metadata

Here you can find more documentation on the manual categorization strategy.

Categorizing each of these instances of potentially harmful language remains a time-consuming, meticulous process. Although much of this work can be computerized, decolonization is a fundamentally human and fundamentally community-centered practice. No computer can dismantle the colonial, white supremacist narratives that archival work often upholds. This work requires our full attention and, for better or for worse, a lot of time, even with the productivity boost technology gives us.

Once categories have been established, at least on a preliminary level, I found that about 100-200 instances of potentially harmful language could be manually parsed and categorized in an hour.

Conclusion

Decolonization and anti-racist efforts in archival work are an ongoing process. It is bound to take active learning, reflection, and lots of remediation. However, using technology to start this process creates a much less daunting entry point. Anti-racism work is essential in archival spaces.

The ways we talk about history can either work to uphold traditional white supremacist, racist, ableist, etc. narratives, or they can work to dismantle them. In many ways, archival work has often upheld these narratives in the past, however this audit represents the sincere beginnings of work to further equitable narratives in the future.

On Protecting Patron Privacy

First, a bit of history

Back in the summer of 2018, calls for applications for the National Web Privacy Forum started circulating around the library community. I’ll be honest — at that point I knew almost nothing about how libraries protect patron privacy. That summer I’d been conducting a library data inventory, interviewing stakeholders of various data systems across the library, and I had just gotten my first hints of some of the processes we use to protect the data we collect from patrons.

Long story short, Duke Libraries submitted an application to the Forum, we were selected, and I attended. The experience was really meaningful, and it gave me a nice overview of the various issues that affect a library’s ability to protect patron privacy. The following spring (2019), the leaders of the National Forum released an action handbook that recommended conducting a data privacy audit, and DUL undertook such an audit during the Fall of 2019. The results of that audit suggested that we still have a bit of work to do to make sure all of our systems are working together to protect our patrons.

Forming a task force

In response to the audit report, Duke Libraries charged a task force called the Data Privacy and Retention Task Force. Despite the pandemic and lockdown, this task force started meeting in the spring of 2020, and we met biweekly for the rest of the year. Our goals were to develop guiding principles and priorities around data privacy and retention, as well as to recommend specific project work that should be undertaken to improve our systems.

The task force included staff members from across the various divisions of the library. Pretty quickly, we determined that we all come with different experiences around patron privacy. We decided to begin with a sort of book club, identifying and reviewing introductory materials related to different components of patron privacy, from web analytics to the GDPR to privacy in archives and special collections. Once we all felt a bit more knowledgeable, we turned our attention to creating a statement of our priorities and principles.

Defining our values

There are a lot of existing statements of library values, and many make mention of patron privacy. Other documents that cover privacy values include regulatory documents and organizational privacy statements. Some of the statements we reviewed include:

Duke University Libraries' Strategic PlanWhile these statements are all relevant, the task force found some of them far too general to truly guide action for an organization. We were looking to create a document that outlined more specifics, helped us make decisions about how to organize our work. At Duke Libraries, we already have one document we use to organize our work and make decisions — our strategic plan.

When we reviewed the strategic plan, we noticed that for each section of the plan, a focus on patron privacy resulted in a set of implications for our work. To express these implications, we devised a rough hierarchy of directed action, indicating our ability and obligation to undertake certain actions.  We use the following terms in our final report:

For actions within our sphere of influence:

  • obligation: DUL should devote significant time and resources toward this work
  • responsibility: DUL should make a concerted effort toward this work, but the work may not receive the same attention and resources as that devoted to our obligations

For actions outside our sphere of influence:

  • commitment: DUL will need to partner with other groups to perform this work and thus cannot promise to accomplish all tasks

An example of our principles and priorities

One section from our strategic plan is Strategic Priority #2: Our Libraries Teach and Support Emerging Literacies. Within this priority, the strategic plan identifies the following goals:

  1. Expand the presence of library staff in the student experience in order to understand and support emerging scholarship, information, data, and literacy needs
  2. Mentor first-year students in scholarly research and learning practices, embracing and building upon their diverse backgrounds, prior knowledge, literacies, and expectations as they begin their Duke experience.
  3. Partner with faculty to develop research methods, curricula, and collaborative projects connecting their courses to our collections.
  4. Enhance the library instruction curriculum, focusing on standards and best practices for pedagogy that will prepare users for lifelong learning in a global and ever-changing research environment.

In our final report, Priorities and Guiding Principles for Protecting Patron Privacy, we identify the following actions for this same strategic priority:

  • We have an obligation to communicate in plain language what data we and our partners collect while providing our services.
  • We have a responsibility to provide education, tools, and collection materials to shed light on the general processes of information exchange behind technology systems.
  • We commit to partnering with researchers seeking to understand the effects of information exchange processes and related policy interventions.

We now have the strategic plan, which outlines types of activities we might undertake, and the new report on protecting patron privacy, which adds to that list new activities and methods to achieve patron privacy protections in each area.

Next steps

The final work of the task force was to propose new project work based on our identified priorities and principles. The task force will share a list of recommended projects with library administration, who will start the hard work of evaluating these projects and identifying staff to undertake them. In the meantime, we hope the report will offer immediate guidance to staff for considerations they should be taking in different areas of their work, as well as serving as a model for future documents that guide our efforts.

Automated Accessibility Testing and Continuous Integration

We use several different tools and strategies at the Duke Libraries to ensure that our web interfaces are accessible. Our aim is to comply with WCAG 2.0 AA and Section 508 guidelines.

One of our favorite accessibility checking tools in our toolbox is the axe DevTools Browser Extension by Deque Systems. It’s easy to use, whether on a live site, or in our own local development environments while we’re working on building new features for our applications. Simply open your browser’s Developer Tools (F12), click the Scan button, and get an instant report of any violations, complete with recommendations about how to fix them.

screenshot of axe DevTools browser extension in action
An axe DevTools test result for our archival finding aids homepage.

Now, it’s one thing to make software compliant; it’s another to keep it that way. Duke’s Web Accessibility office astutely notes:

Keeping a website compliant is a continuous effort. Websites are living things. Content changes, features are added. When a site becomes compliant it does not stay compliant.

One of the goals we set out to accomplish in 2021 was to figure out how to add automated, continuous accessibility testing for our ArcLight software, which powers our archival finding aids search and discovery application. We got it implemented successfully a few months ago and we’re pleased with how it has been working so far.

The solution: using the Deque Systems Axe Core RSpec gem, along with a Selenium Standalone Chrome Docker image, in our GitLab CI pipeline.

Say what?

Let me back up a bit and give some background on the concepts and tools upon which this solution depends. Namely: Continuous Integration, RSpec, Capybara, Selenium, and Docker.

Continuous Integration

For any given software project, we may have several developers making multiple changes to a shared codebase in the same day. Continuous Integration is the practice of ensuring that these code changes 1) don’t break any existing functionality, and 2) comply with established guidelines for code quality.

At Duke Libraries, we use GitLab to host code repositories for most of our software (e.g., our ArcLight application). It’s similar to GitHub, with built-in tooling for DevOps. One such tool is GitLab CI/CD, which we use to manage continuous integration. It’s common for us to configure pipelines to run automated tasks for each new code branch or tag, in at least few stages, e.g.: build > test > deploy.

screenshot of GitLab CI pipeline stages
Evidence of a GitLab CI pipeline that has run — and passed all stages — for a code change.

For accessibility testing, we knew we needed to add something to our existing CI pipeline that would check for compliance and flag any issues.

RSpec Testing Framework

Many of our applications are built using Ruby on Rails. We write tests for our code using RSpec, a popular testing framework for Rails applications. Developers write tests (using the framework’s DSL / domain-specific language) to accompany their code changes. Those tests all execute as part of our CI pipeline (see above). Any failing tests will prevent code from being merged or getting deployed to production.

RSpec logo

There are many different types of tests.  On one end of the spectrum, there are “unit tests,” which verify that one small piece of code (e.g., one method) returns what we expect it to when it is given different inputs. On the other, there are “feature tests,” which typically verify that several pieces of code are working together as intended in different conditions. This often simulates the use of a feature by a person (e.g., when a user clicks this button, test that they get to this page and verify this link gets rendered, etc.). Feature tests might alternatively be called “integration tests,” “acceptance tests,” or even “system tests” — the terminology is both squishy and evolving.

At any rate, accessibility testing is a specific kind of feature test.

Capybara

On its own, RSpec unfortunately doesn’t natively support feature tests. It requires a companion piece of software called Capybara, which can simulate a user interacting with a web interface. Capybara brings with it a DSL to visit pages, fill out forms, or click on elements within RSpec tests, and special matchers to check that the page is behaving as intended.

screenshot of Capybara homepage
Homepage for Capybara, featuring an actual capybara.

When configuring Capybara, you set up the driver you want it to use when running different kinds of tests. Its default driver is RackTest, which is fast but it can’t execute JavaScript like a real web browser can. Our ArcLight UI, for instance, uses a bunch of JavaScript. So we knew that any accessibility tests would have to be performed using a driver for an actual browser; the default Capybara alone wouldn’t cut it.

Selenium

Writing code to drive various browsers was probably a nightmare until Selenium came along. Selenium is an “umbrella project for a range of tools and libraries that enable and support the automation of web browsers.” Its WebDriver platform gives you a language-agnostic coding interface that is now compatible with all the major web browsers. That makes it a valuable component in automated browser testing.

screenshot of Selenium WebDriver website
Documentation for Selenium WebDriver for browser automation

The best way we could find to get Capybara to control a real browser in an RSpec test was to add the selenium-webdriver gem to our project’s Gemfile.

Docker

Over the past few years, we have evolved our DevOps practice and embraced containerizing our applications using Docker. Complex applications that have a lot of interwoven infrastructure dependencies used to be quite onerous to build, run, and share.  Getting one’s local development environment into shape to successfully run an application used be a whole-day affair. Worse still, the infrastructure on the production server used to bear little resemblance to what developers were working with on their local machines.

systems diagram for ArcLight infrastructure
A systems diagram depicting the various services that run to support our ArcLight app.

Docker helps a dev team configure in their codebase all of these system dependencies in a way that that’s easily reproducible in any environment. It builds a network of “containers” on a single host, each running a service that’s crucial to the application.  Now a developer can simply check out the code, run a couple commands, wait a few minutes, and they’re good to go.

The same basic setup also applies in a production environment (with a few easily-configurable differences). And that simplicity also carries over to the CI environment where the test suite will run.

Systems diagram for ArcLight, highlighting containerized components of the infrastructure
Orange boxes depict each service / container we have defined in our Docker configuration.

So what we needed to do was add another container to our existing Docker configuration that would be dedicated to running any JavaScript-dependent feature tests — including accessibility tests — in a browser controlled by Selenium WebDriver.

Keeping this part containerized would hopefully ensure that when a developer runs the tests in their local environment, the exact same browser version and drivers get used in the CI pipeline. We can steer clear of any “well, it worked on my machine” issues.

Putting it All Together

Phew. OK, with all of that background out of the way, let’s look closer at how we put all of these puzzle pieces together.

Gems

We had to add the following three gems to our Gemfile‘s test group:

group :test do
  gem 'axe-core-rspec' # accessibility testing
  gem 'capybara'
  gem 'selenium-webdriver'
end

A Docker Container for Selenium

The folks at Selenium HQ host “standalone” browser Docker images in Docker Hub, complete with the browser software and accompanying drivers. We found a tagged version of their Standalone Chrome image that worked well, and pull that into our newly-defined “selenium” container for our test environments.

In docker-compose.test.yml

services:
  selenium:
    image: selenium/standalone-chrome:3.141.59-xenon
    ports:
      - 4444:4444
    volumes:
      - /dev/shm:/dev/shm
    environment:
      - JAVA_OPTS=-Dwebdriver.chrome.whitelistedIps=
      - START_XVFB=false
[...]

Since this is new territory for us, and already fairly complex, we’re starting with just one browser: Chrome. We may be able to add more in the future.

Capybara Driver Configuration

Next thing we needed to do was tell Capybara that whenever it encounters any javascript-dependent feature tests, it should run them in our standalone Chrome container (using a “remote” driver).  The Selenium WebDriver gem lets us set some options for how we want Chrome to run.

The key setting here is “headless” — that is, run Chrome, but more efficiently, without all the fancy GUI stuff that a real user might see.

In spec_helper.rb

Capybara.javascript_driver = :selenium_remote

Capybara.register_driver :selenium_remote do |app|
  capabilities = Selenium::WebDriver::Remote::Capabilities.chrome(
    chromeOptions: { args: [
      'headless',
      'no-sandbox',
      'disable-gpu',
      'disable-infobars',
      'window-size=1400,1000',
      'enable-features=NetworkService,NetworkServiceInProcess'
    ] }
  )

  Capybara::Selenium::Driver.new(app,
                                 browser: :remote,
                                 desired_capabilities: capabilities,
                                 url: 'http://selenium:4444/wd/hub')
end

That last URL http://selenium:4444/wd/hub is the location of our Chrome driver within our selenium container.

There are a few other important Capybara settings configured in spec_helper.rb that are needed in order to get our app and seleniumcontainers to play nicely together.

Capybara.server = :puma, { Threads: '1:1' }
Capybara.server_port = '3002'
Capybara.server_host = '0.0.0.0'
Capybara.app_host = "http://app:#{Capybara.server_port}"
Capybara.always_include_port = true
Capybara.default_max_wait_time = 30 # our ajax responses are sometimes slow
Capybara.enable_aria_label = true
[...]

The server_port, server_host and app_host variables are the keys here. Basically, we’re saying:

  • Capybara (which runs in our app container) should start up Puma to run the test app, listening on http://0.0.0.0:3002 for requests beyond the current host during a test.
  • The selenium container (where the Chrome browser resides) should access the application under test at http://app:3002 (since it’s in the app container).

Some Actual RSpec Accessibility Tests

Here’s the fun part, where we actually get to write the accessibility tests. The axe-core-rspec gem makes it a breeze. The be_axe_clean matcher ensures that if we have a WCAG 2.0 AA or Section 508 violation, it’ll trip the wire and report a failing test.

In accessibility_spec.rb

require 'spec_helper'
require 'axe-rspec'

RSpec.describe 'Accessibility (WCAG, 508, Best Practices)', type: :feature, js: true, accessibility: true do
  describe 'homepage' do
    it 'is accessible' do
      visit '/'
      expect(page).to be_axe_clean
    end
  end

  [...]
end

With type: :feature and js: true we signal to RSpec that this block of tests should be handled by Capybara and must be run in our headless Chrome Selenium container.

The example above is the simplest case: 1) visit the homepage and 2) do an Axe check. We also make sure to test several different kinds of pages, and with some different variations on UI interactions. E.g., test after clicking to open the Advanced Search modal.

The CI Pipeline

We started out having accessibility tests run along with all our other RSpec tests during the test stage in our GitLab CI pipeline. But we eventually determined it was better to keep accessibility tests isolated in a separate job, one that would not block a code merge or deployment in the event of a failure. We use the accessibility: true tag in our RSpec accessibility test blocks (see the above example) to distinguish them from other feature tests.

No, we don’t condone pushing inaccessible code to production! It’s just that we sometimes get false positives — violations are reported where there are none — particularly in Javascript-heavy pages. There are likely some timing issues there that we’ll work to refine with more configuration.

A Successful Accessibility Test

Here’s a completed job in our CI pipeline logs where the accessibility tests all passed:

Screenshot of passing CI pipeline, including accessibility

Screenshot from successful accessibility test job in GitLab CI
Output from a successful automated accessibility test job run in a GitLab CI pipeline.

Our GitLab CI logs are not publicly available, so here’s a brief snippet from a successful test.

An Accessibility Test With Failures

Screenshot displaying a failed accessibility test jobHere’s a CI pipeline for a code branch that adds two buttons to the homepage with color contrast and aria-label violations. The axe tests flag the issues as FAILED and recommend revisions (see snippet from the logs).

Concluding Thoughts

Automation and accessibility testing are both rapidly evolving areas, and the setup that’s working for our ArcLight app today might look considerably different within the next several months. Still, I thought it’d be useful to pause and reflect on the steps we took to get automated accessibility testing up and running. This strategy would be reasonably reproducible for many other applications we support.

A lot of what I have outlined could also be accomplished with variations in tooling. Don’t use GitLab CI? No problem — just substitute your own CI platform. The five most important takeaways here are:

  1. Accessibility testing is important to do, continually
  2. Use continuous integration to automate testing that used to be manual
  3. Containerizing helps streamline continuous integration, including testing
  4. You can run automated browser-based tests in a ready-made container
  5. Deque’s open source Axe testing tools are easy to use and pluggable into your existing test framework

Many thanks to David Chandek-Stark (Duke) for architecting a large portion of this work. Thanks also to Simon Choy (Duke), Dann Bohn (Penn St.), and Adam Wead (Penn St.) for their assistance helping us troubleshoot and understand how these pieces fit together.


The banner image in this post uses three icons from the FontAwesome Free 5 icon set, unchanged, and licensed under a CC-BY 4.0 license.


REVISION 7/29/21: This post was updated, adding links to snippets from CI logs that demonstrate successful accessibility tests vs. those that reveal violations.

FFV1: The Gains of Lossless

One of the greatest challenges to digitizing analog moving-image sources such as videotape and film reels isn’t the actual digitization. It’s the enormous file sizes that result, and the high costs associated with storing and maintaining those files for long-term preservation. For many years, Duke Libraries has generated 10-bit uncompressed preservation master files when digitizing our vast inventory of analog videotapes.

Unfortunately, one hour of uncompressed video can produce a 100 gigabyte file. That’s at least 50 times larger than an audio preservation file of the same duration, and about 1000 times larger than most still image preservation files. That’s a lot of data, and as we digitize more and more moving-image material over time, the long-term storage costs for these files can grow exponentially.

To help offset this challenge, Duke Libraries has recently implemented the FFV1 video codec as its primary format for moving image preservation. FFV1 was first created as part of the open-source FFmpeg software project, and has been developed, updated and improved by various contributors in the Association of Moving Image Archivists (AMIA) community.

FFV1 enables lossless compression of moving-image content. Just like uncompressed video, FFV1 delivers the highest possible image resolution, color quality and sharpness, while avoiding the motion compensation and compression artifacts that can occur with “lossy” compression. Yet, FFV1 produces a file that is, on average, 1/3 the size of its uncompressed counterpart.

sleeping bag
FFV1 produces a file that is, on average, 1/3 the size of its uncompressed counterpart. Yet, the audio & video content is identical, thanks to lossless compression.

The algorithms used in lossless compression are complex, but if you’ve ever prepared for a fall backpacking trip, and tightly rolled your fluffy goose-down sleeping bag into one of those nifty little stuff-sacks, essentially squeezing all the air out of it, you just employed (a simplified version of) lossless compression. After you set up your tent, and unpack your sleeping bag, it decompresses, and the sleeping bag is now physically identical to the way it was before you packed.

Yet, during the trek to the campsite, it took up a lot less room in your backpack, just like FFV1 files take up a lot less room in our digital repository. Like that sleeping bag, FFV1 lossless compression ensures that the compressed video file is mathematically identical to it’s pre-compressed state. No data is “lost” or irreversibly altered in the process.

Duke Libraries’ Digital Production Center utilizes a pair of 6-foot-tall video racks, which house a current total of eight videotape decks, comprised of a variety of obsolete formats such as U-matic (NTSC), U-matic (PAL), Betacam, DigiBeta, VHS (NTSC) and VHS (PAL, Secam). Each deck is converted from analog to digital (SDI) using Blackmagic Design Mini Converters.

The SDI signals are sent to a Blackmagic Design Smart Videohub, which is the central routing center for the entire system. Audio mixers and video transcoders allow the Digitization Specialist to tweak the analog signals so the waveform, vectorscope and decibel levels meet broadcast standards and the digitized video is faithful to its analog source. The output is then routed to one of two Retina 5K iMacs via Blackmagic UltraStudio devices, which convert the SDI signal to Thunderbolt 3.

FFV1 video digitization in progress in the Digital Production Center.

Because no major company (Apple, Microsoft, Adobe, Blackmagic, etc.) has yet adopted the FFV1 codec, multiple foundational layers of mostly open-source systems software had to be installed, tested and tweaked on our iMacs to make FFV1 work: Apple’s Xcode, Homebrew, AMIA’s vrecord, FFmpeg, Hex Fiend, AMIA’s ffmprovisr, GitHub Desktop, MediaInfo, and QCTools.

FFV1 operates via terminal command line prompts, so some understanding of programming language is helpful to enter the correct prompts, and be able to decipher the terminal logs.

The FFV1 files are “wrapped” in the open source Matroska (.mkv) media container. Our FFV1 scripts employ several degrees of quality-control checks, input logs and checksums, which ensure file integrity. The files can then be viewed using VLC media player, for Mac and Windows. Finally, we make an H.264 (.mp4) access derivative from the FFV1 preservation master, which can be sent to patrons, or published via Duke’s Digital Collections Repository.

An added bonus is that, not only can Duke Libraries digitize analog videotapes and film reels in FFV1, we can also utilize the codec (via scripting) to target a large batch of uncompressed video files (that were digitized from analog sources years ago) and make much smaller FFV1 copies, that are mathematically lossless. The script runs checksums on both the original uncompressed video file, and its new FFV1 counterpart, and verifies the content inside each container is identical.

Now, a digital collection of uncompressed masters that took up 9 terabytes can be deleted, and the newly-generated batch of FFV1 files, which only takes up 3 terabytes, are the new preservation masters for that collection. But no data has been lost, and the content is identical. Just like that goose-down sleeping bag, this helps the Duke University budget managers sleep better at night.

Implementing ArcLight: A Reflection

Around this time last year, I wrote about our ambitious plans to implement ArcLight software for archival discovery and access at Duke in 2020. While this year has certainly laid waste to so many good intentions, our team persisted through the cacophony undeterred, and—I’m proud to report—still hit our mark of going live on July 1, 2020 after a six-month work cycle. The site is available at https://archives.lib.duke.edu/.

Now that we have been live for awhile, I thought it’d be worthwhile to summarize what we accomplished, and reflect a bit on how it’s going.

Working Among Peers

I had the pleasure of presenting about ArcLight at the Oct 2020 Blacklight Summit alongside Julie Hardesty (Indiana University) and Trey Pendragon (Princeton University). The three of us shared our experiences implementing ArcLight at our institutions. Though we have the same core ArcLight software underpinning our apps, we have each taken different strategies to build on top of it. Nevertheless, we’re all emerging with solutions that look polished and fill in various gaps to meet our unique local needs. It’s exciting to see how well the software holds up in different contexts, and to be able to glean inspiration from our peers’ platforms.

Title slide to ArcLight@Duke presentation at Blacklight Summit
Slides from ArcLight@Duke presentation, 10/7/2020

A lot of content in this post will reiterate what I shared in the presentation.

Custom-Built Features

Back in April, I discussed at length our custom-built features and interface revisions that we had completed by the halfway point for the project. So, now let’s look closer at everything else we added in the second half (and in the post-launch period).

Browsing Collection Contents

This is one of the hardest things to get right in a finding aids UI, so our solution has evolved through many iterations. We created a context sidebar with lightly-animated loading indicators matching the number of items currently loading. The nav sticks with you as you scroll down the page and the Request button stays visible.  We also decided to present a list of direct child components in the main page body for any parent component.

Restrictions

At the collection level, we wanted to ensure that users didn’t miss any restrictions info, so we presented a taste of it at the top-right of the page that jumps you to the full description when clicking “More.”

Collection restrictions snippet

We changed how access and use restriction indexing so components can inherit their restrictions from any ancestor component. Then we made bright yellow banners and icons in the UI to signify that a component has restrictions.

Restrictions presented on a component

Hierarchical Record Group Browse

Using the excellent Blacklight Hierarchy plugin, we developed a way to browse University Archives collections by an existing hierarchical Record Group classification system. We encoded the group numbers, titles, nesting, and description in a YAML config file so they’re easy to change as they evolve.

Browse by Record Group

Digital Repository & Bento Search Integration

ArcLight exists among a wide constellation of other applications supporting and promoting discovery in the library, so integrating with these other pieces was an important part of our implementation. In April, I showed the interaction between ArcLight and our Requests app, as well as rendering digital object viewers/players inline via the Duke Digital Repository (DDR).

Inline digital object viewing via the DDR

Two other locations external to our application now use ArcLight’s APIs to retrieve archival information.  The first is the Duke Digital Repository (DDR). When viewing a digital collection or digital object that has a physical counterpart in the archives, we pull archival information for the item into the DDR interface from ArcLight’s JSON API.

Duke Digital Repository pulls archival info from ArcLight

The other is our “Bento” search application powering the default All search available from the library website. Now when your query finds matches in ArcLight, you’ll see component-level results under a Collection Guides bento box. Components are contextualized with a linked breadcrumb trail.

ArcLight search results presented in Bento search UI

 

Bookmarks Export CSV

COVID-19 brought about many changes to how staff at Duke Libraries retrieve materials for faculty and student research. You may have heard Duke’s Library Takeout song (819K YouTube views & counting!), and if you have, you probably can’t ever un-hear it.

But with archival materials, we’re talking about items that could never be taken out of the building. Materials may only be accessed in a controlled environment in the Rubenstein Reading Room, which remains highly restricted.  With so much Duke instruction moving online during COVID, we urgently needed to come up with a better workflow to field an explosion of requests for digitizing archival materials for use in remote instruction.

ArcLight’s Bookmarks feature (which comes via Blacklight) proved to be highly valuable here. We extended the feature to add a CSV export. The CSV is constructed in a way that makes it function as a digitization work order that our Digital Collections & Curation Services staff use to shepherd a request through digitization, metadata creation, and repository ingest. Over 26,000 images have now been digitized for patron instruction requests using this new workflow.

Bookmarks export to CSV

More Features

Here’s a list of several other custom features we completed after the April midway point.

  • Relevancy optimization
  • WCAG2.0 AA accessibility
  • ARKs & permalinks
  • Advanced search modal
  • Catalog record links
  • Dynamic sitemaps (via gem)
  • Creative Commons / RightsStatements.org integration
  • Twitter card metadata
  • Open Graph metadata
  • Google Analytics event tracking with Anonymize IP
  • Debug mode for relevancy tuning

Data Pipeline

Bringing ArcLight online required some major rearchitecting of our pipeline to preview and publish archival data. Our archivists have been using ArchivesSpace for several years to manage the source data, and exporting EAD2002 XML files when ready to be read by the public UI. Those parts remain the same for now, however, everything else is new and improved.
Data pipeline diagram for Duke finding aids

Our new process involves two GitLab repositories: one for the EAD data, and another for the ArcLight-based application. The data repo uses GitLab Webhooks to send POST requests to the app to queue up reindexing  jobs automatically whenever the data changes.  We have a test/preview branch for the data that updates our dev and test servers for the application, so archivists can easily see what any revised or new finding aids will look like before they go live in production.

We use GitLab CI/CD to easily and automatically deploy changes to the application code to the various servers. Each code change gets systematically checked for passing unit and feature tests, security, and code style before being integrated. We also aim to add automated accessibility testing to our pipeline within the next couple months.

A lot of data gets crunched while indexing EAD documents through Traject into Solr. Our app uses Resque-based background job processing to handle the transactions. With about 4,000 finding aids, this creates around 900,000 Solr documents; the index is currently a little over 1GB. Changes to data get reindexed and reflected in the UI near-instantaneously. If we ever need to reindex every finding aid, it takes only about one hour to complete.

What We Have Learned

We have been live for just over four months, and we’re really ecstatic with how everything is going.

Usability

In September 2020, our Assessment & User Experience staff conducted ten usability tests using our ArcLight UI, with five experienced archival researchers and five novice users. Kudos to Joyce Chapman, Candice Wang, and Anh Nguyen for their excellent work. Their report is available here. The tests were conducted remotely over Zoom due to COVID restrictions. This was our first foray into remote usability testing.

Remote usability testing screenshot

Novice and advanced participants alike navigated the site fairly easily and understood the contextual elements in the UI. We’re quite pleased with how well our custom features performed (especially the context sidebar, contents lists, and redesigned breadcrumb trail). The Advanced Search modal got more use than we had anticipated, and it too was effective. We were also somewhat surprised to find that users were not confused by the All Collections vs. This Collection search scope selector when searching the site.

“The interface design does a pretty good job of funneling me to what I need to see… Most of the things I was looking for were in the first place or two I’d suspect they’d be.” — Representative quote from a test participant

A few improvements were recommended as a result of the testing:

  1. make container information clearer, especially within the requesting workflow
  2. improve visibility of the online access facet
  3. make the Show More links in the sidebar context nav clearer
  4. better delineate between collections and series in the breadcrumb
  5. replace jargon with clearer labels, especially “Indexed Terms

We recently implemented changes to address 2, 3, and 5. We’re still considering options for 1 and 4.  Usability testing has been invaluable part of our development process. It’s a joy (and often a humbling experience!) to see your design work put through the paces with actual users in a usability test. It always helps us understand what we’re doing so we can make things better.

Usage

We want to learn more about how often different parts of the UI are used, so we implemented Google Analytics event tracking to anonymously log interactions. We use the Anonymize IP feature to help protect patron privacy.

Google Analytics Event Tracking
Top Google Analytics event categories & actions, Jul 1 – Nov 20, 2020.

Some observations so far:

  • The context nav sidebar is by far the most interacted-with part of the UI.
  • Browsing the Contents section of a component page (list of direct child components) is the second-most frequent interaction.
  • Subject, Collection, & Names are the most-used facets, in that order. That does not correlate with the order they appear in the sidebar.
  • Links presented in the Online Access banners were clicked 5x more often than the limiter in the Online Access facet (which matches what we found in usability testing)
  • Basic keyword searches happen 32x more frequently than advanced searches

Search Engine Optimization (SEO)

We want to be sure that when people search Google for terms that appear in our finding aids, they discover our resources. So when several Blacklight community members combined forces to create a Blacklight Dynamic Sitemaps gem this past year, it caught our eye. We found it super easy to set up, and it got the vast majority of our collection records Google-indexed within a month or so. We are interested in exploring ways to get it to include the component records in the sitemap as well.

Google Search Console showing index performance

 

Launching ArcLight: Retrospective

We’re pretty proud of how this all turned out. We have accomplished a lot in a relatively short amount of time. And the core software will only improve as the community grows.

At Duke, we already use Blacklight to power a bunch of different discovery applications in our portfolio. And given that the responsibility of supporting ArcLight falls to the same staff who support all of those other apps, it has been unquestionably beneficial for us to be able to work with familiar tooling.

We did encounter a few hurdles along the way, mostly because the software is so new and not yet widely adopted. There are still some rough edges that need to be smoothed out in the core software. Documentation is pretty sparse. We found indexing errors and had to adjust some rules. Relevancy ranking needed a lot of work. Not all of the EAD elements and attributes are accounted for; some things aren’t indexed or displayed in an optimal way.

Still, the pros outweigh the cons by far. With ArcLight, you get an extensible Blacklight-based core, only catered specifically to archival data. All the things Blacklight shines at (facets, keyword highlighting, autosuggest, bookmarks, APIs, etc.) are right at your fingertips. We have had a very good experience finding and using Blacklight plugins to add desired features.

Finally, while the ArcLight community is currently small, the larger Blacklight community is not. There is so much amazing work happening out in the Blacklight community–so much positive energy! You can bet it will eventually pay dividends toward making ArcLight an even better solution for archival discovery down the road.

Acknowledgments

Many thanks go out to our Duke staff members who contributed to getting this project completed successfully. Especially:

  • Product Owner: Noah Huffman
  • Developers/DevOps: Sean Aery, David Chandek-Stark, Michael Daul, Cory Lown (scrum master)
  • Project Sponsors: Will Sexton & Meghan Lyon
  • Redesign Team: Noah Huffman (chair), Joyce Chapman, Maggie Dickson, Val Gillispie, Brooke Guthrie, Tracy Jackson, Meghan Lyon, Sara Seten Berghausen

And thank you as well to the Stanford University Libraries staff for spearheading the ArcLight project.


This post was updated on 1/7/21, adding the embedded video recording of the Oct 2020 Blacklight Summit ArcLight presentation.

Sunsetting Tripod2

Featured image – Wayback Machine capture of the Tripod2 beta site in February, 2011. 

We all design and create platforms that work beautifully for us, that fill us with pride as they expand and grow to meet our programmatic needs, and all the while the world changes around us, the programs scale beyond what we envisioned, and what was once perfectly adaptable becomes unsustainable, appearing to us all of the sudden as a voracious, angry beast, threatening to consume us, or else a rickety contraption, teetering on the verge of a disastrous collapse. I mean, everyone has that experience, right?

In March of 2011, a small team consisting primarily of me and fellow developer Sean Aery rolled out a new, homegrown platform, Tripod2. It became the primary point of access for Duke Digital Collections, the Rubenstein Library’s collection guides, and a handful of metadata-only datasets describing special collections materials. Within a few years, we had already begun talking about migrating all the Tripod2 stuff to new platforms. Yet nearly a decade after its rollout, we still have important content that depends on that platform for access.

Nevertheless, we have made significant progress. Sunsetting Tripod2 became a priority for one of the teams in our Digital Preservation and Publishing Program last year, and we vowed to follow through by the end of 2020. We may not make that target, but we do have firm plans for the remaining work. The migration of digital collections to the Duke Digital Repository has been steady, and nears its completion. This past summer, we rolled out a new platform for the Rubenstein collection guides, based on the ArcLight framework. And now have a plan to handle the remaining instances of metadata-only databases, a plan that itself relies on the new collection guides platform.

We built Tripod2 on the triptych of Python/Django, Solr, and a document base of METS files. There were many areas of functionality that we never completely developed, but it gave us a range of capability that was crucial in our thinking about digital collections a decade ago – the ability to customize, to support new content types, and to highlight what made each digital collection unique. In fact, the earliest public statement that I can find revealing the existence of Tripod2 is Sean’s blog post, “An increasingly diverse range of formats,” from almost exactly ten years ago. As Sean wrote then, “dealing with format complexity is one of our biggest challenges.”

As the years went by, a number of factors made it difficult to keep Tripod2 current with the scope of our programs and the changing of web technology. The single most prevalent factor was the expanding scope of the Duke Digital Collections program, which began to take on more high-volume digitization efforts. We started adding all of our new digital collections to the Duke Digital Repository (DDR) starting in 2015, and the effort to migrate from Tripod2 to the repository picked up soon thereafter. That work was subject to all sorts of comical and embarrassing misestimations by myself on the pages of this very blog over the years, but thanks to the excellent work by Digital Collections and Curation Services, we are down to the final stages.

Collection and item counters from the Duke Digital Repository's homepage for Duke Digital Collections, showing the volume of digital collections roughly doubling between 2018 and 2020.
Collection and item counters from the Duke Digital Repository’s homepage for Duke Digital Collections, taken from the Internet Archive’s Wayback Machine, approximately a year apart in 2018, 2019, and 2020. The volume of digital collections has roughly doubled in that time, due to both the addition of new collections, and the migration of collections from Tripod2.

Moving digital collections to the DDR went hand-in-hand with far less customization, and far less developer intervention to publish a new collection. Where we used to have developers dedicated to individual platforms, we now work together more broadly as a team, and promote redundancy in our development and support models as much as we can. In both our digital collections program and our approach to software development, we are more efficient and more process-driven.

Given my record of predictions about our work on this blog, I don’t want to be too forward in announcing this transition. We all know that 2020 doesn’t suffer fools gladly, or maybe it suffers some of them but not others, and maybe I shouldn’t talk about 2020 just like this, right out in the open, where 2020 can hear me. So I’ll just leave it here – in recent times, we have done a lot of work toward saying goodbye to Tripod2. Perhaps soon we shall.

EDTF-Humanize 2.0 with Improved Internationalization Support

About four years ago we released a small Ruby gem (EDTF-Humanize) to generate human readable dates out of Extended Date Time Format dates. For some background on our use of the EDTF standard, please see our previous blog posts on the topic: EDTF-Humanize, Enjoy your Metadata: Fun with Date Encoding, and It’s Date Night Here at Digital Projects and Production Services.

Some recent community contributions to the gem as well as some extra time as we transition from one work cycle to another provided an opportunity for maintenance and refinement of EDTF-Humanize. The primary improvement is better support for languages other than English via Ruby I18n locale configuration files and a language specific module override pattern. Support for French is now included and support for other languages may be added following the same approach as French.

The primary means of adding additional languages to EDTF-Humanize is to add a translation file to config/locals/. This is the translation file included to support French:

fr:
  date:
    day_names: [Dimanche, Lundi, Mardi, Mercredi, Jeudi, Vendredi, Samedi]
    abbr_day_names: [Dim, Lun, Mar, Mer, Jeu, Ven, Sam]
    # Don't forget the nil at the beginning; there's no such thing as a 0th month
    month_names: [~, Janvier, Février, Mars, Avril, Mai, Juin, Juillet, Août, Septembre, Octobre, Novembre, Decembre]
    abbr_month_names: [~, Jan, Fev, Mar, Avr, Mai, Jun, Jul, Aou, Sep, Oct, Nov, Dec]
    seasons:
      spring: "printemps"
      summer: "été"
      autumn: "automne"
      winter: "hiver"
  edtf:
    terms:
      approximate_date_prefix_day: ""
      approximate_date_prefix_month: ""
      approximate_date_prefix_year: ""
      approximate_date_suffix_day: " environ"
      approximate_date_suffix_month: " environ"
      approximate_date_suffix_year: " environ"
      decade_prefix: "Les années "
      decade_suffix: ""
      century_suffix: ""
      interval_prefix_day: "Du "
      interval_prefix_month: "De "
      interval_prefix_year: "De "
      interval_connector_approximate: " à "
      interval_connector_open: " à "
      interval_connector_day: " au "
      interval_connector_month: " à "
      interval_connector_year: " à "
      interval_unspecified_suffix: "s"
      open_start_interval_with_day: "Jusqu'au %{date}"
      open_start_interval_with_month: "Jusqu'en %{date}"
      open_start_interval_with_year: "Jusqu'en %{date}"
      open_end_interval_with_day: "Depuis le %{date}"
      open_end_interval_with_month: "Depuis %{date}"
      open_end_interval_with_year: "Depuis %{date}"
      set_dates_connector_exclusive: ", "
      set_dates_connector_inclusive: ", "
      set_earlier_prefix_exclusive: 'Le ou avant '
      set_earlier_prefix_inclusive: 'Le et avant '
      set_last_date_connector_exclusive: " ou "
      set_last_date_connector_inclusive: " et "
      set_later_prefix_exclusive: 'Le ou après '
      set_later_prefix_inclusive: 'Le et après '
      set_two_dates_connector_exclusive: " ou "
      set_two_dates_connector_inclusive: " et "
      uncertain_date_suffix: "?"
      unknown: 'Inconnue'
      unspecified_digit_substitute: "x"
    formats:
      day_precision_strftime_format: "%-d %B %Y"
      month_precision_strftime_format: "%B %Y"
      year_precision_strftime_format: "%Y"

In addition to the translation file, the methods used to construct the human readable string for each EDTF date object type may be completely overridden for a language if needed. For instance, when the date object is an instance of EDTF::Century the French language uses a different method from the default to construct the humanized form. This override is accomplished by adding a language module for the French language that includes the Default module and also includes a Century module that overrides the default behavior. The override is here (minus the internals of the humanizer method) as an example:

# lib/edtf/humanize/language/french.rb
module Edtf
  module Humanize
    module Language
      module French
        include Default
        module Century
          extend self

          def humanizer(date)
            # Special French handling for EDTF::Century
          end
        end
      end
    end
  end
end

EDTF-Humanize version 2.0.0 is available on rubygems.org and on GitHub. Documentation is available on GitHub. Pull requests are welcome; I’m especially interested in contributions to add support for languages in addition to English and French.

Announcing New Features in the Duke Digital Repository

Last week the Duke University Libraries (DUL) development team released a new version of the Duke Digital Repository (DDR), which is the preservation and access platform for digitized, born digital, and purchased library collections. DDR is developed and maintained by DUL staff and it is built using Samvera, Valkyrie and Blacklight components (read all about our migration to Valkyrie which concluded in early 2020).

Look at that beautiful technical metadata!

The primary goal of our new repository features are to provide better support for and access to born digital records. The planning for this work began more than 2 years ago, when the Rubenstein Libraries’ Digital Records Archivist joined the Digital Collections Implementation Team (DCIT) to help us envision how DDR and our workflows could better support born digital collections. Conversations on this topic began between the Rubenstein Library and Digital Strategies and Technology well before that.

Back in 2018, DCIT developed a list of user stories to address born digital records as well as some other longstanding needs. At the time we evaluated each need based on its difficult and impact and then developed a list of high, medium and low priority features.  Fast forward to late 2019, and we designated 3 folks from DCIT to act as product owners during development.  Those folks are our Metadata Architect (Maggie Dickson), Digital Records Archivist ([Matthew] farrell), and me (Head of Digital Collections and Curation Services). Development work began in earnest in Jan/February and now after many meetings, user story refinements, more meetings, and actual development work here we are!

Notable new features include:

  • Metadata only view of objects: restrict the object but allow the public to search and discover its metadata
  • Expose technical metadata for components in the public interface
  • Better access to full text search in CONTENTdm from DDR

As you can see above we were able to fit in a few non-born digital records related features. This is because one of our big priorities is finishing the migration from our legacy Tripod 2 platform to DDR in 2020. One of the impediments to doing so (in addition migrating the actual content) is that Tripod 2 connects with our CONTENTdm instance, which is where we provide access to digitized primary sources that require full text search (newspapers and publications primarily). The new DDR features therefor include enhanced links to our collections in CONTENTdm.

We hope these new features provide a better experience for our users as well as a safe and happy home for our born digital records!

Search full text link on a collection landing page.
Example of the search within an item interface