Category Archives: Projects

How we broke up with Basecamp

We recently published a blog post outlining our recent decision to drop Basecamp as a project management platform. Several people have asked us to share our new solution(s), and we thought we would take the opportunity to explain at a high level how we approached migrating away from Basecamp.

When DUL decided we would not be renewing our Basecamp subscription, there were about 88 active projects representing work across the entire organization. The owners of these projects would have just over two months to export their content and, if necessarily, migrate it to a new solution.

Exporting Content

Our first step was to form a 3-person migration team to explore different options for exporting content from Basecamp. We identified two main export options: DIY exporting and administrator exporting. For both options, we proactively tested the workflow, made screencasts, and wrote tutorials to ensure staff were well-prepared for either option.

Before explaining the export options, however, we emphasized to staff that they should only export content that is needed for ongoing work or archival purposes. For completed projects that are no longer needed, we encouraged staff to “let it go.”

DIY exporting: We established that any member of a Basecamp project or team could export content from Basecamp. For projects that primarily used Basecamp to create documents or share files, we recommended they use the built-in export functionality within the Docs & Files section of the project. Project members could just go to Docs & Files, click the three dots menu in the upper-right corner, and select “Download this folder.” The download would include all of the same subfolders that were created in Basecamp. Basecamp documents would be saved as HTML files, and additional files would appear in their original format. One note, however, is that the HTML versions of the Basecamp documents would not include the comments added to those documents.

To export documents with their comments, or to export other Basecamp content like Message Boards or To-dos, another option was to save the individual pages as PDFs. While this took more time and didn’t work well for large and complicated projects, it had the added benefit of preserving the look-and-feel of the original Basecamp content. PDF was also preferable to HTML for some situations, like viewing the files in different file storage solutions.

Administrator exporting: If a team really needed a complete export of the current Basecamp content, an administrator for the Basecamp account could generate a full export. This type of export generated an HTML-based set of files that included all components of the project and any comments on documents. Every page available in Basecamp in the browser became a separate HTML file. With this export, however, the HTML files that were created for Docs & Files were stored in the same directory, all in a bunch. End users could still see subfolders when navigating the HTML files in a web browser, but the downloaded files weren’t organized that way anymore.

We decided not to track the DIY exporting but needed to organize the requests for administrator exporting as only a few accounts were “Basecamp Owners.” To organize and facilitate this process, we created an online request form for staff to complete.

To help folks decide what they needed, we created a visual decision tree representing export options.

A flow diagram highlighting the different export options for Basecamp content.

Alternative Project Management Tools

We realized early on that it was out of scope to complete a formal migration from Basecamp to one specific, alternate tool. We didn’t have staff capacity for that kind of project, and we weren’t inclined to prescribe the same solution for every project or team. To promote reliability and additional support systems, we encouraged teams to take advantage of enterprise solutions already supported by Duke University. The primary alternatives supported by Duke at this time were Box and MS Teams. Box had been in use in the Libraries for several years, but MS Teams was newer and not used in as widespread a manner across all staff.

To help folks decide what tool they wanted to use, we created some documentation comparing the various features of Box and MS Teams with the functionality folks were used to in Basecamp. We also created screencasts that demonstrated how one can use Tasks in Box or the Planner app within Teams for managing projects and assignments.

Basecamp to Teams and Box Features
Basecamp Teams Box Others
Campfire Chat Comments on Files, BoxNotes
Message Board Chat (can thread chats in Teams) Comments on Files, Box Notes
To-dos Tasks by Planner and To Do Task List
Schedule Channel Calendar, Tasks by Planner and To Do (timelines and due dates for tasks) Task List (due date scheduling) Outlook: Resource or Shared Calendar
Automatic Check-ins
Docs & Files Files tab (add Shortcut to OneDrive for desktop access) All of Box (add Box Drive for desktop access)
Email Forwards
Card Table Tasks by Planner and To Do MeisterTask

Managing Exported HTML Files

To be honest, having all our Basecamp created documents export as HTML files was challenging. In Box, the HTML files would just display the code view when opened online, and this was frustrating to our colleagues who use Box. While Teams would display the rendered web pages of the HTML files very well, any relative links between HTML files would be broken. For staff who wanted, essentially, a clone of their Basecamp project site, we directed them to use Box Drive or to sync a folder in their Team to OneDrive. When opened that way, the exported files would be navigable as normal, so we saw this as more of a challenge of guiding user behavior rather than manipulating the files themselves. Since we had encountered these issues in our preliminary testing of workflows and exports, we built that guidance into the first explanatory screencasts we made. Foregrounding those issues helped many people avoid them altogether.

Supporting Staff through the Transition

Overall, we had a very smooth transition for staff. In terms of time invested in this process, the Basecamp Migration Team spent a lot of time upfront preparing documentation and helpful how-to videos. We set up a centralized space for staff to find all of our documentation and to ask questions, and we also scheduled several blocks of open office hours to offer more personalized help. That work supported staff who were comfortable managing their own export, and the number of teams requiring administrator support or additional training ended up being manageable. The deadline for submitting requests for an administrator export was about one month before the cancellation date, but many teams submitted their requests earlier than that. All of these strategies worked well, and we felt very comfortable cancelling our account on the appointed day. We migrated out of Basecamp, and you can too!

Web Accessibility at Duke Libraries and Beyond — a Quick Start Guide

Coming on board as the new Web Experience Developer in the Assessment and User Experience Services (AUXS) Department in early 2022, one of my first priorities was to get up to speed on Web Accessibility guidelines and testing. I wanted to learn how these standards had been applied to Library websites to date and establish my own processes and habits for ongoing evaluation and improvement. Flash-forward one year, and I’m looking back at the steps that I took and reflecting on lessons learned and projects completed. I thought it might be helpful to myself and others in a similar situation (e.g. new web developers, designers, or content creators) to organize these experiences and reflections into a sort of manual or “Quick Start Guide”. I hope that these 5 steps will be useful to others who need a crash course in this potentially confusing or intimidating–but ultimately crucial and rewarding–territory.

Learn from your colleagues

Fortunately, I quickly discovered that Duke Libraries already had a well-established culture and practice around web accessibility, including a number of resources I could consult.

Two Bitstreams posts from our longtime web developer/designer Sean Aery gave me a quick snapshot of the current state of things, recent initiatives, and ongoing efforts:

Repositories of the Library’s open source software projects proved valuable in connecting broader concepts with specific examples and seeing how other developers had solved problems. For instance, I was able to look at the code for DUL’s “theme” (basically visual styling, color, typography, and other design elements) to better understand how it builds on the ubiquitous Bootstrap CSS framework and implements specific accessibility standards around semantic markup, color contrast, and ARIA roles/attributes:

Screenshot of Duke Libraries' public website Gitlab repository
The repo is your friend

Equipped with some background and context, I found I was much better prepared to formulate questions for team meetings or the project Slack channel.

Look for guidelines from your institution

Duke’s Web Accessibility Initiative establishes clear web accessibility standards for the University and offers practical information on how to understand and implement them:

Duke Accessibility "How To..." page

The site also offers guides geared towards the needs of different stakeholders (content creators, designers, developers) as well as a step-by-step overview of how to do an accessibility assessment.

Duke Accessibility "Key concepts for web developers" page

Know your standards

Duke University has specified the Worldwide Web Consortium Web Content Accessibility Guidelines version 2.0, Level AA Conformance (WCAG 2.0 Level AA) as its preferred accessibility standard for websites. While it was initially daunting to digest and parse these technical documents, at least I had a known, widely-adopted target that I was aiming for–in other words, an achievable goal. Feeling bolstered by that knowledge, I was able to use the other resources mentioned here to fill in the gaps and get hands-on experience and practice solving web accessibility issues.

Screenshot of WCAG 2.0 standards document
RTFM, ok?

Find a playground

As I settled into the workflow within our Scrum team (based on Agile software development principles), I found a number of projects that gave me opportunities to test and experiment with how different markup and design decisions affect accessibility. I particularly enjoyed working with updating the Style Guide for our Catalog as part of a Bootstrap 3–>4 migration, updating our DUL Theme across various applications — Library Catalog, Quicksearch, Staff Directory — built on the Ruby on Rails framework, and getting scrappy and creative trying to improve branding and accessibility of some of our vendor-hosted web apps with the limited tools available (essentially jQuery scripts and applying CSS to existing markup).

Build your toolkit

A few well-chosen tools can get you far in assessing and correcting web accessibility issues on your websites.

Browser Tools

The built-in developer tools in your browser are essential for viewing and testing changes to markup and understanding how CSS rules are applied to the Document Object Model. The Deque Systems aXe Chrome Extension (also available for Firefox) adds additional tools for accessibility testing with a slick interface that performs a scan, gives a breakdown of accessibility violations ranked by severity, and tells you how to fix them.

Color Contrast Checkers

I frequently turned to these two web-based tools for quick tests of different color combinations. It was educational to see what did and didn’t work in various situations and think more about how aesthetic and design concerns interact with accessibility concerns.

Screenshot of WebAIM color contrast checker
While this color combo technically passes WCAG 2.0 AA and uses colors from the Duke Brand Guide, I probably won’t be using it any time soon

Style Guides

These style guides provided a handy reference for default and variant typography, color, and page design elements. I found the color palettes particular helpful as I tried to find creative solutions to color contrast problems while maintaining Duke branding and consistency across various Library pages.

Keyboard-Only Navigation

Attempting to navigate our websites using only the TAB, ENTER, SPACE, UP, and DOWN keys on a standard computer keyboard gave me a better understanding of the significance of semantic markup, skip links, and landmarks. This test is essential for getting another “view” of your pages that isn’t as dependent on visual cues to convey meaning and structure and can help surface issues that automated accessibility scanners might miss.

We are Hiring: 2 Repository Services Analysts

Duke University Libraries (DUL) is recruiting two (2) Repository Services Analysts to ingest and help collaboratively manage content in their digital preservation systems and platforms. These positions will partner with the Research Data Curation, Digital Collections, and Scholarly Communications Programs, as well as other library and campus partners to provide digital curation and preservation services. The Repository Services Analyst role is an excellent early career opportunity for anyone who enjoys managing large sets of data and/or files, working with colleagues across an organization, preserving unique data and library collections, and learning new skills.

DUL will hold an open zoom session where prospective candidates can join anonymously and ask questions. This session will take place on Wednesday January 12 at 1pm EST; the link to join is posted on the libraries’ job advertisement.

The Research Data Curation Program has grown significantly in recent years, and DUL is seeking candidates who want to grow their skills in this area. DUL is a member of the Data Curation Network (DCN), which provides opportunities for cross-institutional collaboration, curation training, and hands-on data curation practice. These skills are essential for anyone who wants to pursue a career in research data curation. 

Ideal Repository Services Analyst applicants have been exposed to digital asset management tools and techniques such as command line scripting. They can communicate functional system requirements between groups with varying types of expertise, enjoy working with varied kinds of data and collections, and love solving problems. Applicants should also be comfortable collaboratively managing a shared portfolio of digital curation services and projects, as the two positions work closely together. The successful candidates will join the Digital Collections and Curation Services department (within the Digital Strategies and Technology Division).

Please refer to the DUL’s job posting for position requirements and application instructions.

The Shortest Year

Featured image – screenshot from the Sunset Tripod2 project charter.

Realizing that my most recent post here went up more than a year ago, I pause to reflect. What even happened over these last twelve months? Pandemic and vaccine, election and insurrection, mandates and mayhem – outside of our work bubble, October 2020 to October 2021 has been a churn of unprecedented and often dark happenings. Bitstreams, however, broadcasts from inside the bubble, where we have modeled cooperation and productivity, met many milestones, and kept our collective cool, despite working nearly 100% remotely as a team, with our stakeholders, and across organizational lines.

Last October, I wrote about Sunsetting Tripod2, a homegrown platform for our digital collections and archival finding aids that was also the final service we had running on a physical server. “Firm plans,” I said we had for the work that remained. Still, in looking toward that setting sun, I worried about “all sorts of comical and embarrassing misestimations by myself on the pages of this very blog over the years.” I was optimistic, but cautiously so, that we would banish the ghosts of Django-based systems past.

Reader, I have returned to Bitstreams to tell you that we did it. Sometime in Q1 of 2021, we said so long, farewell, adieu to Tripod2. It was a good feeling, like when you get your laundry folded, or your teeth cleaned, only better.

However, we did more in the past year than just power down exhausted old servers. What follows are a few highlights from the work of the Digital Strategies and Technology division of Duke University Libraries’ software developers, and our collaborators (whom we cannot thank or praise enough) over the past twelve months. 

In November, Digital Projects Developer Sean Aery posted on Implementing ArcLight: A Reflection. The work of replacing and improving upon our implementation for the Rubenstein Library’s collection guides was one of the main components that allowed us to turn off Tripod2. We actually completed it in July of 2020, but that team earned its Q4 victory laps, including Sean’s post and a session at Blacklight Summit a few days after my own post last October.

As the new year began, the MorphoSource team rolled out version 2.0 of that platform. MorphoSource Repository Developer Jocelyn Triplett shared a A Preview of MorphoSource 2 Beta in these pages on January 20. The launch took place on February 1.

One project we had underway as I was writing last October was the integration of Globus, a transfer service for large datasets, into the Duke Research Data Repository. We completed that work in Q1 of 2021, prompting our colleague, Senior Research Data Management Consultant Sophia Lafferty-Hess, to post Share More Data in the Duke Research Data Repository! in a neighboring location that shares our charming cul-de-sac of library blogs.

The seventeen months since the murder of George Floyd have seen major changes in how we think and talk about race in the Libraries. We committed ourselves to the DUL Racial Justice Roadmap, a pathway for recognizing and attacking the pervasive influence of white supremacy in our society, in higher education, at Duke, in the field of librarianship, in our library, in the field of information technology, and in our own IT practices. During this time, members of our division have also participated broadly in DiversifyIT, a campus-wide group of IT professionals who seek to foster a culture of inclusion “by providing professional development, networking, and outreach opportunities.”

Digital Projects Developer Michael Daul shared his own point of view with great thoughtfulness in his April post, What does it mean to be an actively antiracist developer? He touched on representation in the IT industry, acknowledging bias, being aware of one’s own patterns of communication, and bringing these ideas to the systems we build and maintain. 

One of the ideas that Michael identified for software development is web accessibility; as he wrote, we can “promote the benefits of building accessible interfaces that follow the practices of universal design.” We put that idea into action a few months later, as Sean described in precise technical terms in his July post, Automated Accessibility Testing and Continuous Integration. Currently that process applies to the ArcLight platform, but when we have a chance, we’ll see if we can expand it to other services.

The question of when we’ll have that chance is a big one, as it hinges on the undertaking that now dominates our attention. Over the past year we have ramped up on the migration of our website from Drupal 7 to Drupal 9, to head off the end-of-life for 7. This project has transformed into the raging beast that our colleagues at NC State Libraries warned us it would at the Code4Lib Southeast in May of 2019

Screenshot of NC State Libraries presentation on Drupal migration
They warned us – Screenshot from “Drupal 7 to Drupal 8: Our Journey,” by Erik Olson and Meredith Wynn of NC State Libraries’ User Experience Department, presented at Code4Lib Southeast in May of 2019.

We are on a path to complete the Drupal migration in March 2022 – we have “firm plans,” you could say – and I’m certain that its various aspects will come to feature in Bitstreams in due time. For now I will mention that it spawned two sub-projects that have challenged our team over the past six months or so, both of which involve refactoring functionality previously implemented as Drupal modules into standalone Rails applications:

  1. Quicksearch, aka unified search, aka “Bento search” – see Michael’s Bento is Coming! from 2014 – is now a standalone app; it also uses the open-source tool Apache Nutch, rather than Google CSE.
  2. The staff directory app that went live in 2019, which Michael wrote about in Building a new Staff Directory, also no longer runs as a Drupal module.

Each of these implementations was necessary to prepare the way for a massive migration of theme and content that will take place over the coming months. 

Screenshot of a Jira issue related to the Decouple Staff Directory project.
Screenshot of a Jira issue related to the Decouple Staff Directory project.

When it’s done, maybe we’ll have a chance to catch our breath. Who can really say? I could not have guessed a year ago where we’d be now, and anyway, the period of the last twelve months gets my nod as the shortest year ever. Assuming we’re here, whatever “here” means in the age of remote/hybrid/flexible work arrangements, then I expect we’ll be burning down backlogs, refactoring this or that, deploying some service, and making firm plans for something grand.

Good News from the DPC: Digitization of Behind the Veil Tapes is Underway

This post was written by Jen Jordan, a graduate student at Simmons University studying Library Science with a concentration in Archives Management. She is the Digital Collections intern with the Digital Collections and Curation Services Department.  Jen will complete her masters degree in December 2021. 

The Digital Production Center (DPC) is thrilled to announce that work is underway on a 3-year long National Endowment for the Humanities (NEH) grant-funded project to digitize the entirety of Behind the Veil: Documenting African-American Life in the Jim Crow South, an oral history project that produced 1,260 interviews spanning more than 1,800 audio cassette tapes. Accompanying the 2,000 plus hours of audio is a sizable collection of visual materials (e.g.- photographic prints and slides) that form a connection with the recorded voices.

We are here to summarize the logistical details relating to the digitization of this incredible collection. To learn more about its historical significance and the grant that is funding this project, titled “Documenting African American Life in the Jim Crow South: Digital Access to the Behind the Veil Project Archive,” please take some time to read the July announcement written by John Gartrell, Director of the John Hope Franklin Research Center and Principal Investigator for this project. Co-Principal Investigator of this grant is Giao Luong Baker, Digital Production Services Manager.

Digitizing Behind the Veil (BTV) will require, in part, the services of outside vendors to handle the audio digitization and subsequent captioning of the recordings. While the DPC regularly digitizes audio recordings, we are not equipped to do so at this scale (while balancing other existing priorities). The folks at Rubenstein Library have already been hard at work double checking the inventory to ensure that each cassette tape and case are labeled with identifiers. The DPC then received the tapes, filling 48 archival boxes, along with a digitization guide (i.e. – an Excel spreadsheet) containing detailed metadata for each tape in the collection. Upon receiving the tapes, DPC staff set to boxing them for shipment to the vendor. As of this writing, the boxes are snugly wrapped on a pallet in Perkins Shipping & Receiving, where they will soon begin their journey to a digital format.

The wait has begun! In eight to twelve weeks we anticipate receiving the digital files, at which point we will perform quality control (QC) on each one before sending them off for captioning. As the captions are returned, we will run through a second round of QC. From there, the files will be ingested into the Duke Digital Repository, at which point our job is complete. Of course, we still have the visual materials to contend with, but we’ll save that for another blog! 

As we creep closer to the two-year mark of the COVID-19 pandemic and the varying degrees of restrictions that have come with it, the DPC will continue to focus on fulfilling patron reproduction requests, which have comprised the bulk of our work for some time now. We are proud to support researchers by facilitating digital access to materials, and we are equally excited to have begun work on a project of the scale and cultural impact that is Behind the Veil. When finished, this collection will be accessible for all to learn from and meditate on—and that’s what it’s all about. 

 

Auditing Archival Description for Harmful Language: A Computer and Community Effort

This post was written by Miriam Shams-Rainey, a third-year undergraduate at Duke studying Computer Science and Linguistics with a minor in Arabic. As a student employee in the Rubenstein’s Technical Services Department in the Summer of 2021, Miriam helped build a tool to audit archival description in the Rubenstein for potentially harmful language. In this post, she summarizes her work on that project.

The Rubenstein Library has collections ranging across centuries. Its collections are massive and often contain rare manuscripts or one of a kind data. However, with this wide-ranging history often comes language that is dated, harmful, often racist, sexist, homophobic, and/or colonialist. As important as it is to find and remediate these instances of potentially harmful language, there is lot of data that must be searched.

With over 4,000 collection guides (finding aids) and roughly 12,000 catalog records describing archival collections, archivists would need to spend months of time combing their metadata to find harmful or problematic language before even starting to find ways to handle this language. That is, unless there was a way to optimize this workflow.

Working under Noah Huffman’s direction and the imperatives of the Duke Libraries’ Anti-Racist Roadmap, I developed a Python program capable of finding occurrences of potentially harmful language in library metadata and recording them for manual analysis and remediation. What would have taken months of work can now be done in a few button clicks and ten minutes of processing time. Moreover, the tools I have developed are accessible to any interested parties via a GitHub repository to modify or expand upon.

Although these gains in speed push metadata language remediation efforts at the Rubenstein forward significantly, a computer can only take this process so far; once uses of this language have been identified, the responsibility of determining the impact of the term in context falls onto archivists and the communities their work represents. To this end, I have also outlined categories of harmful language occurrences to act as a starting point for archivists to better understand the harmful narratives their data uphold and developed best practices to dismantle them.

Building an automated audit tool

Audit Tool GUI Screenshot
The simple, yet user-friendly interface that allows archivists to customize the search audit to their specific needs.

I created an executable that allows users to interact with the program regardless of their familiarity with Python or with using their computer’s command line. With an executable, all that a user must do is simply click on the program (titled “description_audit.exe”) and the script will load with all of its dependencies in a self-contained environment. There’s nothing that a user needs to install, not even Python.

Within this executable, I also created a user interface to allow users to set up the program with their specific audit parameters. To use this program, users should first create a CSV file (spreadsheet) containing each list of words they want to look for in their metadata.

Snippet of Lexicon CSV
Snippet from a sample lexicon CSV file containing harmful terms to search

In this CSV file of “lexicons”, each category of terms  should have its own column, for example RaceTerms could be the first row in a column of terms such as “colored” or “negro,” and GenderTerms could be the first row in a column of gendered terms such as “homemaker” or “wife.”  See these lexicon CSV file examples.

Once this CSV has been created, users can select this CSV of lexicons in the program’s user interface and then select which columns of terms they want the program to use when searching across the source metadata. Users can either use all lexicon categories (all columns) by default or specify a subset by typing out those column headers. For the Rubenstein’s purposes, there is also a rather long lexicon called HateBase (from a regional, multilingual database of potential hate speech terms often used in online moderating) that is only enabled when a checkbox is checked; users from other institutions can download the HateBase lexicon for themselves and use it or they can simply ignore it.

In the CSV reports that are output by the program, matches for harmful terms and phrases will be tagged with the specific lexicon category the match came from, allowing users to filter results to certain categories of potentially harmful terms.

Users also need to designate a folder on their desktop where report outputs should be stored, along with the folder containing their source EAD records in .xml format and their source MARCXML file containing all of the MARC records they wish to process as a single XML file. Results from MARC and EAD records are reported separately, so only one type of record is required to use the program, however both can be provided in the same session.

How archival metadata is parsed and analyzed

Once users submit their input parameters in the GUI, the program begins by accessing the specified lexicons from the given CSV file. For each lexicon, a “rule” is created for a SpaCy rule-based matcher, using the column name (e.g. RaceTerms or GenderTerms) as the name of the specific rule. The same SpaCy matcher object identifies matches to each of the several lexicons or “rules”. Once the matcher has been configured, the program assesses whether valid MARC or EAD records were given and starts reading in their data.

To access important pieces of data from each of these records, I used a Python library called BeautifulSoup to parse the XML files. For each individual record, the program parses the call numbers and collection or entry name so that information can be included in the CSV reports. For EAD records, the collection title and component titles  are also parsed to be analyzed for matches to the lexicons, along with any data that is in a paragraph (<p>) tag. For MARC records, the program also parses the author or creator of the item, the extent of the collection, and the timestamp of when the description of the item was last updated. In each MARC record, the 520 field (summary)  and 545 field (biography/history note) are all concatenated together and analyzed as a single entity.

Data from each record is stored in a Python dictionary with the names of fields (as strings) as keys mapping to the collection title, call number, etc. Each of these dictionaries is stored in a list, with a separate structure for EAD and MARC records.

Once data has been parsed and stored, each record is checked for matches to the given lexicons using the SpaCy rule-based matcher. For each record, any matches that are found are then stored in the dictionary with the matching term, the context of the term (the entire field or surrounding few sentences, depending on length), and the rule the term matches (such as RaceTerms). These matches are found using simple tokenization from SpaCy that allow matches to be identified quickly and without regard for punctuation, capitalization, etc. 

Although this process doesn’t necessarily use the cutting-edge of natural language processing that the SpaCy library makes accessible, this process is adaptable in ways that matching procedures like using regular expressions often isn’t. Moreover, identifying and remedying harmful language is a fundamentally human process which, at the end of the day, needs a significant level of input both from historically marginalized communities and from archivists.

Matches to any of the lexicons, along with all other associated data (the record’s call number, title, etc.) are then written into CSV files for further analysis and further categorization by users. You can see sample CSV audit reports here. The second phase of manual categorization is still a lengthy process, yielding roughly 14600 matches from the Rubenstein Library’s EAD data and 4600 from its MARC data which must still be read through and analyzed by hand, but the process of identifying these matches has been computerized to take a mere ten minutes, where it could otherwise be a months-long process.

Categorizing matches: an archivist and community effort

An excerpt of initial data returned by the audit program for EAD records. This data should be further categorized manually to ensure a comprehensive and nuanced understanding of these instances of potentially harmful language.
An excerpt of initial data returned by the audit program for EAD records. This data should be further categorized manually to ensure a comprehensive and nuanced understanding of these instances of potentially harmful language.

To better understand these matches and create a strategy to remediate the harmful language they represent, it is important to consider each match in several different facets.

Looking at the context provided with each match allows archivists to understand the way in which the term was used. The remediation strategy for the use of a potentially harmful term in a proper noun used as a positive, self-identifying term, such as the National Association for the Advancement of Colored People, for example, is vastly different from that of a white person using the word “colored” as a racist insult.

The three ways in which I proposed we evaluate context are as follows:

  1. Match speaker: who was using the term? Was the sensitive term being used as a form of self-identification or reclaiming by members of a marginalized group, was it being used by an archivist, or was it used by someone with privilege over the marginalized group the term targets (e.g. a white person using an anti-Black term or a cisgender straight person using an anti-LGBTQ+ term)? Within this category, I proposed three potential categories for uses of a term: in-group, out-group, and archivist. If a term is used by a member (or members) of the identity group it references, its use is considered an in-group use. If the term is used by someone who is not a member of the identity group the term references, that usage of the term is considered out-group. Paraphrasing or dated term use by archivists is designated simply as archivist use.
  2. Match context: how was the term in question being used? Modifying the text used in a direct quote or a proper noun constitutes a far greater liberty by the archivist than removing a paraphrased section or completely archivist-written section of text that involved harmful language. Although this category is likely to evolve as more matches are categorized, my initial proposed categories are: proper noun, direct quote, paraphrasing, and archivist narrative.
  3. Match impact: what was the impact of the term? Was this instance a false positive, wherein the use of the term was in a completely unrelated and innocuous context (e.g. the use of the word “colored” to describe the colors used in visual media), or was the use of the term in fact harmful? Was the use of the term derogatory, or was it merely a mention of politicized identities? In many ways, determining the impact of a particular term or use of potentially harmful language is a community effort; if a community member with a marginalized identity says that the use of a term in that particular context is harmful to people with that identity, archivists are in no position to disagree or invalidate those feelings and experiences. The categories that I’ve laid out initially–dated original term, dated Rubenstein term, mention of marginalized issues, mention of marginalized identity, downplaying bias (e.g. calling racism and discrimination an issue with “race relations”), dehumanization of marginalized people, false positive–only hope to serve as an entry point and rudimentary categorization of these nuances to begin this process.
A short excerpt of categorized EAD metadata
A short excerpt of categorized EAD metadata

Here you can find more documentation on the manual categorization strategy.

Categorizing each of these instances of potentially harmful language remains a time-consuming, meticulous process. Although much of this work can be computerized, decolonization is a fundamentally human and fundamentally community-centered practice. No computer can dismantle the colonial, white supremacist narratives that archival work often upholds. This work requires our full attention and, for better or for worse, a lot of time, even with the productivity boost technology gives us.

Once categories have been established, at least on a preliminary level, I found that about 100-200 instances of potentially harmful language could be manually parsed and categorized in an hour.

Conclusion

Decolonization and anti-racist efforts in archival work are an ongoing process. It is bound to take active learning, reflection, and lots of remediation. However, using technology to start this process creates a much less daunting entry point. Anti-racism work is essential in archival spaces.

The ways we talk about history can either work to uphold traditional white supremacist, racist, ableist, etc. narratives, or they can work to dismantle them. In many ways, archival work has often upheld these narratives in the past, however this audit represents the sincere beginnings of work to further equitable narratives in the future.

ArcLight Migration: A Status Update After Three Months of Work

On January 20, 2020, we kicked off our first development sprint for implementing ArcLight at Duke as our new finding aids / collection guides platform. We thought our project charter was solid: thorough, well-vetted, with a reasonable set of goals. In the plan was a roadmap identifying a July 1, 2020 launch date and a list of nineteen high-level requirements. There was nary a hint of an impending global pandemic that could upend absolutely everything.

The work wasn’t supposed to look like this, carried out by zooming virtually into each other’s living rooms every day. Code sessions and meetings now require navigating around child supervision shifts and schooling-from-home responsibilities. Our new young office-mates occasionally dance into view or within earshot during our calls. Still, we acknowledge and are grateful for the privilege afforded by this profession to continue to do our work remotely from safe distance.

So, a major shoutout is due to my colleagues in the trenches of this work overcoming the new unforeseen constraints around it, especially Noah Huffman, David Chandek-Stark, and Michael Daul. Our progress to date has only been possible through resilience, collaboration, and willingness to keep pushing ahead together.

Three months after we started the project, we remain on track for a summer 2020 launch.

As a reminder, we began with the core open-source ArcLight platform (demo available) and have been building extensions and modifications in our local application in order to accommodate Duke needs and preferences. With the caveat that there’ll be more changes coming over the next couple months before launch, I want to provide a summary of what we have been able to accomplish so far and some issues we have encountered along the way. Duke staff may access our demo app (IP-restricted) for an up-to-date look at our work in progress.

Homepage

Homepage design for Duke’s ArcLight finding aids site.
  • Duke Branding. Aimed to make an inviting front door to the finding aids consistent with other modern Duke interfaces, similar to–yet distinguished enough from–other resources like the catalog, digital collections, or Rubenstein Library website.
  • Featured Items. Built a configurable set of featured items from the collections (with captions), to be displayed randomly (actual selections still in progress).
  • Dynamic Content. Provided a live count of collections; we might add more indicators for types/counts of materials represented.

Layout

A collection homepage with a sidebar for context navigation.
  • Sidebar. Replaced the single-column tabbed layout with a sidebar + main content area.
  • Persistent Collection Info. Made collection & component views more consistent; kept collection links (Summary, Background, etc.) visible/available from component pages.
  • Width. Widened the largest breakpoint. We wanted to make full use of the screen real estate, especially to make room for potentially lengthy sidebar text.

Navigation

Component pages contextualized through a sidebar navigator and breadcrumb above the main title.
  • Hierarchical Navigation. Restyled & moved the hierarchical tree navigation into the sidebar. This worked well functionally in ArcLight core, but we felt it would be more effective as a navigational aid when presented beside rather than below the content.
  • Tooltips & Popovers. Provided some additional context on mouseovers for some navigational elements.

    Mouseover context in navigation.
  • List Child Components. Added a direct-child list in the main content for any series or other component. This makes for a clear navigable table of what’s in the current series / folder / etc. Paginating it helps with performance in cases where we might have 1,000+ sibling components to load.
  • Breadcrumb Refactor. Emphasized the collection title. Kept some indentation, but aimed for page alignment/legibility plus a balance of emphasis between current component title and collection title.

    Breadcrumb trail to show the current component’s nesting.

Search Results

Search results grouped by collection, with keyword highlighting.
  • “Group by Collection” as the default. Our stakeholders were confused by atomized components as search results outside of the context of their collections, so we tried to emphasize that context in the default search.
  • Revised search result display. Added keyword highlighting within result titles in Grouped or All view. Made Grouped results display checkboxes for bookmarking & digitized content indicators.
  • Advanced Search. Kept the global search box simple but added a modal Advanced search option that adds fielded search and some additional filters.

Digital Objects Integration

Digital objects from the Duke Digital Repository are presented inline in the finding aid component page.
  • DAO Roles. Indexed the @role attribute for <dao> elements; we used that to call templates for different kinds of digital content
  • Embedded Object Viewers. Used the Duke Digital Repository’s embed feature, which renders <iframe>s for images and AV.

Indexing

  • Whitespace compression. Added a step to the pipeline to remove extra whitespace before indexing. This seems to have slightly accelerated our time-to-index rather than slow it down.
  • More text, fewer strings. We encountered cases where note-like fields indexed as strings by ArcLight core (e.g., <scopecontent>) needed to be converted to text because we had more than 32,766 bytes of data (limit for strings) to put in them. In those cases, finding aids were failing to index.
  • Underscores. For the IDs that end up in a URL for a component, we added an underscore between the finding aid slug and the component ID. We felt these URLs would look cleaner and be better for SEO (our slugs often contain names).
  • Dates. Changed the date normalization rules (some dates were being omitted from indexing/display)
  • Bibliographic ID. We succeeded in indexing our bibliographic IDs from our EADs to power a collection-level Request button that leads a user to our homegrown requests system.

Formatting

  • EAD -> HTML. We extended the EAD-to-HTML transformation rules for formatted elements to cover more cases (e.g., links like <extptr> & <extref> or other elements like <archref> & <indexentry>)

    Additional formatting and link render rules applied.
  • Formatting in Titles. We preserved bold or italic formatting in component titles.

ArcLight Core Contributions

  • We have been able to contribute some of our code back to the ArcLight core project to help out other adopters.

Setting the Stage

The behind-the-scenes foundational work deserves mention here — it represents some of the most complex and challenging aspects of the project.  It makes the application development driving the changes I’ve shared above possible.

  • Built separate code repositories for our Duke ArcLight application and our EAD data
  • Gathered a diverse set of 40 representative sample EADs for testing
  • Dockerized our Duke ArcLight app to simplify developer environment setup
  • Provisioned a development/demo server for sharing progress with stakeholders
  • Automated continuous integration and deployment to servers using GitLabCI
  • Performed targeted data cleanup
  • Successfully got all 4,000 of our finding aids indexed in Solr on our demo server

Our team has accomplished a lot in three months, in large part due to the solid foundation the ArcLight core software provides. We’re benefiting from some amazing work done by many, many developers who have contributed their expertise and their code to the Blacklight and ArcLight codebases over the years. It has been a real pleasure to be able to build upon an open source engine– a notable contrast to our previous practice of developing everything in-house for finding aids discovery and access.

Still, much remains to be addressed before we can launch this summer.

The Road Ahead

Here’s a list of big things we still plan to tackle by July (other minor revisions/bugfixes will continue as well)…

  • ASpace -> ArcLight. We need a smoother publication pipeline to regularly get data from ArchivesSpace indexed into ArcLight.
  • Access & Use Statements. We need to revise the existing inheritance rules and make sure these statements are presented clearly. It’s especially important when materials are indeed restricted.
  • Relevance Ranking. We know we need to improve the ranking algorithm to ensure the most relevant results for a query appear first.
  • Analytics. We’ll set up some anonymized tracking to help monitor usage patterns and guide future design decisions.
  • Sitemap/SEO. It remains important that Google and other crawlers index the finding aids so they are discoverable via the open web.
  • Accessibility Testing / Optimization. We aim to comply with WCAG2.0 AA guidelines.
  • Single-Page View. Many of our stakeholders are accustomed to a single-page view of finding aids. There’s no such functionality baked into ArcLight, as its component-by-component views prioritize performance. We might end up providing a downloadable PDF document to meet this need.
  • More Data Cleanup. ArcLight’s feature set (especially around search/browse) reveals more places where we have suboptimal or inconsistent data lurking in our EADs.
  • More Community Contributions. We plan to submit more of our enhancements and bugfixes for consideration to be merged into the core ArcLight software.

If you’re a member of the Duke community, we encourage you to explore our demo and provide feedback. To our fellow future ArcLight adopters, we would love to hear how your implementations or plans are shaping up, and identify any ways we might work together toward common goals.

Stay safe, everyone!

The New Books & Media Catalog Turns One

It’s been just over a year since we launched our new catalog in January of 2019. Since then we’ve made improvements to features, performance, and reliability, have developed a long term governance and development strategy, and have plans for future features and enhancements.

During the Spring 2019 semester we experienced a number of outages of the Solr index that powers the new catalog. It proved to be both frustrating and difficult to track down the root cause of the outages. We took a number of measures to reduce the risk of bot traffic slowing down or crashing the index. A few of these measures include limiting facet paging to 50 pages and results paging to 250 pages, as well as setting limits on OpenSearch queries. We also added service monitoring so we are automatically alerted when things go awry and automatic restarts under some known bad system conditions. We also identified that a bug in the version of Solr we were running was vulnerable to causing crashes for queries with particular characteristics. We have since applied a patch to Solr to address this bug. Happily, the index has not crashed since we implemented these protective measures and bug fixes.

Over the past year we’ve made a number of other improvements to the catalog including:

  • Caching of the home page and advanced search page have reduced page load times by 75%.
  • Subject searches are now more precise and do not include stemmed terms.
  • CDs and DVDs can be searched by accession number.
  • When digitized copies of Duke material are available at the Internet Archive, links to the digital copy are automatically generated.
  • Records can be saved to a bookmarks list and shared with others via a stable URL.
  • Eligible records now have a “Request digitization” link.
  • Many other small improvements and bug fixes.

We sometimes get requests for features that the catalog already supports:

While development has slowed, the core TRLN team meets monthly to discuss and prioritize new features and fixes, and dedicates time each month to maintenance and new development. We have a number of small improvements and bug fixes in the works. One new feature we’re working on is adding a citation generator that will provide copyable citations in multiple formats (APA, MLA, Chicago, Harvard, and Turabian) for records with OCLC numbers.

We welcome, read, and respond to all feedback and feature requests that come to us through the catalog’s feedback form. Let us know what you think.

Check out “Search Tips” and “Expert Search Tips” for detailed information about how to get the most out of the new catalog.

FOLIO Update November 2019

Here at Duke, the buzz continues around FOLIO. We have continued to contribute to the international project  as active participants on the FOLIO Product Council,  special interest groups, and contribute development resources. You can find links to the various groups on the FOLIO wiki.

We’ve also committed to implementing the electronic resource management (ERM)-focused apps in summer of 2020. Starting with the ERM-focused apps will give us the opportunity to use FOLIO in a production environment, and will be a benefit to our Continuing Resource Acquisitions Department since they are not currently using software dedicated to electronic resources to keep track of licences and terms.

 

Our local project planning has come more into focus as well. We have gathered names for team participants and will be kicking off our project teams in January. As we’ve talked about the implementation here, we’ve realized that we have a number of tasks that will need to be addressed, regardless of subject matter. For example, we’re going to need to map data – not just bibliographic, holdings and item data, but users, orders, invoices, etc. We’ll also need to set up configurations and user permissions for each of the apps, and document, train, and develop new workflows. Since our work is not siloed in functional areas, we need to facilitate discussions among the functional areas. To do that, we’re going to create a set of functional area implementation teams, and work groups around the task areas that need to be addressed.

To learn more about the FOLIO project at Duke, fly on over to our WordPress site and read through our past newsletters, look through slides from past presentations, and check out some fun links to bee facts.

Managing impermanence – migration of the Libraries’ digital exhibits

Post contributed by Claire Cahoon, student in the master’s program at the School of Information and Library Science, UNC-Chapel Hill.

This summer I worked as a field experience student in the Software Services department migrating digital exhibits into Omeka 2, Duke’s most current platform. The ultimate goal was to start and document the process of moving exhibits from legacy platforms into Omeka 2.

The reasoning behind the project became clear as we started creating an index of all of the digital exhibits on display in the exhibits website. Out of 97 total exhibits, there were varying degrees of functionality, from the most recent and up-to-date exhibits, to sites with broken links and pages where only text would display, leaving out crucial images. Centralizing these into a single platform should make it easier to create, support, and maintain all of these exhibits.

Screenshot of the sidebar of an exhibit, showing the link to the previous version of the exhibit in the Internet Archive
Screenshot of the sidebar of an exhibit, showing the link to the previous version of the exhibit in the Internet Archive

I found exhibits in Omeka 1, Cascade, Scriptorium, JAlbum, and even found a few mystery platforms that we never identified. Since it was the largest, we decided to work on the Omeka 1 group over the summer, and this week I finished migrating all 34 exhibits – that means that after a few adjustments to make the new exhibits available, Omeka 1 can be shut off!

We worked with Meg Brown, Exhibits Coordinator for the Libraries, and the exhibits department to figure out how each exhibit needed to be represented. Since we were managing expectations from lots of different stakeholders, we landed on the idea to include a link to the archived version of each exhibit in the WayBack machine, in case the look and feel of the new exhibits is limiting for anyone used to Omeka 1.

Working with the internet archive links and sorting through broken pieces of these exhibits really put into perspective how impermanent the internet is, even for seemingly static information. Without much maintenance, these exhibits lost some of the core content when video links changed, references were lost, and even the most well-written custom code stopped working. I hope that my work this summer will help keep these exhibit materials in working order while also eliminating the need to continue supporting for Omeka 1.

While migrating, I came across a few favorite exhibits and items that combined interesting content and some updated features in Omeka 2:

Cover of “Anxious homes: cursory-cleaning for the imminent arrival of visitors or how to give the impression of a clean house in under 20 minutes” by Jackie Batey.
Cover of “Anxious homes: cursory-cleaning for the imminent arrival of visitors or how to give the impression of a clean house in under 20 minutes” by Jackie Batey. Available in the Rubenstein Library: N7433.4.B38 A59 2006

Book + Art: Artists’ books from the Sallie Bingham Center for Women’s History and Culture (and the old version of Book + Art)

John Hope Franklin: Imprint of an American Scholar (and the old version of the John Hope Franklin exhibit)

Cheap Thrills: The Highs and Lows of Paris’s Cabaret Culture (and the old version of Cheap Thrills)

Medicology, or, Home encyclopedia of health: a complete family guide... Vol. I, by Joseph Gibbons Richardson (1904).
Medicology, or, Home encyclopedia of health: a complete family guide… Vol. I, by Joseph Gibbons Richardson (1904). Available in the Rubenstein Library: RC81 .R52 1904

Animated Anatomies: The Human Body in Anatomical Texts from the 16th to 21st Centuries (and the old version of Animated Anatomies)

Omeka still has some quirks to work out, and the accessibility of the pages and the metadata display are still in the works. However, migrating these exhibits into Omeka 2 will make them much easier to support and change for improvements. Thanks to the team that worked with me and taught me so much this summer: Will Sexton, Michael Daul, and Meg Brown!