This post was authored by Behind the Veil/Digital Collections intern Kristina Zapfe.
From the outside, viewing digitized items or requesting one yourself is a straightforward activity. Browsing images in the Duke Digital Repository produces instantaneous access to images from the David M. Rubenstein Rare Book and Manuscript Library’s collections, and requesting an item for digitization means that they appear in your email in as little as a few weeks. But what happens between placing that request and receiving your digital copies is a well-mechanized feat, a testament to the hard work and dedication put forth by the many staff members who have a hand in digitizing special collections.
I began at Duke Libraries in July 2022 as the Digital Collections Intern, learning about the workflows and pivots that were made in order to prioritize public access to Rubenstein’s collections during pandemic-era uncertainty. While learning to apply metadata and scan special collection materials, I sketched out an understanding of how digital library systems function together, crafted by the knowledge and skills of library staff who maintain and improve them. This established an appreciation of the collaborative and adaptive nature of the many departments and systems that account for every detail of digitizing items, providing patrons access to Rubenstein’s special collections from afar.
I have filmed short videos before, but none that required the amount of coordination and planning that this one did. Once the concept was formed, I began researching Duke University Libraries’ digital platforms and how they work together in order to overlay where the patron request process dipped into these platforms and when. After some email coordination, virtual meetings, and hundreds of questions, a storyboard was born and filming could begin. I tried out some camera equipment and determined that shooting on my iPhone 13 Pro with a gimbal attachment was a sufficient balance of quality and dexterity. Multiple trips to Rubenstein, Bostock, and Perkins libraries and Smith Warehouse resulted in about 500 video clips, about 35 of which appear in the final video.
Throughout this process, I learned that not everything goes as the storyboard plans, and I enjoyed leaving space for my own creativity as well as input from staff members whose insights about their daily work made for compelling shots and storytelling opportunities. The intent of this video is to tell the story of this process to everyone who uses and appreciates Duke Libraries’ resources and say “thank you” to the library staff who work together to make digital collections available.
Special thanks to all staff who appeared in this video and enthusiastically volunteered their time and Maggie Dickson, who supervised and helped coordinate this project.
Written by Will Shaw on behalf of the Library Summer Camp organizing committee.
From June to August, most students may be off campus, but summer is still a busy time at Duke Libraries. Between attending conferences, preparing for fall semester, and tackling those projects we couldn’t quite fit into the academic year, Libraries staff have plenty to do. At the same time, summer also means a lull in many regular meetings — as well as remote or hybrid work schedules for many of us. Face-to-face time with our colleagues can be hard to come by. Lucky for all of us, July is when Libraries Summer Camp rolls around.
What is Summer Camp?
Summer Camp began in 2019 with two major goals: to foster peer-to-peer teaching and learning among Libraries staff, and to help build connections across the many units in our organization. Our staff have wide-ranging areas of expertise that deserve a showcase, and we could use a little time together in the summer. Why not try it out?
The first Summer Camp was narrowly focused on digital scholarship and publishing, and we solicited sessions from staff who we knew would already have instructional materials in hand. The response from both instructors and participants was enthusiastic; we ultimately brought staff together for 21 workshops over the course of a week in late summer.
The pandemic scuttled plans for 2020 and 2021 Summer Camps, but we relaunched in 2022 with the theme “Refresh!”—a conscious attempt to help us reconnect (in person, when possible!) after months of physical distance. Across the 2019, 2022, and 2023 iterations, Libraries Summer Camp has brought over 60 workshops to hundreds of attendees.
What did we learn this year?
Professional development workshops are still at the core of Summer Camp. But over the years, Camp has evolved to include a wider range of personal enrichment topics. The evolution has helped us find the right tone: learning together, as always, but having fun and focusing on personal growth, too.
For example, participants in this year’s Summer Camp could learn how to crochet or play the recorder, explore native plants, create memes, or practice Koru meditation. In parallel with those sessions, we had opportunities to discover the essentials of data visualization, try out platforms such as AirTable, discuss ChatGPT in libraries, learn fundraising basics, and improve our group discussions and decision-making, to name just a few.
Like any good Summer Camp, we wrapped things up with a closing circle. We shared our lessons learned, favorite moments, and hopes for future camps over Monuts and coffee.
After its third iteration, Summer Camp is starting to feel like a Duke Libraries tradition. Over 100 Libraries staff came together to teach with and learn from each other in 25 sessions this year. Based on both attendance and participant feedback, that’s a success, and it’s one we’d like to sustain. It’s hard not to feel excited for Summer Camp 2024.
As we look ahead, the organizing committee—Angela Zoss, Arianne Hartsell-Gundy, Kate Collins, Liz Milewicz, and Will Shaw—will be actively seeking new members, ideas for Summer Camp sessions, and volunteers to help out with planning. We encourage all Libraries staff to reach out and let us know what you’d like to see next time around!
simplify the patron request process in the Rubenstein Library;
preserve and make accessible files from patron requests in the Duke Digital Repository (DDR).
Note that in this context, our patrons are generally folks that want to access Rubenstein Library materials without making the trip to Durham. Anyone, regardless of their researcher or academic status, can request digital copies of Rubenstein collections.
Moving digitization requests through this workflow continues to be the major focus for the digital collections team and the Digital Production Center (DPC). Given the folder level nature of the process (whole folders of manuscript material at preservation quality), more requests are digitized by the DPC than under our previous workflow. Additionally, the new request process became an essential tool to serving remote researchers during the pandemic. It continues to be a valuable service, and we have not seen demand lessen significantly since the peak of the pandemic. Below is a chart showing the number of patron requests managed by the DPC since before the pandemic (note that we track our statistics by fiscal year or FY, which in Duke’s case is July – June).
# of requests
Patron requests received and files produced from said requests by the DPC.
As a result of the new patron request workflow, the digital collections team has made portions of hundreds of collections accessible in the digital repository. We also see new materials from the existing collections requested periodically, so individual digital collections grow over time. Our statistics for new digital collections are in the chart below.
New digital collections from patron requests
Additions to existing collections from patron requests
Print items digitized for patron requests
Non-patron based new digital collections
Additions to digital collections (not patron request oriented)
Numbers of collections launched in the Duke Digital Repository since 2020.
The patron request workflow, like all other digital collections projects, is carried out by the cross-departmental Duke Libraries Digital Collections Implementation Team (DCIT). DCIT members include representatives from Conservation Services, Digital Curation Services, the Digital Production Center, a Digital Projects Developer (from the Assessment and User Experience Strategy department), Rubenstein Library Research Services, and Rubenstein Library Technical Services. The group’s membership shows how varied the needs are to develop and sustain digital collections.
We have also been making slow progress on the “Section A” mass digitization project. This project is named for an old Rubenstein Library shelving location, and contains over 3000 small manuscript collections. Many of the collections document life in the South in the 19th Century. Since 2020, we have been able to make 210 Section A collections accessible online. Many of these were scanned before the pandemic began, however the DPC continues to scan Section A when time permits. We have also seen at least 25 Section A collections come all the way through the patron request workflow, and there are more in progress. I’ve included embedded links to 3 Section A collections below.
Here are a few other project highlights from the past 2.5 years.
Metadata created during a Rubenstein Library re-cataloging project has been transformed and applied to the American Slavery Documents digital collection, thus making this collection and the identities of the enslaved persons documented therein more discoverable.
The Memory Project grew to include more oral histories in September 2021.
Digital Collections has a lot to look forward to in 2023-2024. Along with the John Hope Franklin Research Center we expect to wrap up the Behind the Veil grant in 2024 (lots more news to come on that). The digital collections team also plans to continue refining the patron request workflow. We are hoping to find a new balance in our portfolio that allows us to continue serving the needs of remote researchers while also completing more project based digitization. How will we actually do that without significantly changing our staffing? When we figure it out, we will be happy to share.
Behind the Veil Digitization intern Sarah Waugh and Digital Collections intern Kristina Zapfe’s efforts over the past year have focused on quality control of interviews transcribed by Rev.com. This post was authored by Sarah Waugh and Kristina Zapfe.
The Digital Production Center (DPC) is proud to announce that we have reached a milestone in our work on Documenting African American Life in the Jim Crow South: Digital Access to the Behind the Veil Project Archive. We have completed digitization and are over halfway through our quality control of the audio transcripts! The project, funded by the National Endowment for the Humanities, will expand the Behind the Veil (BTV) digital collection, currently 410 audio files, to include the newly digitized copies of the original master recordings, photographic materials, and supplementary project files.
The collection derives from Behind the Veil: Documenting African-American Life in the Jim Crow South. This was an oral history project headed by Duke University’s Center for Documentary Studies from 1993 to 1995 and is currently housed in the David M. Rubenstein Rare Book and Manuscript Library and curated by the John Hope Franklin Research Center for African and African American History and Culture. The BTV collectiondocumented and preserved the memory of African Americans who lived in the South from the 1890s to the 1950s, resulting in a culturally-significant and extensive multimedia collection.
As interns, our work focused on ordering transcripts from Rev.com and performing quality control on transcripts for the digitized oral histories. July 2023 marked our arrival at the halfway point of completing the oral history transcript quality control process. At the time of writing, we’ve checked 1727 of 2876 files after a year of initial planning and hard work. With over 1,666 hours worth of audio files to complete, 3 interns and 7 student workers in the DPC contributed 849 combined hours to oral history transcript quality control so far. Because of their scope, transcription and quality control are the last pieces of the digitization puzzle before the collection moves on to be ingested and published in the Duke Digital Repository.
We are approaching the home stretch with the deadline for transcript quality control coming in December 2023, and the collection scheduled to launch in 2024. With that goal approaching, here is what we’ve completed and what remains to be done.
As the graphic above indicates, the BTV digitization project consists of many different media like audio, video, prints, negatives, slides, administrative and project related documents that tell a fuller story of this endeavor.With these formats digitized, we look forward to finishing quality control and preparing the files for handoff to members of the Digital Collections and Curation Services department for ingest, metadata application, and launch for public access in 2024. We plan to send all 2876 audio files to Rev.com service by the end of August and to perform quality control on all those transcripts by December 2023.
Developing the Transcription Quality Control Process
With 2876 files to check within 19 months, the cross-departmental BTV team developed a process to perform quality control as efficiently as possible without sacrificing accuracy, accessibility, and our commitment to our stakeholders. We made our decisions based on how we thought BTV interviewers and narrators would want their speech represented as text. Our choices in creating our quality control workflow began with Columbia University’s Oral History Transcription Style Guide and from that resource, we developed a workflow that made sense for our team and the project.
Some voices were difficult to transcribe due to issues with the original recording, such as a microphone being placed too far away from a speaker, the interference of background noise, or mistakes with the tape. Since we did not have the resources to listen to entire interviews and check for every single mistake, we developed what we called the “spot-check” process of checking these interviews. Given the BTV project’s original ethos and the history of marginalized people in archives, the team decided to prioritize making sure race-related language met our standards across every single interview.
A few decisions on standards were quick and unanimous—such as not transcribing speech phonetically. With that, we avoided pitfalls from older oral histories of African Americans, like the WPA’s famous “Slave Narratives” project, that interviewed formerly-enslaved people, but often transcribed their words in non-standard phonetic spellings. Some narrators in the BTV project who may have been familiar with the WPA transcripts specifically requested the BTV project team not to use phonetic spelling.
Other choices took more discussion: we agreed on capitalizing “Black” when describing race, but we had to decide whether to capitalize other racial terms, including “White” and antiquated designations like “Colored.” Ultimately, we decided to capitalize all racial terms (with the exception of slurs). The team did not want users to make distinctions between lower and uppercase terms if we did not choose to capitalize them all. Maintaining consistency with capitalization would provide clarity and align with BTV values of equality between all races.
Using a spot-check process where we use Rev’s find-and-replace feature to standardize our top priorities saved us time to improve the transcripts in other ways. For instance, we also try to find and correct proper nouns like street names or names of important people in our narrators’ communities, allowing users to make connections in their research. We corrected mistakes with phrases used mainly in the past or that are very specific to certain regions, such as calling a dance hall a “Piccolo joint” from an early jukebox brand name. We also listened to instances where the transcriptionist could not hear or understand a phrase and marked it as “indistinct,” so we can add in the dialogue later (assuming we are able to decipher what was said).
While we developed these methods to increase the pace of our quality control process, one of the biggest improvements came from working with Rev. If we were able to attain more accurate transcripts, our quality control process would be more efficient. Luckily, Rev’s suite of services provided us this option without straying too far from our transcription budget.
Improving Accuracy with Southern Accents Specialists
When deciding on what would be the best speech-to-text option for our project’s needs, we elected to order Transcript Services from Rev, rather than their Caption Services. This decision hinged on the fact that the Transcript Services option is their only service that allows us to request Rev transcriptionists who specialize in Southern accents. Many people who were interviewed for Behind the Veil spoke with Southern accents that varied in strength and dialect. We found that the Southern accent expertise of the specialists had a significant impact on the accuracy of the transcripts we received from Rev.
This improvement in transcript quality has made a substantial difference in the time we spend on quality control for each interview: on average, it only takes us about 48 seconds of work for every 60 seconds of audio we check. We appreciated Rev’s offering of Southern accent specialists enough that we chose that service, even though it meant that we had to then convert their text file format output to the WebVTT file format for enhanced accessibility in the Duke Digital Repository.
Optimizing Accessibility with WebVTT File Format
The WebVTT file format provides visual tracking that coordinates the audio with the written transcript. This improvement in user experience and accessibility justified converting the interview transcripts to WebVTT format. Below is a visual of the WebVTT format in our existing BTV collection in the DDR. Click here to listen to the audio recording.
We have been collaborating with developer Sean Aery to convert transcript text files to WebVTT files so they will display properly in the Duke Digital Repository. He explained the conversion process that occurs after we hand off the transcripts in text file format.
“The .txt transcripts we received from the vendor are primarily formatted to be easy for people to read. However, they are structured well enough to be machine-readable as well. I created a script to batch-convert the files into standard WebVTT captions with long text cues. In WebVTT form, the caption files play nicely with our existing audiovisual features in the Duke Digital Repository, including an interactive transcript viewer, and PDF exports.” – Sean Aery, Digital Projects Developer, Duke University Libraries
Before conversion, we complete one more round of quality control using the spot-checking process. We have even referred to other components of the Behind the Veil collection (Administrative and Project Files Administrative Files) to cross-reference any alterations to metadata for accuracy.
We also recently presented at the Triangle Research Libraries Network annual meeting, where our presentation overlapped with some of what you’ve just read in this post. It was exciting to share our work publicly for the first time and answer questions from library staff across the region. We will also be presenting a poster about our BTV experience at the upcoming North Carolina Library Association conference in Winston-Salem in October.
As we’ve hoped to convey, this project heavily relies on collaboration from many library departments and external vendors, and there are more contributors than we can thoroughly include in this post. Behind the Veil is a large-scale and high-profile project that has impacted many people over its 30-year history, and this newest iteration of digital accessibility seeks to expand the reach of this collection. Two years on, we’ve built on the work of the many professionals who have come before us to create and develop Behind the Veil. We are honored to be part of this rewarding process. Look for more BTV stories when we cross the finish line in 2024.
Coming on board as the new Web Experience Developer in the Assessment and User Experience Services (AUXS) Department in early 2022, one of my first priorities was to get up to speed on Web Accessibility guidelines and testing. I wanted to learn how these standards had been applied to Library websites to date and establish my own processes and habits for ongoing evaluation and improvement. Flash-forward one year, and I’m looking back at the steps that I took and reflecting on lessons learned and projects completed. I thought it might be helpful to myself and others in a similar situation (e.g. new web developers, designers, or content creators) to organize these experiences and reflections into a sort of manual or “Quick Start Guide”. I hope that these 5 steps will be useful to others who need a crash course in this potentially confusing or intimidating–but ultimately crucial and rewarding–territory.
Learn from your colleagues
Fortunately, I quickly discovered that Duke Libraries already had a well-established culture and practice around web accessibility, including a number of resources I could consult.
Two Bitstreams posts from our longtime web developer/designer Sean Aery gave me a quick snapshot of the current state of things, recent initiatives, and ongoing efforts:
Repositories of the Library’s open source software projects proved valuable in connecting broader concepts with specific examples and seeing how other developers had solved problems. For instance, I was able to look at the code for DUL’s “theme” (basically visual styling, color, typography, and other design elements) to better understand how it builds on the ubiquitous Bootstrap CSS framework and implements specific accessibility standards around semantic markup, color contrast, and ARIA roles/attributes:
The site also offers guides geared towards the needs of different stakeholders (content creators, designers, developers) as well as a step-by-step overview of how to do an accessibility assessment.
Know your standards
Duke University has specified the Worldwide Web Consortium Web Content Accessibility Guidelines version 2.0, Level AA Conformance (WCAG 2.0 Level AA) as its preferred accessibility standard for websites. While it was initially daunting to digest and parse these technical documents, at least I had a known, widely-adopted target that I was aiming for–in other words, an achievable goal. Feeling bolstered by that knowledge, I was able to use the other resources mentioned here to fill in the gaps and get hands-on experience and practice solving web accessibility issues.
Find a playground
As I settled into the workflow within our Scrum team (based on Agile software development principles), I found a number of projects that gave me opportunities to test and experiment with how different markup and design decisions affect accessibility. I particularly enjoyed working with updating the Style Guide for our Catalog as part of a Bootstrap 3–>4 migration, updating our DUL Theme across various applications — Library Catalog, Quicksearch, Staff Directory — built on the Ruby on Rails framework, and getting scrappy and creative trying to improve branding and accessibility of some of our vendor-hosted web apps with the limited tools available (essentially jQuery scripts and applying CSS to existing markup).
Build your toolkit
A few well-chosen tools can get you far in assessing and correcting web accessibility issues on your websites.
The built-in developer tools in your browser are essential for viewing and testing changes to markup and understanding how CSS rules are applied to the Document Object Model. The Deque Systems aXe Chrome Extension (also available for Firefox) adds additional tools for accessibility testing with a slick interface that performs a scan, gives a breakdown of accessibility violations ranked by severity, and tells you how to fix them.
Color Contrast Checkers
I frequently turned to these two web-based tools for quick tests of different color combinations. It was educational to see what did and didn’t work in various situations and think more about how aesthetic and design concerns interact with accessibility concerns.
These style guides provided a handy reference for default and variant typography, color, and page design elements. I found the color palettes particular helpful as I tried to find creative solutions to color contrast problems while maintaining Duke branding and consistency across various Library pages.
Attempting to navigate our websites using only the TAB, ENTER, SPACE, UP, and DOWN keys on a standard computer keyboard gave me a better understanding of the significance of semantic markup, skip links, and landmarks. This test is essential for getting another “view” of your pages that isn’t as dependent on visual cues to convey meaning and structure and can help surface issues that automated accessibility scanners might miss.
Post authored by Jen Jordan, Digital Collections Intern.
Hello, readers. This marks my third, and final blog as the Digital Collections intern, a position that I began in June of last year.* Over the course of this internship I have been fortunate to gain experience in nearly every step of the digitization and digital collections processes. One of the things I’ve come to appreciate most about the different workflows I’ve learned about is how well they accommodate the variety of collection materials that pass through. This means that when unique cases arise, there is space to consider them. I’d like to describe one such case, involving a pretty remarkable collection.
In early October I arrived to work in the Digital Production Center (DPC) and was excited to see the Booker T. Washington correspondence, 1903-1916, 1933 and undated was next up in the queue for digitization. The collection is small, containing mostly letters exchanged between Washington, W. E. B. DuBois, and a host of other prominent leaders in the Black community during the early 1900s. A 2003 article published in Duke Magazine shortly after the Washington collection was donated to the John Hope Franklin Research Center provides a summary of the collection and the events it covers.
Arranged chronologically, the papers were stacked neatly in a small box, each letter sealed in a protective sleeve, presumably after undergoing extensive conservation treatments to remediate water and mildew damage. As I scanned the pages, I made a note to learn more about the relationship between Washington and DuBois, as well as the events the collection is centered around—the Carnegie Hall Conference and the formation of the short-lived Committee of Twelve for the Advancement of the Interests of the Negro Race. When I did follow up, I was surprised to find that remarkably little has been written about either.
As I’ve mentioned before, there is little time to actually look at materials when we scan them, but the process can reveal broad themes and tone. Many of the names in the letters were unfamiliar to me, but I observed extensive discussion between DuBois and Washington regarding who would be invited to the conference and included in the Committee of Twelve. I later learned that this collection documents what would be the final attempt at collaboration between DuBois and Washington.
Once scanned, the digital surrogates pass through several stages in the DPC before they are prepared for ingest into the Duke Digital Repository (DDR); you can read a comprehensive overview of the DPC digitization workflow here. Fulfilling patron requests is top priority, so after patrons receive the requested materials, it might be some time before the files are submitted for ingest to the DDR. Because of this, I was fortunate to be on the receiving end of the BTW collection in late January. By then I was gaining experience in the actual creation of digital collections—basically everything that happens with the files once the DPC signals that they are ready to move into long term storage.
There are a few different ways that new digital collections are created. Thus far, most of my experience has been with the files produced through patron requests handled by the DPC. These tend to be smaller in size and have a simple file structure. The files are migrated into the DDR, into either a new or existing collection, after which file counts are checked, and identifiers assigned. The collection is then reviewed by one of a few different folks with RL Technical Services. Noah Huffman conducted the review in this case, after which he asked if we might consider itemizing the collection, given the letter-level descriptive metadata available in the collection guide.
I’d like to pause for a moment to discuss the tricky nature of “itemness,” and how the meaning can shift between RL and DCCS. If you reference the collection guide linked in the second paragraph, you will see that the BTW collection received item-level description during processing—with each letter constituting an item in the collection. The physical arrangement of the papers does not reflect the itemized intellectual arrangement, as the letters are grouped together in the box they are housed in. When fulfilling patron reproduction requests, itemness is generally dictated by physical arrangement, in what is called the folder-level model; materials housed together are treated as a single unit. So in this case, because the letters were grouped together inside of the box, the box was treated as the folder, or item. If, however, each letter in the box was housed within its own folder, then each folder would be considered an item. To be clear, the papers were housed according to best practices; my intent is simply to describe how the processes between the two departments sometimes diverge.
Processing archival collections is labor intensive, so it’s increasingly uncommon to see item-level description. Collections can sit unprocessed in “backlog” for many years, and though the depth of that backlog varies by institution, even well-resourced archives confront the problem of backlog. Enter: More Product, Less Process (MPLP), introduced by Mark Greene and Dennis Meissner in a 2005 article as a means to address the growing problem. They called on archivists to prioritize access over meticulous arrangement and description.
The spirit of folder-level digitization is quite similar to MPLP, as it enables the DPC to provide access to a broader selection of collection materials digitized through patron requests, and it also simplifies the process of putting the materials online for public access. Most of the time, the DPC’s approach to itemness aligns closely with the level of description given during processing of the collection, but the inevitable variance found between archival collections requires a degree of flexibility from those working to provide access to them. Numerous examples of digital collections that received item-level description can be found in the DDR, but those are generally tied to planned efforts to digitize specific collections.
Because the BTW collection was digitized as an item, the digital files were grouped together in a single folder, which translated to a single landing page in the DDR’s public user interface. Itemizing the collection would give each item/letter its own landing page, with the potential to add unique metadata. Similarly, when users navigate the RL collection guide, embedded digital surrogates appear for each item. A moment ago I described the utility of More Product Less Process. There are times, however, when it seems right to do more. Given the research value of this collection, as well as its relatively small size, the decision to proceed with itemization was unanimous.
Itemizing the collection was fairly straightforward. Noah shared a spreadsheet with metadata from the collection guide. There were 108 items, with each item’s title containing the sender and recipient of a correspondence, as well as the location and date sent. Given the collection’s chronological physical arrangement, it was fairly simple to work through the files and assign them to new folders. Once that was finished, I selected additional descriptive metadata terms to add to the spreadsheet, in accordance with the DDR Metadata Application Profile. Because there was a known sender and recipient for almost every letter, my goal was to identify any additional name authority records not included in the collection guide. This would provide an additional access point by which to navigate the collection. It would also help me to identify death dates for the creators, which determines copyright status. I think the added time and effort was well worth it.
This isn’t the space for analysis, but I do hope you’re inspired to spend some time with this fascinating collection. Primary source materials offer an important path to understanding history, and this particular collection captures the planning and aftermath of an event that hasn’t received much analysis. There is more coverage of what came after; Washington and DuBois parted ways, after which DuBois became a founding member of the Niagara Movement. Though also short lived, it is considered a precursor to the NAACP, which many members of the Niagara Movement would go on to join. A significant portion of W. E. B. DuBois’s correspondence has been digitized and made available to view through UMass Amherst. It contains many additional letters concerning the Carnegie Conference and Committee of Twelve, offering additional context and perspective, particularly in certain correspondence that were surely not intended for Washington’s eyes. What I found most fascinating, though, was the evidence of less public (and less adversarial) collaboration between the two men.
The additional review and research required by the itemization and metadata creation was such a fascinating and valuable experience. This is true on a professional level as it offered the opportunity to do something new, but I also felt moved to try to understand more about the cast of characters who appear in this important collection. That endeavor extended far beyond the hours of my internship, and I found myself wondering if this was what the obsessive pursuit of a historian’s work is like. In any case, I am grateful to have learned more, and also reminded that there is so much more work to do.
Click here to view the Booker T. Washington correspondence in the Duke Digital Repository.
*Indeed, this marks my final post in this role, as my internship concludes at the end of April, after which I will move on to a permanent position. Happily, I won’t be going far, as I’ve been selected to remain with DCCS as one of the next Repository Services Analysts!
Cheyne, C.E. “Booker T. Washington sitting and holding books,” 1903. 2 photographs on 1 mount : gelatin silver print ; sheets 14 x 10 cm. In Washington, D.C., Library of Congress Prints and Photographs Division. Accessed April 5, 2022. https://www.loc.gov/pictures/item/2004672766/
Post authored by Jen Jordan, Digital Collections Intern.
As another strange year nears its end, I’m going out on a limb to assume that I’m not the only one around here challenged by a lack of focus. With that in mind, I’m going to keep things relatively light (or relatively unfocused) and take you readers on a short tour of items that have passed through the Digital Production Center (DPC) this year.
Shortly before the arrival of COVID-19, the DPC implemented a folder-level model for digitization. This model was not developed in anticipation of a life-altering pandemic, but it was well-suited to meet the needs of researchers who, for a time, were unable to visit the Rubenstein Library to view materials in person. You can read about the implementation of folder-level digitization and its broader impact here. To summarize, before spring of 2020 it was standard practice to fill patron requests by imaging only the item needed (e.g. – a single page within a folder). Now, the default practice is to digitize the entire folder of materials. This has produced a variety of positive outcomes for stakeholders in the Duke University Libraries and broader research community, but for the purpose of this blog, I’d like to describe my experience interacting with materials in this way.
Digitization is time consuming, so the objective is to move as quickly as possible while maintaining a high level of accuracy. There isn’t much time for meaningful engagement with collection items, but context reveals itself in bits and pieces. Themes rise to the surface when working with large folders of material on a single topic, and sometimes the image on the page demands to be noticed.
On more than one occasion I’ve found myself thinking about the similarities between scanning and browsing a social media app like Instagram. Stick with me here! Broadly speaking, both offer an endless stream of visual stimuli with little opportunity for meaningful engagement in the moment. Social media, when used strategically, can be world-expanding. Work in the DPC has been similarly world-expanding, but instead of an algorithm curating my experience, the information that I encounter on any given day is curated by patron requests for digitization. Also similar to social media is the range of internal responses triggered over the course of a work day, and sometimes in the span of a single minute. Amusement, joy, shock, sorrow—it all comes up.
I started keeping notes on collection materials and topics to revisit on my own time. Sometimes I was motivated by a stray fascination with the subject matter. Other times I encountered collections relating to prominent historical figures or events that I realized I should probably know a bit more about.
First wave feminism was one such topic that revealed itself. It was a movement I knew little about, but the DPC has digitized numerous items relating to women’s suffrage and other feminist issues at the turn of the 20th century. I was particularly intrigued by the radical leanings of the UK’s Women’s Social and Political Union (WSPU), organized by Emmeline Pankhurst to fight for the right to vote. When I started looking at newspaper clippings pasted into a scrapbook documenting WSPU activities, I was initially distracted by the amusing choice of words (“Coronation chair damaged by wild women’s bomb”). Curious to learn more, I went home and read about the WSPU. The following excerpt is from a speech by Pankhurst in which she provides justification for the militant tactics employed by the WSPU:
I want to say here and now that the only justification for violence, the only justification for damage to property, the only justification for risk to the comfort of other human beings is the fact that you have tried all other available means and have failed to secure justice. I tell you that in Great Britain there is no other way…
Pankhurst argued that men had to take the right to vote through war, so why shouldn’t women also resort to violence and destruction? And so they did.
As Rubenstein Library is home to the Sallie Bingham Center, it’s unsurprising that the DPC digitizes a fair amount of material on women’s issues. To share a few more examples, I appreciate the juxtaposition of the following two images, both of which I find funny, and yet sad.
This advertisement for window shades is pasted inside a young woman’s scrapbook dated 1900—1905. It contains information on topics such as etiquette, how to manage a household, and how to be a good wife. Are we to gather that proper shade cloth is necessary to keep a man happy?
In contrast, the below image is from the bookL’amour libre by French feminist, Madeleine Vernet, describes prostitution and marriage as the same kind of prison, with “free love” as the only answer. Some might call that a hyperbolic comparison, but after perusing the young woman’s scrapbook, I’m not so sure. I’m just thankful to have been born a woman near the end of the 20th century and not the start of it.
This may be difficult to believe, but I didn’t set out to write a blog so focused on struggle. The reality, however, is that our special collections are full of struggle. That’s not all there is, of course, but I’m glad this material is preserved. It holds many lessons, some of which we still have yet to learn.
I think we can all agree that 2021 was, well, a challenging year. I’d be remiss not to close with a common foe we might all rally around. As we move into 2022 and beyond, venturing ever deeper into space, we may encounter this enemy sooner than we imagined…
Pankhurst, Emmeline. Why We Are Militant: A Speech Delivered by Mrs. Pankhurst in New York, October 21, 1913. London: Women’s Press, 1914. Print.
“‘Prayers for Prisoners’ and church protests.” Historic England, n.d., https://historicengland.org.uk/research/inclusive-heritage/womens-history/suffrage/church-protests/
Fourteen-hundred pages with 70 different authors, all sharing information about library services, resources, and policies — over the past eight years, any interested library staff member has been able to post and edit content on the Duke University Libraries (DUL) website. Staff have been able to work independently, using their own initiative to share information that they thought would be helpful to the people who use our website.
Unfortunately, DUL has had no structure for coordinating this work or even for providing training to people undertaking this work. This individualistic approach has led to a complex website often containing inconsistent or outdated information. And this is all about to change.
Our new approach
We are implementing a team-based approach to manage our website content by establishing the Web Editorial Board (WEB) comprised of 22 staff from departments throughout DUL. The Editors serving on WEB will be the only people who will have hands-on access to create or edit content on our website. We recognize that our primary website is a core publication of DUL, and having this select group of Editors work together as a team will ensure that our content is cared for, cohesive, and current. Our Editors have already undertaken training on topics such as writing for the web, creating accessible content, editing someone else’s content, and using our content management system.
Our Editors will apply their training to improve the quality and consistency of our website. As they undertake this work, they will collaborate with other Editors within WEB as well as with subject matter experts from across the libraries. All staff at DUL will be able to request changes, contribute ideas, and share feedback with WEB using either a standard form or by contacting Editors directly.
The scope of work undertaken by WEB includes:
Editing, formatting, and maintaining all content on DUL’s Drupal-based website
Writing new content
Retiring deprecated content
Reviewing, editing, and formatting content submitted to WEB by DUL staff, and consulting with subject matter experts within DUL
Deepening their expertise in how to write and format website content through continuing education
While there are times when all 22 Editors will meet together to address common issues or collaborate on site-wide projects, much of the work undertaken by WEB will be organized around sub-teams that we refer to as content neighborhoods, each one meeting monthly and focused on maintaining different sections of our website. Our eight sub-teams range in size from two to five people. Having sub-teams ensures that our Editors will be able to mutually support one another in their work.
Initially, Editors on WEB will serve for a two-year term, after which some members will rotate off so that new members can rotate on. Over time it will be helpful to balance continuity in membership with the inclusion of fresh viewpoints.
WEB was created following a recommendation developed by DUL’s Web Experience Team (WebX), the group that provides high-level governance for all of our web platforms. Based on this WebX recommendation, the DUL Executive Group issued a charge for WEB in the spring and WEB began its orientation and training during the summer of 2021. Members of WEB will soon be assisting in our migration from Drupal 7 to Drupal 9 by making key updates to content prior to the migration. Once we complete our migration to Drupal 9 in March 2022, we will then limit hands-on access to create or edit content in Drupal to the members of WEB.
The charge establishing WEB contains additional information about WEB’s work, the names of those serving on WEB, and the content neighborhoods they are focusing on.
Duke is using FOLIO in production! We have eight apps that we’re using in production. For our electronic resources management, we are using Agreements, Licenses, Organizations, Users, and Settings. Those apps went live in July of 2020, even with the pandemic in full force! In July of 2021, we launched Courses and Inventory so that professors and students could store and access electronic reserves material. In Summer 2022, we plan to launch the eUsage app that will allow us to link to vendor sites and bring our eUsage statistics into one place.
In Summer 2023, we plan to launch the rest of the FOLIO, moving all of our acquisitions, cataloging, and circulation functions into their respective apps. Currently the total number of apps included in FOLIO is 20. We’re almost halfway there!
Quick—when was the last time you went a full day without using a Google product or service? How many years ago was that day?
We all know Google has permeated so many facets of our personal and professional lives. A lot of times, using a Google something-or-other is your organization’s best option to get a job done, given your available resources. If you ever searched the Duke Libraries website at any point over the past seventeen years, you were using Google.
It’s really no secret that when you have a website with a lot of pages, you need to provide a search box so people can actually find things. Even the earliest version of the library website known to the Wayback Machine–from “way back” in 1997–had a search box. Those days, search was powered by the in-house supported Texis Webinator. Google was yet to exist.
July 24, 2004 was an eventful day for the library IT staff. We went live with a shiny new Integrated Library System from Ex Libris called Aleph (that we are still to this day working to replace). On that very same day, we launched a new library website, and in the top-right corner of the masthead on that site was–for the very first time–a Google search box.
Years went by. We redesigned the website several times. Interface trends came and went. But one thing remained constant: there was a search box on the site, and if you used it, somewhere on the next page you were going to get search results from a Google index.
That all changed in summer 2021, when we implemented Nutch…
Why Not Google?
Google Programmable Search Engine (recently rebranded from “Google Custom Search Engine”), is easy to use. It’s “free.” It’s fast, familiar, and being a Google thing, it’s unbeatable at search relevancy. So why ditch it now? Well…
The results are capped at 100 per query. Google prioritizes speed and page 1 relevancy, but it won’t give you a precise hit count nor an exhaustive list of results.
It’s a black box. You don’t really get to see why pages get ranked higher or lower than others.
There’s a search API you could potentially build around, but if you exceed 100 searches/day, you have to start paying to use it.
Apache Nutch is open source web crawler software written in Java. It’s been around for nearly 20 years–almost as long as Google. It supports out-of-the-box integration with Apache Solr for indexing.
What’s So Good About Nutch?
Solr. Our IT staff have grown quite accustomed to the Solr search platform over the past decade; we already support around ten different applications that use it under the hood.
Self-Hosted. You run it yourself, so you’re in complete control of the data being crawled, collected, and indexed. User search data is not being collected by a third party like Google.
Configurable. You have a lot of control over how it works. All our configs are in a public code repository so we have record of what we have changed and why.
What are the Drawbacks to Using Nutch?
Maintenance. Using open source software requires a commitment of IT staff resources to build and maintain over time. It’s free, but it’s not really free.
Interface. Nutch doesn’t come with a user interface to actually use the indexed data from the crawls; you have to build a web application. Here’s ours.
Relevancy. Though Google considers such factors as page popularity and in-link counts to deem pages as more relevant than others for a particular query, Nutch can’t. Or, at least, its optional features that attempt to do so are flawed enough that not using them gets us better results. So we rely on other factors for our relevancy algorithm, like the segment of the site that a page resides, URL slugs, page titles, subheading text, inlink text, and more.
Documentation. Some open source platforms have really clear, easy to understand instruction manuals online to help you understand how to use them. Nutch is not one of those platforms.
Searching from the website masthead or the default “All” box in the tabbed section on our homepage brings you to QuickSearch results page.
You’ll see a search results page rendered by our QuickSearch app. It includes sections of results from various places, like articles, books & media, and more. One of the sections is “Our Website” — it shows the relevant pages that we’ve crawled with Nutch.
You can just search the website specifically if you’re not interested in all those other resources.
Three pieces work in concert to enable searching the website: Nutch, Solr, and QuickSearch. Here’s what they do:
Crawls web pages that we want to include in the website search.
Parses HTML content; writes it to Solr fields.
Includes configuration for what pages to include/exclude, crawler settings, field mappings
Index & document store for crawled website content.
Crawls happen every night to pick up new pages and changes to existing ones. We use an “adaptive fetch schedule” so by default each page gets recrawled every 30 days. If a page changes frequently, it’ll get re-crawled sooner automatically.
Overall, we’re satisfied with how the switch to Nutch has been working out for us. The initial setup was challenging, but it has been running reliably without needing much in the way of developer intervention. Here’s hoping that continues!
Many thanks to Derrek Croney and Cory Lown for their help implementing Nutch at Duke, and to Kevin Beswick (NC State University Libraries) for consulting with our team.
Notes from the Duke University Libraries Digital Projects Team