Category Archives: Projects

Lichens, Bryophytes and Climate Change

As 2015 winds down, the Digital Production Center is wrapping up a four-year collaboration with the Duke Herbarium to digitize their lichen and bryophyte specimens. The project is funded by the National Science Foundation, and the ultimate goal is to digitize over 2 million specimens from more than 60 collections across the nation. Lichens and bryophytes (mosses and their relatives) are important indicators of climate change. After the images from the participating institutions are uploaded to one central portal, called iDigBio, large-scale distribution mapping will be used to identify regions where environmental changes are taking place, allowing scientists to study the patterns and effects of these changes.

0233518_1

The specimens are first transported from the Duke Herbarium to Perkins Library on a scheduled timeline. Then, we photograph the specimen labels using our Phase One overhead camera. Some of the specimens are very bulky, but our camera’s depth of field is broad enough to keep them in focus. To be clear, what the project is utilizing is not photos of the actual plant specimens themselves, but rather images of the typed and hand-written scientific metadata adorning the envelopes which house the specimens. After we photograph them, the images are uploaded to the national database, where they are available for online research, along with other specimen labels uploaded from universities across the United States. Optical character recognition is used to digest and organize the scientific metadata in the images.

0167750_1

Over the past four years, the Digital Production Center has digitized approximately 100,000 lichen and bryophyte specimens. Many are from the Duke Herbarium, but some other institutions have also asked us to digitize some of their specimens, such as UNC-Chapel Hill, SUNY-Binghamton, Towson University and the University of Richmond. The Duke Herbarium is the second-largest herbarium of all U.S. private universities, next to Harvard. It was started in 1921, and it contains more than 800,000 specimens of vascular plants, bryophytes, algae, lichens, and fungi, some of which were collected as far back as the 1800s. Several specimens have unintentionally humorous names, like the following, which wants to be funky, but isn’t fooling anyone. Ok, maybe only I find that funny.

10351607_10203888424672120_3650120923747796868_n

The project has been extensive, but enjoyable, thanks to the leadership of Duke Herbarium Data Manager Blanka Shaw. Dr. Shaw has personally collected bryophytes on many continents, and has brought a wealth of knowledge, energy and good humor to the collaboration with the Digital Production Center. The Duke Herbarium is open for visitors, and citizen scientists are also needed to volunteer for transcription and georeferencing of the extensive metadata collected in the national database.

Introducing the Digital Monograph of Haiti

In 2014 the Rubenstein Library acquired the Monograph of Haiti, an aggregation of intelligence information gathered by the U.S. Marine Corps during their occupation of the country between 1915-1934. This item has recently been digitized, and this week guest bloggers Holly Ackerman and Sara Seten Berghausen introduce us to the monograph and its provenance.

MonographOfHaiti1932_0418
Interior image from the Monograph of Haiti

The catalog of the U.S. Marine Corps Archives is not publically available. Marine regulations make it necessary for researchers wanting to explore the Archives’ holdings to physically go to Quantico, Virginia. Once there, they must rely on expert staff to conduct a search for them. Researchers are then free to look at the materials.

Like any prohibition, the lack of direct access creates both frustration and allure. As the number of Duke faculty and students studying Haiti increased over the last five years, Holly Ackerman, Duke’s Librarian for Latin American and Caribbean Studies, felt the pull of possible treasure and traveled to Quantico. Since the U.S. Marines had occupied Haiti from 1915 – 1934, it seemed likely that there would be significant collections that might interest our scholars.

moh
An image of the Monograph prior to digitization.

The archives did not disappoint. Chief among the treasures was The Monograph of the Republic of Haiti, a book that looks more like an old accountant’s ledger than the accumulation of intelligence information from the U.S. occupation era that it really is. On its opening page the Monograph declares its purpose,

“The object of this book is to provide operative and war information upon the Republic of Haiti. A monograph aims to be so thorough a description of the country upon which it is written that the Commander of any Expedition approaching its coasts will have at his disposal all the information obtainable to commence active operations in case of a hostile invasion or a peaceful occupation, and to facilitate his diplomatic routine mission in time of peace.”

Since the Marine Corps Archive owned two of only six known copies of the Monograph, they offered to donate one to the Rubenstein Library at Duke. It was received in the Spring of 2014. The intent of the Marine Corps Archive was to share the monograph as widely as possible. To fulfill that pledge, the Duke Libraries’ Digital Production Center cataloged, conserved and digitized the Monograph in 2015, making it available worldwide via the Internet Archive. Scholars in Haiti and the U.S. have begun using the resource for research and teaching.

MonographOfHaiti1932_0272
Image of an interior page from the Monograph of Haiti

Post Contributed by Holly Ackerman, Librarian for Latin American, Iberian and Latino/a Studies and Sara Seten Berghausen, Associate Curator of Collections, Rubenstein Library

Baby Steps towards Metadata Synchronization

How We Got Here: A terribly simplistic history of library metadata

Managing the description of library collections (especially “special” collections) is an increasingly complex task.  In the days of yore, we bought books and other things, typed up or purchased catalog cards describing those things (metadata), and filed the cards away.  It was tedious work, but fairly straightforward.  If you wanted to know something about anything in the library’s collection, you went to the card catalog.  Simple.

Some time in the 1970s or 1980s we migrated all (well, most) of that card catalog description to the ILS (Integrated Library System).  If you wanted to describe something in the library, you made a MARC record in the ILS.  Patrons searched those MARC records in the OPAC (the public-facing view of the ILS).  Still pretty simple.  Sure, we maintained other paper-based tools for managing description of manuscript and archival collections (printed finding aids, registers, etc.), but until somewhat recently, the ILS was really the only “system” in use in the library.

duke_online_catalog_1980s
Duke Online Catalog, 1980s

From the 1990s on things got complicated. We started making EAD and MARC records for archival collections. We started digitizing parts of those collections and creating Dublin Core records and sometimes TEI for the digital objects.  We created and stored library metadata in relational databases (MySQL), METS, MODS, and even flat HTML. As library metadata standards proliferated, so too did the systems we used the create, manage, and store that metadata.

Now, we have an ILS for managing MARC-based catalog records, ArchivesSpace for managing more detailed descriptions of manuscript collections, a Fedora (Hydra) repository for managing digital objects, CONTENTdm for managing some other digital objects, and lots of little intermediary descriptive tools (spreadsheets, databases, etc.).  Each of these systems stores library metadata in a different format and in varying levels of detail.

So what’s the problem and what are we doing about it?

The variety of metadata standards and systems isn’t the problem.  What is the problem–a very painful and time-consuming problem–is having to maintain and reconcile description of the same thing (a manuscript, a folder of letters, an image, an audio file, etc.) across all these disparate metadata formats and systems.  It’s a metadata synchronization problem and it’s a big one.

For the past four months or so, a group of archivists and developers here in the library have been meeting regularly to brainstorm ways to solve or at least help alleviate some of our metadata synchronization problems.  We’ve been calling our group “The Synchronizers.”

What have The Synchronizers been up to?  Well, so far we’ve been trying to tackle two pieces of the synchronization conundrum:

Problem 1 (the big one): Keeping metadata for special collections materials in sync across ArchivesSpace, the digitization process, and our Hydra repository.

Ideally, we’d like to re-purpose metadata from ArchivesSpace to facilitate the digitization process and also keep that metadata in sync as items are digitized, described more fully, and ingested into our Hydra repository. Fortunately, we’re not the only library trying to tackle this problem.  For more on AS/Hydra integration, see the work of the Hydra Archivists Interest Group.

Below are a couple of rough sketches we drafted to start thinking about this problem at Duke.

AS_Hydra_diagram
Hydra / ArchivesSpace Integration Sketch, take 1
sychronizers_will
Hydra / ArchivesSpace Integration Sketch, take 2

 

In addition to these systems integration diagrams, I’ve been working on some basic tools (scripts) that address two small pieces of this larger problem:

  • A script to auto-generate digitization guides by extracting metadata from ArchivesSpace-generated EAD files (digitization guides are simply spreadsheets we use to keep track of what we digitize and to assign identifiers to digital objects and files during the digitization process).
  • A script that uses a completed digitization guide to batch-create digital object records in ArchivesSpace and at the same time link those digital objects to the descriptions of the physical items (the archival object records in ArchivesSpace-speak).  Special thanks to Dallas Pillen at the University of Michigan for doing most of the heavy lifting on this script.

Problem 2 (the smaller one): Using ArchivesSpace to produce MARC records for archival collections (or, stopping all that cutting and pasting).

In the past, we’ve had two completely separate workflows in special collections for creating archival description in EAD and creating collection-level MARC records for those same collections.  Archivists churned out detailed EAD finding aids and catalogers took those finding aids, and cut-and-pasted relevant sections into collection-level MARC records.  It’s quite silly, really, and we need a better solution that saves time and keeps metadata consistent across platforms.

While we haven’t done much work in this area yet, we have formed a small working group of archivists/catalogers and developed the following work plan:

  1. Examine default ArchivesSpace MARC exports and compare those exports to current MARC cataloging practices (document differences).
  2. Examine differences between ArchivesSpace MARC and “native” MARC and decide which current practices are worth maintaining keeping in mind we’ll need to modify default ArchivesSpace MARC exports to meet current MARC authoring practices.
  3. Develop cross-walking scripts or modify the ArchivesSpace MARC exporter to generate usable MARC data from ArchivesSpace.
  4. Develop and document an efficient workflow for pushing or harvesting MARC data from ArchivesSpace to both OCLC and our local ILS.
  5. If possible, develop, test, and document tools and workflows for re-purposing container (instance) information in ArchivesSpace in order to batch-create item records in the ILS for archival containers (boxes, folders, etc).
  6. Develop training for staff on new ArchivesSpace to MARC workflows.
courtesy of xkcd.com

Conclusion

So far we’ve only taken baby steps towards our dream of TOTAL METADATA SYNCHRONIZATION, but we’re making progress.  Please let us know if you’re working on similar projects at your institution. We’d love to hear from you.

Recognizing the Garden While Managing the Weeds

Life in Duke University Libraries has been even more energetic than usual these past months.  Our neighbors in Rubenstein just opened their newly renovated library and the semester is off with a bang.  As you can read over on Devil’s Tale, a lot of effort went on behind the scenes to get that sparkly new building ready for the public.  In following that theme, today I am sharing some thoughts on how producing digital collections both blesses and curses my perspective on our finished products.

When I write a Bitstreams post, I look for ideas in my calendar and to-do list to find news and projects to share.  This week I considered writing about “Ben”, those prints/negs/spreadsheets, and some resurrected proposals I’ve been fostering (don’t worry, these labels shouldn’t make sense to you).   I also turned to my list of favorite items in our digital collections; these are items I find particularly evocative and inspiring.  While reviewing my favorites with my possible topics in mind (Ben, prints/negs/spreadsheets, etc), I was struck by how differently patrons and researchers must relate to Duke Digital Collections than I do.  Where they see a polished finished product, I see the result of a series of complicated tasks I both adore and would sometimes prefer to disregard.

Let me back up and say that my first experience with Duke digital collections projects isn’t always about content or proper names.  Someone comes to me with an idea and of course I want to know about the significance of the content, but from there I need to know what format? How many items? Is the collection processed? What kind of descriptive data is available? Do you have a student to loan me? My mind starts spinning with logistics logistics logistics.   These details take on a life of their own separate from the significant content at hand.   As a project takes off, I come to know a collection by its details, the web of relationships I build to complete the project, and the occasional nickname. Lets look at a few examples.

There are so many Gedney favorites to choose from, here is just one of mine.

William Gedney Photographs and Writings

Parts of this collection are published, but we are expanding and improving the online collection dramatically.

What the public sees:  poignant and powerful images of everyday life in an array of settings (Brooklyn, India, San Francisco, Rural Kentucky, and others).

What I see:  50,000 items in lots of formats; this project could take over DPC photographic digitization resources, all publication resources, all my meetings, all my emails, and all my thoughts (I may be over dramatizing here just a smidge). When it all comes together, it will be amazing.  

Benjamin Rush Papers
We have just begun working with this collection, but the Devil’s Tale blog recently shared a sneak preview.

What people will see:  letters to and from fellow founding fathers including Thomas Jefferson (Benjamin Rush signed the Declaration of Independence), as well as important historical medical accounts of a Yellow Fever outbreak in 1793.

What I see: Ben or when I’m really feeling it, Benny.  We are going to test out an amazing new workflow between ArchivesSpace and DPC digitization guides with Ben.  

 

Mangum’s negatives show a diverse range of subjects. I highly recommend his exterior images as well.

Hugh Mangum Photographs

This collection of photographs was published in 2008. Since then we have added more images to it, and enhanced portions of the collection’s metadata. 

What others see:  a striking portfolio of a Southern itinerant photographer’s portraits featuring a diverse range of people.  Mangum also had a studio in Durham at the beginning of his career.

What I see:  HMP.  HMP is the identifier for the collection included in every URL, which I always have to remind myself when I’m checking stats or typing in the URL (at first I think it should be Mangum).   HMP is sneaky, because every now and then the popularity of this collection spikes.   I really want more people to get to know HMP.

They may not be orphans but they are “cave children”.

The Orphans

The orphans are not literal children, but they come in all size and shapes, and span multiple collections.  

What the public sees:  the public doesn’t see these projects.

What I see: orphans – plain and simple.  The orphans are projects that started, but then for whatever reason didn’t finish.  They have complicated rights, metadata, formats, or other problems that prevent them from making it through our production pipeline.  These issues tend to be well beyond my control, and yet I periodically pull out my list of orphans to see if their time has come.  I feel an extra special thrill of victory when we are able to complete an orphan project; the Greek Manuscripts are a good example.   I have my sights set on a few others currently, but do not want to divulge details here for fear of jinxing the situation.  

Don’t we all want to be in a digital collections land where the poppies bloom?

I could go on and on about how the logistics of each project shapes and re-shapes my perspective of it.  My point is that it is easy to temporarily lose sight of the digital collections garden given how entrenched (and even lost at times) we are in the weeds.  For my part, when I feel like the logistics of my projects are overwhelming, I go back to my favorites folder and remind myself of the beauty and impact of the digital artifacts we share with the world.  I hope the public enjoys them as much as I do.

 

Request for Proposals – The SNCC Digital Gateway

Promotional postcard for One Person, One Vote site.
Promotional postcard for One Person, One Vote site.

Last year we at Duke University Libraries circulated a prospectus for our still-young partnership with the SNCC Legacy Project, seeking bids from web contractors to help with developing the web site that we rolled out last March as One Person, One Vote (OPOV). Now, almost 18 months later, we’re back – but wiser – hoping to do it again – but bigger.

Thanks to a grant from the Mellon Foundation, we’ll be moving to a new phase of our partnership with the SNCC Legacy Project and the Center for Documentary Studies. The SNCC Digital Gateway will build on the success of the OPOV pilot, bringing Visiting Activist Scholars to campus to work with Duke undergraduates and graduates on documenting the historic drive for voting rights, and the work of the Student Nonviolent Coordinating Committee.

As before, we seek an experienced and talented contractor to join with our project team to design and build a compelling site. If you think your outfit might be right for the job, please review the RFP embedded below and get in touch.

 

 

FY15: A Year in Digital Projects

We experience a number of different cycles in the Digital Projects and Production Services Department (DPPS). There is of course the project lifecycle, that mysterious abstraction by which we try to find commonalities in work processes that can seem unique for every case. We follow the academic calendar, learn our fate through the annual budget cycle, and attend weekly, monthly, and quarterly meetings.

The annual reporting cycle at Duke University Libraries usually falls to departments in August, with those reports informing a master library report completed later. Because of the activities and commitments around the opening of the Rubenstein Library, the departments were let off the hook for their individual reports this year. Nevertheless, I thought I would use my turn in the Bitstreams rotation to review some highlights from our 2014-15 cycle.

Loads of accomplishments after the jump …

Continue reading FY15: A Year in Digital Projects

How Duke Chronicle Goes Digital

Today we will take a detailed look at how the Duke Chronicle, the university’s beloved newspaper for over 100 years, is digitized. Since our scope of digitization spans nine decades (1905-1989), it is an ongoing project the Digital Production Center (DPC), part of Digital Projects and Production Services (DPPS) and Duke University Libraries’ Digital Collections Program, has been chipping away at. Scanning and digitizing may seem straightforward to many – place an item on a scanner and press scan, for goodness sake! – but we at the DPC want to shed light on our own processes to give you a sense of what we do behind the scenes. It seems like an easy-peasy process of scanning and uploading images online, but there is much more that goes into it than that. Digitizing a large collection of newspapers is not always a fun-filled endeavor, and the physical act of scanning thousands of news pages is done by many dedicated (and patient!) student workers, staff members, and me, the King Intern for Digital Collections.

Pre-Scanning Procedures

chrondigblog_chronboxcrop
Large format 1940s Chronicles in over-sized archival box

Many steps in the digitization process do not actually occur in the DPC, but among other teams or departments within the library. Though I focus mainly on the DPC’s responsibilities, I will briefly explain the steps others perform in this digital projects tango…or maybe it’s a waltz?

Each proposed project must first be approved by the Advisory Council for Digital Collections (ACDC), a team that reviews each project for its strategic value. Then it is passed on to the Digital Collections Implementation Team (DCIT) to perform a feasibility study that examines the project’s strengths and weaknesses (see Thomas Crichlow’s post for an overview of these teams). The DCIT then helps guide the project to fruition. After clearing these hoops back in 2013, the Duke Chronicle project started its journey toward digital glory.

We pull 10 years’ worth of newspapers at a time from the University Archives in Rubenstein Library. Only one decade at a time is processed to make the 80+ years of Chronicle publications more manageable. The first stop is Conservation. To make sure the materials are stable enough to withstand digitizing, Conservation must inspect the condition of the paper prior to giving the DPC the go-ahead. Because newspapers since the mid-19th century were printed on cheap and very acidic wood pulp paper, the pages can become brittle over time and may warrant extensive repairs. Senior Conservator, Erin Hammeke, has done great work mending tears and brittle edges of many Chronicle pages since the start of this project. As we embark on digitizing the older decades, from the 1940s and earlier, Erin’s expertise will be indispensable. We rely on her not only to repair brittle pages but to guide the DPC’s strategy when deciding the best and safest way to digitize such fragile materials. Also, several volumes of the Chronicle have been bound, and to gain the best digital image scan these must be removed from their binding. Erin to the rescue!

chrondigblog_conservation1crop
Conservation repair on a 1940s Chronicle page
chrondigblog_conservation2crop
Conservation repair to a torn 1940s Chronicle ad

 

chrondigblog_digguide
1950s Duke Chronicle digitization guide

Now that Conservation has assessed the condition and given the DPC the green light, preliminary prep work must still be done before the scanner comes into play. A digitization guide is created in Microsoft Excel to list each Chronicle issue along with its descriptive metadata (more information about this process can be found in my metadata blog post). This spreadsheet acts as a guide in the digitization process (hence its name, digitization guide!) to keep track of each analog newspaper issue and, once scanned, its corresponding digital image. In this process, each Chronicle issue is inspected to collect the necessary metadata. At this time, a unique identifier is assigned to every issue based on the DPC’s naming conventions. This identifier stays with each item for the duration of its digital life and allows for easy identification of one among thousands of Chronicle issues. At the completion of the digitization guide, the Chronicle is now ready for the scanner.

 

The DPC’s Zeutschel OS 14000 A2

The Scanning Process

With all loose unbound issues, the Zeutschel is our go-to scanner because it allows for large format items to be imaged on a flat surface. This is less invasive and less damaging to the pages, and is quicker than other scanning methods. The Zeutschel can handle items up to 25 x 18 inches, which accommodates the larger sized formats of the Chronicle used in the 1940s and 1950s. If bound issues must be digitized, due to the absence of a loose copy or the inability to safely dis-bound a volume, the Phase One digital camera system is used as it can better capture large bound pages that may not necessarily lay flat.

chrondigblog_folderorganization
Folders each containing multiple page images of one Chronicle issue

For every scanning session, we need the digitization guide handy as it tells what to name the image files using the previously assigned unique identifier. Each issue of the newspaper is scanned as a separate folder of images, with one image representing one page of the newspaper. This system of organization allows for each issue to become its own compound object – multiple files bound together with an XML structure – once published to the website. The Zeutschel’s scanning software helps organize these image files into properly named folders. Of course, no digitization session would be complete without the initial target scan that checks for color calibration (See Mike Adamo’s post for a color calibration crash course).

chrondigblog_zeutschelbuttonscrop
The Zeutschel’s control panel of buttons
chrondigblog_zeutschelpedalscrop
The Zeutschel’s optional foot pedals

The scanner’s plate glass can now be raised with the push of a button (or the tap of a foot pedal) and the Chronicle issue is placed on the flatbed.  Lowering the plate glass down, the pages are flattened for a better scan result. Now comes the excitement… we can finally press SCAN. For each page, the plate glass is raised, lowered, and the scan button is pressed. Chronicle issues can have anywhere from 2 to 30 or more pages, so you can image this process can become monotonous – or even mesmerizing – at times. Luckily, with the smaller format decades, like the 1970s and 1980s, the inner pages can be scanned two at a time and the Zeutschel software separates them into two images, which cuts down on the scan time. As for the larger formats, the pages are so big you can only fit one on the flatbed. That means each page is a separate scan, but older years tended to publish less issues, so it’s a trade-off. To put the volume of this work into perspective, the 1,408 issues of the 1980s Chronicle took 28,089 scans to complete, while the 1950s Chronicle of about 482 issues took around 3,700 scans to complete.

 

chrondigblog_zeutschelpaper1crop
A 1940s Chronicle page is placed on the flatbed for scanning

 

chrondigblog_zeutschelscancrop
Scanning in progress of the 1940s Chronicle page
chrondigblog_targetadobe
Target image opened in Adobe Photoshop for color calibration

Every scanned image that pops up on the screen is also checked for alignment and cropping errors that may require a re-scan. Once all the pages in an issue are digitized and checked for errors, clicking the software’s Finalize button will compile the images in the designated folder. We now return to our digitization guide to enter in metadata pertaining to the scanning of that issue, including capture person, capture date, capture device, and what target image relates to this session (subsequent issues do not need a new target scanned, as long as the scanning takes place in the same session).

Now, with the next issue, rinse and repeat: set the software settings and name the folder, scan the issue, finalize, and fill out the digitization guide. You get the gist.

 

Post-Scanning Procedures

chrondigblog_qcrotate
Rotating an image in Adobe Photoshop

We now find ourselves with a slue of folders filled with digitized Chronicle images. The next phase of the process is quality control (QC). Once every issue from the decade is scanned, the first round of QC checks all images for excess borders to be cropped, crooked images to be squared, and any other minute discrepancy that may have resulted from the scanning process. This could be missing images, pages out of order, or even images scanned upside down. This stage of QC is often performed by student workers who diligently inspect image after image using Adobe Photoshop. The second round of QC is performed by our Digital Production Specialist Zeke Graves who gives every item a final pass.

At this stage, derivatives of the original preservation-quality images are created. The originals are archived in dark storage, while the smaller-sized derivatives are used in the CONTENTdm ingest process. CONTENTdm is the digital collection management software we use that collates the digital images with their appropriate descriptive metadata from our digitization guide, and creates one compound object for each Chronicle issue. It also generates the layer of Optical Character Recognition (OCR) data that makes the Chronicle text searchable, and provides an online interface for users to discover the collection once published on the website. The images and metadata are ingested into CONTENTdm’s Project Client in small batches (1 to 3 years of Chronicle issues) to reduce the chance of upload errors. Once ingested into CONTENTdm, the items are then spot-checked to make sure the metadata paired up with the correct image. During this step, other metadata is added that is specific to CONTENTdm fields, including the ingest person’s initials. Then, another ingest must run to push the files and data from the Project Client to the CONTENTdm server. A third step after this ingest finishes is to approve the items in the CONTENTdm administrative interface. This gives the go-ahead to publish the material online.

Hold on, we aren’t done yet. The project is now passed along to our developers in DPPS who must add this material to our digital collections platform for online discovery and access (they are currently developing Tripod3 to replace the previous Tripod2 platform, which is more eloquently described in Will Sexton’s post back in April). Not only does this improve discoverability, but it makes all of the library’s digital collections look more uniform in their online presentation.

Then, FINALLY, the collection goes live on the web. Now, just repeat the process for every decade of the Duke Chronicle, and you can see how this can become a rather time-heavy and laborious process. A labor of love, that is.

I could have narrowly stuck with describing to you the scanning process and the wonders of the Zeutschel, but I felt that I’d be shortchanging you. Active scanning is only a part of the whole digitization process which warrants a much broader narrative than just “push scan.” Along this journey to digitize the Duke Chronicle, we’ve collectively learned many things. The quirks and trials of each decade inform our process for the next, giving us the chance to improve along the way (to learn how we reflect upon each digital project after completion, go to Molly Bragg’s blog post on post-mortem reports).

If your curiosity is piqued as to how the Duke Chronicle looks online, the Fall 1959-Spring 1970 and January 1980-February 1989 issues are already available to view in our digital collections. The 1970s Chronicle is the next decade slated for publication, followed by the 1950s. Though this isn’t a comprehensive detailed account of the digitization process, I hope it provides you with a clearer picture of how we bring a collection, like the Duke Chronicle, into digital existence.

Who, Why, and What:  the three W’s of the Duke Digital Collections Mini-Survey

My colleague Sean wrote two weeks ago about the efforts a group of us  in the library are making towards understanding the scholarly impacts of Duke Digital Collections.  In this post, I plan to continue the discussion with details about the survey we are conducting as well as share some initial results.

Surveying can be perilous work!
Surveying can be perilous work!

After reviewing the analytics and Google Scholar data Sean wrote about, our working group realized we needed more information.   Our goal in this entire assessment process has been to pull together scholarly use data which will inform our digitization decisions, priorities, technological choices (features on the digital collections platform), and to help us gain an understanding of if and how we are meeting the needs of researcher communities.    Analytics gave us clues, but we still didn’t some of the fundamental facts about our patrons.   After a fervent discussion with many whiteboard notes, the group decided creating a survey would get us more of the data we were looking for.  The resulting survey focuses on the elemental questions we have about our patrons:   who are they, why are they visiting Duke Digital Collections, and what are they going to do with what they find here.

 

The Survey

Creating the survey itself was no small task, but after an almost endless process of writing, rewriting, and consultations with our assessment coordinator we settled on 6 questions (a truely miniature survey).  We considered the first three questions (who, why, what) to be most important, and we intended the last three to provide us with additional information such as Duke affiliation and allow a space for general feedback.  None of the questions were considered “required” so respondents could answer or skip whatever they wanted; we also included space for respondents to write-in further details especially when choosing the “other” option.

Our survey in its completed form.
Our survey in its completed form.

The survey launched on April 30 and remains accessible by hovering over a “feedback” link on every single Digital Collection webpage.  Event tracking analytics show that 0.29% of the patrons that hover over our feedback link click through to the survey. An even smaller number have actually submitted responses.  This has worked out to 56 responses at an average rate of around 1 per day.  Despite that low click through rate, we have been really pleased with the number of responses we have had so far.  The response rate remains steady, and we have already learned a lot from even this small sample of visitor data.  We are not advertising the survey or promoting it, because our target respondents are patrons who find us in the course of their research or general Internet browsing.

Hovering over the help us box reveals expectations and instructions for survey participants.
Hovering over the help us box reveals expectations and instructions for survey participants.

Initial Results

Before I start discussing our results, please note that what I’m sharing here is based on initial responses and my own observations.  No one in digital collections has thoroughly reviewed or analyzed this data.  Additionally, this information is drawn from responses submitted between April 30 – July 8, 2015. We plan to keep the survey online into the academic year to see if our responses change when classes are in session.

With that disclaimer now behind us, let’s review results by question.

Questions 1 and 4:  Who are you?

Since we are concerned with scholarly oriented use more than other types in this exercise, the first question is intended to sort respondents primarily by academic status.   In question 4, respondents are given the chance to further categorized their academic affiliation.

Question 1 Answers # of Responses %
Student 14 25%
Educator 10 18%
Librarian, Archivist or Museum Staff 5 9%
Other 26 47%
55 100

Of the respondents who categorized themselves as “other” in question 1, 11 clarified their otherness by writing their identities in the space provided.  Of this 11, 4 associated themselves with music oriented professions or hobbies, and 2 with fine arts (photographer and filmmaker).  The remaining 5 could not be grouped easily into categories.

As a follow up later in the survey, question 4 asks respondents to categorize their academic affiliation (if they had one).  The results showed that 3 respondents are affiliated with Duke, 12  with other colleges or universities and 9 with a K-12 school.   Of the write-in responses, 3 listed names of universities abroad, and 1 listed a school whose level has not been identified.

Question 2:  Why are you here?

We can tell from our analytics how people get to us (if they were referred to us via a link or sought us out directly), but this information does not address why visitors come to the site.  Enter question 2.

Question 2 Answers # of Responses %
Academic research 15 28
Casual browsing 15 28
Followed a link 9 17%
Personal research 24 44%
Other 6 11%
54

The survey asks that those who select academic research, personal research, and other to write-in their research topic or purpose.  Academic research topics submitted so far primarily revolve around various historical research topics.  Personal research topics reflect a high interest in music (specific songs or types of music), advertising, and other various personal projects.  It is interesting to note that local history related topics have been submitted under all three categories (academic, personal and other).  Additionally,  non-academic researchers seem to be more willing to share sharing their specific topics; 19 of 24 respondents listed their topics as compared to 7 out of 15 academic researchers.

Question 3:  What will you do with the images and/or resources you find on this site?

To me, this question has the potential to provide some of the most illuminating information from our patrons. Knowing how they use the material helps us determine how to enhance access to the digitized objects and what kinds of technology we should be investing in.  This can also shed light on our digitization process itself.  For example, maybe the full text version of an item will provide more benefit to more researchers than an illustrated or hand-written version of the same item (of course we would prefer to offer both, but I think you see where I am going with this).

In designing this question, the group decided it would be valuable to offer options for the those who share items due to their visual or subject appeal (for example, the Pinterest user), the publication minded researcher, and a range of patron types in between.

 

Question 3 Answers # of Responses %
Use for an academic publication 3 6%
Share on social media 10 19%
Use them for homework 8 15%
Use them as a teaching tool in my classes 5 9%
Personal use 31 58%
Use for my job 2 4%
Other 10 19%
53

The 10 “other” respondents all entered subsequent details; they planned to share items with friends and family (in some way other than on social media), they also wanted to use the items they found as a reference, or were working on an academic pursuit that in their mind didn’t fit the listed categories.

Observations

As I said above, these survey results are cursory as we plan to leave the survey up for several more months.  But so far the data reveals that Duke Digital collections serves a wide audience of academic and non-academic users for a range of purposes. For example, one respondent uses the outdoor advertising collections to get a glimpse of how their community has changed over time. Another is concerned with US History in the 1930s, and another is focused on music from the 1900s.

The next phase of the the assessment group’s activities is to meet with researchers and instructors in person and talk with them about their experiences using digital collections (not just Duke’s) for scholarly research or instruction.  We have also been collecting examples of instructors who have used digital collections in their classes.  We plan to create a webpage with these examples with the goal of encouraging other instructors to do the same.  The goal of both of these efforts is to increase academic use of the digital collections (whether that be at the K-12 or collegiate level).

 

Just like this survey team, we stand at the ready, waiting for our chance to analyze and react to our data!

Of course, another next step is to keep collecting this survey data and analyze it further.  All in all, it has been truly exciting to see the results thus far.  As we study the data in more depth this Fall, we plan to work with the Duke University Library Digital Collections Advisory Team to implement any new technical or policy oriented decisions based on our conclusions.  Our minds are already spinning with the possibilities.

The Value of Metadata in Digital Collections Projects

Before you let your eyes glaze over at the thought of metadata, let me familiarize you with the term and its invaluable role in the creation of the library’s online Digital Collections.  Yes, metadata is a rather jargony word librarians and archivists find themselves using frequently in the digital age, but it’s not as complex as you may think.  In the most simplistic terms, the Society of American Archivists defines metadata as “data about data.”  Okay, what does that mean?  According to the good ol’ trusty Oxford English Dictionary, it is “data that describes and gives information about other data.”  In other words, if you have a digitized photographic image (data), you will also have words to describe the image (metadata).

Better yet, think of it this way.  If that image were of a large family gathering and grandma lovingly wrote the date and names of all the people on the backside, that is basic metadata.  Without that information those people and the image would suddenly have less meaning, especially if you have no clue who those faces are in that family photo.  It is the same with digital projects.  Without descriptive metadata, the items we digitize would hold less meaning and prove less valuable for researchers, or at least be less searchable.  The better and more thorough the metadata, the more it promotes discovery in search engines.  (Check out the metadata from this Cornett family photo from the William Gedney collection.)

The term metadata was first used in the late 1960s in computer programming language.  With the advent of computing technology and the overabundance of digital data, metadata became a key element to help describe and retrieve information in an automated way.  The use of the word metadata in literature over the last 45 years shows a steeper increase from 1995 to 2005, which makes sense.  The term became used more and more as technology grew more widespread.  This is reflected in the graph below from Google’s Ngram Viewer, which scours over 5 million Google Books to track the usage of words and phrases over time.

metadatangram_blog
Google Ngram Viewer for “metadata”

Because of its link with computer technology, metadata is widely used in a variety of fields that range from computer science to the music industry.  Even your music playlist is full of descriptive metadata that relates to each song, like the artist, album, song title, and length of audio recording.  So, libraries and archives are not alone in their reliance on metadata.  Generating metadata is an invaluable step in the process of preserving and documenting the library’s unique collections.  It is especially important here at the Digital Production Center (DPC) where the digitization of these collections happens.  To better understand exactly how important a role metadata plays in our job, let’s walk through the metadata life cycle of one of our digital projects, the Duke Chapel Recordings.

The Chapel Recordings project consists of digitizing over 1,000 cassette and VHS tapes of sermons and over 1,300 written sermons that were given at the Duke Chapel from the 1950s to 2000s.  These recordings and sermons will be added to the existing Duke Chapel Recordings collection online.  Funded by a grant from the Lilly Foundation, this digital collection will be a great asset to Duke’s Divinity School and those interested in hermeneutics worldwide.

Before the scanners and audio capture devices are even warmed up at the DPC, preliminary metadata is collected from the analog archival material.  Depending on the project, this metadata is created either by an outside collaborator or in-house at the DPC.  For example, the Duke Chronicle metadata is created in-house by pulling data from each issue, like the date, volume, and issue number.  I am currently working on compiling the pre-digitization metadata for the 1950s Chronicle, and the spreadsheet looks like this:

1950sChronicle_blog
1950s Duke Chronicle preliminary metadata

As for the Chapel Recordings project, the DPC received an inventory from the University Archives in the form of an Excel spreadsheet.  This inventory contained the preliminary metadata already generated for the collection, which is also used in Rubenstein Library‘s online collection guide.

inventorymetadata_blog
Chapel Recordings inventory metadata

The University Archives also supplied the DPC with an inventory of the sermon transcripts containing basic metadata compiled by a student.

inventorysermons_blog
Duke Chapel Records sermon metadata

Here at the DPC, we convert this preliminary metadata into a digitization guide, which is a fancy term for yet another Excel spreadsheet.  Each digital project receives its own digitization guide (we like to call them digguides) which keeps all the valuable information for each item in one place.  It acts as a central location for data entry, but also as a reference guide for the digitization process.  Depending on the format of the material being digitized (image, audio, video, etc.), the digitization guide will need different categories.  We then add these new categories as columns in the original inventory spreadsheet and it becomes a working document where we plug in our own metadata generated in the digitization process.   For the Chapel Recordings audio and video, the metadata created looks like this:

digitizationmetadata_blog
Chapel Recordings digitization metadata

Once we have digitized the items, we then run the recordings through several rounds of quality control.  This generates even more metadata which is, again, added to the digitization guide.  As the Chapel Recordings have not gone through quality control yet, here is a look at the quality control data for the 1980s Duke Chronicle:

qcmetadata_blog
1980s Duke Chronicle quality control metadata

Once the digitization and quality control is completed, the DPC then sends the digitization guide filled with metadata to the metadata archivist, Noah Huffman.  Noah then makes further adds, edits, and deletes to match the spreadsheet metadata fields to fields accepted by the management software, CONTENTdm.  During the process of ingesting all the content into the software, CONTENTdm links the digitized items to their corresponding metadata from the Excel spreadsheet.  This is in preparation for placing the material online. For even more metadata adventures, see Noah’s most recent Bitstreams post.

In the final stage of the process, the compiled metadata and digitized items are published online at our Digital Collections website.  You, the researcher, history fanatic, or Sunday browser, see the results of all this work on the page of each digital item online.  This metadata is what makes your search results productive, and if we’ve done our job right, the digitized items will be easily discovered.  The Chapel Recordings metadata looks like this once published online:

onlinemetadata_blog
Chapel Recordings metadata as viewed online

Further down the road, the Duke Divinity School wishes to enhance the current metadata to provide keyword searches within the Chapel Recordings audio and video.  This will allow researchers to jump to specific sections of the recordings and find the exact content they are looking for.  The additional metadata will greatly improve the user experience by making it easier to search within the content of the recordings, and will add value to the digital collection.

On this journey through the metadata life cycle, I hope you have been convinced that metadata is a key element in the digitization process.  From preliminary inventories, to digitization and quality control, to uploading the material online, metadata has a big job to do.  At each step, it forms the link between a digitized item and how we know what that item is.  The life cycle of metadata in our digital projects at the DPC is sometimes long and tiring.  But, each stage of the process  creates and utilizes the metadata in varied and important ways.  Ultimately, all this arduous work pays off when a researcher in our digital collections hits gold.

Adventures in metadata hygiene: using Open Refine, XSLT, and Excel to dedup and reconcile name and subject headings in EAD

OpenRefine, formerly Google Refine, bills itself as “a free, open source, powerful tool for working with messy data.”  As someone who works with messy data almost every day, I can’t recommend it enough.  While Open Refine is a great tool for cleaning up “grid-shaped data” (spreadsheets), it’s a bit more challenging to use when your source data is in some other format, particularly XML.

Some corporate name terms from an EAD collection guide
Some corporate name terms from an EAD (XML) collection guide

As part of a recent project to migrate data from EAD (Encoded Archival Description) to ArchivesSpace, I needed to clean up about 27,000 name and subject headings spread across over 2,000 EAD records in XML.  Because the majority of these EAD XML files were encoded by hand using a basic text editor (don’t ask why), I knew there were likely to be variants of the same subject and name terms throughout the corpus–terms with extra white space, different punctuation and capitalization, etc.  I needed a quick way to analyze all these terms, dedup them, normalize them, and update the XML before importing it into ArchivesSpace.  I knew Open Refine was the tool for the job, but the process of getting the terms 1) out of the EAD, 2) into OpenRefine for munging, and 3) back into EAD wasn’t something I’d tackled before.

Below is a basic outline of the workflow I devised, combining XSLT, OpenRefine, and, yes, Excel.  I’ve provided links to some source files when available.  As with any major data cleanup project, I’m sure there are 100 better ways to do this, but hopefully somebody will find something useful here.

1. Use XSLT to extract names and subjects from EAD files into a spreadsheet

I’ve said it before, but sooner or later all metadata is a spreadsheet. Here is some XSLT that will extract all the subjects, names, places and genre terms from the <controlaccess> section in a directory full of EAD files and then dump those terms along with some other information into a tab-separated spreadsheet with four columns: original_term, cleaned_term (empty), term_type, and eadid_term_source.

controlaccess_extractor.xsl

 2. Import the spreadsheet into OpenRefine and clean the messy data!

Once you open the resulting tab delimited file in OpenRefine, you’ll see the four columns of data above, with “cleaned_term” column empty. Copy the values from the first column (original_term) to the second column (cleaned_term).  You’ll want to preserve the original terms in the first column and only edit the terms in the second column so you can have a way to match the old values in your EAD with any edited values later on.

OpenRefine offers several amazing tools for viewing and cleaning data.  For my project, I mostly used the “cluster and edit” feature, which applies several different matching algorithms to identify, cluster, and facilitate clean up of term variants. You can read more about clustering in Open Refine here: Clustering in Depth.

In my list of about 27,000 terms, I identified around 1200 term variants in about 2 hours using the “cluster and edit” feature, reducing the total number of unique values from about 18,000 to 16,800 (about 7%). Finding and replacing all 1200 of these variants manually in EAD or even in Excel would have taken days and lots of coffee.

refine_screeshot
Screenshot of “Cluster & Edit” tool in OpenRefine, showing variants that needed to be merged into a single heading.

 

In addition to “cluster and edit,” OpenRefine provides a really powerful way to reconcile your data against known vocabularies.  So, for example, you can configure OpenRefine to query the Library of Congress Subject Heading database and attempt to find LCSH values that match or come close to matching the subject terms in your spreadsheet.  I experimented with this feature a bit, but found the matching a bit unreliable for my needs.  I’d love to explore this feature again with a different data set.  To learn more about vocabulary reconciliation in OpenRefine, check out freeyourmetadata.org

 3. Export the cleaned spreadsheet from OpenRefine as an Excel file

Simple enough.

4. Open the Excel file and use Excel’s “XML Map” feature to export the spreadsheet as XML.

I admit that this is quite a hack, but one I’ve used several times to convert Excel spreadsheets to XML that I can then process with XSLT.  To get Excel to export your spreadsheet as XML, you’ll first need to create a new template XML file that follows the schema you want to output.  Excel refers to this as an “XML Map.”  For my project, I used this one: controlaccess_cleaner_xmlmap.xml

From the Developer tab, choose Source, and then add the sample XML file as the XML Map in the right hand window.  You can read more about using XML Maps in Excel here.

After loading your XML Map, drag the XML elements from the tree view in the right hand window to the top of the matching columns in the spreadsheet.  This will instruct Excel to map data in your columns to the proper XML elements when exporting the spreadsheet as XML.

Once you’ve mapped all your columns, select Export from the developer tab to export all of the spreadsheet data as XML.

Your XML file should look something like this: controlaccess_cleaner_dataset.xml

control_access_dataset_chunk
Sample chunk of exported XML, showing mappings from original terms to cleaned terms, type of term, and originating EAD identifier.

 

5. Use XSLT to batch process your source EAD files and find and replace the original terms with the cleaned terms.

For my project, I bundled the term cleanup as part of a larger XSLT “scrubber” script that fixed several other known issues with our EAD data all at once.  I typically use the Oxygen XML Editor to batch process XML with XSLT, but there are free tools available for this.

Below is a link to the entire XSLT scrubber file, with the templates controlling the <controlaccess> term cleanup on lines 412 to 493.  In order to access the XML file  you saved in step 4 that contains the mappings between old values and cleaned values, you’ll need to call that XML from within your XSLT script (see lines 17-19).

AT-import-fixer.xsl

What this script does, essentially, is process all of your source EAD files at once, finding and replacing all of the old name and subject terms with the ones you normalized and deduped in OpenRefine. To be more specific, for each term in EAD, the XSLT script will find the matching term in the <original_term>field of the XML file you produced in step 4 above.  If it finds a match, it will then replace that original term with the value of the <cleaned_term>.  Below is a sample XSLT template that controls the find and replace of <persname> terms.

XSLT template that find and replaces old values with cleaned ones.
XSLT template that find and replaces old values with cleaned ones.

 

Final Thoughts

Admittedly, cobbling together all these steps was quite an undertaking, but once you have the architecture in place, this workflow can be incredibly useful for normalizing, reconciling, and deduping metadata values in any flavor of XML with just a few tweaks to the files provided.  Give it a try and let me know how it goes, or better yet, tell me a better way…please.

More resources for working with OpenRefine:

“Using Google Refine to Clean Messy Data” (Propublica Blog)

freeyourmetadata.org