Category Archives: Behind the Scenes

Zoomable Hi-Res Images: Hopping Aboard the OpenSeadragon Bandwagon

Our new W. Duke & Sons digital collection (released a month ago) stands as an important milestone for us: our first collection constructed in the (Hydra-based) Duke Digital Repository, which is built on a suite of community-built open source software. Among that software is a remarkable image viewer tool called OpenSeadragon. Its website describes it as:

“an open-source, web-based viewer for high-resolution zoomable images, implemented in pure Javascript, for desktop and mobile.”

OpenSeadragon viewer in action on W. Duke & Sons collection.
OpenSeadragon viewer in action on W. Duke & Sons collection.
OpenSeadragon zoomed in, W. Duke & Sons collection.
OpenSeadragon zoomed in, W. Duke & Sons collection.

In concert with tiled digital images (we use Pyramid TIFFs), an image server (IIPImage), and a standard image data model (IIIF: International Image Interoperability Framework), OpenSeadragon considerably elevates the experience of viewing our image collections online. Its greatest virtues include:

  • smooth, continuous zooming and panning for high-resolution images
  • open source, built on web standards
  • extensible and well-documented

We can’t wait to get to share more of our image collections in the new platform.

OpenSeadragon Examples Elsewhere

Arthur C. Clarke’s Third Law states, “Any sufficiently advanced technology is indistinguishable from magic.” And looking at high-res images in OpenSeadragon feels pretty darn magical. Here are some of my favorite implementations from places that inspired us to use it:

  1. The Metropolitan Museum of Art. Zooming in close on this van Gogh self-portrait gives you a means to inspect the intense brushstrokes and texture of the canvas in a way that you couldn’t otherwise experience, even by visiting the museum in-person.

    Self-Portrait with a Straw Hat (obverse: The Potato Peeler). Vincent van Gogh, 1887.
    Self-Portrait with a Straw Hat (obverse: The Potato Peeler). Vincent van Gogh, 1887.
  2. Chronicling America: Historic American Newspapers (Library of Congress). For instance, zoom to read in the July 21, 1871 issue of “The Sun” (New York City) about my great-great-grandfather George Aery’s conquest being crowned the Schuetzen King, sharpshooting champion, at a popular annual festival of marksmen.
    The sun. (New York [N.Y.]), 21 July 1871. Chronicling America: Historic American Newspapers. Lib. of Congress.
    The sun. (New York [N.Y.]), 21 July 1871. Chronicling America: Historic American Newspapers. Lib. of Congress.
  3. Other GLAMs. See these other nice examples from The National Gallery of Art, The Smithsonian National Museum of American Museum, NYPL Digital Collections, and Digital Public Library of America (DPLA).

OpenSeadragon’s Microsoft Origins

OpenSeadragon

The software began with a company called Sand Codex, founded in Princeton, NJ in 2003. By 2005, the company had moved to Seattle and changed its name to Seadragon Software. Microsoft acquired the company in 2006 and positioned Seadragon within Microsoft Live Labs.

In March 2007, Seadragon founder Blaise Agüera y Arcase gave a TED Talk where he showcased the power of continuous multi-resolution deep-zooming for applications built on Seadragon. In the months that followed, we held a well-attended staff event at Duke Libraries to watch the talk. There was a lot of ooh-ing and aah-ing. Indeed, it looked like magic. But while it did foretell a real future for our image collections, at the time it felt unattainable and impractical for our needs. It was a Microsoft thing. It required special software to view. It wasn’t going to happen here, not when we were making a commitment to move away from proprietary platforms and plugins.

Sometime in 2008, Microsoft developed a more open Javascript-based version of Seadragon called Seadragon Ajax, and by 2009 had shared it as open-source software via a New BSD license.  That curtailed many barriers for use, however it still required a Microsoft server-side framework and Microsoft AJAX library.  So in the years since, the software has been re-engineered to be truly open, framework-agnostic, and has thus been rebranded as OpenSeadragon. Having a technology that’s this advanced–and so useful–be so open has been an incredible boon to cultural heritage institutions and, by extension, to the patrons we serve.

Setup

OpenSeadragon’s documentation is thorough, so that helped us get up and running quickly with adding and customizing features. W. Duke & Sons cards were scanned front & back, and the albums are paginated, so we knew we had to support navigation within multi-image items. These are the key features involved:

Customizations

Some aspects of the interface weren’t quite as we needed them to be out-of-the-box, so we added and customized a few features.

  • Custom Button Binding. Created our own navigation menu to match our site’s more modern aesthetic.
  • Page Indicator / Jump to Page. Developed a page indicator and direct-input page jump box using the OpenSeadragon API
  • Styling. Revised the look & feel with additional CSS & Javascript.

Future Directions: Page-Turning & IIIF

OpenSeadragon does have some limitations where we think that it alone won’t meet all our needs for image interfaces. When we have highly-structured paginated items with associated transcriptions or annotations, we’ll need to implement something a bit more complex. Mirador (example) and Universal Viewer (example) are two example open-source page-viewer tools that are built on top of OpenSeadragon. Both projects depend on “manifests” using the IIIF presentation API to model this additional data.

The Hydra Page Turner Interest Group recently produced a summary report that compares these page-viewer tools and features, and highlights strategies for creating the multi-image IIIF manifests they rely upon. Several Hydra partners are already off and running; at Duke we still have some additional research and development to do in this area.

We’ll be adding many more image collections in the coming months, including migrating all of our existing ones that predated our new platform. Exciting times lie ahead. Stay tuned.

Animated Demo

eye-ui-demo-4

 

Lichens, Bryophytes and Climate Change

As 2015 winds down, the Digital Production Center is wrapping up a four-year collaboration with the Duke Herbarium to digitize their lichen and bryophyte specimens. The project is funded by the National Science Foundation, and the ultimate goal is to digitize over 2 million specimens from more than 60 collections across the nation. Lichens and bryophytes (mosses and their relatives) are important indicators of climate change. After the images from the participating institutions are uploaded to one central portal, called iDigBio, large-scale distribution mapping will be used to identify regions where environmental changes are taking place, allowing scientists to study the patterns and effects of these changes.

0233518_1

The specimens are first transported from the Duke Herbarium to Perkins Library on a scheduled timeline. Then, we photograph the specimen labels using our Phase One overhead camera. Some of the specimens are very bulky, but our camera’s depth of field is broad enough to keep them in focus. To be clear, what the project is utilizing is not photos of the actual plant specimens themselves, but rather images of the typed and hand-written scientific metadata adorning the envelopes which house the specimens. After we photograph them, the images are uploaded to the national database, where they are available for online research, along with other specimen labels uploaded from universities across the United States. Optical character recognition is used to digest and organize the scientific metadata in the images.

0167750_1

Over the past four years, the Digital Production Center has digitized approximately 100,000 lichen and bryophyte specimens. Many are from the Duke Herbarium, but some other institutions have also asked us to digitize some of their specimens, such as UNC-Chapel Hill, SUNY-Binghamton, Towson University and the University of Richmond. The Duke Herbarium is the second-largest herbarium of all U.S. private universities, next to Harvard. It was started in 1921, and it contains more than 800,000 specimens of vascular plants, bryophytes, algae, lichens, and fungi, some of which were collected as far back as the 1800s. Several specimens have unintentionally humorous names, like the following, which wants to be funky, but isn’t fooling anyone. Ok, maybe only I find that funny.

10351607_10203888424672120_3650120923747796868_n

The project has been extensive, but enjoyable, thanks to the leadership of Duke Herbarium Data Manager Blanka Shaw. Dr. Shaw has personally collected bryophytes on many continents, and has brought a wealth of knowledge, energy and good humor to the collaboration with the Digital Production Center. The Duke Herbarium is open for visitors, and citizen scientists are also needed to volunteer for transcription and georeferencing of the extensive metadata collected in the national database.

Baby Steps towards Metadata Synchronization

How We Got Here: A terribly simplistic history of library metadata

Managing the description of library collections (especially “special” collections) is an increasingly complex task.  In the days of yore, we bought books and other things, typed up or purchased catalog cards describing those things (metadata), and filed the cards away.  It was tedious work, but fairly straightforward.  If you wanted to know something about anything in the library’s collection, you went to the card catalog.  Simple.

Some time in the 1970s or 1980s we migrated all (well, most) of that card catalog description to the ILS (Integrated Library System).  If you wanted to describe something in the library, you made a MARC record in the ILS.  Patrons searched those MARC records in the OPAC (the public-facing view of the ILS).  Still pretty simple.  Sure, we maintained other paper-based tools for managing description of manuscript and archival collections (printed finding aids, registers, etc.), but until somewhat recently, the ILS was really the only “system” in use in the library.

duke_online_catalog_1980s
Duke Online Catalog, 1980s

From the 1990s on things got complicated. We started making EAD and MARC records for archival collections. We started digitizing parts of those collections and creating Dublin Core records and sometimes TEI for the digital objects.  We created and stored library metadata in relational databases (MySQL), METS, MODS, and even flat HTML. As library metadata standards proliferated, so too did the systems we used the create, manage, and store that metadata.

Now, we have an ILS for managing MARC-based catalog records, ArchivesSpace for managing more detailed descriptions of manuscript collections, a Fedora (Hydra) repository for managing digital objects, CONTENTdm for managing some other digital objects, and lots of little intermediary descriptive tools (spreadsheets, databases, etc.).  Each of these systems stores library metadata in a different format and in varying levels of detail.

So what’s the problem and what are we doing about it?

The variety of metadata standards and systems isn’t the problem.  What is the problem–a very painful and time-consuming problem–is having to maintain and reconcile description of the same thing (a manuscript, a folder of letters, an image, an audio file, etc.) across all these disparate metadata formats and systems.  It’s a metadata synchronization problem and it’s a big one.

For the past four months or so, a group of archivists and developers here in the library have been meeting regularly to brainstorm ways to solve or at least help alleviate some of our metadata synchronization problems.  We’ve been calling our group “The Synchronizers.”

What have The Synchronizers been up to?  Well, so far we’ve been trying to tackle two pieces of the synchronization conundrum:

Problem 1 (the big one): Keeping metadata for special collections materials in sync across ArchivesSpace, the digitization process, and our Hydra repository.

Ideally, we’d like to re-purpose metadata from ArchivesSpace to facilitate the digitization process and also keep that metadata in sync as items are digitized, described more fully, and ingested into our Hydra repository. Fortunately, we’re not the only library trying to tackle this problem.  For more on AS/Hydra integration, see the work of the Hydra Archivists Interest Group.

Below are a couple of rough sketches we drafted to start thinking about this problem at Duke.

AS_Hydra_diagram
Hydra / ArchivesSpace Integration Sketch, take 1
sychronizers_will
Hydra / ArchivesSpace Integration Sketch, take 2

 

In addition to these systems integration diagrams, I’ve been working on some basic tools (scripts) that address two small pieces of this larger problem:

  • A script to auto-generate digitization guides by extracting metadata from ArchivesSpace-generated EAD files (digitization guides are simply spreadsheets we use to keep track of what we digitize and to assign identifiers to digital objects and files during the digitization process).
  • A script that uses a completed digitization guide to batch-create digital object records in ArchivesSpace and at the same time link those digital objects to the descriptions of the physical items (the archival object records in ArchivesSpace-speak).  Special thanks to Dallas Pillen at the University of Michigan for doing most of the heavy lifting on this script.

Problem 2 (the smaller one): Using ArchivesSpace to produce MARC records for archival collections (or, stopping all that cutting and pasting).

In the past, we’ve had two completely separate workflows in special collections for creating archival description in EAD and creating collection-level MARC records for those same collections.  Archivists churned out detailed EAD finding aids and catalogers took those finding aids, and cut-and-pasted relevant sections into collection-level MARC records.  It’s quite silly, really, and we need a better solution that saves time and keeps metadata consistent across platforms.

While we haven’t done much work in this area yet, we have formed a small working group of archivists/catalogers and developed the following work plan:

  1. Examine default ArchivesSpace MARC exports and compare those exports to current MARC cataloging practices (document differences).
  2. Examine differences between ArchivesSpace MARC and “native” MARC and decide which current practices are worth maintaining keeping in mind we’ll need to modify default ArchivesSpace MARC exports to meet current MARC authoring practices.
  3. Develop cross-walking scripts or modify the ArchivesSpace MARC exporter to generate usable MARC data from ArchivesSpace.
  4. Develop and document an efficient workflow for pushing or harvesting MARC data from ArchivesSpace to both OCLC and our local ILS.
  5. If possible, develop, test, and document tools and workflows for re-purposing container (instance) information in ArchivesSpace in order to batch-create item records in the ILS for archival containers (boxes, folders, etc).
  6. Develop training for staff on new ArchivesSpace to MARC workflows.
courtesy of xkcd.com

Conclusion

So far we’ve only taken baby steps towards our dream of TOTAL METADATA SYNCHRONIZATION, but we’re making progress.  Please let us know if you’re working on similar projects at your institution. We’d love to hear from you.

Recognizing the Garden While Managing the Weeds

Life in Duke University Libraries has been even more energetic than usual these past months.  Our neighbors in Rubenstein just opened their newly renovated library and the semester is off with a bang.  As you can read over on Devil’s Tale, a lot of effort went on behind the scenes to get that sparkly new building ready for the public.  In following that theme, today I am sharing some thoughts on how producing digital collections both blesses and curses my perspective on our finished products.

When I write a Bitstreams post, I look for ideas in my calendar and to-do list to find news and projects to share.  This week I considered writing about “Ben”, those prints/negs/spreadsheets, and some resurrected proposals I’ve been fostering (don’t worry, these labels shouldn’t make sense to you).   I also turned to my list of favorite items in our digital collections; these are items I find particularly evocative and inspiring.  While reviewing my favorites with my possible topics in mind (Ben, prints/negs/spreadsheets, etc), I was struck by how differently patrons and researchers must relate to Duke Digital Collections than I do.  Where they see a polished finished product, I see the result of a series of complicated tasks I both adore and would sometimes prefer to disregard.

Let me back up and say that my first experience with Duke digital collections projects isn’t always about content or proper names.  Someone comes to me with an idea and of course I want to know about the significance of the content, but from there I need to know what format? How many items? Is the collection processed? What kind of descriptive data is available? Do you have a student to loan me? My mind starts spinning with logistics logistics logistics.   These details take on a life of their own separate from the significant content at hand.   As a project takes off, I come to know a collection by its details, the web of relationships I build to complete the project, and the occasional nickname. Lets look at a few examples.

There are so many Gedney favorites to choose from, here is just one of mine.

William Gedney Photographs and Writings

Parts of this collection are published, but we are expanding and improving the online collection dramatically.

What the public sees:  poignant and powerful images of everyday life in an array of settings (Brooklyn, India, San Francisco, Rural Kentucky, and others).

What I see:  50,000 items in lots of formats; this project could take over DPC photographic digitization resources, all publication resources, all my meetings, all my emails, and all my thoughts (I may be over dramatizing here just a smidge). When it all comes together, it will be amazing.  

Benjamin Rush Papers
We have just begun working with this collection, but the Devil’s Tale blog recently shared a sneak preview.

What people will see:  letters to and from fellow founding fathers including Thomas Jefferson (Benjamin Rush signed the Declaration of Independence), as well as important historical medical accounts of a Yellow Fever outbreak in 1793.

What I see: Ben or when I’m really feeling it, Benny.  We are going to test out an amazing new workflow between ArchivesSpace and DPC digitization guides with Ben.  

 

Mangum’s negatives show a diverse range of subjects. I highly recommend his exterior images as well.

Hugh Mangum Photographs

This collection of photographs was published in 2008. Since then we have added more images to it, and enhanced portions of the collection’s metadata. 

What others see:  a striking portfolio of a Southern itinerant photographer’s portraits featuring a diverse range of people.  Mangum also had a studio in Durham at the beginning of his career.

What I see:  HMP.  HMP is the identifier for the collection included in every URL, which I always have to remind myself when I’m checking stats or typing in the URL (at first I think it should be Mangum).   HMP is sneaky, because every now and then the popularity of this collection spikes.   I really want more people to get to know HMP.

They may not be orphans but they are “cave children”.

The Orphans

The orphans are not literal children, but they come in all size and shapes, and span multiple collections.  

What the public sees:  the public doesn’t see these projects.

What I see: orphans – plain and simple.  The orphans are projects that started, but then for whatever reason didn’t finish.  They have complicated rights, metadata, formats, or other problems that prevent them from making it through our production pipeline.  These issues tend to be well beyond my control, and yet I periodically pull out my list of orphans to see if their time has come.  I feel an extra special thrill of victory when we are able to complete an orphan project; the Greek Manuscripts are a good example.   I have my sights set on a few others currently, but do not want to divulge details here for fear of jinxing the situation.  

Don’t we all want to be in a digital collections land where the poppies bloom?

I could go on and on about how the logistics of each project shapes and re-shapes my perspective of it.  My point is that it is easy to temporarily lose sight of the digital collections garden given how entrenched (and even lost at times) we are in the weeds.  For my part, when I feel like the logistics of my projects are overwhelming, I go back to my favorites folder and remind myself of the beauty and impact of the digital artifacts we share with the world.  I hope the public enjoys them as much as I do.

 

FY15: A Year in Digital Projects

We experience a number of different cycles in the Digital Projects and Production Services Department (DPPS). There is of course the project lifecycle, that mysterious abstraction by which we try to find commonalities in work processes that can seem unique for every case. We follow the academic calendar, learn our fate through the annual budget cycle, and attend weekly, monthly, and quarterly meetings.

The annual reporting cycle at Duke University Libraries usually falls to departments in August, with those reports informing a master library report completed later. Because of the activities and commitments around the opening of the Rubenstein Library, the departments were let off the hook for their individual reports this year. Nevertheless, I thought I would use my turn in the Bitstreams rotation to review some highlights from our 2014-15 cycle.

Loads of accomplishments after the jump …

Continue reading FY15: A Year in Digital Projects

How Duke Chronicle Goes Digital

Today we will take a detailed look at how the Duke Chronicle, the university’s beloved newspaper for over 100 years, is digitized. Since our scope of digitization spans nine decades (1905-1989), it is an ongoing project the Digital Production Center (DPC), part of Digital Projects and Production Services (DPPS) and Duke University Libraries’ Digital Collections Program, has been chipping away at. Scanning and digitizing may seem straightforward to many – place an item on a scanner and press scan, for goodness sake! – but we at the DPC want to shed light on our own processes to give you a sense of what we do behind the scenes. It seems like an easy-peasy process of scanning and uploading images online, but there is much more that goes into it than that. Digitizing a large collection of newspapers is not always a fun-filled endeavor, and the physical act of scanning thousands of news pages is done by many dedicated (and patient!) student workers, staff members, and me, the King Intern for Digital Collections.

Pre-Scanning Procedures

chrondigblog_chronboxcrop
Large format 1940s Chronicles in over-sized archival box

Many steps in the digitization process do not actually occur in the DPC, but among other teams or departments within the library. Though I focus mainly on the DPC’s responsibilities, I will briefly explain the steps others perform in this digital projects tango…or maybe it’s a waltz?

Each proposed project must first be approved by the Advisory Council for Digital Collections (ACDC), a team that reviews each project for its strategic value. Then it is passed on to the Digital Collections Implementation Team (DCIT) to perform a feasibility study that examines the project’s strengths and weaknesses (see Thomas Crichlow’s post for an overview of these teams). The DCIT then helps guide the project to fruition. After clearing these hoops back in 2013, the Duke Chronicle project started its journey toward digital glory.

We pull 10 years’ worth of newspapers at a time from the University Archives in Rubenstein Library. Only one decade at a time is processed to make the 80+ years of Chronicle publications more manageable. The first stop is Conservation. To make sure the materials are stable enough to withstand digitizing, Conservation must inspect the condition of the paper prior to giving the DPC the go-ahead. Because newspapers since the mid-19th century were printed on cheap and very acidic wood pulp paper, the pages can become brittle over time and may warrant extensive repairs. Senior Conservator, Erin Hammeke, has done great work mending tears and brittle edges of many Chronicle pages since the start of this project. As we embark on digitizing the older decades, from the 1940s and earlier, Erin’s expertise will be indispensable. We rely on her not only to repair brittle pages but to guide the DPC’s strategy when deciding the best and safest way to digitize such fragile materials. Also, several volumes of the Chronicle have been bound, and to gain the best digital image scan these must be removed from their binding. Erin to the rescue!

chrondigblog_conservation1crop
Conservation repair on a 1940s Chronicle page
chrondigblog_conservation2crop
Conservation repair to a torn 1940s Chronicle ad

 

chrondigblog_digguide
1950s Duke Chronicle digitization guide

Now that Conservation has assessed the condition and given the DPC the green light, preliminary prep work must still be done before the scanner comes into play. A digitization guide is created in Microsoft Excel to list each Chronicle issue along with its descriptive metadata (more information about this process can be found in my metadata blog post). This spreadsheet acts as a guide in the digitization process (hence its name, digitization guide!) to keep track of each analog newspaper issue and, once scanned, its corresponding digital image. In this process, each Chronicle issue is inspected to collect the necessary metadata. At this time, a unique identifier is assigned to every issue based on the DPC’s naming conventions. This identifier stays with each item for the duration of its digital life and allows for easy identification of one among thousands of Chronicle issues. At the completion of the digitization guide, the Chronicle is now ready for the scanner.

 

The DPC’s Zeutschel OS 14000 A2

The Scanning Process

With all loose unbound issues, the Zeutschel is our go-to scanner because it allows for large format items to be imaged on a flat surface. This is less invasive and less damaging to the pages, and is quicker than other scanning methods. The Zeutschel can handle items up to 25 x 18 inches, which accommodates the larger sized formats of the Chronicle used in the 1940s and 1950s. If bound issues must be digitized, due to the absence of a loose copy or the inability to safely dis-bound a volume, the Phase One digital camera system is used as it can better capture large bound pages that may not necessarily lay flat.

chrondigblog_folderorganization
Folders each containing multiple page images of one Chronicle issue

For every scanning session, we need the digitization guide handy as it tells what to name the image files using the previously assigned unique identifier. Each issue of the newspaper is scanned as a separate folder of images, with one image representing one page of the newspaper. This system of organization allows for each issue to become its own compound object – multiple files bound together with an XML structure – once published to the website. The Zeutschel’s scanning software helps organize these image files into properly named folders. Of course, no digitization session would be complete without the initial target scan that checks for color calibration (See Mike Adamo’s post for a color calibration crash course).

chrondigblog_zeutschelbuttonscrop
The Zeutschel’s control panel of buttons
chrondigblog_zeutschelpedalscrop
The Zeutschel’s optional foot pedals

The scanner’s plate glass can now be raised with the push of a button (or the tap of a foot pedal) and the Chronicle issue is placed on the flatbed.  Lowering the plate glass down, the pages are flattened for a better scan result. Now comes the excitement… we can finally press SCAN. For each page, the plate glass is raised, lowered, and the scan button is pressed. Chronicle issues can have anywhere from 2 to 30 or more pages, so you can image this process can become monotonous – or even mesmerizing – at times. Luckily, with the smaller format decades, like the 1970s and 1980s, the inner pages can be scanned two at a time and the Zeutschel software separates them into two images, which cuts down on the scan time. As for the larger formats, the pages are so big you can only fit one on the flatbed. That means each page is a separate scan, but older years tended to publish less issues, so it’s a trade-off. To put the volume of this work into perspective, the 1,408 issues of the 1980s Chronicle took 28,089 scans to complete, while the 1950s Chronicle of about 482 issues took around 3,700 scans to complete.

 

chrondigblog_zeutschelpaper1crop
A 1940s Chronicle page is placed on the flatbed for scanning

 

chrondigblog_zeutschelscancrop
Scanning in progress of the 1940s Chronicle page
chrondigblog_targetadobe
Target image opened in Adobe Photoshop for color calibration

Every scanned image that pops up on the screen is also checked for alignment and cropping errors that may require a re-scan. Once all the pages in an issue are digitized and checked for errors, clicking the software’s Finalize button will compile the images in the designated folder. We now return to our digitization guide to enter in metadata pertaining to the scanning of that issue, including capture person, capture date, capture device, and what target image relates to this session (subsequent issues do not need a new target scanned, as long as the scanning takes place in the same session).

Now, with the next issue, rinse and repeat: set the software settings and name the folder, scan the issue, finalize, and fill out the digitization guide. You get the gist.

 

Post-Scanning Procedures

chrondigblog_qcrotate
Rotating an image in Adobe Photoshop

We now find ourselves with a slue of folders filled with digitized Chronicle images. The next phase of the process is quality control (QC). Once every issue from the decade is scanned, the first round of QC checks all images for excess borders to be cropped, crooked images to be squared, and any other minute discrepancy that may have resulted from the scanning process. This could be missing images, pages out of order, or even images scanned upside down. This stage of QC is often performed by student workers who diligently inspect image after image using Adobe Photoshop. The second round of QC is performed by our Digital Production Specialist Zeke Graves who gives every item a final pass.

At this stage, derivatives of the original preservation-quality images are created. The originals are archived in dark storage, while the smaller-sized derivatives are used in the CONTENTdm ingest process. CONTENTdm is the digital collection management software we use that collates the digital images with their appropriate descriptive metadata from our digitization guide, and creates one compound object for each Chronicle issue. It also generates the layer of Optical Character Recognition (OCR) data that makes the Chronicle text searchable, and provides an online interface for users to discover the collection once published on the website. The images and metadata are ingested into CONTENTdm’s Project Client in small batches (1 to 3 years of Chronicle issues) to reduce the chance of upload errors. Once ingested into CONTENTdm, the items are then spot-checked to make sure the metadata paired up with the correct image. During this step, other metadata is added that is specific to CONTENTdm fields, including the ingest person’s initials. Then, another ingest must run to push the files and data from the Project Client to the CONTENTdm server. A third step after this ingest finishes is to approve the items in the CONTENTdm administrative interface. This gives the go-ahead to publish the material online.

Hold on, we aren’t done yet. The project is now passed along to our developers in DPPS who must add this material to our digital collections platform for online discovery and access (they are currently developing Tripod3 to replace the previous Tripod2 platform, which is more eloquently described in Will Sexton’s post back in April). Not only does this improve discoverability, but it makes all of the library’s digital collections look more uniform in their online presentation.

Then, FINALLY, the collection goes live on the web. Now, just repeat the process for every decade of the Duke Chronicle, and you can see how this can become a rather time-heavy and laborious process. A labor of love, that is.

I could have narrowly stuck with describing to you the scanning process and the wonders of the Zeutschel, but I felt that I’d be shortchanging you. Active scanning is only a part of the whole digitization process which warrants a much broader narrative than just “push scan.” Along this journey to digitize the Duke Chronicle, we’ve collectively learned many things. The quirks and trials of each decade inform our process for the next, giving us the chance to improve along the way (to learn how we reflect upon each digital project after completion, go to Molly Bragg’s blog post on post-mortem reports).

If your curiosity is piqued as to how the Duke Chronicle looks online, the Fall 1959-Spring 1970 and January 1980-February 1989 issues are already available to view in our digital collections. The 1970s Chronicle is the next decade slated for publication, followed by the 1950s. Though this isn’t a comprehensive detailed account of the digitization process, I hope it provides you with a clearer picture of how we bring a collection, like the Duke Chronicle, into digital existence.

The Beauty of Auto Crop

One of the most tedious and time-consuming tasks we do in the Digital Production Center is cropping and straightening still image files. Hired students spend hours sitting at our computers, meticulously straightening and cropping extraneous background space out of hundreds of thousands of photographed images, using Adobe Photoshop. This process is neccessary in order to present a clean, concise image for our digital collections, but it causes delays in the completion of our projects, and requires a lot of student labor. Auto cropping software has long been sought after in digital imaging, but few developers have been able to make it work efficiently, for all materials. The Digital Production Center’s Zeutschel overhead scanner utilizes auto cropping software, but the scanner can only be used with completely flat media, due to its limited depth of field. Thicker and more fragile materials must be photographed using our Phase One digital camera system, shown above.

Screen Shot 2015-07-15 at 4.32.26 PM
Capture One’s Cultural Heritage software includes the auto crop feature.

Recently, Digital Transitions, who is the supplier of Phase One and it’s accompanying Capture One software, announced an update to the software which includes an auto crop and straightening feature. The new software is called Capture One Cultural Heritage, and is specifically designed for use in libraries and archival institutions. The auto crop feature, previously unavailable in Capture One, is a real breakthrough, and there are several options for how to use it.

First of all, the user can choose to auto crop “On Capture” or “On Crop.” That is, the software can auto crop instantly, right after a photograph has been taken (On Capture), or it can be applied to the image, or batch of images, at a later time (On Crop). You can also choose between auto cropping at a fixed size, or by the edge of the material. For instance, if you are photographing a collection of posters that are all sized 18” x 24,” you would choose “Fixed Size” and set the primary crop to “18 x 24,” or slightly larger if you want your images to have an outer border. The software recognizes the rectangular shape, and applies the crop. If you are photographing a collection of materials that are a variety of different sizes, you would choose “Generic,” which tells the software to crop wherever it sees a difference between the edge of the material and the background. “Padding” can be used to give those images a border.

camera_stand
The Digital Production Center’s Phase One camera system.

Because Capture One utilizes raw files, the auto crops are non-destructive edits. One benefit of this is that if your background color is close to the color of your material, you can temporarily adjust the contrast of the photograph in order to darken the edges of the object, thus enhancing the delineation between object and background.  Next apply the auto crop, which will be more successful due to it’s ability to recognize the newly-defined edges of the material. After the crops are applied, you can reverse the contrast adjustment, thus returning the images to their original state, while still keeping the newly-generated crops.

levels
Temporarily increasing the contrast of your images can help the auto crop feature find the edges of the object.

Like a lot of technological advances, reliable auto cropping seemed like a fantasy just a few years ago, but is now a reality. It doesn’t work perfectly every time, and quality control is still necessary to uncover errors, but it’s a big step forward. The only thing disconcerting is the larger question facing our society. How long will it be before our work is completely automated, and humans are left behind?

Who, Why, and What:  the three W’s of the Duke Digital Collections Mini-Survey

My colleague Sean wrote two weeks ago about the efforts a group of us  in the library are making towards understanding the scholarly impacts of Duke Digital Collections.  In this post, I plan to continue the discussion with details about the survey we are conducting as well as share some initial results.

Surveying can be perilous work!
Surveying can be perilous work!

After reviewing the analytics and Google Scholar data Sean wrote about, our working group realized we needed more information.   Our goal in this entire assessment process has been to pull together scholarly use data which will inform our digitization decisions, priorities, technological choices (features on the digital collections platform), and to help us gain an understanding of if and how we are meeting the needs of researcher communities.    Analytics gave us clues, but we still didn’t some of the fundamental facts about our patrons.   After a fervent discussion with many whiteboard notes, the group decided creating a survey would get us more of the data we were looking for.  The resulting survey focuses on the elemental questions we have about our patrons:   who are they, why are they visiting Duke Digital Collections, and what are they going to do with what they find here.

 

The Survey

Creating the survey itself was no small task, but after an almost endless process of writing, rewriting, and consultations with our assessment coordinator we settled on 6 questions (a truely miniature survey).  We considered the first three questions (who, why, what) to be most important, and we intended the last three to provide us with additional information such as Duke affiliation and allow a space for general feedback.  None of the questions were considered “required” so respondents could answer or skip whatever they wanted; we also included space for respondents to write-in further details especially when choosing the “other” option.

Our survey in its completed form.
Our survey in its completed form.

The survey launched on April 30 and remains accessible by hovering over a “feedback” link on every single Digital Collection webpage.  Event tracking analytics show that 0.29% of the patrons that hover over our feedback link click through to the survey. An even smaller number have actually submitted responses.  This has worked out to 56 responses at an average rate of around 1 per day.  Despite that low click through rate, we have been really pleased with the number of responses we have had so far.  The response rate remains steady, and we have already learned a lot from even this small sample of visitor data.  We are not advertising the survey or promoting it, because our target respondents are patrons who find us in the course of their research or general Internet browsing.

Hovering over the help us box reveals expectations and instructions for survey participants.
Hovering over the help us box reveals expectations and instructions for survey participants.

Initial Results

Before I start discussing our results, please note that what I’m sharing here is based on initial responses and my own observations.  No one in digital collections has thoroughly reviewed or analyzed this data.  Additionally, this information is drawn from responses submitted between April 30 – July 8, 2015. We plan to keep the survey online into the academic year to see if our responses change when classes are in session.

With that disclaimer now behind us, let’s review results by question.

Questions 1 and 4:  Who are you?

Since we are concerned with scholarly oriented use more than other types in this exercise, the first question is intended to sort respondents primarily by academic status.   In question 4, respondents are given the chance to further categorized their academic affiliation.

Question 1 Answers # of Responses %
Student 14 25%
Educator 10 18%
Librarian, Archivist or Museum Staff 5 9%
Other 26 47%
55 100

Of the respondents who categorized themselves as “other” in question 1, 11 clarified their otherness by writing their identities in the space provided.  Of this 11, 4 associated themselves with music oriented professions or hobbies, and 2 with fine arts (photographer and filmmaker).  The remaining 5 could not be grouped easily into categories.

As a follow up later in the survey, question 4 asks respondents to categorize their academic affiliation (if they had one).  The results showed that 3 respondents are affiliated with Duke, 12  with other colleges or universities and 9 with a K-12 school.   Of the write-in responses, 3 listed names of universities abroad, and 1 listed a school whose level has not been identified.

Question 2:  Why are you here?

We can tell from our analytics how people get to us (if they were referred to us via a link or sought us out directly), but this information does not address why visitors come to the site.  Enter question 2.

Question 2 Answers # of Responses %
Academic research 15 28
Casual browsing 15 28
Followed a link 9 17%
Personal research 24 44%
Other 6 11%
54

The survey asks that those who select academic research, personal research, and other to write-in their research topic or purpose.  Academic research topics submitted so far primarily revolve around various historical research topics.  Personal research topics reflect a high interest in music (specific songs or types of music), advertising, and other various personal projects.  It is interesting to note that local history related topics have been submitted under all three categories (academic, personal and other).  Additionally,  non-academic researchers seem to be more willing to share sharing their specific topics; 19 of 24 respondents listed their topics as compared to 7 out of 15 academic researchers.

Question 3:  What will you do with the images and/or resources you find on this site?

To me, this question has the potential to provide some of the most illuminating information from our patrons. Knowing how they use the material helps us determine how to enhance access to the digitized objects and what kinds of technology we should be investing in.  This can also shed light on our digitization process itself.  For example, maybe the full text version of an item will provide more benefit to more researchers than an illustrated or hand-written version of the same item (of course we would prefer to offer both, but I think you see where I am going with this).

In designing this question, the group decided it would be valuable to offer options for the those who share items due to their visual or subject appeal (for example, the Pinterest user), the publication minded researcher, and a range of patron types in between.

 

Question 3 Answers # of Responses %
Use for an academic publication 3 6%
Share on social media 10 19%
Use them for homework 8 15%
Use them as a teaching tool in my classes 5 9%
Personal use 31 58%
Use for my job 2 4%
Other 10 19%
53

The 10 “other” respondents all entered subsequent details; they planned to share items with friends and family (in some way other than on social media), they also wanted to use the items they found as a reference, or were working on an academic pursuit that in their mind didn’t fit the listed categories.

Observations

As I said above, these survey results are cursory as we plan to leave the survey up for several more months.  But so far the data reveals that Duke Digital collections serves a wide audience of academic and non-academic users for a range of purposes. For example, one respondent uses the outdoor advertising collections to get a glimpse of how their community has changed over time. Another is concerned with US History in the 1930s, and another is focused on music from the 1900s.

The next phase of the the assessment group’s activities is to meet with researchers and instructors in person and talk with them about their experiences using digital collections (not just Duke’s) for scholarly research or instruction.  We have also been collecting examples of instructors who have used digital collections in their classes.  We plan to create a webpage with these examples with the goal of encouraging other instructors to do the same.  The goal of both of these efforts is to increase academic use of the digital collections (whether that be at the K-12 or collegiate level).

 

Just like this survey team, we stand at the ready, waiting for our chance to analyze and react to our data!

Of course, another next step is to keep collecting this survey data and analyze it further.  All in all, it has been truly exciting to see the results thus far.  As we study the data in more depth this Fall, we plan to work with the Duke University Library Digital Collections Advisory Team to implement any new technical or policy oriented decisions based on our conclusions.  Our minds are already spinning with the possibilities.

The Tao of the DAO: Embedding digital objects in finding aids

Over the last few months, we’ve been doing some behind-the-scenes re-engineering of “the way” we publish digital objects in finding aids (aka “collection guides”).  We made these changes in response to two main developments:

  • The transition to ArchivesSpace for managing description of archival collections and the production of finding aids
  • A growing need to handle new types, or classes, of digital objects in our finding aid interface (especially born-digital electronic records)

Background

While the majority of items found in Duke Digital Collections are published and accessible through our primary digital collections interface (codename Tripod), we have a growing number of digital objects that are published (and sometimes embedded) in finding aids.

Finding aids describe the contents of manuscript and archival collections, and in many cases, we’ve digitized all or portions of these collections.  Some collections may contain material that we acquired in digital form.  For a variety of reasons that I won’t describe here, we’ve decided that embedding digital objects directly in finding aids can be a suitable, often low-barrier alternative to publishing them in our primary digital collections platform.  You can read more on that decision here.

ahstephens_screenshot
Screenshot showing digital objects embedded in the Alexander H. Stephens Papers finding aid

 

EAD, ArchivesSpace, and the <dao>

At Duke, we’ve been creating finding aids in EAD (Encoded Archival Description) since the late 1990s.  Prior to implementing ArchivesSpace (June 2015) and its predecessor Archivists Toolkit (2012), we created EAD through some combination of an XML editor (NoteTab, Oxygen), Excel spreadsheets, custom scripts, templates, and macros.  Not surprisingly, the evolution of EAD authoring tools led to a good deal of inconsistent encoding across our EAD corpus.  These inconsistencies were particularly apparent when it came to information encoded in the <dao> element, the EAD element used to describe “digital archival objects” in a collection.

As part of our ArchivesSpace implementation plan, we decided to get better control over the <dao>–both its content and its structure.  We wrote some local best practice guidelines for formatting the data contained in the <dao> element and we wrote some scripts to normalize our existing data before migrating it to ArchivesSpace.

Classifying digital objects with the “use statement.”

In June 2015, we migrated all of our finding aids and other descriptive data to ArchivesSpace.  In total, we now have about 3400 finding aids (resource records) and over 9,000 associated digital objects described in ArchivesSpace.  Among these 9,000 digital objects, there are high-res master images, low-res use copies, audio files, video files, disk image files, and many other kinds of digital content.  Further, the digital files are stored in several different locations–some accessible to the public and some restricted to staff.

In order for our finding aid interface to display each type of digital object properly, we developed a classification system of sorts that 1) clearly identifies each class of digital object and 2) describes the desired display behavior for that type of object in our finding aid interface.

In ArchivesSpace, we store that information consistently in the ‘Use Statement’ field of each Digital Object record.  We’ve developed a core set of use statement values that we can easily maintain in a controlled value list in the ArchivesSpace application.  In turn, when ArchivesSpace generates or exports an EAD file for any given collection that contains digital objects, these use statement values are output in the DAO role attribute.  Actually, a minor bug in the ArchivesSpace application currently prevents the use statement information from appearing in the <dao>. I fixed this by customizing the ArchivesSpace EAD serializer in a local plugin.

file_version_aspace_example
Screenshot from ArchivesSpace showing digital object record, file version, and use statement

 

duke_dao_code
Snippet of EAD generated from ArchivesSpace showing <dao> encoding

 Every object its viewer/player

The values in the DAO role attribute tell our display interface how to render a digital object in the finding aid.  For example, when the display interface encounters a DAO with role=”video-streaming” it knows to queue up our embedded streaming video player.  We have custom viewers and players for audio, batches of image files, PDFs, and many other content types.

Here are links to some finding aids with different classes of embedded digital objects, each with its own associated use statement and viewer/player.

The curious case of electronic records

The last example above illustrates the curious case of electronic records.  The term “electronic records” can describe a wide range of materials but may include things like email archives, disk images, and other formats that are not immediately accessible on our website, but must be used by patrons in the reading room on a secure machine.  In these cases, we want to store information about these files in ArchivesSpace and provide a convenient way for patrons to request access to them in the finding aid interface.

Within the next few weeks, we plan to implement some improvements to the way we handle the description of and access to electronic records in finding aids.  Eventually, patrons will be able to view detailed information about the electronic records by hovering over a link in the finding aid.  Clicking on the link will automatically generate a request for those records in Aeon, the Rubenstein Library’s request management system.  Staff can then review and process those requests and, if necessary, prepare the electronic records for viewing on the reading room desktop.

Conclusion

While we continue to tweak our finding aid interface and learn our way around ArchivesSpace, we think we’ve developed a fairly sustainable and flexible way to publish digital objects in finding aids that both preserves the archival context of the items and provides an engaging user-experience for interacting with the objects.  As always, we’d love to hear how other libraries may have tackled this same problem.  Please share your comments or experiences with handling digital objects in finding aids!

[Credit to Lynn Holdzkom at UNC-Chapel Hill for coining the phrase “The Tao of the DAO”]