Category Archives: Behind the Scenes

What happens when you click “Search?”

How many times each day to you type something into a search box on the web and click “Search?” Have you ever wondered what happens behind the scenes to make this possible? In this post I’ll show how search works on the Duke University Libraries Catalog. I’ll trace the journey of how search works from metadata in a MARC record (where our bibliographic data is stored), to transforming that data into something we can index for searching, to how the words you type into the search box are transformed, and then finally how the indexed records and your search interact to produce a relevance ranked list of search results. Let’s get into the weeds!

A MARC record stores bibliographic data that we purchase from vendors or are created by metadata specialists who work at Duke Libraries. These records look something like this:

In an attempt to keep this simple, let’s just focus on the main title of the record. This is information recorded in the MARC record’s 245 field in subfields a, b, f, g, h, k, n, p, and s. I’m not going to explain what each of the subfields is for but the Library of Congress maintains extensive documentation about MARC field specifications (see 245 – Title Statement (NR)). Here is an example of a MARC 245 field with a linked 880 field that contains the equivalent title in an alternate script (just to keep things interesting).

=245 10$6880-02$aUrbilder ;$bBlossoming ; Kalligraphie ; O Mensch, bewein' dein' Sünde gross (Arrangement) : for string quartet /$cToshio Hosokawa.
=880 10$6245-02/{dollar}1$a原像 ;$b開花 ; 書 (カリグラフィー) ほか : 弦楽四重奏のための /$c細川俊夫.

The first thing that has to happen is we need to get the data out of the MARC record into a more computer friendly data format — an array of hashes, which is just a fancy way of saying a list of key value pairs. The software reads the metadata from the MARC 245 field, joins all the subfields together, and cleans up some punctuation. The software also checks to see if the title field contains Arabic, Chinese, Japanese, Korean, or Cyrillic characters, which have to be handled separately from Roman character languages. From the MARC 245 field and its linked 880 field we end up with the following data structure.

"title_main": [
{
"value": "Urbilder ; Blossoming ; Kalligraphie ; O Mensch, bewein' dein' Sünde gross (Arrangement) : for string quartet"
},
{
"value": "原像 ; 開花 ; 書 (カリグラフィー) ほか : 弦楽四重奏のための",
"lang": "cjk"
}
]

We send this data off to an ingest service that prepares the metadata for indexing.

The data is first expanded to multiple fields.

{"title_main_indexed": "Urbilder ; Blossoming ; Kalligraphie ; O Mensch, bewein' dein' Sünde gross (Arrangement) : for string quartet",

"title_main_vernacular_value": "原像 ; 開花 ; 書 (カリグラフィー) ほか : 弦楽四重奏のための",

"title_main_vernacular_lang": "cjk",

"title_main_value": "原像 ; 開花 ; 書 (カリグラフィー) ほか : 弦楽四重奏のための / Urbilder ; Blossoming ; Kalligraphie ; O Mensch, bewein' dein' Sünde gross (Arrangement) : for string quartet"}

title_main_indexed will be indexed for searching.
title_main_vernacular_value holds the non Roman version of the title to be indexed for searching.
title_main_vernacular_lang holds information about the character set stored in title_main_vernacular_value.
title_main_value holds the data that will be stored for display purposes in the catalog user interface.

We take this flattened, expanded set of fields and apply a set of rules to prepare the data for the indexer (Solr). These rules append suffixes to each field and combine the two vernacular fields to produce the following field value pairs. The suffixes provide instructions to the indexer about what should be done with each field.

{"title_main_indexed_tsearchtp": "Urbilder ; Blossoming ; Kalligraphie ; O Mensch, bewein' dein' Sünde gross (Arrangement) : for string quartet",

"title_main_cjk_v": "原像 ; 開花 ; 書 (カリグラフィー) ほか : 弦楽四重奏のための",

"title_main_t_stored_single": "原像 ; 開花 ; 書 (カリグラフィー) ほか : 弦楽四重奏のための / Urbilder ; Blossoming ; Kalligraphie ; O Mensch, bewein' dein' Sünde gross (Arrangement) : for string quartet" }

When sent to the indexer the fields are further transformed.

Suffixed Source Field Solr Field Solr Field Type Solr Stored/Indexed Values
title_main_indexed_tsearchtp title_main_indexed_t text stemmed urbild blossom kalligraphi o mensch bewein dein sund gross arrang for string quartet
title_main_indexed_tsearchtp title_main_indexed_tp text unstemmed urbilder blossoming kalligraphie o mensch bewein dein sunde gross arrangement for string quartet
title_main_cjk_v title_main_cjk_v chinese, japanese, korean text 原 像 开花 书 か り く ら ふ ぃ い ほか 弦乐 亖 重奏 の ため の
title_main_t_stored_single title_main stored string 原像 ; 開花 ; 書 (カリグラフィー) ほか : 弦楽四重奏のための / Urbilder ; Blossoming ; Kalligraphie ; O Mensch, bewein’ dein’ Sünde gross (Arrangement) : for string quartet

These are all index time transformations. They occur when we send records into the index.

The query you enter into the search box also gets transformed in different ways and then compared to the indexed fields above. These are query time transformations. As an example, if I search for the terms “Urbilder Blossom Kalligraphie,” the following transformations and comparisons take place:

The values stored in the records for title_main_indexed_t are evaluated against my search string transformed to urbild blossom kalligraphi.

The values stored in the records for title_main_indexed_tp are evaluated against my search string transformed to urbilder blossom kalligraphie.

The values stored in the records for title_main_cjk_v are evaluated against my search string transformed to urbilder blossom kalligraphie.

Then Solr does some calculations based on relevance rules we configure to determine which documents are matches and how closely they match (signified by the relevance score calculated by Solr). The field value comparisons end up looking like this under the hood in Solr:

+(DisjunctionMaxQuery((
(title_main_cjk_v:urbilder)^50.0 |
(title_main_indexed_tp:urbilder)^500.0 |
(title_main_indexed_t:urbild)^100.0)~1.0)
DisjunctionMaxQuery((
(title_main_cjk_v:blossom)^50.0 |
(title_main_indexed_tp:blossom)^500.0 |
(title_main_indexed_t:blossom)^100.0)~1.0)
DisjunctionMaxQuery((
(title_main_cjk_v:kalligraphie)^50.0 |
(title_main_indexed_tp:kalligraphie)^500.0 |
(title_main_indexed_t:kalligraphi)^100.0)~1.0))~3
DisjunctionMaxQuery((
(title_main_cjk_v:"urbilder blossom kalligraphie")^150.0 |
(title_main_indexed_t:"urbild blossom kalligraphi")^600.0 |
(title_main_indexed_tp:"urbilder blossom kalligraphie")^5000.0)~1.0)
(DisjunctionMaxQuery((
(title_main_cjk_v:"urbilder blossom")^75.0 |
(title_main_indexed_t:"urbild blossom")^200.0 |
(title_main_indexed_tp:"urbilder blossom")^1000.0)~1.0)
DisjunctionMaxQuery((
(title_main_cjk_v:"blossom kalligraphie")^75.0 |
(title_main_indexed_t:"blossom kalligraphi")^200.0 |
(title_main_indexed_tp:"blossom kalligraphie")^1000.0)~1.0))
DisjunctionMaxQuery((
(title_main_cjk_v:"urbilder blossom kalligraphie")^100.0 |
(title_main_indexed_t:"urbild blossom kalligraphi")^350.0 |
(title_main_indexed_tp:"urbilder blossom kalligraphie")^3000.0)~1.0)

The ^nnnn indicates the relevance weight given to any matches it finds, while the ~n.n indicates the number of matches that are required from each clause to consider the document a match. Matches in fields with higher boosts count more than fields with lower boosts. You might notice another thing, that full phrase matches are boosted the most, two consecutive term matches are boosted slightly less, and then individual term matches are given the least boost. Furthermore unstemmed field matches (those that have been modified the least by the indexer, such as in the field title_main_indexed_tp) get more boost than stemmed field matches. This provides the best of both worlds — you still get a match if you search for “blossom” instead of “blossoming,” but if you had searched for “blossoming” the exact term match would boost the score of the document in results. Solr also considers how common the term is among all documents in the index so that very common words like “the” don’t boost the relevance score as much as less common words like “kalligraphie.”

I hope this provides some insight into what happens when you clicks search. Happy searching.

Building a new Staff Directory

The staff directory on the Library’s website was last overhauled in late 2014, which is to say that it has gotten a bit long in the tooth! For the past few months I’ve been working along with my colleagues Sean Aery, Tom Crichlow, and Derrek Croney on revamping the staff application to make it more functional, easier to use, and more visually compelling.

staff directory interface
View of the legacy staff directory interface

 

Our work was to be centered around three major components — an admin interface for HR staff, an edit form for staff members, and the public display for browsing people and departments. We spent a considerable amount of time discussing the best ways to approach the infrastructure for the project. In the end we settled on a hybrid approach in which the HR tool would be built as a Ruby-on-Rails application, and we would update our existing custom Drupal module for staff editing and public UI display.

We created a seed file for our Rails app based on the legacy data from the old application and then got to work building the HR interface. We decided to rely on the Rails Admin gem as it met most of our use cases and had worked well on some other internal projects. As we continued to add features, our database models became more and more complex, but working in Rails makes these kind of changes very straightforward. We ended up with two main tables (People and Departments) and four auxiliary tables to store extra attributes (External Contacts, Languages, Subject Areas, and Trainings).

rails admin
View of the Rails Admin dashboard

 

We also made use of the Ancestry gem and the Nestable gem to allow HR staff to visually sort department hierarchy. This makes it very easy to move departments around quickly and using a visual approach, so the next time we have a large department reorganization it will be very easy to represent the changes using this tool.

department sorting
Nestable gem allows for easy sorting of departments

 

After the HR interface was working well, we concentrated our efforts on the staff edit form in Drupal. We’d previously augmented the default Drupal profile editor with our extra data fields, but wanted to create a new form to make things cleaner and easier for staff to use. We created a new ‘Staff Profile’ tab and also included a link on the old ‘Edit’ tab that points to the new form. We’re enabling staff to include their subject areas, preferred personal pronouns, language expertise, and to tie into external services like ORCID and Libguides.

drupal edit form
Edit form for Staff Profile

 

The public UI in Drupal is where most of our work has gone. We’ve created four approaches to browsing; Departments, A–Z, Subject Specialists, and Executive Group. There is also a name search that incorporates typeahead for helping users find staff more efficiently.

The Department view displays a nested view of our complicated organizational structure which helps users to understand how a given department relates to another one. You can also drill down through departments when you’ve landed on a department page.

departments
View of departments

 

Department pages display all staff members therein and positions managers at the top of the display. We also display the contact information for the department and link to the department website if it exists.

department example
Example of a department page

 

The Staff A–Z list allows users to browse through an alphabetized list of all staff in the library. One challenge we’re still working through is staff photos. We are lacking photos for many of our staff, and many of the photos we do have are out of date and inconsistently formatted. We’ve included a default avatar for staff without photos to help with consistency, but they also serve the purpose of highlighting the number of staff without a photo. Stay tuned for improvements on this front!

a-to-z list
A-to-Z browse

 

The Subject Specialists view helps in finding specific subject librarians. We include links to relevant research guides and appointment scheduling. We also have a text filter at the top of the display that can help quickly narrow the results to whatever area you are looking for.

subject specialists
Subject Specialists view

 

The Executive Group display is a quick way to view the leadership of the library.

executive group
Executive Group display

 

One last thing to highlight is the staff display view. We spent considerable effort refining this, and I think our work has really paid off. The display is clean and modern and a great improvement from what we had before.

old profile
View of staff profile in legacy application
updated profile
View of the same profile in the new application

 

In addition to standard information like name, title, contact info, and department, we’re displaying:

  • a large photo of the staff person
  • personal pronouns
  • specialized trainings (like Duke’s P.R.I.D.E. program)
  • links our to ORCID, Libguides, and Libcal scheduling
  • customizable bio (with expandable text display)
  • language expertise
  • subject areas

Our plan is to roll out the new system at the end of the month, so you can look forward to a greatly improved staff directory experience soon!

Looking Ahead to MorphoSource 2.0

MorphoSource logo For the past year, developers in the Library’s Software Services department have been working to rebuild Duke’s MorphoSource repository for 3D research data. The current repository, available at www.morphosource.org, provides a place for researchers and curators to make scans of biological specimens available to other researchers and to the general public.

MorphoSource, first launched in 2013, has become the most popular website for virtual fossils in the world.  The site currently contains sixty thousand data sets representing twenty thousand specimens from seven thousand different species. In 2017, led by Doug Boyer in Duke Evolutionary Anthropology, the project received a National Science Foundation grant. Under this grant, the technical infrastructure for the repository will be moved to the Library’s management, and the user interface is being rebuilt using Hyrax, an open-source digital repository application widely implemented by libraries that manage research data.  The scope of the repository is being expanded to include data for cultural heritage objects, such as museum artifacts, architecture, and archaeological sites. Most importantly, MorphoSource is being improved with better performance, a more intuitive user experience, and expanded functionality for users to view and interact with the data within the site.

Viewing and manipulating CT scans and the derived 3D model of a platypus in the MorphoSource viewer

Management of 3D data is in itself complicated.  It becomes even more so when striving towards long-term preservation of the digital representation of a unique biological specimen. In many cases, these specimens no longer exist, and the 3D data becomes the only record of their particular morphology.  It’s necessary to collect not only the actual digital files, but extensive metadata describing both the data’s creation and the specimen that was scanned to create the data. This can make the process of contributing data daunting for researchers. To improve the user experience and assist users with entering metadata about their files, MorphoSource 2.0 will guide them through the process. Users will be asked questions about their data, what it represents, when and how it was created, and if it is a derivative of data already in MorphoSource. As they progress through making their deposit, the answers they provide will direct them through linking their deposit to records already in the repository, or help them with entering new metadata about the specimen that was scanned, the facility and equipment used to scan the specimen, and any automated processes that were run to create the files.

MorphoSource page showing an alligator skull
Screenshot of a MorphoSource media page showing an alligator skull.

The new repository will also improve the experience for users exploring metadata about contributed resources and viewing the accompanying 3D files. All of the data describing technical information, acquisition and processing information, ownership and permissions, and related files will be gathered in one page, and give users the option to expand or collapse different metadata sections as their interests dictate. A file viewer will also be embedded in the page, which also allows for full-screen viewing and provides several new tools for users analyzing the media. Besides being able to move and spin the model within the viewer, users can also adjust lighting and other factors to focus on different areas of the model, and take custom measurements of different points on the specimen. Most exciting, for CT image series, users can scroll through the images along three axes, or convert the images to a 3D model. For some data, users will also be able to share models by embedding the file viewer in a webpage.

The MorphoSource team is very excited about our planned improvements, and plans to launch MorphoSource 2.0 in 2020. Stay tuned for the launch date, and in the meantime please visit the current site: www.morphosource.org.

Congratulations and farewell to Mike Adamo

This week, Digitization Specialist Mike Adamo will move on from Duke Libraries after 14 years to assume a new position as Digital Imaging Coordinator at the Libraries of Virginia Tech University. Mike has contributed so much to our Digital Collections program during his tenure, providing years of uncompromising still imaging services, stewardship in times of change for the Digital Production Center, as well as leadership of and then years of service on our Digital Collections Implementation Team. He has also been the lead digitization specialist on some of our most well known digital collections like the Hugh Mangum photographs, James Karales photographs and William Gedney collection.

In addition, Mike has been a principal figure on our Multispectral Imaging Team and has been invaluable to our development of this service for the library. He established the setup and led all MSI imaging sessions; collaborated cross-departmentally with other members on the MSI Team to vet requests and develop workflows; and worked with vendors and other MSI practitioners to develop best practices, documentation, and a preservation plan and service model for MSI services at Duke Libraries. He’s also provided maintenance for our MSI equipment, researching options for additional equipment as our program grew.

Side by side comparison of a papyri item under natural light and the same item after multispectral imaging and processing.

We are grateful to Mike for his years of dedication to the job at to the field of cultural heritage digitization as well as for the instrumental role he’s played in developing MSI Services at DUL. We offer a huge thank you to Mike for his work and wish him well in his future position!

Post contributed by Giao Luong Baker and Erin Hammeke

Features and Gaps and Bees, Oh My!

Since my last post about our integrated library system (ILS), there’s been a few changes. First, my team is now the Library Systems and Integration Support Department. We’ve also added three business analysts to our team and we have a developer coming on board this summer. We continue to work on FOLIO as a replacement for our current ILS. So what work are we doing on FOLIO?

FOLIO is a community-sourced product. There are currently more than 30 institutions, over a dozen developer organizations, and vendors such as EBSCO and IndexData involved. The members of the community come together in Special Interest Groups (SIGs). The SIGs discuss what functionality and data is needed, write the user stories, and develop workflows so the library staff will be able to do their tasks. There are ten main SIGs, an Implementation Group, and Product and Technical Councils. Here at Duke, we have staff from all over the libraries involved in the SIGs. They speak up to be sure the product will work for Duke Libraries.

Features

The institutions planning to implement FOLIO in Summer 2020 spent April ranking 468 open features. They needed to choose  whether the feature was needed at the time the institution planned to go live, or if they could wait for the feature to be added (one quarter later or one year later). Duke voted for 62% of the features be available at the time we go live with FOLIO. These features include things like  default reports, user experience enhancements, and more detailed permission settings, to name a few.

Gaps

After the feature prioritization was complete, we conducted a gap analysis. The gap analysis required our business analysts to take what they’ve learned from conducting interviews with library staff across the University and compare it to what FOLIO can currently do and what is planned. The Duke Libraries’ staff who have been active on the SIGs were extremely helpful in identifying gaps. Some feature requests that came out of the gap analysis included making sure a user has an expiration date associated with it. Another was being able to re-print notices to patrons. Others had to do with workflow, for example, making sure that when a holdings record is “empty” (no items attached), that an alert is sent so a staff person can decide to delete the empty record or not.

Bees?

So where to the bees come into all of this? Well, the logo for FOLIO includes a bee!folio: future of libraries is open. Bee icon

The release names and logos are flowers. And we’re working together in a community toward a single goal – a new Library Services Platform that is community-sourced and works for the future of libraries.

Learn more about FOLIO@Duke by visiting our site: https://sites.duke.edu/folioatduke/. We’ve posted newsletters, presentations, and videos from the FOLIO project team.

hexagon badge, image of aster flower, words folio aster release Jan 2019

hexagon badge, image of bellis flower, words folio bellis release Apr  2019

hexagon badge, image of clove flower, words folio clover release May 2019

hexagon badge, image of daisy flower, words folio daisy release Oct 2019

A simple tool with a lot of power: Project Estimates

It takes a lot to build and publish digital collections as you can see from the variety and scope of the blog posts here on Bitstreams.  We all have our internal workflows and tools we use to make our jobs easier and more efficient.  The number and scale of activities going on behind the scenes is mind-boggling and we would never be able to do as much as we do if we didn’t continually refine our workflows and create tools and systems that help manage our data and work.  Some of these tools are big, like the Duke Digital Repository (DDR), with its public, staff and backend interface used to preserve, secure, and provide access to digital resources, while others are small, like scripts built to transform ArchiveSpace output into a starter digitization guides.  In the Digital Production Center (DPC) we use a homegrown tool that not only tracks production statistics but is also used to do project projections and to help isolate problems that occur during the digitization process.  This tool is a relational database that is affectionately named the Daily Work Report and has collected over 9 years of data on nearly every project in that time.

A long time ago, in a newly minted DPC, supervisors and other Library staff often asked me, “How long will that take?”, “How many students will we need to digitize this collection?”, “What will the data foot print of this project be?”, “How fast does this scanner go?”, “How many scans did we do last year?”, “How many items is that?”.  While I used to provide general information and anecdotal evidence to answer all of these questions, along with some manual hunting down of this information, it became more and more difficult to answer these questions as the number of projects multiplied, our services grew, the number of capture devices multiplied and the types of projects grew to include preservation projects, donor requests, patron request and exhibits.  Answering these seemingly simple questions became more complicated and time consuming as the department grew.  I thought to myself, I need a simple way to track the work being done on these projects that would help me answer these recurring common questions.

We were already using a FileMaker Pro database with a GUI interface as a checkout system to assign students batches of material to scan, but it was only tracking what student worked on what material.  I decided I could build out this concept to include all of the data points needed to answer the questions above.  I decided to use Microsoft Access because it was a common tool installed on every workstation in the department, I had used it before, and classes and instructional videos abound if I wanted to do anything fancy.

Enter the Daily Work Report (DWR).  I created a number of discrete tables to hold various types of data: project names, digitization tasks, employee names and so on.  These fields are connected to a datasheet represented as a form, which allowed for dropdown lists and auto filling for rapid and consistent entry of information. 

At the end of each shift students and professionals alike fill out the DWR for each task they performed on each project and how long they worked on each task.  These range from the obvious tasks of scanning and quality control to more minute tasks of derivative creation, equipment cleaning, calibration, documentation, material transfer, file movement, file renaming, ingest prep, and ingest.

Some of these tasks may seem minor and possibly too insignificant to record but they add up.  They add up to ~30% of the time it takes to complete a project.   When projecting the time it will take to complete a project we collect Scanning and Quality Control data from a similar project, calculate the time and add 30%.

Common Digitization Tasks

Task
Hours Overall % of project
Scanning 406.5 57.9
Quality Control 1 133 19
Running Scripts 24.5 3.5
Collection Analysis 21 3
Derivative Creation 20.5 2.9
File Renaming 15.5 2.2
Material Transfer 14 2
Testing 12.5 1.8
Documentation 10 1.4
File Movement 9.75 1.4
Digitization Guide 7 1
Quality Control 2 6.75 1
Training 6 0.9
Quality Control 3 5.5 0.9
Stitching 3 0.4
Rescanning 1.5 0.2
Finalize 1.5 0.2
Troubleshooting 1.5 0.2
Conservation Consultation 1 0.1
Total 701 100

New Project Estimates

Using the Daily Work Report’s Datasheet View, the database can be filtered by project, then by the “Scanning” task to get the total number of scans and the hours worked to complete those scans.  The same can be done for the Quality Control task.  With this information the average number of scans per hour can be calculated for the project and applied to the new project estimate.

Gather information from an existing project that is most similar to the project you are creating the estimate for.  For example, if you need to develop an estimate for a collection of bound volumes that will be captured on the Zeutschel you should find a similar collection in the DWR to run your numbers.

Gather data from an existing project:

Scanning

  • Number of scans = 3,473
  • Number of hours = 78.5
  • 3,473/78.5 = 2/hr

Quality Control

  • Number of scans = 3,473
  • Number of hours = 52.75
  • 3,473/52.75 = 8/hr

Apply the per hour rates to the new project:

Estimated number of scans: 7,800

  • Scanning: 7,800 / 44.2/hr = 176.5 hrs
  • QC: 7,800 / 68.8/hr = 113.4 hrs
  • Total: 290 hrs
  • + 30%: 87 hrs
  • Grand Total: 377 hrs

Rolling Production Rate

When an update is required for an ongoing project the Daily Work Report can be used to see how much has been done and calculate how much longer it will take.  The number of images scanned in a collection can be found by filtering by project then by the “Scanning” Task.  That number can then be subtracted from the total number of scans in the project.  Then, using a similar project to the one above you can calculate the production rate for the project and estimate the number of hours it will take to complete the project.

Scanning

  • Number of scans in the project = 7,800
  • Number of scans completed = 4,951
  • Number of scans left to do = 7,800 – 4,951 = 2,849

Scanning time to completion

  • Number of scans left = 2,849
  • 2,849/42.4/hr = 2 hrs

Quality Control

  • Number of files to QC in the project = 7,800
  • Number of files completed = 3,712
  • Number of files left to do = 7,800 – 3,712 = 4,088

QC hours to completion

  • Number of scans left to scan = 4,088
  • 4,088/68.8 = 4 hrs

The amount of time left to complete the project

  • Scanning – 67.2 hrs
  • Quality Control – 59.4 hrs
  • Total = 126.2 hrs
  • + 30% = 38
  • Grand Total = 164.2 hrs

Isolate an error

Errors inevitably occur during most digitization projects.  The DWR can be used to identify how widespread the error is by using a combination of filtering, the digitization guide (which is an inventory of images captured along with other metadata about the capture process), and inspecting the images.  As an example, a set of files may be found to have no color profile.  The digitization guide can be used to identify the day the erroneous images were created and who created them. The DWR can be used to filter by the scanner operator and date to see if the error is isolated to a particular person, a particular machine or a particular day.  This information can then be used to filter by the same variables across collections to see if the error exists elsewhere.  The result of this search can facilitate retraining, recalibrating of capture devices and also identify groups of images that need to be rescanned without having to comb through an entire collection.

While I’ve only touched on the uses of the Daily Work Report, we have used this database in many different ways over the years.  It has continued to answer those recurring questions that come up year after year.  How many scans did we do last year?  How many students worked on that multiyear project?  How many patron requests did we complete last quarter?  This database has helped us do our estimates, isolate problems and provide accurate updates over the years.  For such a simple tool it sure does come in handy.

Mythical Beasts of Audio

Gear. Kit. Hardware. Rig. Equipment.

In the audio world, we take our tools seriously, sometimes to an unhealthy and obsessive degree. We give them pet names, endow them with human qualities, and imbue them with magical powers. In this context, it’s not really strange that a manufacturer of professional audio interfaces would call themselves “Mark of the Unicorn.”

Here at the Digital Production Center, we recently upgraded our audio interface to a MOTU 896 mk3 from an ancient (in tech years) Edirol UA-101. The audio interface, which converts analog signals to digital and vice-versa, is the heart of any computer-based audio system. It controls all of the routing from the analog sources (mostly cassette and open reel tape decks in our case) to the computer workstation and the audio recording/editing software. If the audio interface isn’t seamlessly performing analog to digital conversion at archival standards, we have no hope of fulfilling our mission of creating high-quality digital surrogates of library A/V materials.

Edirol UA-101
The Edirol enjoying its retirement with some other pieces of kit

While the Edirol served us well from the very beginning of the Library’s forays into audio digitization, it had recently begun to cause issues resulting in crashes, restarts, and lost work. Given that the Edirol is over 10 years old and has been discontinued, it is expected that it would eventually fail to keep up with continued OS and software updates. After re-assessing our needs and doing a bit of research, we settled on the MOTU 896 mk3 as its replacement. The 896 had the input, output, and sync options we needed along with plenty of other bells and whistles.

I’ve been using the MOTU for several weeks now, and here are some things that I’m liking about it:

  • Easy installation of drivers
  • Designed to fit into standard audio rack
  • Choice of USB or Firewire connection to PC workstation
  • Good visual feedback on audio levels, sample rate, etc. via LED meters on front panel
  • Clarity and definition of sound
MOTU 896mk3
The MOTU sitting atop the audio tower

I haven’t had a chance to explore all of the additional features of the MOTU yet, but so far it has lived up to expectations and improved our digitization workflow. However, in a production environment such as ours, each piece of equipment needs to be a workhorse that can perform its function day in and day out as we work our way through the vaults. Only time can tell if the Mark of the Unicorn will be elevated to the pantheon of gear that its whimsical name suggests!

Bringing 500 Years of Women’s Work Online

Back in 2015, Lisa Unger Baskin placed her extensive collection of more than 11,000 books, manuscripts, photographs, and artifacts in the Sallie Bingham Center for Women’s History & Culture in the David M. Rubenstein Rare Book & Manuscript Library. In late February of 2019, the Libraries opened the exhibit “Five Hundred Years of Women’s Work: The Lisa Unger Baskin Collection” presenting visitors with a first look at the diversity and depth of the collection, revealing the lives of women both famous and forgotten and recognizing their accomplishments. I was fortunate to work on the online component of the exhibit in which we aimed to offer an alternate way to interact with the materials.

Homepage of the exhibit
Homepage of the exhibit

Most of the online exhibits I have worked on have not had the benefit of a long planning timeframe, which usually means we have to be somewhat conservative in our vision for the end product. However, with this high-profile exhibit, we did have the luxury of a (relatively) generous schedule and as such we were able to put a lot more thought and care into the planning phase. The goal was to present a wide range and number of items in an intuitive and user-friendly manner. We settled on the idea of arranging items by time period (items in the collection span seven centuries!) and highlighting the creators of those items.

We also decided to use Omeka (classic!) for our content management system as we’ve done with most of our other online exhibits. Usually exhibit curators manually enter the item information for their exhibits, which can get somewhat tedious. In this case, we were dealing with more than 250 items, which seemed like a lot of work to enter one at a time. I was familiar with the CSV Import plugin for Omeka, which allows for batch uploading items and mapping metadata fields. It seemed like the perfect solution to our situation. My favorite feature of the plugin is that it also allows for quickly undoing an ingest in case you discover that you’ve made a mistake with mapping fields or the like, which made me less nervous about applying batch uploads to our production Omeka instance that already contained about 1,100 items.

Metadata used for batch upload
Metadata used for batch upload

Working with the curators, we came up with a data model that would nest well within Omeka’s default Dublin-core based approach and expanded that with a few extra non-standard fields that we attached to a new custom item type. We then assembled a small sample set of data in spreadsheet form and I worked on spinning up a local instance of Omeka to test and make sure our approach was actually going to work! After some frustrating moments with MAMP and tracking down strange paths to things like imagemagick (thank you eternally, Stack Overflow!) I was able to get things running well and was convinced the batch uploads via spreadsheet was a good approach.

Now that we had a process in place, I began work on a custom theme to use with the exhibit. I’d previously used Omeka Foundation (a grid-based starter theme using the Zurb Foundation CSS framework) and thought it seemed like a good place to start with this project. The curators had a good idea of the site structure that they wanted to use, so I jumped right in on creating some high-fidelity mockups borrowing look-and-feel cues from the beautiful print catalog that was produced for the exhibit. After a few iterations we arrived at a place where everyone was happy and I started to work on functionality. I also worked on incorporating a more recent version of the Foundation framework as the starter theme was out of date.

Print catalog for the exhibit
Print catalog for the exhibit

The core feature of the site would be the ability to browse all of the items we wanted to feature via the Explore menu, which we broke into seven sections — primarily by time period, but also by context. After looking at some other online exhibit examples that I thought were successful, we decided to use a masonry layout approach (popularized by sites like Pinterest) to display the items. Foundation includes a great masonry plugin that was very easy to implement. Another functionality issue had to do with displaying multi-page items. Out of the box, I think Omeka doesn’t do a great job displaying items that contain multiple images. I’ve found combining them into PDFs works much better, so that’s what we did in this case. I also installed the PDF Embed plugin (based on the PDF.js engine) in order to get a consistent experience across browsers and platforms.

Once we got the theme to a point that everyone was happy with it, I batch imported all of the content and proceeded with a great deal of cross-platform testing to make sure things were working as expected. We also spent some time refining the display of metadata fields and making small tweaks to the content. Overall I’m very pleased with how everything turned out. User traffic has been great so far so it’s exciting to know that so many people have been able to experience the wonderful items in the exhibit. Please check out the website and also come visit in person — on display until June 15, 2019.

Examples of 'Explore' and 'Item' pages
Examples of ‘Explore’ and ‘Item’ pages

It Takes a Village to Curate Your Data: Duke Partners with the Data Curation Network

In early 2017, Duke University Libraries launched a research data curation program designed to help researchers on campus ensure that their data are adequately prepared for both sharing and publication, and long term preservation and re-use. Why the focus on research data? Data generated by scholars in the course of their investigation are increasingly being recognized as outputs similar in importance to the scholarly publications they support. Open data sharing reinforces unfettered intellectual inquiry, fosters transparency, reproducibility and broader analysis, and permits the creation of new data sets when data from multiple sources are combined. For these reasons, a growing number of publishers and funding agencies like PLoS ONE and the National Science Foundation are requiring researchers to make openly available the data underlying the results of their research.

Data curation steps

But data sharing can only be successful if the data have been properly documented and described. And they are only useful in the long term if steps have been taken to mitigate the risks of file format obsolescence and bit rot. To address these concerns, Duke’s data curation workflow will review a researcher’s data for appropriate documentation (such as README files or codebooks), solicit and refine Dublin Core metadata about the dataset, and make sure files are named and arranged in a way that facilitates secondary use. Additionally, the curation team can make suggestions about preferred file formats for long-term re-use and conduct a brief review for personally identifiable information. Once the data package has been reviewed, the curation team can then help researchers make their data available in Duke’s own Research Data Repository, where the data can be licensed and assigned a Digital Object Identifier, ensuring persistent access.

 

“The Data Curation Network (DCN) serves as the “human layer” in the data repository stack and seamlessly connects local data sets to expert data curators via a cross-institutional shared staffing model.”

 

New to Duke’s curation workflow is the ability to rely on the domain expertise of our colleagues at a few other research institutions. While our data curators here at Duke possess a wealth of knowledge about general research data-related best practices, and are especially well-versed in the vagaries of social sciences data, they may not always have the all the information they need to sufficiently assess the state of a dataset from a researcher. As an answer to this problem, the Data Curation Network, an Alfred P. Sloan Foundation-funded endeavor, has established a cross-institutional staffing model that distributes the domain expertise of each of its partner institutions. Should a curator at one institution encounter data of a kind with which they are unfamiliar, submission to the DCN opens up the possibility for enhanced curation from a network partner with the requisite knowledge.

DCN Partner Institutions
DCN Partner Institutions

Duke joins Cornell University, Dryad Digital Repository, Johns Hopkins University, University of Illinois, University of Michigan, University of Minnesota, and Pennsylvania State University in partnering to provide curatorial expertise to the DCN. As of January of this year, the project has moved out of pilot phase into production, and is actively moving data through the network. If a Duke researcher were to submit a dataset our curation team thought would benefit from further examination by a curator with domain knowledge, we will now reach out to the potential depositor to receive clearance to submit the data to the network. We’re very excited about this opportunity to provide this enhancement to our service!

Looking forward, the DCN hopes to expand their offerings to include nation-wide training on specialized data curation and to extend the curation services the network offers beyond the partner institutions to individual end users. Duke looks forward to contributing as the project grows and evolves.

Sustainability Planning for a Better Tomorrow

In March of last year I wrote about efforts of the Resource Discovery Systems and Strategies team (RDSS, previously called the Discovery Strategy Team) to map Duke University Libraries’ discovery system environment in a visual way. As part of this project we created supporting documentation for each system that appeared in a visualization, including identifying functional and technical owners as well as links to supporting documentation. Gathering this information wasn’t as straightforward as it ideally should have been, however. When attempting to identify ownership, for example, we were often asked questions like, “what IS a functional owner, anyway?”, or told “I guess I’m the owner… I don’t know who else it would be”. And for many systems, local documentation was outdated, distributed across platforms, or simply nonexistent.

As a quick glance through the Networked Discovery Systems document will evince, we work with a LOT of different systems here at DUL, supporting a great breadth of processes and workflows. And we’ve been steadily adding to the list of systems we support every year, without necessarily articulating how we will manage the ever-growing list. This has led to situations of benign neglect, confusion as to roles and responsibilities and, in a few cases, we’ve hung onto systems for too long because we hadn’t defined a plan for responsible decommission.

So, to promote the healthier management of our Networked Discovery Systems, the RDSS team developed a set of best practices for sustainability planning. Originally we framed this document as best practices for maintenance planning, but in conversations with other groups in the Libraries, we realized that this didn’t quite capture our intention. While maintenance planning is often considered from a technical standpoint, we wanted to convey that the responsible management of our systems involves stakeholders beyond just those in ITS, to include the perspective and engagement of non-technical staff. So, we landed on the term sustainability, which we hope captures the full lifecycle of a system in our suite of tools, from implementation, through maintenance, to sunsetting, when necessary.

The best practices are fairly short, intended to be a high-level guide rather than overly prescriptive, recognizing that every system has unique needs. Each section of the framework is described, and key terms are defined. Functional and technical ownership are described, including the types of activities that may attend each role, and we acknowledge that ownership responsibilities may be jointly accomplished by groups or teams of stakeholders. We lay out the following suggested framework for developing a sustainability plan, which we define as “a living document that addresses the major components of a system’s life cycle”:

  • Governance:
    • Ownership
    • Stakeholders
    • Users
  • Maintenance:
    • System Updates
    • Training
    • Documentation
  • Review:
    • Assessments
    • Enhancements
    • Sunsetting

Interestingly, and perhaps tellingly, many of the conversations we had about the framework ended up focusing on the last part – sunsetting. How to responsibly decommission or sunset a system in a methodical, process-oriented way is something we haven’t really tackled yet, but we’re not alone in this, and the topic is one that is garnering some attention in project management circles.

So far, the best practices have been used to create a sustainability plan for one of our systems, Dukespace, and the feedback was positive. We hope that these guidelines will facilitate the work we do to sustain our system, and in so doing lead to better communication and understanding throughout the organization. And we didn’t forget to create a sustainability plan for the best practices themselves – the RDSS team has committed to reviewing and updating it at least annually!