This week, my colleague Will Sexton and I (as well as several other Duke folks) are attending the Digital Library Federation conference in beautiful Vancouver, British Columbia. While here, we presented a poster on our work to assess scholarly use of digital collections. Please have a look at our poster below.
If you are interested in learning more about our assessment project, check out these previous blog posts:
Sean Aery, Digital Projects Developer, Duke Rachel Ingold, Curator for the History of Medicine Collections, Duke
Duke’s Digital Collections program recently published a remarkable set of 16th-17th century anatomical fugitive sheets from The Rubenstein Library’s History of Medicine Collections. These illustrated sheets are similar to broadsides, but feature several layers of delicate flaps that lift to show inside the human body. The presenters will discuss the unique challenges posed by the source material including conservation, digitization, description, data modeling, and UI design. They will also demonstrate the resulting digital collection, which has already earned several accolades for its innovative yet elegant solutions for a project with a high degree of complexity.
On October 27-29 librarians, archivists, developers, project managers, and others met for the Digital Library Federation (DLF) Forum in Atlanta, GA. The program was packed to the gills with outstanding projects and presenters, and several of us from Duke University Libraries were fortunate enough to attend. Below is a round up of notes summarizing interesting sessions, software tools, projects and collections we learned about at the conference.
Please note that these notes were written by humans listening to presentations and mistakes are inevitable. Click the links to learn more about each tool/project or session straight from the source.
Tools and Technology
Spotlightis an open-source tool for featuring digitized resources and is being developed at Stanford University. It appears to have fairly similar functionality to Omeka, but is integrated into Blacklight, a discovery interface used by a growing number of libraries.
There were two short presentations about media walls; one from our friends in Raleigh at the Hunt Library at N.C. State University, and the second from Georgia State. Click the links to see just how much you can do with an amazing media wall.
The California Digital Library (CDL) is redesigning and reengineering their digital collections interface to create a kind of mini-Digital Public Library of America just for University of California digital collections. They are designing the project using a platform called Nuxeo and storing their data through Amazon web services. The new interface and platform development is highly informed by user studies done on the existing Calisphere digital collections interface.
Emblematica Online is a collection of digitized emblem books contributed by several global institutions including Duke. The collection is hosted by University of Illinois at Urbana Champagne. The project has been conducting user studies and hope to publish them in the coming year.
The University of Indiana Media Digitization and Preservation Initiative started in 2009 with a survey of all the audio and visual materials on campus. In 2011, the initiative proposed digitizing all rare and unique audio and video items within a 15 year period. However in 2013, the President of the University said that the campus would commit to completing the project in a 7 year period. To accomplish this ambitious goal, the university formed a public-private partnership with Memnon Archiving Services of Brussels. The university estimates that they will create over 9 petabytes of data. The initiative has been in the planning phases and should be ramping up in 2015.
The Project Managers group within DLF organized a session on “Cultivating a Culture of Project Management” followed by a working lunch. Representatives from John’s Hopkins and Brown talked about implementing Agile Methodology for managing and developing technical projects. Both libraries spoke positively about moving towards Agile, and the benefits of clear communication lines and defined development cycles. A speaker from Temple university discussed her methods for tracking and communicating the capacity of her development team; her spreadsheet for doing so took the session by storm (I’m not exaggerating – check out Twitter around the time of this session). Two speakers from the University of Michigan shared their work in creating a project management special interest group within their library to share PM skills, tools and heartaches.
A session entitled “Beyond the digital Surrogate” highlighted the work of several projects that are using digitized materials as a starting point for text mining and visualizing data. First, many of UNC’s Documenting the American South collections are available as a text download. Second, a tool out of Georgia Tech supports interactive exploration and visualization of text based archives. Third, a team from University of Nebraska-Lincoln is developing methods for using visual information to leverage discovery and analysis of digital collections.
“Moving Forward with Digital Library Assessment.” Based around the need to strategically focus our assessment efforts in digital libraries and to better understand and measure the value, impact, and associated costs of what we do.
The first phase exposed areas for potential standardization. The community then collectively prioritized those potential projects, and the second phase is now developing those best practices. A Working group is developed, its recommendation due June 2016.
Fifty years ago, hundreds of student volunteers headed south to join the Student Nonviolent Coordinating Committee’s (SNCC) field staff and local people in their fight against white supremacy in Mississippi. This week, veterans of Freedom Summer are gathering at Tougaloo College, just north of Jackson, Mississippi, to commemorate their efforts to remake American democracy.
The 50th anniversary events, however, aren’t only for movement veterans. Students, young organizers, educators, historians, archivists, and local Mississippians make up the nearly one thousand people flocking to Tougaloo’s campus this Wednesday through Saturday. We here at Duke Libraries, as well as members of the SNCC Legacy Project Editorial Board, are in the mix, making connections with both activists and archivists about our forthcoming website, One Person, One Vote: The Legacy of SNCC and the Fight for Voting Rights.
This site will bring together material created in and around SNCC’s struggle for voting rights in the 1960s and pair it with new interpretations of that history by the movement veterans themselves. To pull this off, we’ll be drawing on Duke’s own collection of SNCC-related material, as well as incorporating the wealth of material already digitized by institutions like the University of Southern Mississippi, the Wisconsin Historical Society’s Freedom Summer Collection, the Mississippi Department of Archives and History, as well as others.
What becomes clear while circling through the panels, films, and hallway conversations at Freedom Summer 50th events is how the fight for voting rights is really a story of thousands of local people. The One Person, One Vote site will feature these everyday people – Mississippians like Peggy Jean Connor, Fannie Lou Hamer, Vernon Dahmer, and SNCC workers like Hollis Watkins, Bob Moses, and Charlie Cobb. And the list goes on. It’s not everyday that so many of these people come together under one roof, and we’re doing our share of listening to and connecting with the people whose stories will make up the One Person, One Vote site.
On Tuesday April 8, I had the honor of presenting at the annual meeting of the Society of North Carolina Archivists with representatives from Wake Forest University and Davidson College. The focus of our panel was to present alternatives to CONTENTdm, a system for displaying digital collections widely used by libraries. At Duke, we have developed our own Tripod interface to digital collections. Wake Forest and Davidson use a variety of tools most notably DSpace and Islandora (via Lyrasis) respectively. It was great to present with and learn more about the Wake Forest and Davidson programs! I’ve embedded slides from all three speakers below.
Over the past year and a half, among our many other projects, we have been experimenting with a creative new approach to powering searches within digital collections and finding aids using Google’s index of our structured data. My colleague Will Sexton and I have presented this idea in numerous venues, most recently and thoroughly for a recorded ASERL (Association of Southeastern Research Libraries) webinar on June 6, 2013.
We’re eager to share what we’ve learned to date and hope this new blog will make a good outlet. We’ve had some success, but have also encountered some considerable pitfalls along the way.
What We Set Out to Do
I won’t recap all the fine details of the project here, but in a nutshell, here are the problems we’ve been attempting to address:
Maintaining our own Solr index takes a ton of time to do right. We don’t have a ton of time.
Staff have noted poor relevance rank and poor support for search using non-Roman characters.
Our digital collections search box is actually used sparsely (in only 12% of visits).
External discovery (e.g., via Google) is of equal or greater importance vs. our local search for these “inside-out” resources.
Get Google to index all of our embedded structured data
Use Google’s index of our structured data to power our local search for finding aids & digital collections
Where We Are Today
We mapped several of our metadata fields to schema.org terms, then embedded that schema.org data in all 74,000 digital object pages and all 2,100 finding aids. We’re now using Google’s index of that data to power our default search for:
Embedding the Data. We kept it super simple here. We labeled every finding aid page a ‘CollectionPage’ and tagged only a few properties: name, description, creator, and if present, a thumbnailUrl for a collection with digitized content.
Rendering Search Results Using Google’s Index.
Digital Collections Search: Sidney D. Gamble Collection
Embedding the Data.
Our digital collections introduce more complexity in the structured data than we see in our finding aids. Naturally, we have a wide range of item types with diverse metadata. We want our markup to represent the relationship of an item to its source collection. The item, the webpage that it’s on, the collection it came from, and the media files associated with it all have properties that can be expressed using schema.org terms. So, we tried it all.
Rendering Search Results Using Google’s Index.
For the Gamble collection, we succeeded in making queries hit Google’s XML API while sustaining the look of our existing search results. Note that the facets in the left side aren’t powered via Google–we haven’t gotten far enough in our experiment to work with filtering the result set based on the structured data, but that’s possible to do.
We’ve been pleased with the ability to make our own rich snippets and highly customize the appearance of search results without having to do a ton of development. Getting our structured data back from Google’s index to work with is an awesome service and developing around the schema.org properties that we were already providing has been a nice way to kill two birds with one stone.
For performance, Google CSE is working well in both the finding aids and the Gamble digital collection search for these purposes:
getting the most relevant content presented early on in the search result
getting results quickly
handling non-Roman characters in search terms
retrieving a needle in a haystack — an item or handful of items that contain some unique text
While Google CSE shows relevant results quickly, we’re finding it’s not a good fit for exploratory searching when either of these aspects is important:
getting a stable and precise count of relevant results
browsing an exhaustive list of results that match a general query
For queries with several pages of hits, you may get an estimated result count that’s close, but unfortunately things occasionally and inexplicably go sour as you navigate from from one result page to the next. E.g., the Gamble digital collection query ‘beijing‘ shows about 2,100 results (which is in the ballpark of what Solr returns), yet browse a few pages in and the result set will get truncated severely: you may only be able to actually browse about 200 of the results without issuing more specific query terms.
Impact on External Discovery
Traffic to digital collections via external search engines has mostly climbed steadily every quarter for the past few years, from 26% of all visits in Jul-Sep 2011 up to 44% from Jan-Mar 2014 (to date) . We entered schema.org tags in Oct 2012, however we don’t know whether adding that data has contributed at all to this trend. Does schema.org data impact relevance? It’s hard to tell.
Structured Data Syntax + Google APIs
Rich Snippets in Big Google
We’re seeing Google render rich snippets for our videos, because we’ve marked them as schema.org VideoObjects with properties like thumbnailUrl. That’s encouraging! Perhaps someday Google will render better snippets for things like photographs (of which we have a bunch), or maybe even more library domain-specific materials like digitized oral histories, manuscripts, and newspapers. But at present, none of our other objects seem to trigger nice snippets like this.
 We represented item pages as schema.org “ItemPage” types using the “ispartOf” property to relate the item page to its corresponding “CollectionPage”. We made the ItemPage “about” a “CreativeWork”. Then we created mappings for many of our metadata fields to CreativeWork properties, e.g., creator, contentLocation, genre, dateCreated.
 Digital Collections External Search Traffic by Quarter
Quarter Visits via Search % Visits via Search
Jul – Sep 2011 26,621 25.97%
Oct – Dec 2011 32,191 29.59%
Jan – Mar 2012 41,048 32.16%
Apr – Jun 2012 33,872 34.49%
Jul – Sep 2012 28,250 32.40%
Oct – Dec 2012 38,472 36.52% <– entered schema.org tags Oct 19, 2012
Jan – Mar 2013 39,948 35.29%
Apr – Jun 2013 36,641 38.30%
Jul – Sep 2013 35,058 41.88%
Oct – Dec 2013 46,082 43.98%
Jan – Mar 2014 47,123 43.93%
 For example, if your RDFa indicates that “an ItemPage is about a CreativeWork whose creator is Sidney Gamble”– the creator of the creative work is not accessible to the API since the CreativeWork is not a top-level item. To get around that, we had to duplicate all the CreativeWork properties in the HTML <head>, which is unnatural and a bit of a hack.
 Google’s CSE JS APIs also don’t let us retrieve the data when there are multiple values specified for the same field. For a given CreativeWork, we might have six locations that are all important to represent: China; Beijing (China); Huabei xie he nu zi da xue (Beijing, China); 中国; 北京; 华北协和女子大学. The JSON returned by the API only contains the first value: ‘China’. This, plus the result count limit, made the XML API our only viable choice for digital collections.
Notes from the Duke University Libraries Digital Projects Team