As our long-term readers of Bitstreams will attest, the Duke Digital Collections program has an established and well-earned reputation as a trailblazer when it comes to introducing new technologies, improved user interfaces, high definition imaging, and other features that deliver digital images with a beauty and verisimilitude true to the originals held by the David M. Rubenstein Rare Book & Manuscript Library. Thus, we are particularly proud to launch today our newest feature, Smell-O-Bit, which adds a whole new dimension to the digital collections experience.
Smell-O-Bit is a cutting-edge technology that utilizes the diffusers built into most recent model computers to emit predefined scents associated with select digital objects within the Duke Digital Collections site. While still in a test phase, the Digital Collections team has already tagged several images with scents that evoke the mood or content of key images. To experience the smells, simply select Ctl-Alt-W-Up- while viewing these test images:
Made by the Pabst brewing company while beer was off limits due to Prohibition, Pabst-ett cheese was soft, spreadable, and comfort-food delicious. We’ve selected a bold, tangy scent to highlight these comforts. The scent may make you happy enough to slap your own cheeks!
The smell of cigarette smoke, margaritas, and salt from around glass rims and chess players’ brows will make you feel as if you have front row seating at this chess match between composer John Cage and a worthy, but anonymous opponent.
You may feel yourself overwhelmed with the wafting scent of char-broiled deliciousness, but don’t forget to take a deep inhale to detect the pickles, ketchup, and mustard which makes this a savory image all around.
Perhaps you smell garbage? If so, your Garbex isn’t working! What about flies, cats, or dogs? Or, perhaps you just smell a rat. Alright, you caught us.
Happy April Fool’s Day from Duke Digital Collections!!
Over the past year and a half, among our many other projects, we have been experimenting with a creative new approach to powering searches within digital collections and finding aids using Google’s index of our structured data. My colleague Will Sexton and I have presented this idea in numerous venues, most recently and thoroughly for a recorded ASERL (Association of Southeastern Research Libraries) webinar on June 6, 2013.
We’re eager to share what we’ve learned to date and hope this new blog will make a good outlet. We’ve had some success, but have also encountered some considerable pitfalls along the way.
What We Set Out to Do
I won’t recap all the fine details of the project here, but in a nutshell, here are the problems we’ve been attempting to address:
Maintaining our own Solr index takes a ton of time to do right. We don’t have a ton of time.
Staff have noted poor relevance rank and poor support for search using non-Roman characters.
Our digital collections search box is actually used sparsely (in only 12% of visits).
External discovery (e.g., via Google) is of equal or greater importance vs. our local search for these “inside-out” resources.
Get Google to index all of our embedded structured data
Use Google’s index of our structured data to power our local search for finding aids & digital collections
Where We Are Today
We mapped several of our metadata fields to schema.org terms, then embedded that schema.org data in all 74,000 digital object pages and all 2,100 finding aids. We’re now using Google’s index of that data to power our default search for:
Embedding the Data. We kept it super simple here. We labeled every finding aid page a ‘CollectionPage’ and tagged only a few properties: name, description, creator, and if present, a thumbnailUrl for a collection with digitized content.
Rendering Search Results Using Google’s Index.
Digital Collections Search: Sidney D. Gamble Collection
Embedding the Data.
Our digital collections introduce more complexity in the structured data than we see in our finding aids. Naturally, we have a wide range of item types with diverse metadata. We want our markup to represent the relationship of an item to its source collection. The item, the webpage that it’s on, the collection it came from, and the media files associated with it all have properties that can be expressed using schema.org terms. So, we tried it all.
Rendering Search Results Using Google’s Index.
For the Gamble collection, we succeeded in making queries hit Google’s XML API while sustaining the look of our existing search results. Note that the facets in the left side aren’t powered via Google–we haven’t gotten far enough in our experiment to work with filtering the result set based on the structured data, but that’s possible to do.
We’ve been pleased with the ability to make our own rich snippets and highly customize the appearance of search results without having to do a ton of development. Getting our structured data back from Google’s index to work with is an awesome service and developing around the schema.org properties that we were already providing has been a nice way to kill two birds with one stone.
For performance, Google CSE is working well in both the finding aids and the Gamble digital collection search for these purposes:
getting the most relevant content presented early on in the search result
getting results quickly
handling non-Roman characters in search terms
retrieving a needle in a haystack — an item or handful of items that contain some unique text
While Google CSE shows relevant results quickly, we’re finding it’s not a good fit for exploratory searching when either of these aspects is important:
getting a stable and precise count of relevant results
browsing an exhaustive list of results that match a general query
For queries with several pages of hits, you may get an estimated result count that’s close, but unfortunately things occasionally and inexplicably go sour as you navigate from from one result page to the next. E.g., the Gamble digital collection query ‘beijing‘ shows about 2,100 results (which is in the ballpark of what Solr returns), yet browse a few pages in and the result set will get truncated severely: you may only be able to actually browse about 200 of the results without issuing more specific query terms.
Impact on External Discovery
Traffic to digital collections via external search engines has mostly climbed steadily every quarter for the past few years, from 26% of all visits in Jul-Sep 2011 up to 44% from Jan-Mar 2014 (to date) . We entered schema.org tags in Oct 2012, however we don’t know whether adding that data has contributed at all to this trend. Does schema.org data impact relevance? It’s hard to tell.
Structured Data Syntax + Google APIs
Rich Snippets in Big Google
We’re seeing Google render rich snippets for our videos, because we’ve marked them as schema.org VideoObjects with properties like thumbnailUrl. That’s encouraging! Perhaps someday Google will render better snippets for things like photographs (of which we have a bunch), or maybe even more library domain-specific materials like digitized oral histories, manuscripts, and newspapers. But at present, none of our other objects seem to trigger nice snippets like this.
 We represented item pages as schema.org “ItemPage” types using the “ispartOf” property to relate the item page to its corresponding “CollectionPage”. We made the ItemPage “about” a “CreativeWork”. Then we created mappings for many of our metadata fields to CreativeWork properties, e.g., creator, contentLocation, genre, dateCreated.
 Digital Collections External Search Traffic by Quarter
Quarter Visits via Search % Visits via Search
Jul – Sep 2011 26,621 25.97%
Oct – Dec 2011 32,191 29.59%
Jan – Mar 2012 41,048 32.16%
Apr – Jun 2012 33,872 34.49%
Jul – Sep 2012 28,250 32.40%
Oct – Dec 2012 38,472 36.52% <– entered schema.org tags Oct 19, 2012
Jan – Mar 2013 39,948 35.29%
Apr – Jun 2013 36,641 38.30%
Jul – Sep 2013 35,058 41.88%
Oct – Dec 2013 46,082 43.98%
Jan – Mar 2014 47,123 43.93%
 For example, if your RDFa indicates that “an ItemPage is about a CreativeWork whose creator is Sidney Gamble”– the creator of the creative work is not accessible to the API since the CreativeWork is not a top-level item. To get around that, we had to duplicate all the CreativeWork properties in the HTML <head>, which is unnatural and a bit of a hack.
 Google’s CSE JS APIs also don’t let us retrieve the data when there are multiple values specified for the same field. For a given CreativeWork, we might have six locations that are all important to represent: China; Beijing (China); Huabei xie he nu zi da xue (Beijing, China); 中国; 北京; 华北协和女子大学. The JSON returned by the API only contains the first value: ‘China’. This, plus the result count limit, made the XML API our only viable choice for digital collections.
This digital collection consists of a selection of audio and video recordings from the extensive collection of Duke University Chapel recordings housed in the Duke University Archives, part of the David M. Rubenstein Rare Book & Manuscript Library. The digital collection features 168 audio and video recordings from the chapel including sermons from notable African American and female preachers. This project has been a fruitful collaboration between Duke Chapel, the Divinity School, the Rubenstein Library and of course the digital projects team in Duke University Libraries. To learn more, visit the Devil’s Tale blog (the blog of the Rubenstein Library).
But wait, there’s more!
Fifteen of the recordings were digitized from VHS tapes and are available as video playable from within the digital collection. These are our first digitized videos delivered via our own infrastructure. Our previous efforts have all relied on external platforms like YouTube, iTunes, and the Internet Archive to serve up the videos. While these tools are familiar to users, feature-rich, and built on a strong technological backbone, we have been intending for quite awhile to develop support for delivering digital video in-house.
When you view a video from the Duke Chapel Recordings, you’ll see a “poster frame” image of the featured speaker. Click the play button to begin (of course!) and the video will play within the page. Watching the videos is a “pseudo-streaming” or “progressive download” experience akin to YouTube. That is, you can start watching almost immediately, and you can click ahead to arbitrary points in the middle of the video at any time. And while you might occasionally have to wait for things to buffer, videos should play smoothly on desktop, tablet, and smartphone devices, and can be easily enlarged to full-screen. Finally, there’s a Download link right below the video if you’d like to take the files with you.
We’re looking forward to hearing from our users and learning from our peers who are working with digital media to keep refining our approach. We hope to make many more videos from our collections available in the near future.
Post authored by Sean Aery and Molly Bragg.
Notes from the Duke University Libraries Digital Projects Team