All posts by Sean Aery

Embeds, Math & Beyond

This week, in conjunction with our H. Lee Waters Film Collection unveiling, we rolled out a handy new Embed feature for digital collections items.  The idea is to make it as easy as possible for someone to share their discoveries from our collections, with proper attribution, on other websites or blogs.

How To

It’s simple, really, and mimics the experience you’re likely to encounter getting embed code from other popular sites with videos, images, and the like. We modeled our approach loosely on the Internet Archive‘s video embed service (e.g., visit this video and click the Share icon, but only if you are unafraid of clowns).

Embed Link

Click the “Embed” link under an item from Duke Digital Collections, and copy the snippet of code that pops up. Paste it in your website, and you’re done!

Examples

I’ll paste a few examples below using different kinds of items. The embed code is short and nearly identical for all of these:

A Single Image

Paginated Item

A Video

Single-Track Audio

Multi-Track Audio

Document with Document Viewer

Technical Considerations

Building this feature required a little bit of math, some trial & error, and a few tricks. The steps were to:

  • Set up a service to return customized item pages at the path http://library.duke.edu/digitalcollections/embed/<itemid>/
  • Use CSS & JS to make the media as fluid as possible to fill whatever space it ends up in
  • Use a fixed height and overflow: auto on the attribution box so longer content will scroll
  • Use link rel=”canonical” to ensure the item’s embed page is associated with the real item page (especially to improve links / ranking signals for search engines).
  • Present the user a copyable HTML <iframe> element in the regular item page that has the correct height & width attributes to accommodate the item(s) to be embedded

This last point is where the math comes in. Take a single image item, for example. With a landscape-orientation image we need to give the user a different <iframe> height to copy than we would for a portrait. It gets even more complicated when we have to account for multiple tracks of audio or video, or combinations of the two.

Coming Soon

We’ll refine this feature a bit in the coming weeks, and work out any embed-bugs we discover. We’ll also be developing a similar feature for embedding digitized content found in our archival collection guides.

New Angles & Avenues for Bitstreams

This week, we added a display of our most recent Bitstreams blog posts to our Digital Collections homepage (example), and likewise, a view of posts relevant to a given collection on the respective collection’s homepage (example).

Screen Shot 2014-11-12 at 1.19.56 PM

Background

Our Digital Projects & Production team has been writing in Bitstreams at least weekly since February 2014. We’ve had some excellent guest contributors, too. Some posts share updates about new digital collections or additions, while others share insights, lessons learned, and behind-the-scenes looks at the projects we’re currently tackling.

Many of our posts have been featured on our library homepage and library news site. But until now, we haven’t been able to display any of them—not even the ones about new digital collections—alongside the collections themselves. So, if you visited the DukEngineer collection in the past, you likely missed out on Melanie’s excellent overview, which puts the magazine in context and highlights the best of what’s inside.

Past Solutions

Syndicating tagged blog posts for display elsewhere is a pretty common use case, and we’ve used a bunch of different solutions as our platforms have evolved. Each solution has naturally been painstakingly tailored to accommodate the inner workings of both the source and the destination. Seven years ago, we were writing custom XSLT to create and then consume our own RSS feeds in Cascade Server CMS. We have since hopped over to Wordpress for managing news and blogs (whew!). An older version of our digital collections app used WordPress’ XML-RPC API to get tagged posts and parsed them with Python.

These days, our library website does blog syndication by using a combo of WordPress RSS, Drupal’s feed aggregator module, and occasionally Yahoo! Pipes for data mashing and munging. It works well in Drupal, but other platforms require other approaches.

Under the Hood: Angular.js and Wordpress JSON API

Bret Davidson’s Code4Lib 2014 presentation, Towards Pasta Code Nirvana: Using JavaScript MVC to Fill Your Programming Ravioli  (slides) made me hungry. Hungry for pasta, yes, but also for knowledge. I wanted to:

  1. Experiment with one of the Javascript MVC frameworks to learn how they work, and in the process…
  2. Build something potentially useful for digital collections that could be ported over to a new application framework in the future (e.g., from our current Django app to a future Ruby on Rails app).

From the many possibilities, I chose AngularJS. It seemed well-documented, increasingly popular, and with Google’s backing, it seems like it’ll be around for awhile.

WordPress JSON API

Among Angular’s virtues is that it really simplifies the process of getting and using JSON data from an API. I found Wordpress’ JSON API plugin, which was interestingly developed by staff at MoMA so they could use WordPress as a back-end to a site with a Rails front-end. So we first had to enable that for our Bitstreams blog.

AngularJS

angularjsAngularJS definitely helps keep code clean, especially by abstracting the model (the blogposts & associated characteristics, as well as the page state) from the view (indicates how to display the data) from the controller (gets and refines the data into the model, updates the model upon interactions with the view). I’ve done several projects in the past using jQuery and DOM manipulation to retrieve and display data. It usually works, but in the process I create a veritable rat’s nest of spaghetti code wherein /* no amount of commenting */ can truly help disentangle what’s happening.

Angular also supercharges HTML with more useful attributes to control a display. I’ve only just scratched the surface, but it’s clear that built-in directives like ng-repeat and filters like limitTo spare me from writing a ton of Javascript, e.g., <li ng-repeat="post in blogposts | limitTo:pageSize">. After the initial learning curve, the markup is visually intuitive. And it’s nice that directives and filters are extensible so you can make your own.

Source code: controller js, HTML (view source)

Initial Lessons Learned

  • AngularJS has a steeper learning curve than I’d expected; I assumed I could do this mini-project in a few hours, but it took a couple days to really get a handle on the basic pieces I needed for this project.
  • Writing an Angular app within a Django app is tricky. Both use {{ variable }} template tags so I had to change Angular to use [[ variable ]] instead.

Looking Ahead

I consider this an encouraging proof of concept. While our own blog posts can be interesting, there are many other sources of valuable data out in the world that are relevant to our collections that would add value for our researchers if we were able to easily get and display them. AngularJS won’t be the answer to all of these needs, but it’s nice to have in the toolset.

Tweets and Metadata Unite!: Meet the Twitter Card

Twitter Cards
Source: https://dev.twitter.com/cards

Everyone knows that Twitter limits each post to 140 characters. Early criticism has since cooled and most people agree it’s a helpful constraint, circumvented through clever (some might say better) writing, hyperlinks, and URL-shorteners.  But as a reader of tweets, how do you know what lies at the other end of a shortened link? What entices you to click? The tweet author can rarely spare the characters to attribute the source site or provide a snippet of content, and can’t be expected to attach a representative image or screenshot.

Our webpages are much more than just mystery destinations for shortened URLs. Twitter agrees: its developers want help understanding what the share-worthy content from a webpage actually is in order to present it in a compelling way alongside the 140 characters or less.  Enter two library hallmarks: vocabularies and metadata.

This week, we added Twitter Card metadata in the <head> of all of our digital collections pages and in our library blogs. This data instantly made all tweets and retweets linking to our pages far more interesting. Check it out!

For the blogs, tweets now display the featured image, post title, opening snippet, site attribution, and a link to the original post. Links to items from digital collections now show the image itself (along with some item info), while links to collections, categories, or search results now display a grid of four images with a description underneath. See these examples:

 

A gallery tweet, linking to the homepage for the William Gedney Photographs collection.
A gallery tweet, linking to the homepage for the William Gedney Photographs collection.
Summary Card With Large Image: tweet linking to a post in The Devil's Tale blog.
Summary Card With Large Image: Tweet linking to a post in The Devil’s Tale blog.
Summary Card With Large Image: tweet linking to a digital collections image.
Summary Card With Large Image: tweet linking to a digital collections image.

 

Why This Matters

In 2013-14, social media platforms accounted for 10.1% of traffic to our blogs (~28,000 visits in 2013-14, 11,300 via Twitter), and 4.3% of visits to our digital collections (~17,000 visits, 1,000 via Twitter). That seems low, but perhaps it’s because of the mystery link phenomenon. These new media-rich tweets have the potential to increase our traffic through these channels by being more interesting to look at and more compelling to click.  We’re looking forward to finding out whether they do.

And regardless of driving clicks, there are two other benefits of Twitter Cards that we really care about in the library: context and attribution. We love it when our collections and blog posts are shared on Twitter. These tweets now automatically give some additional information and helpfully cite the source.

How to Get Your Own Twitter Cards

The Manual Way

If you’re manually adding tags like we’ve done in our Digital Collections templates, you can “View Source” on any of our pages to see what <meta> tags make the magic happen. Moz also has some useful code snippets to copy, with links to validator tools so you can make sure you’re doing it correctly.

Gallery Page
Twitter Card metadata for a Gallery Page (Broadsides & Ephemera Collection)

WordPress

Since our blogs run on WordPress, we were able to use the excellent WordPress SEO plugin by Yoast. It’s helpful for a lot of things related to search engine optimization, and it makes this social media optimization easy, too.

Adding Twitter Card metadata with the WordPress SEO plugin.
Adding Twitter Card metadata with the WordPress SEO plugin.

Once your tags are in place, you just need to validate an example from your domain using the Twitter Card Validator before Twitter will turn on the media-rich tweets. It doesn’t take long at all: ours began appearing within a couple hours. The cards apply retroactively to previous tweets, too.

Related Work

Our addition of Twitter Card data follows similar work we have done using semantic markup in our Digital Collections site using the Open Graph and Schema.org vocabularies. Open Graph is a standard developed by Facebook. Similar to Twitter Card metadata, OG tags inform Facebook what content to highlight from a linked webpage. Schema.org is a vocabulary for describing the contents of web pages in a way that is helpful for retrieval and representation in Google and other search engines.

All of these tools use RDFa syntax, a key cornerstone of Linked Data on the web that supports the description of resources using whichever vocabularies you choose. Google, Twitter, Facebook, and other major players in our information ecosystem are now actively using this data, providing clear incentive for web authors to provide it. We should keep striving to play along.

Large-Scale Digitization and Lessons from the CCC Project

Back in February 2014, we wrapped up the CCC project, a collaborative three year IMLS-funded digitization initiative with our partners in the Triangle Research Libraries Network (TRLN). The full title of the project is a mouthful, but it captures its essence: “Content, Context, and Capacity: A Collaborative Large-Scale Digitization Project on the Long Civil Rights Movement in North Carolina.”

Together, the four university libraries (Duke, NC State, UNC-Chapel Hill, NC Central) digitized over 360,000 documents from thirty-eight collections of manuscripts relevant to the project theme. About 66,000 were from our David M. Rubenstein Rare Book & Manuscript Library collections.

Large-Scale

So how large is “large-scale”? By comparison, when the project kicked off in summer 2011, we had a grand total of 57,000 digitized objects available online (“published”), collectively accumulated through sixteen years of digitization projects. That number was 69,000 by the time we began publishing CCC manuscripts in June 2012. Putting just as many documents online in three years as we’d been able to do in the previous sixteen naturally requires a much different approach to creating digital collections.

Traditional Digitization Large-Scale Digitization
Individual items identified during scanning No item-level identification: entire folders scanned
Descriptive metadata applied to each item Archival description only (e.g., at the folder level)
Robust portals for search & browse Finding aid / collection guide as access point

There are some considerable tradeoffs between document availability vs. discovery and access features, but going at this scale speeds publication considerably. Large-scale digitization was new for all four partners, so we benefited by working together.

Digitized documents accessed through an archival finding aid / collection guide with folder-level description.

Project Evaluation

CCC staff completed qualitative and quantitative evaluations of this large-scale digitization approach during the course of the project, ranging from conducting user focus groups and surveys to analyzing the impact on materials prep time and image quality control. Researcher assessments targeted three distinct user groups: 1) Faculty & History Scholars; 2) Undergraduate Students (in research courses at UNC & NC State); 3) NC Secondary Educators.

Here are some of the more interesting findings (consult the full reports for details):

  • Ease of Use. Faculty and scholars, for the most part, found it easy to use digitized content presented this way. Undergraduates were more ambivalent, and secondary educators had the most difficulty.
  • To Embed or Not to Embed. In 2012, Duke was the only library presenting the image thumbnails embedded directly within finding aids and a lightbox-style image navigator. Undergrads who used Duke’s interface found it easier to use than UNC or NC Central’s, and Duke’s collections had a higher rate of images viewed per folder than the other partners. UNC & NC Central’s interfaces now use a similar convention.
  • Potential for Use. Most users surveyed said they could indeed imagine themselves using digitized collections presented in this way in the course of their research. However, the approach falls short in meeting key needs for secondary educators’ use of primary sources in their classes.
  • Desired Enhancements. The top two most desired features by faculty/scholars and undergrads alike were 1) the ability to search the text of the documents (OCR), and 2) the ability to explore by topic, date, document type (i.e., things enabled by item-level metadata). PDF download was also a popular pick.

 

Impact on Duke Digitization Projects

Since the moment we began putting our CCC manuscripts online (June 2012), we’ve completed the eight CCC collections using this large-scale strategy, and an additional eight manuscript collections outside of CCC using the same approach. We have now cumulatively put more digital objects online using the large-scale method (96,000) than we have via traditional means (75,000). But in that time, we have also completed eleven digitization projects with traditional item-level identification and description.

We see the large-scale model for digitization as complementary to our existing practices: a technique we can use to meet the publication needs of some projects.

Usage

Do people actually use the collections when presented in this way? Some interesting figures:

  • Views / item in 2013-14 (traditional digital object; item-level description): 13.2
  • Views / item in 2013-14 (digitized image within finding aid; folder-level description): 1.0
  • Views / folder in 2013-14 (digitized folder view in finding aid): 8.5

It’s hard to attribute the usage disparity entirely to the publication method (they’re different collections, for one). But it’s reasonable to deduce (and unsurprising) that bypassing item-level description generally results in less traffic per item.

On the other hand, one of our CCC collections (The Allen Building Takeover Collection) has indeed seen heavy use–so much, in fact, that nearly 90% of TRLN’s CCC items viewed in the final six months of the project were from Duke. Its images averaged over 78 views apiece in the past year; its eighteen folders opened 363 times apiece. Why? The publication of this collection coincided with an on-campus exhibit. And it was incorporated into multiple courses at Duke for assignments to write using primary sources.

The takeaway is, sometimes having interesting, important, and timely content available for use online is more important than the features enabled or the process by which it all gets there.

Looking Ahead

We’ll keep pushing ahead with evolving our practices for putting digitized materials online. We’ve introduced many recent enhancements, like fulltext searching, a document viewer, and embedded HTML5 video. Inspired by the CCC project, we’ll continue to enhance our finding aids to provide access to digitized objects inline for context (e.g., The Jazz Loft Project Records). Our TRLN partners have also made excellent upgrades to the interfaces to their CCC collections (e.g., at UNC, at NC State) and we plan, as usual, to learn from them as we go.

Leveling Up Our Document Viewer

This past week, we were excited to be able to publish a rare 1804 manuscript copy of the Haitian Declaration of Independence in our digital collections website. We used the project as a catalyst for improving our document-viewing user experience, since we knew our existing platforms just wouldn’t cut it for this particular treasure from the Rubenstein Library collection. In order to present the declaration online, we decided to implement the open-source Diva.js viewer. We’re happy with the results so far and look forward to making more strides in our ability to represent documents in our site as the year progresses.

docviewer
Haitian Declaration of Independence as seen in Diva.js document viewer with full text transcription.

Challenges to Address

We have had two glaring limitations in providing access to digitized collections to date: 1) a less-than-stellar zoom & pan feature for images and 2) a suboptimal experience for navigating documents with multiple pages. For zooming and panning (see example), we use software called OpenLayers, which is primarily a mapping application. And for paginated items we’ve used two plugins designed to showcase image galleries, Galleria (example) and Colorbox (example). These tools are all pretty good at what they do, but we’ve been using them more as stopgap solutions for things they weren’t really created to do in the first place. As the old saying goes, when all you have is a hammer, everything looks like a nail.

Big (OR Zoom-Dependent) Things

A selection from our digitized Italian Cultural Posters. Large derivative is 11,000 x 8,000 pixels, a 28MB JPG.
A selection from our digitized Italian Cultural Posters. The “large” derivative is 11,000 x 8,000 pixels, a 28MB JPG.

Traditionally as we digitize images, whether freestanding or components of a multi-page object, at the end of the process we generate three JPG derivatives per page. We make a thumbnail (helpful in search results or other item sets), medium image (what you see on an item’s webpage), and large image (same dimensions as the preservation master, viewed via the ‘all sizes’ link). That’s a common approach, but there are several places where that doesn’t always work so well. Some things we’ve digitized are big, as in “shoot them in sections with a camera and stitch the images together” big. And we’ve got several more materials like this waiting in the wings to make available. A medium image doesn’t always do these things justice, but good luck downloading and navigating a giant 28MB JPG when all you want to do is zoom in a little bit.

Likewise, an object doesn’t have to be large to really need easy zooming to be part of the viewing experience. You might want to read the fine print on that newspaper ad, see the surgeon general’s warning on that billboard, or inspect the brushstrokes in that beautiful hand-painted glass lantern slide.

And finally, it’s not easy to anticipate the exact dimensions at which all our images will be useful to a person or program using them. Using our data to power an interactive display for a media wall? A mobile app? A slideshow on the web? You’ll probably want images that are different dimensions than what we’ve stored online. But to date, we haven’t been able to provide ways to specify different parameters (like height, width, and rotation angle) in the image URLs to help people use our images in environments beyond our website.

A page from Mary McCornack Thompson's 1908 travel diary, underrepresented by its presentation via an image gallery.
A page from Mary McCornack Thompson’s 1908 travel diary, limited by its presentation via an image gallery.

Paginated Things

We do love our documentary photography collections, but a lot of our digitized objects are represented by more than just a single image. Take an 11-page piece of sheet music or a 127-page diary, for example. Those aren’t just sequences or collections of images. Their paginated orientation is pretty essential to their representation online, but a lot of what characterizes those materials is unfortunately lost in translation when we use gallery tools to display them.

The Intersection of (Big OR Zoom-Dependent) AND Paginated

Here’s where things get interesting and quite a bit more complicated: when zooming, panning, page navigation, and system performance are all essential to interacting with a digital object. There are several tools out there that support these various aspects, but very few that do them all AND do them well. We knew we needed something that did.

Our Solution: Diva.js

diva-logoWe decided to use the open-source Diva.js (Document Image Viewer with AJAX). Developed at the Distributed Digital Music Archives and Libraries Lab (DDMAL) at McGill University, it’s “a Javascript frontend for viewing documents, designed to work with digital libraries to present multi-page documents as a single, continuous item” (see About page). We liked its combination of zooming, panning, and page navigation, as well as its extensibility. This Code4Lib article nicely summarizes how it works and why it was developed.

Setting up Diva.js required us to add a few new pieces to our infrastructure. The most significant was an image server (in our case, IIPImage) that could 1) deliver parts of a digital image upon request, and 2) deliver complete images at whatever size is requested via URL parameters.

Our Interface: How it Works

By default, we present a document in our usual item page template that provides branding, context, and metadata. You can scroll up and down to navigate pages, use Page Up or Page Down keys, or enter a page number to jump to a page directly. There’s a slider to zoom in or out, or alternatively you can double-click to zoom in / Ctrl-double-click to zoom out. You can toggle to a grid view of all pages and adjust how many pages to view at once in the grid. There’s a really handy full-screen option, too.

Fulltext transcription presented in fullscreen mode, thumbnail view.
Fulltext transcription presented in fullscreen mode, thumbnail view.
Page 4, zoom level 4, with link to download.
Page 4, zoom level 4, with link to download.

It’s optimized for performance via AJAX-driven “lazy loading”: only the page of the document that you’re currently viewing has to load in your browser, and likewise only the visible part of that page image in the viewer must load (via square tiles). You can also download a complete JPG for a page at the current resolution by clicking the grey arrow.

We extended Diva.js by building a synchronized fulltext pane that displays the transcript of the current page alongside the image (and beneath it in full-screen view). That doesn’t come out-of-the-box, but Diva.js provides some useful hooks into its various functions to enable developing this sort of thing. We also slightly modified the styles.

image tile
A tile delivered by IIPImage server

Behind the scenes, we have pyramid TIFF images (one for each page), served up as JPGs by IIPImage server. These files comprise arrays of 256×256 JPG tiles for each available zoom level for the image. Let’s take page 1 of the declaration for example. At zoom level 0 (all the way zoomed out), there’s only one image tile: it’s under 256×256 pixels; level 1 is 4 tiles, level 2 is 12, level 3 is 48, level 4 is 176. The page image at level 5 (all the way zoomed in) includes 682 tiles (example of one), which sounds like a lot, but then again the server only has to deliver the parts that you’re currently viewing.

Every item using Diva.js also needs to load a JSON stream including the dimensions for each page within the document, so we had to generate that data. If there’s a transcript present, we store it as a single HTML file, then use AJAX to dynamically pull in the part of that file that corresponds to the currently-viewed page in the document.

Diva.js & IIPImage Limitations

It’s a good interface, and is the best document representation we’ve been able to provide to date. Yet it’s far from perfect. There are several areas that are limiting or that we want to explore more as we look to make more documents available in the future.

Out of the box, Diva.js doesn’t support page metadata, transcriptions, or search & retrieval within a document. We do display a synchronized transcript, but there’s currently no mapping between the text and the location within each page where each word appears, nor can you perform a search and discover which pages contain a given keyword. Other folks using Diva.js are working on robust applications that handle these kinds of interactions, but the degree to which they must customize the application is high. See for example, the Salzinnes Antiphonal: a 485-page liturgical manuscript w/text and music or a prototype for the Liber Usualis: a 2,000+ page manuscript using optical music recognition to encode melodic fragments.

Diva.js also has discrete zooming, which can feel a little jarring when you jump between zoom levels. It’s not the smooth, continuous zoom experience that is becoming more commonplace in other viewers.

With the IIPImage server, we’ll likely re-evaluate using Pyramid TIFFs vs. JPEG2000s to see which file format works best for our digitization and publication workflow. In either case, there are several compression and caching variables to tinker with to find an ideal balance between image quality, storage space required, and system performance. We also discovered that the IIP server unfortunately strips out the images’ ICC color profiles when it delivers JPGs, so users may not be getting a true-to-form representation of the image colors we captured during digitization.

Next Steps

Launching our first project using Diva.js gives us a solid jumping-off point for expanding our ability to provide useful, compelling representations of our digitized documents online. We’ll assess how well this same approach would scale to other potential projects and in the meantime keep an eye on the landscape to see how things evolve. We’re better equipped now than ever to investigate alternative approaches and complementary tools for doing this work.

We’ll also engage more closely with our esteemed colleagues in the Duke Collaboratory for Classics Computing (DC3), who are at the forefront of building tools and services in support of digital scholarship. Well beyond supporting discovery and access to documents, their work enables a community of scholars to collaboratively transcribe and annotate items (an incredible–and incredibly useful–feat!). There’s a lot we’re eager to learn as we look ahead.

Schema.org and Google for Local Discovery: Some Key Takeaways

Google CSE & schema.orgOver the past year and a half, among our many other projects, we have been experimenting with a creative new approach to powering searches within digital collections and finding aids using Google’s index of our structured data. My colleague Will Sexton and I have presented this idea in numerous venues, most recently and thoroughly for a recorded ASERL (Association of Southeastern Research Libraries) webinar on June 6, 2013.

We’re eager to share what we’ve learned to date and hope this new blog will make a good outlet. We’ve had some success, but have also encountered some considerable pitfalls along the way.

What We Set Out to Do

I won’t recap all the fine details of the project here, but in a nutshell, here are the problems we’ve been attempting to address:

  • Maintaining our own Solr index takes a ton of time to do right. We don’t have a ton of time.
  • Staff have noted poor relevance rank and poor support for search using non-Roman characters.
  • Our digital collections search box is actually used sparsely (in only 12% of visits).
  • External discovery (e.g., via Google) is of equal or greater importance vs. our local search for these “inside-out” resources.

Here’s our three-step strategy:

  1. Embed schema.org data in our HTML (using RDFa Lite)
  2. Get Google to index all of our embedded structured data
  3. Use Google’s index of our structured data to power our local search for finding aids & digital collections

Where We Are Today

We mapped several of our metadata fields to schema.org terms, then embedded that schema.org data in all 74,000 digital object pages and all 2,100 finding aids. We’re now using Google’s index of that data to power our default search for:

  1. All of our finding aids (a.k.a. collection guides).  [Example search for “photo”]
  2. One digital collection: Sidney Gamble Photographs. [Example search for “beijing”]

Though the strategy is the same, some of the implementation details are different between our finding aids and digital collections applications. Here are the main differences:

Site Service Google CSE API Max Results per Query
Finding Aids Google Custom Search (free) JS v1.0 100
Digital Collection Google Site Search
(premium version of Custom Search)
XML API 1,000

 

Finding Aids Search

Embedding the Data. We kept it super simple here. We labeled every finding aid page a ‘CollectionPage’ and tagged only a few properties: name, description, creator, and if present, a thumbnailUrl for a collection with digitized content.

Schema.org tags using RDFa Lite in finding aid HTML
Schema.org tags using RDFa Lite in finding aid HTML

Rendering Search Results Using Google’s Index. 

This worked great. We used a Google Custom Search Element (CSE) and created our own “rich snippets” using the CSE JavaScript API (v1.0) and the handy templating options Google provides. You can simply “View Source” to see the underlying code: it’s all there in the HTML. The HTML5 data- attributes set all the content and the display logic.

Google Javascript objects used in search result snippet presentation.
Google Javascript objects used in search result snippet presentation.

 

Digital Collections Search: Sidney D. Gamble Collection

Embedding the Data.

Our digital collections introduce more complexity in the structured data than we see in our finding aids. Naturally, we have a wide range of item types with diverse metadata. We want our markup to represent the relationship of an item to its source collection. The item, the webpage that it’s on, the collection it came from, and the media files associated with it all have properties that can be expressed using schema.org terms. So, we tried it all.[1]

Example Schema.org tags used in item pages
Example Schema.org tags used in item pages

Rendering Search Results Using Google’s Index. 

For the Gamble collection, we succeeded in making queries hit Google’s XML API while sustaining the look of our existing search results. Note that the facets in the left side aren’t powered via Google–we haven’t gotten far enough in our experiment to work with filtering the result set based on the structured data, but that’s possible to do.

Search result rendering using Google's XML API
Search result rendering using Google’s XML API

Outcomes 

 

The Good

We’ve been pleased with the ability to make our own rich snippets and highly customize the appearance of search results without having to do a ton of development. Getting our structured data back from Google’s index to work with is an awesome service and developing around the schema.org properties that we were already providing has been a nice way to kill two birds with one stone.

For performance, Google CSE is working well in both the finding aids and the Gamble digital collection search for these purposes:

  • getting the most relevant content presented early on in the search result
  • getting results quickly
  • handling non-Roman characters in search terms
  • retrieving a needle in a haystack — an item or handful of items that contain some unique text

The Gotchas

While Google CSE  shows relevant results quickly, we’re finding it’s not a good fit for exploratory searching when either of these aspects is important:

  • getting a stable and precise count of relevant results
  • browsing an exhaustive list of results that match a general query

Be careful: queries max out at 100 results with the JavaScript APIs or 1,000 results when using the XML API.  Those limits aren’t obvious in the documentation, yet they might be a deal-breaker for some potential uses.

For queries with several pages of hits, you may get an estimated result count that’s close, but unfortunately things occasionally and inexplicably go sour as you navigate from from one result page to the next.  E.g., the Gamble digital collection query ‘beijing‘ shows about 2,100 results (which is in the ballpark of what Solr returns), yet browse a few pages in and the result set will get truncated severely: you may only be able to actually browse about 200 of the results without issuing more specific query terms.

Other Considerations

Impact on External Discovery

Traffic to digital collections via external search engines has mostly climbed steadily every quarter for the past few years, from 26% of all visits in Jul-Sep 2011 up to 44% from Jan-Mar 2014 (to date) [2]. We entered schema.org tags in Oct 2012, however we don’t know whether adding that data has contributed at all to this trend. Does schema.org data impact relevance? It’s hard to tell.

Structured Data Syntax + Google APIs

Though RDFa Lite and microdata should be equally acceptable ways to add schema.org tags, Google’s APIs actually work better with microdata if there are nested item types.[3]  And regardless of microdata or RDFa, the Google CSE JavaScript API unfortunately can’t access more than one value for any given property, so that can be problematic [4].

Rich Snippets in Big Google

We’re seeing Google render rich snippets for our videos, because we’ve marked them as schema.org VideoObjects with properties like thumbnailUrl. That’s encouraging! Perhaps someday Google will render better snippets for things like photographs (of which we have a bunch), or maybe even more library domain-specific materials like digitized oral histories, manuscripts, and newspapers.  But at present, none of our other objects seem to trigger nice snippets like this.

A rich snippet triggered by using schema.org videoObject type & thumbnailUrl property.
A rich snippet triggered by using schema.org videoObject type & thumbnailUrl property.

Footnotes

[1] We represented item pages as schema.org “ItemPage” types using the “ispartOf” property to relate the item page to its corresponding “CollectionPage”. We made the ItemPage “about” a “CreativeWork”. Then we created mappings for many of our metadata fields to CreativeWork properties, e.g., creator, contentLocation, genre, dateCreated.

[2] Digital Collections External Search Traffic by Quarter

Quarter    Visits via Search   % Visits via Search

Jul – Sep 2011   26,621   25.97%
Oct – Dec 2011   32,191   29.59%
Jan – Mar 2012   41,048   32.16%
Apr – Jun 2012   33,872   34.49%
Jul – Sep 2012   28,250   32.40%
Oct – Dec 2012   38,472   36.52% <– entered schema.org tags Oct 19, 2012
Jan – Mar 2013   39,948   35.29%
Apr – Jun 2013   36,641   38.30%
Jul – Sep 2013   35,058   41.88%
Oct – Dec 2013   46,082   43.98%
Jan – Mar 2014   47,123   43.93%

[3] For example, if your RDFa indicates that “an ItemPage is about a CreativeWork whose creator is Sidney Gamble”– the creator of the creative work is not accessible to the API since the CreativeWork is not a top-level item.  To get around that, we had to duplicate all the CreativeWork properties in the HTML <head>, which is unnatural and a bit of a hack.

[4]  Google’s CSE JS APIs also don’t let us retrieve the data when there are multiple values specified for the same field. For a given CreativeWork, we might have six locations that are all important to represent: China; Beijing (China); Huabei xie he nu zi da xue (Beijing, China); 中国; 北京;  华北协和女子大学.  The JSON returned by the API only contains the first value: ‘China’. This, plus the result count limit, made the XML API our only viable choice for digital collections.