This week, in conjunction with our H. Lee Waters Film Collection unveiling, we rolled out a handy new Embed feature for digital collections items. The idea is to make it as easy as possible for someone to share their discoveries from our collections, with proper attribution, on other websites or blogs.
It’s simple, really, and mimics the experience you’re likely to encounter getting embed code from other popular sites with videos, images, and the like. We modeled our approach loosely on the Internet Archive‘s video embed service (e.g., visit this video and click the Share icon, but only if you are unafraid of clowns).
Click the “Embed” link under an item from Duke Digital Collections, and copy the snippet of code that pops up. Paste it in your website, and you’re done!
I’ll paste a few examples below using different kinds of items. The embed code is short and nearly identical for all of these:
A Single Image
Document with Document Viewer
Building this feature required a little bit of math, some trial & error, and a few tricks. The steps were to:
Set up a service to return customized item pages at the path http://library.duke.edu/digitalcollections/embed/<itemid>/
Use CSS & JS to make the media as fluid as possible to fill whatever space it ends up in
Use a fixed height and overflow: auto on the attribution box so longer content will scroll
Use link rel=”canonical” to ensure the item’s embed page is associated with the real item page (especially to improve links / ranking signals for search engines).
Present the user a copyable HTML <iframe> element in the regular item page that has the correct height & width attributes to accommodate the item(s) to be embedded
This last point is where the math comes in. Take a single image item, for example. With a landscape-orientation image we need to give the user a different <iframe> height to copy than we would for a portrait. It gets even more complicated when we have to account for multiple tracks of audio or video, or combinations of the two.
We’ll refine this feature a bit in the coming weeks, and work out any embed-bugs we discover. We’ll also be developing a similar feature for embedding digitized content found in our archival collection guides.
This week, we added a display of our most recent Bitstreams blog posts to our Digital Collections homepage (example), and likewise, a view of posts relevant to a given collection on the respective collection’s homepage (example).
Our Digital Projects & Production team has been writing in Bitstreams at least weekly since February 2014. We’ve had some excellent guest contributors, too. Some posts share updates about new digital collections or additions, while others share insights, lessons learned, and behind-the-scenes looks at the projects we’re currently tackling.
Many of our posts have been featured on our library homepage and library news site. But until now, we haven’t been able to display any of them—not even the ones about new digital collections—alongside the collections themselves. So, if you visited the DukEngineer collection in the past, you likely missed out on Melanie’s excellent overview, which puts the magazine in context and highlights the best of what’s inside.
Syndicating tagged blog posts for display elsewhere is a pretty common use case, and we’ve used a bunch of different solutions as our platforms have evolved. Each solution has naturally been painstakingly tailored to accommodate the inner workings of both the source and the destination. Seven years ago, we were writing custom XSLT to create and then consume our own RSS feeds in Cascade Server CMS. We have since hopped over to Wordpress for managing news and blogs (whew!). An older version of our digital collections app used WordPress’ XML-RPC API to get tagged posts and parsed them with Python.
These days, our library website does blog syndication by using a combo of WordPress RSS, Drupal’s feed aggregator module, and occasionally Yahoo! Pipes for data mashing and munging. It works well in Drupal, but other platforms require other approaches.
Among Angular’s virtues is that it really simplifies the process of getting and using JSON data from an API. I found Wordpress’ JSON API plugin, which was interestingly developed by staff at MoMA so they could use WordPress as a back-end to a site with a Rails front-end. So we first had to enable that for our Bitstreams blog.
AngularJS definitely helps keep code clean, especially by abstracting the model (the blogposts & associated characteristics, as well as the page state) from the view (indicates how to display the data) from the controller (gets and refines the data into the model, updates the model upon interactions with the view). I’ve done several projects in the past using jQuery and DOM manipulation to retrieve and display data. It usually works, but in the process I create a veritable rat’s nest of spaghetti code wherein /* no amount of commenting */ can truly help disentangle what’s happening.
AngularJS has a steeper learning curve than I’d expected; I assumed I could do this mini-project in a few hours, but it took a couple days to really get a handle on the basic pieces I needed for this project.
I consider this an encouraging proof of concept. While our own blog posts can be interesting, there are many other sources of valuable data out in the world that are relevant to our collections that would add value for our researchers if we were able to easily get and display them. AngularJS won’t be the answer to all of these needs, but it’s nice to have in the toolset.
Everyone knows that Twitter limits each post to 140 characters. Early criticism has since cooled and most people agree it’s a helpful constraint, circumvented through clever (some might say better) writing, hyperlinks, and URL-shorteners. But as a reader of tweets, how do you know what lies at the other end of a shortened link? What entices you to click? The tweet author can rarely spare the characters to attribute the source site or provide a snippet of content, and can’t be expected to attach a representative image or screenshot.
Our webpages are much more than just mystery destinations for shortened URLs. Twitter agrees: its developers want help understanding what the share-worthy content from a webpage actually is in order to present it in a compelling way alongside the 140 characters or less. Enter two library hallmarks: vocabularies and metadata.
This week, we added Twitter Card metadata in the <head> of all of our digital collections pages and in our library blogs. This data instantly made all tweets and retweets linking to our pages far more interesting. Check it out!
For the blogs, tweets now display the featured image, post title, opening snippet, site attribution, and a link to the original post. Links to items from digital collections now show the image itself (along with some item info), while links to collections, categories, or search results now display a grid of four images with a description underneath. See these examples:
Why This Matters
In 2013-14, social media platforms accounted for 10.1% of traffic to our blogs (~28,000 visits in 2013-14, 11,300 via Twitter), and 4.3% of visits to our digital collections (~17,000 visits, 1,000 via Twitter). That seems low, but perhaps it’s because of the mystery link phenomenon. These new media-rich tweets have the potential to increase our traffic through these channels by being more interesting to look at and more compelling to click. We’re looking forward to finding out whether they do.
And regardless of driving clicks, there are two other benefits of Twitter Cards that we really care about in the library: context and attribution. We love it when our collections and blog posts are shared on Twitter. These tweets now automatically give some additional information and helpfully cite the source.
How to Get Your Own Twitter Cards
The Manual Way
If you’re manually adding tags like we’ve done in our Digital Collections templates, you can “View Source” on any of our pages to see what <meta> tags make the magic happen. Moz also has some useful code snippets to copy, with links to validator tools so you can make sure you’re doing it correctly.
Since our blogs run on WordPress, we were able to use the excellent WordPress SEO plugin by Yoast. It’s helpful for a lot of things related to search engine optimization, and it makes this social media optimization easy, too.
Once your tags are in place, you just need to validate an example from your domain using the Twitter Card Validator before Twitter will turn on the media-rich tweets. It doesn’t take long at all: ours began appearing within a couple hours. The cards apply retroactively to previous tweets, too.
Our addition of Twitter Card data follows similar work we have done using semantic markup in our Digital Collections site using the Open Graph and Schema.org vocabularies. Open Graph is a standard developed by Facebook. Similar to Twitter Card metadata, OG tags inform Facebook what content to highlight from a linked webpage. Schema.org is a vocabulary for describing the contents of web pages in a way that is helpful for retrieval and representation in Google and other search engines.
All of these tools use RDFa syntax, a key cornerstone of Linked Data on the web that supports the description of resources using whichever vocabularies you choose. Google, Twitter, Facebook, and other major players in our information ecosystem are now actively using this data, providing clear incentive for web authors to provide it. We should keep striving to play along.
Back in February 2014, we wrapped up the CCC project, a collaborative three year IMLS-funded digitization initiative with our partners in the Triangle Research Libraries Network (TRLN). The full title of the project is a mouthful, but it captures its essence: “Content, Context, and Capacity: A Collaborative Large-Scale Digitization Project on the Long Civil Rights Movement in North Carolina.”
So how large is “large-scale”? By comparison, when the project kicked off in summer 2011, we had a grand total of 57,000 digitized objects available online (“published”), collectively accumulated through sixteen years of digitization projects. That number was 69,000 by the time we began publishing CCC manuscripts in June 2012. Putting just as many documents online in three years as we’d been able to do in the previous sixteen naturally requires a much different approach to creating digital collections.
Individual items identified during scanning
No item-level identification: entire folders scanned
Descriptive metadata applied to each item
Archival description only (e.g., at the folder level)
CCC staff completed qualitative and quantitative evaluations of this large-scale digitization approach during the course of the project, ranging from conducting user focus groups and surveys to analyzing the impact on materials prep time and image quality control. Researcher assessments targeted three distinct user groups: 1) Faculty & History Scholars; 2) Undergraduate Students (in research courses at UNC & NC State); 3) NC Secondary Educators.
Ease of Use. Faculty and scholars, for the most part, found it easy to use digitized content presented this way. Undergraduates were more ambivalent, and secondary educators had the most difficulty.
To Embed or Not to Embed. In 2012, Duke was the only library presenting the image thumbnails embedded directly within finding aids and a lightbox-style image navigator. Undergrads who used Duke’s interface found it easier to use than UNC or NC Central’s, and Duke’s collections had a higher rate of images viewed per folder than the other partners. UNC & NC Central’s interfaces now use a similar convention.
Potential for Use. Most users surveyed said they could indeed imagine themselves using digitized collections presented in this way in the course of their research. However, the approach falls short in meeting key needs for secondary educators’ use of primary sources in their classes.
Desired Enhancements. The top two most desired features by faculty/scholars and undergrads alike were 1) the ability to search the text of the documents (OCR), and 2) the ability to explore by topic, date, document type (i.e., things enabled by item-level metadata). PDF download was also a popular pick.
Impact on Duke Digitization Projects
Since the moment we began putting our CCC manuscripts online (June 2012), we’ve completed the eight CCC collections using this large-scale strategy, and an additional eight manuscript collections outside of CCC using the same approach. We have now cumulatively put more digital objects online using the large-scale method (96,000) than we have via traditional means (75,000). But in that time, we have also completed eleven digitization projects with traditional item-level identification and description.
We see the large-scale model for digitization as complementary to our existing practices: a technique we can use to meet the publication needs of some projects.
Do people actually use the collections when presented in this way? Some interesting figures:
Views / item in 2013-14 (traditional digital object; item-level description): 13.2
Views / item in 2013-14 (digitized image within finding aid; folder-level description): 1.0
Views / folder in 2013-14 (digitized folder view in finding aid): 8.5
It’s hard to attribute the usage disparity entirely to the publication method (they’re different collections, for one). But it’s reasonable to deduce (and unsurprising) that bypassing item-level description generally results in less traffic per item.
The takeaway is, sometimes having interesting, important, and timely content available for use online is more important than the features enabled or the process by which it all gets there.
We’ll keep pushing ahead with evolving our practices for putting digitized materials online. We’ve introduced many recent enhancements, like fulltext searching, a document viewer, and embedded HTML5 video. Inspired by the CCC project, we’ll continue to enhance our finding aids to provide access to digitized objects inline for context (e.g., The Jazz Loft Project Records). Our TRLN partners have also made excellent upgrades to the interfaces to their CCC collections (e.g., at UNC, at NC State) and we plan, as usual, to learn from them as we go.
This past week, we were excited to be able to publish a rare 1804 manuscript copy of the Haitian Declaration of Independence in our digital collections website. We used the project as a catalyst for improving our document-viewing user experience, since we knew our existing platforms just wouldn’t cut it for this particular treasure from the Rubenstein Library collection. In order to present the declaration online, we decided to implement the open-source Diva.js viewer. We’re happy with the results so far and look forward to making more strides in our ability to represent documents in our site as the year progresses.
Challenges to Address
We have had two glaring limitations in providing access to digitized collections to date: 1) a less-than-stellar zoom & pan feature for images and 2) a suboptimal experience for navigating documents with multiple pages. For zooming and panning (see example), we use software called OpenLayers, which is primarily a mapping application. And for paginated items we’ve used two plugins designed to showcase image galleries, Galleria (example) and Colorbox (example). These tools are all pretty good at what they do, but we’ve been using them more as stopgap solutions for things they weren’t really created to do in the first place. As the old saying goes, when all you have is a hammer, everything looks like a nail.
Big (OR Zoom-Dependent) Things
Traditionally as we digitize images, whether freestanding or components of a multi-page object, at the end of the process we generate three JPG derivatives per page. We make a thumbnail (helpful in search results or other item sets), medium image (what you see on an item’s webpage), and large image (same dimensions as the preservation master, viewed via the ‘all sizes’ link). That’s a common approach, but there are several places where that doesn’t always work so well. Some things we’ve digitized are big, as in “shoot them in sections with a camera and stitch the images together” big. And we’ve got several more materials like this waiting in the wings to make available. A medium image doesn’t always do these things justice, but good luck downloading and navigating a giant 28MB JPG when all you want to do is zoom in a little bit.
Likewise, an object doesn’t have to be large to really need easy zooming to be part of the viewing experience. You might want to read the fine print on that newspaper ad, see the surgeon general’s warning on that billboard, or inspect the brushstrokes in that beautiful hand-painted glass lantern slide.
And finally, it’s not easy to anticipate the exact dimensions at which all our images will be useful to a person or program using them. Using our data to power an interactive display for a media wall? A mobile app? A slideshow on the web? You’ll probably want images that are different dimensions than what we’ve stored online. But to date, we haven’t been able to provide ways to specify different parameters (like height, width, and rotation angle) in the image URLs to help people use our images in environments beyond our website.
We do love our documentary photography collections, but a lot of our digitized objects are represented by more than just a single image. Take an 11-page piece of sheet music or a 127-page diary, for example. Those aren’t just sequences or collections of images. Their paginated orientation is pretty essential to their representation online, but a lot of what characterizes those materials is unfortunately lost in translation when we use gallery tools to display them.
The Intersection of (Big OR Zoom-Dependent) AND Paginated
Here’s where things get interesting and quite a bit more complicated: when zooming, panning, page navigation, and system performance are all essential to interacting with a digital object. There are several tools out there that support these various aspects, but very few that do them all AND do them well. We knew we needed something that did.
Our Solution: Diva.js
Setting up Diva.js required us to add a few new pieces to our infrastructure. The most significant was an image server (in our case, IIPImage) that could 1) deliver parts of a digital image upon request, and 2) deliver complete images at whatever size is requested via URL parameters.
Our Interface: How it Works
By default, we present a document in our usual item page template that provides branding, context, and metadata. You can scroll up and down to navigate pages, use Page Up or Page Down keys, or enter a page number to jump to a page directly. There’s a slider to zoom in or out, or alternatively you can double-click to zoom in / Ctrl-double-click to zoom out. You can toggle to a grid view of all pages and adjust how many pages to view at once in the grid. There’s a really handy full-screen option, too.
It’s optimized for performance via AJAX-driven “lazy loading”: only the page of the document that you’re currently viewing has to load in your browser, and likewise only the visible part of that page image in the viewer must load (via square tiles). You can also download a complete JPG for a page at the current resolution by clicking the grey arrow.
We extended Diva.js by building a synchronized fulltext pane that displays the transcript of the current page alongside the image (and beneath it in full-screen view). That doesn’t come out-of-the-box, but Diva.js provides some useful hooks into its various functions to enable developing this sort of thing. We also slightly modified the styles.
Behind the scenes, we have pyramid TIFF images (one for each page), served up as JPGs by IIPImage server. These files comprise arrays of 256×256 JPG tiles for each available zoom level for the image. Let’s take page 1 of the declaration for example. At zoom level 0 (all the way zoomed out), there’s only one image tile: it’s under 256×256 pixels; level 1 is 4 tiles, level 2 is 12, level 3 is 48, level 4 is 176. The page image at level 5 (all the way zoomed in) includes 682 tiles (example of one), which sounds like a lot, but then again the server only has to deliver the parts that you’re currently viewing.
Every item using Diva.js also needs to load a JSON stream including the dimensions for each page within the document, so we had to generate that data. If there’s a transcript present, we store it as a single HTML file, then use AJAX to dynamically pull in the part of that file that corresponds to the currently-viewed page in the document.
Diva.js & IIPImage Limitations
It’s a good interface, and is the best document representation we’ve been able to provide to date. Yet it’s far from perfect. There are several areas that are limiting or that we want to explore more as we look to make more documents available in the future.
Out of the box, Diva.js doesn’t support page metadata, transcriptions, or search & retrieval within a document. We do display a synchronized transcript, but there’s currently no mapping between the text and the location within each page where each word appears, nor can you perform a search and discover which pages contain a given keyword. Other folks using Diva.js are working on robust applications that handle these kinds of interactions, but the degree to which they must customize the application is high. See for example, the Salzinnes Antiphonal: a 485-page liturgical manuscript w/text and music or a prototype for the Liber Usualis: a 2,000+ page manuscript using optical music recognition to encode melodic fragments.
Diva.js also has discrete zooming, which can feel a little jarring when you jump between zoom levels. It’s not the smooth, continuous zoom experience that is becoming more commonplace in other viewers.
With the IIPImage server, we’ll likely re-evaluate using Pyramid TIFFs vs. JPEG2000s to see which file format works best for our digitization and publication workflow. In either case, there are several compression and caching variables to tinker with to find an ideal balance between image quality, storage space required, and system performance. We also discovered that the IIP server unfortunately strips out the images’ ICC color profiles when it delivers JPGs, so users may not be getting a true-to-form representation of the image colors we captured during digitization.
Launching our first project using Diva.js gives us a solid jumping-off point for expanding our ability to provide useful, compelling representations of our digitized documents online. We’ll assess how well this same approach would scale to other potential projects and in the meantime keep an eye on the landscape to see how things evolve. We’re better equipped now than ever to investigate alternative approaches and complementary tools for doing this work.
We’ll also engage more closely with our esteemed colleagues in the Duke Collaboratory for Classics Computing (DC3), who are at the forefront of building tools and services in support of digital scholarship. Well beyond supporting discovery and access to documents, their work enables a community of scholars to collaboratively transcribe and annotate items (an incredible–and incredibly useful–feat!). There’s a lot we’re eager to learn as we look ahead.
Over the past year and a half, among our many other projects, we have been experimenting with a creative new approach to powering searches within digital collections and finding aids using Google’s index of our structured data. My colleague Will Sexton and I have presented this idea in numerous venues, most recently and thoroughly for a recorded ASERL (Association of Southeastern Research Libraries) webinar on June 6, 2013.
We’re eager to share what we’ve learned to date and hope this new blog will make a good outlet. We’ve had some success, but have also encountered some considerable pitfalls along the way.
What We Set Out to Do
I won’t recap all the fine details of the project here, but in a nutshell, here are the problems we’ve been attempting to address:
Maintaining our own Solr index takes a ton of time to do right. We don’t have a ton of time.
Staff have noted poor relevance rank and poor support for search using non-Roman characters.
Our digital collections search box is actually used sparsely (in only 12% of visits).
External discovery (e.g., via Google) is of equal or greater importance vs. our local search for these “inside-out” resources.
Get Google to index all of our embedded structured data
Use Google’s index of our structured data to power our local search for finding aids & digital collections
Where We Are Today
We mapped several of our metadata fields to schema.org terms, then embedded that schema.org data in all 74,000 digital object pages and all 2,100 finding aids. We’re now using Google’s index of that data to power our default search for:
Embedding the Data. We kept it super simple here. We labeled every finding aid page a ‘CollectionPage’ and tagged only a few properties: name, description, creator, and if present, a thumbnailUrl for a collection with digitized content.
Rendering Search Results Using Google’s Index.
Digital Collections Search: Sidney D. Gamble Collection
Embedding the Data.
Our digital collections introduce more complexity in the structured data than we see in our finding aids. Naturally, we have a wide range of item types with diverse metadata. We want our markup to represent the relationship of an item to its source collection. The item, the webpage that it’s on, the collection it came from, and the media files associated with it all have properties that can be expressed using schema.org terms. So, we tried it all.
Rendering Search Results Using Google’s Index.
For the Gamble collection, we succeeded in making queries hit Google’s XML API while sustaining the look of our existing search results. Note that the facets in the left side aren’t powered via Google–we haven’t gotten far enough in our experiment to work with filtering the result set based on the structured data, but that’s possible to do.
We’ve been pleased with the ability to make our own rich snippets and highly customize the appearance of search results without having to do a ton of development. Getting our structured data back from Google’s index to work with is an awesome service and developing around the schema.org properties that we were already providing has been a nice way to kill two birds with one stone.
For performance, Google CSE is working well in both the finding aids and the Gamble digital collection search for these purposes:
getting the most relevant content presented early on in the search result
getting results quickly
handling non-Roman characters in search terms
retrieving a needle in a haystack — an item or handful of items that contain some unique text
While Google CSE shows relevant results quickly, we’re finding it’s not a good fit for exploratory searching when either of these aspects is important:
getting a stable and precise count of relevant results
browsing an exhaustive list of results that match a general query
For queries with several pages of hits, you may get an estimated result count that’s close, but unfortunately things occasionally and inexplicably go sour as you navigate from from one result page to the next. E.g., the Gamble digital collection query ‘beijing‘ shows about 2,100 results (which is in the ballpark of what Solr returns), yet browse a few pages in and the result set will get truncated severely: you may only be able to actually browse about 200 of the results without issuing more specific query terms.
Impact on External Discovery
Traffic to digital collections via external search engines has mostly climbed steadily every quarter for the past few years, from 26% of all visits in Jul-Sep 2011 up to 44% from Jan-Mar 2014 (to date) . We entered schema.org tags in Oct 2012, however we don’t know whether adding that data has contributed at all to this trend. Does schema.org data impact relevance? It’s hard to tell.
Structured Data Syntax + Google APIs
Rich Snippets in Big Google
We’re seeing Google render rich snippets for our videos, because we’ve marked them as schema.org VideoObjects with properties like thumbnailUrl. That’s encouraging! Perhaps someday Google will render better snippets for things like photographs (of which we have a bunch), or maybe even more library domain-specific materials like digitized oral histories, manuscripts, and newspapers. But at present, none of our other objects seem to trigger nice snippets like this.
 We represented item pages as schema.org “ItemPage” types using the “ispartOf” property to relate the item page to its corresponding “CollectionPage”. We made the ItemPage “about” a “CreativeWork”. Then we created mappings for many of our metadata fields to CreativeWork properties, e.g., creator, contentLocation, genre, dateCreated.
 Digital Collections External Search Traffic by Quarter
Quarter Visits via Search % Visits via Search
Jul – Sep 2011 26,621 25.97%
Oct – Dec 2011 32,191 29.59%
Jan – Mar 2012 41,048 32.16%
Apr – Jun 2012 33,872 34.49%
Jul – Sep 2012 28,250 32.40%
Oct – Dec 2012 38,472 36.52% <– entered schema.org tags Oct 19, 2012
Jan – Mar 2013 39,948 35.29%
Apr – Jun 2013 36,641 38.30%
Jul – Sep 2013 35,058 41.88%
Oct – Dec 2013 46,082 43.98%
Jan – Mar 2014 47,123 43.93%
 For example, if your RDFa indicates that “an ItemPage is about a CreativeWork whose creator is Sidney Gamble”– the creator of the creative work is not accessible to the API since the CreativeWork is not a top-level item. To get around that, we had to duplicate all the CreativeWork properties in the HTML <head>, which is unnatural and a bit of a hack.
 Google’s CSE JS APIs also don’t let us retrieve the data when there are multiple values specified for the same field. For a given CreativeWork, we might have six locations that are all important to represent: China; Beijing (China); Huabei xie he nu zi da xue (Beijing, China); 中国; 北京; 华北协和女子大学. The JSON returned by the API only contains the first value: ‘China’. This, plus the result count limit, made the XML API our only viable choice for digital collections.
Notes from the Duke University Libraries Digital Projects Team