Category Archives: Digital Collections

A Summer Day in the Life of Digital Collections

A recent tweet from my colleague in the Rubenstein Library (pictured above) pretty much sums up the last few weeks at work.  Although I rarely work directly with students and classes, I am still impacted by the hustle and bustle in the library when classes are in session.  Throughout the busy Spring I found myself saying, oh I’ll have time to work on that over the Summer.  Now Summer is here, so it is time to make some progress on those delayed projects while keeping others moving forward.  With that in mind here is your late Spring and early Summer round-up of Digital Collections news and updates.

Radio Haiti

A preview of the soon to be live Radio Haiti Archive digital collection.

The long anticipated launch of the Radio Haiti Archives is upon us.  After many meetings to review the metadata profile, discuss modeling relationships between recordings, and find a pragmatic approach to representing metadata in 3 languages all in the Duke Digital Repository public interface, we are now in preview mode, and it is thrilling.  Behind the scenes, Radio Haiti represents a huge step forward in the Duke Digital Repository’s ability to store and play back audio and video files.

You can already listen to many recordings via the Radio Haiti collection guide, and we will share the digital collection with the world in late June or early July.  In the meantime, check out this teaser image of the homepage.

 

Section A

My colleague Meghan recently wrote about our ambitions Section A digitization project, which will result in creating finding aids for and digitizing 3000+ small manuscript collections from the Rubenstein library.  This past week the 12 people involved in the project met to review our workflow.  Although we are trying to take a mass digitization and streamlined approach to this project, there are still a lot of people and steps.  For example, we spent about 20-30 minutes of our 90 minute meeting reviewing the various status codes we use on our giant Google spreadsheet and when to update them. I’ve also created a 6 page project plan that encompasses both a high and medium level view of the project. In addition to that document, each part of the process (appraisal, cataloging review, digitization, etc.) also has their own more detailed documentation.  This project is going to last at least a few years, so taking the time to document every step is essential, as is agreeing on status codes and how to use them.  It is a big process, but with every box the project gets a little easier.

Status codes for tracking our evaluation, remediation, and digitization workflow.
Section A Project Plan Summary

 

 

 

 

 

 

 

Diversity and Inclusion Digitization Initiative Proposals and Easy Projects

As Bitstreams readers and DUL colleagues know, this year we instituted 2 new processes for proposing digitization projects.  Our second digitization initiative deadline has just passed (it was June 15) and I will be working with the review committee to review new proposals as well as reevaluate 2 proposals from the first round in June and early July.  I’m excited to say that we have already approved one project outright (Emma Goldman papers), and plan to announce more approved projects later this Summer. 

We also codified “easy project” guidelines and have received several easy project proposals.  It is still too soon to really assess this process, but so far the process is going well.

Transcription and Closed Captioning

Speaking of A/V developments, another large project planned for this Summer is to begin codifying our captioning and transcription practices.  Duke Libraries has had a mandate to create transcriptions and closed captions for newly digitized A/V for over a year. In that time we have been working with vendors on selected projects.  Our next steps will serve two fronts; on the programmatic side we need  review the time and expense captioning efforts have incurred so far and see how we can scale our efforts to our backlog of publicly accessible A/V.  On the technology side I’ve partnered with one of our amazing developers to sketch out a multi-phase plan for storing and providing access to captions and time-coded transcriptions accessible and searchable in our user interface.  The first phase goes into development this Summer.  All of these efforts will no doubt be the subject of a future blog post.  

Testing VTT captions of Duke Chapel Recordings in JWPlayer

Summer of Documentation

My aspirational Summer project this year is to update digital collections project tracking documentation, review/consolidate/replace/trash existing digital collections documentation and work with the Digital Production Center to create a DPC manual.  Admittedly writing and reviewing documentation is not the most exciting Summer plan,  but with so many projects and collaborators in the air, this documentation is essential to our productivity, communication practices, and my personal sanity.   

Late Spring Collection launches and Migrations

Over the past few months we launched several new digital collections as well as completed the migration of a number of collections from our old platform into the Duke Digital Repository.  

New Collections:

Migrated Collections:

…And so Much More!

In addition to the projects above, we continue to make slow and steady progress on our MSI system, are exploring using the FFv1 format for preserving selected moving image collections, planning the next phase of the Digital Collections migration into the Duke Digital Repository, thinking deeply about collection level metadata and structured metadata, planning to launch newly digitized Gedney images, integrating digital objects in finding aids and more.  No doubt some of these efforts will appear in subsequent Bitstreams posts.  In the meantime, let’s all try not to let this Summer fly by too quickly!

Enjoy Summer while you can!

The ABCs of Digitizing Section A

I’m not sure anyone who currently works in the library has any idea when the phrase “Section A” was first coined as a call number for small manuscript collections. Before the library’s renovation, before we barcoded all our books and boxes — back when the Rubenstein was still RBMSCL, and our reading room carpet was a very bright blue — there was a range of boxes holding single-folder manuscript collections, arranged alphabetically by collection creator. And this range was called Section A.

Box 175 of Section A
Box 175 of Section A

Presumably there used to be a Section B, Section C, and so on — and it could be that the old shelf ranges were tracked this way, I’m not sure — but the only one that has persisted through all our subsequent stacks moves and barcoding projects has been Section A. Today there are about 3900 small collections held in 175 boxes that make up the Section A call number. We continue to add new single-folder collections to this call number, although thanks to the miracle of barcodes in the catalog, we no longer have to shift files to keep things in perfect alphabetical order. The collections themselves have no relationship to one another except that they are all small. Each collection has a distinct provenance, and the range of topics and time periods is enormous — we have everything from the 17th to the 21st century filed in Section A boxes. Small manuscript collections can also contain a variety of formats: correspondence, writings, receipts, diaries or other volumes, accounts, some photographs, drawings, printed ephemera, and so on. The bang-for-your-buck ratio is pretty high in Section A: though small, the collections tend to be well-described, meaning that there are regular reproduction and reference requests. Section A is used so often that in 2016, Rubenstein Research Services staff approached Digital Collections to propose a mass digitization project, re-purposing the existing catalog description into digital collections within our repository. This will allow remote researchers to browse all the collections easily, and also reduce repetitive reproduction requests.

This project has been met with enthusiasm and trepidation from staff since last summer, when we began to develop a cross-departmental plan to appraise, enhance description, and digitize the 3900 small manuscript collections that are housed in Section A. It took us a bit of time, partially due to the migration and other pressing IT priorities, but this month we are celebrating a major milestone: we have finally launched our first 2 Section A collections, meant to serve as a proof of concept, as well as a chance for us to firmly define the project’s goals and scope. Check them out: Abolitionist Speech, approximately 1850, and the A. Brouseau and Co. Records, 1864-1866. (Appropriately, we started by digitizing the collections that began with the letter A.)

A. Brouseau & Co. Records carpet receipts, 1865

Why has it been so complicated? First, the sheer number of collections is daunting; while there are plenty of digital collections with huge item counts already in the repository, they tend to come from a single or a few archival collections. Each newly-digitized Section A collection will be a new collection in the repository, which has significant workflow repercussions for the Digital Collections team. There is no unifying thread for Section A collections, so we are not able to apply metadata in batch like we would normally do for outdoor advertising or women’s diaries. Rubenstein Research Services and Library Conservation Department staff have been going box by box through the collections (there are about 25 collections per box) to identify out-of-scope collections (typically reference material, not primary sources), preservation concerns, and copyright concerns. These are excluded from the digitization process. Technical Services staff are also reviewing and editing the Section A collections’ description. This project has led to our enhancing some of our oldest catalog records — updating titles, adding subject or name access, and upgrading the records to RDA, a relatively new standard. Using scripts and batch processes (details on GitHub), the refreshed MARC records are converted to EAD files for each collection, and the digitized folder is linked through ArchivesSpace, our collection management system. We crosswalk the catalog’s name and subject access data to both the finding aid and the repository’s metadata fields, allowing the collection to be discoverable through the Rubenstein finding aid portal, the Duke Libraries catalog, and the Duke Digital Repository.

It has been really exciting to see the first two collections go live, and there are many more already digitized and just waiting in the wings for us to automate some of our linking and publishing processes. Another future development that we expect will speed up the project is a batch ingest feature for collections entering the repository. With over 3000 collections to ingest, we are eager to streamline our processes and make things as efficient as possible. Stay tuned here for more updates on the Section A project, and keep an eye on Digital Collections if you’d like to explore some of these newly-digitized collections.

New Digitization Project Proposal Process and Call for Proposals

At Duke University Libraries (DUL), we are embarking on a new way to propose digitization projects.  This isn’t a spur of the moment New Year’s resolution I promise, but has been in the works for months.  Our goal in making a change to our proposal process is twofold: first, we want to focus our resources on specific types of projects, and second, we want to make our efforts as efficient as possible.

Introducing Digitization Initiatives

The new proposal workflow centers on what we are calling “digitization initiatives.” These are groups of digitization projects that relate to a specific theme or characteristic.  DUL’s Advisory Council for Digital Collections develops guidelines for an initiative, and will then issue a call for proposals to the library.  Once the call has been issued, library staff can submit proposals on or before one of two deadlines over a 6 month period.  Following submission, proposals will be vetted, and accepted proposals will move onto implementation. Our previous system did not include deadlines, and proposals were asked to demonstrate broad strategic importance only.

DUL is issuing our first call for proposals now, and if this system proves successful we will develop a second digitization initiative to be announced in 2018.

I’ll say more about why we are embarking on this new system later, but first I would like to tell you about our first digitization initiative.

Call for Proposals

Duke University Libraries’ Advisory Council for Digital Collections has chosen diversity and inclusion as the theme of our first digitization initiative.  This initiative draws on areas of strategic importance both for DUL (as noted in the 2016 strategic plan) and the University.  Prospective champions are invited to think broadly about definitions of diversity and inclusion and how particular collections embody these concepts, which may include but is not limited to topics of race, religion, class, ability, socioeconomic status, gender, political beliefs, sexuality, age, and nation of origin.

Full details of the call for proposals here: https://duke.box.com/s/vvftxcqy9qmhtfcxdnrqdm5kqxh1zc6t

Proposals will be due on March 15, 2017 or June 15, 2017.

Proposing non-diversity and inclusion related proposal

We have not forgotten about all the important digitization proposals that support faculty, important campus or off campus partnerships, and special events. In our experience, these are often small projects and do not require a lot of extra conservation, technical services, or metadata support so we are creating an“easy” project pipeline.  This will be a more light-weight process that will still requires a proposal, but less strategic vetting at the outset. There will be more details coming out in late January or February on these projects so stay tuned.

Why this change?

I mentioned above that we are moving to this new system to meet two goals. First, this new system will allow us to focus proposal and vetting resources on projects that meet a specific strategic goal as articulated by an initiative’s guidelines.  Additionally, over the last few years we have received a huge variety of proposals: some are small “no brainer” type proposals while others are extremely large and complicated.  We only had one system for proposing and reviewing all proposals, and sometimes it seemed like too much process and sometimes too little.  In other words one process size does not not fit all.  By dividing our process into strategically focussed proposals on the one hand and easy projects on the other, we can spend more of our Advisory committee’s time on proposals that need it and get the smaller ones straight into the hands of the implementation team.

Another benefit of this process is that proposal deadlines will allow the implementation team to batch various aspects of our work (batching similar types of work makes it go faster).  The deadlines will also allow us to better coordinate the digitization related work performed by other departments.  I often find myself asking departments to fit digitization projects in with their already busy schedules, and it feels rushed and can create unnecessary stress.  If the implementation team has a queue of projects to address, then we can schedule it well in advance.

I’m really excited to see this new process get off the ground, and I’m looking forward to seeing all the fantastic proposals that will result from the Diversity and Inclusion initiative!

Cutting Through the Noise

Noise is an inescapable part of our sonic environment.  As I sit at my quiet library desk writing this, I can hear the undercurrent of the building’s pipes and HVAC systems, the click-clack of the Scribe overhead book scanner, footsteps from the floor above, doors opening and closing in the hallway, and the various rustlings of my own fidgeting.  In our daily lives, our brains tune out much of this extraneous noise to help us focus on the task at hand and be alert to sounds conveying immediately useful information: a colleagues’s voice, a cell-phone buzz, a fire alarm.

When sound is recorded electronically, however, this tuned-out noise is often pushed to the foreground.  This may be due to the recording conditions (e.g. a field recording done on budget equipment in someone’s home or outdoors) or inherent in the recording technology itself (electrical interference, mechanical surface noise).  Noise is always present in the audio materials we digitize and archive, many of which are interviews, oral histories, and events recorded to cassette or open reel tape by amateurs in the field.  Our first goal is to make the cleanest and most direct analog-to-digital transfer possible, and then save this as our archival master .wav file with no alterations.  Once this is accomplished, we have some leeway to work with the digital audio and try to create a more easily listenable and intelligible access copy.

img_2190

I recently started experimenting with Steinberg WaveLab software to clean up digitized recordings from the Larry Rubin Papers.  This collection contains some amazing documentation of Rubin’s work as a civil rights organizer in the 1960s, but the ever-present hum & hiss often threaten to obscure the content.  I worked with two plug-ins in WaveLab to try to mitigate the noise while leaving the bulk of the audio information intact.

plugin1

Even if you don’t know it by name, anyone who has used electronic audio equipment has probably heard the dreaded 60 Cycle Hum.  This is a fixed low-frequency tone that is related to our main electric power grid operating at 120 volts AC in the United States.  Due to improper grounding and electromagnetic interference from nearby wires and appliances, this current can leak into our audio signals and appear as the ubiquitous 60 Hz hum (disclaimer–you may not be able to hear this as well on tiny laptop speakers or earbuds).  Wavelab’s De-Buzzer plug-in allowed me to isolate this troublesome frequency and reduce its volume level drastically in relation to the interview material.  Starting from a recommended preset, I adjusted the sensitivity of the noise reduction by ear to cut unwanted hum without introducing any obvious digital artifacts in the sound.

plugin2

Similarly omnipresent in analog audio is High-Frequency Hiss.  This wash of noise is native to any electrical system (see Noise Floor) and is especially problematic in tape-based media where the contact of the recording and playback heads against the tape introduces another level of “surface noise.”  I used the De-Noiser plug-in to reduce hiss while being careful not to cut into the high-frequency content too much.  Applying this effect too heavily could make the voices in the recording sound dull and muddy, which would be counterproductive to improving overall intelligibility.

Listen to the before & after audio snippets below.  While the audio is still far from perfect due to the original recording conditions, conservative application of the noise reduction tools has significantly cleaned up the sound.  It’s possible to cut the noise even further with more aggressive use of the effects, but I felt that would do more harm than good to the overall sound quality.

BEFORE:

AFTER:

 

I was fairly pleased with these results and plan to keep working with these and other software tools in the future to create digital audio files that meet the needs of archivists and researchers.  We can’t eliminate all of the noise from our media-saturated lives, but we can always keep striving to keep the signal-to-noise ratio at manageable and healthy levels.

 

img_2187

Blacklight Summit 2016

Last week I traveled to lovely Princeton, NJ to attend Blacklight Summit. For the second year in a row a smallish group of developers who use or work on Project Blacklight met to talk about our work and learn from each other.

Blacklight is an open source project written in Ruby on Rails that serves as a discovery interface over a Lucene Solr search index. It’s commonly used to build library catalogs, but is generally agnostic about the source and type of the data you want to search. It was even used to help reporters explore the leaked Panama Papers.
blacklight-logo-h200-transparent-black-text
At Duke we’re using Blacklight as the public interface to our digital repository. Metadata about repository objects are indexed in Solr and we use Blacklight (with a lot of customizations) to provide access to digital collections, including images, audio, and video. Some of the collections include: Gary Monroe Photographs, J. Walter Thompson Ford Advertisements, and Duke Chapel Recordings, among many others.

Blacklight has also been selected to replace the aging Endeca based catalog that provides search across the TRLN libraries. Expect to hear more information about this project in the future.
trln_logo_abbrev_rgb
Blacklight Summit is more of an unconference meeting than a conference, with a relatively small number of participants. It’s a great chance to learn and talk about common problems and interests with library developers from other institutions.

I’m going to give a brief overview of some of what we talked about and did during the two and a half day meeting and provides links for you explore more on your own.

First, a representative from each institution gave about a five minute overview of how they’re using Blacklight:

The group participated in a workshop on customizing Blacklight. The organizers paired people based on experience, so the most experienced and least experienced (self-identified) were paired up, and so on. Links to the github project for the workshop: https://github.com/projectblacklight/blacklight_summit_demo

We got an update on the state of Blacklight 7. Some of the highlights of what’s coming:

  • Move to Bootstrap 4 from Bootstrap 3
  • Use of HTML 5 structural elements
  • Better internationalization support
  • Move from helpers to presenters. (What are presenters: http://nithinbekal.com/posts/rails-presenters/)
  • Improved code quality
  • Partial structure that makes overrides easier

A release of Blacklight 7 won’t be ready until Bootstrap 4 is released.

There were also several conversations and breakout session about Solr, the indexing tool used to power Blacklight. I won’t go into great detail here, but some topics discussed included:

  • Developing a common Solr schema for library catalogs.
  • Tuning the performance of Solr when the index is updated frequently. (Items that are checkout out or returned need to be indexed relatively frequently to keep availability information up to date.)
  • Support for multi-lingual indexing and searching in Solr, especially Chinese, Japanese, and Korean languages. Stanford has done a lot of work on this.

I’m sure you’ll be hearing more from me about Blacklight on this blog, especially as we work to build a new TRLN shared catalog with it.

The Other Man in Black

Looking through Duke Libraries’ AdViews collection of television commercials, I recently came across the following commercial for Beech-Nut chewing tobacco, circa 1970:

Obviously this was before tobacco advertising was banned from the television airwaves, which took effect on January 2, 1971, as part of the Public Health Cigarette Smoking Act, signed by President Richard Nixon. At first listen, the commercial’s country-tinged jingle, and voice-over narration sound like “The Man in Black,” the legendary Johnny Cash. This would not be unusual, as Cash had previously done radio and television promos sponsored by Beech-Nut, and can be seen in other 1970’s television commercials, shilling for such clients as Lionel Trains, Amoco and STP. Obviously, Johnny was low on funds at this point in his career, as his music seemed old-fashioned to the younger generation of record-buyers, and his popularity had waned. Appearing in television commercials may have been necessary to balance his checkbook.

However, the Beech-Nut commercial above is mysterious. It sounds like Johnny Cash, but the pitch is slightly off. It’s also odd that Johnny doesn’t visually appear in the ad, like he does in the LionelAmoco and STP commercials. Showing his face would have likely yielded higher pay. This raises the question of whether it is in fact Johnny Cash in the Beech-Nut commercial, or someone imitating Johnny’s baritone singing voice and folksy speaking style. Who would be capable of such close imitation? Well, it could be Johnny’s brother, Tommy Cash. Most fans know about Johnny’s older brother, Jack, who died in a tragic accident when Johnny was a child (ironically, the accident was in a sawmill), but Johnny had six siblings, including younger brother Tommy.

Tommy Cash carved out a recording career of his own, and had several hit singles in the late 60’s and early 70’s, by conveniently co-opting Johnny’s sound and image. One of his biggest hits was “Six White Horses,” in 1969, a commentary on the deaths of JFK, RFK and MLK. Other hits included “One Song Away,” and “Rise and Shine.” Johnny and Tommy can be seen performing together in this performance, singing about their father. It turns out Tommy allowed his voice to be used on television commercials for Pepsi, Burger King, and Beech-Nut. So, it’s likely the Beech-Nut commercial is the work of Tommy Cash, rather than his more famous brother. Tommy, now in his 70’s, has continued to record as recently as 2008. Tommy’s also a real estate agent, and handled the sale of Johnny Cash’s home in Tennessee, after the deaths of Johnny and wife June Carter Cash in 2003.

Go Faster, Do More, Integrate Everything, and Make it Good

We’re excited to have released nine digitized collections online this week in the Duke Digital Repository (see the list below ). Some are brand new, and the others have been migrated from older platforms. This brings our tally up to 27 digitized collections in the DDR, and 11,705 items. That’s still just a few drops in what’ll eventually be a triumphantly sloshing bucket, but the development and outreach we completed for this batch is noteworthy. It changes the game for our ability to put digital materials online faster going forward.

Let’s have a look at the new features, and review briefly how and why we ended up here.

Collection Portals: No Developers Needed

Mangum Photos Collection
The Hugh Mangum Photographs collection portal, configured to feature selected images.

Before this week, each digital collection in the DDR required a developer to create some configuration files in order to get a nice-looking, made-to-order portal to the collection. These configs set featured items and their layout, a collection thumbnail, custom rules for metadata fields and facets, blog feeds, and more.

 

Duke Chapel Recordings Portal
The Duke Chapel Recordings collection portal, configured with customized facets, a blog feed, and images external to the DDR.

It’s helpful to have this kind of flexibility. It can enhance the usability of collections that have distinctive characteristics and unique needs. It gives us a way to show off photos and other digitized images that’d otherwise look underwhelming. But on the other hand, it takes time and coordination that isn’t always warranted for a collection.

We now have an optimized default portal display for any digital collection we add, so we don’t need custom configuration files for everything. A collection portal is not as fancy unconfigured, but it’s similar and the essential pieces are present. The upshot is: the digital collections team can now take more items through the full workflow quickly–from start to finish–putting collections online without us developers getting in the way.

Whitener Collection Portal
A new “unconfigured” collection portal requiring no additional work by developers to launch. Emphasis on archival source collection info in lieu of a digital collection description.

Folder Items

To better accommodate our manuscript collections, we added more distinction in the interface between different kinds of image items. A digitized archival folder of loose manuscript material now includes some visual cues to reinforce that it’s a folder and not, e.g., a bound album, a single photograph, or a two-page letter.

Folder items
Folder items have a small folder icon superimposed on their thumbnail image.
Folder item view
Above the image viewer is a folder icon with an image count; the item info header below changes to “Folder Info”

We completed a fair amount of folder-level digitization in recent years, especially between 2011-2014 as part of a collaborative TRLN Large-Scale Digitization IMLS grant project.  That initiative allowed us to experiment with shifting gears to get more digitized content online efficiently. We succeeded in that goal, however, those objects unfortunately never became accessible or discoverable outside of their lengthy, text-heavy archival collection guides (finding aids). They also lacked useful features such as zooming, downloading, linking, and syndication to other sites like DPLA. They were digital collections, but you couldn’t find or view them when searching and browsing digital collections.

Many of this week’s newly launched collections are composed of these digitized folders that were previously siloed off in finding aids. Now they’re finally fully integrated for preservation, discovery, and access alongside our other digital collections in the DDR. They remain viewable from within the finding aids and we link between the interfaces to provide proper context.

Keyboard Nav & Rotation

Two things are bound to increase when digitizing manuscripts en masse at the folder level: 1) the number of images present in any given “item” (folder); 2) the chance that something of interest within those pages ends up oriented sideways or upside-down. We’ve improved the UI a bit for these cases by adding full keyboard navigation and rotation options.

Rotate Image Feature
Rotation options in the image viewer. Navigate pages by keyboard (Page Up/Page Down on Windows, Fn+Up/Down on Mac).

Conclusion

Duke Libraries’ digitization objectives are ambitious. Especially given both the quality and quantity of distinctive, world-class collections in the David M. Rubenstein Library, there’s a constant push to: 1) Go Faster, 2) Do More, 3) Integrate Everything, and 4) Make Everything Good. These needs are often impossibly paradoxical. But we won’t stop trying our best. Our team’s accomplishments this week feel like a positive step in the right direction.

Newly Available DDR Collections in Sept 2016

Nobody Wants a Slow Repository

As we’ve been adding features and refining the public interface to Duke’s Digital Repository, the application has become increasingly slow. Don’t worry, the very slowest versions were never deployed beyond our development servers. This blog post is about how I approached addressing the application’s performance problems before they made their way to our production site.

14729168562_ecc30e44d8_b
A modern web application, like the public interface to Duke’s Digital Repository, is a complex beast, relying on layers of software and services just to deliver a bunch of HTML, CSS, and JavaScript to your web browser. A page like this, the front page to the Alex Harris collection takes a lot to build — code to read configuration files, methods that assemble information needed to build the page, requests to Solr to find the images to display, requests to a separate administrative application service that provides contact information for the collection, another request to fetch related blog posts, and requests to our finding aid application to deliver information about the physical collection. All of these requests take time and all of them have to finish before anything gets delivered to your browser.

My main suspects for the slowness: HTTP requests to external services, such as the ones mentioned above; and repeated calls to slow methods in the application. But identifying precisely which HTTP requests are slow and what code needs to be optimized takes a bit of sleuthing.

The first thing I wanted to know was: how slow is this thing, really? Turns out it was getting getting really slow. Too slow. There’s old research (1960s old) about computer system performance and its impact on user perception and task performance that still applies today. This also old (1993 old) article from the Nielsen Norman Group summarizes the issue nicely.

To determine just how slow things were getting I used Chrome’s developer tools. The “Network” tab in Chrome’s developer tools is where the hard truth comes to light about just how bloated and slow your web application is. Or, as my high school teachers used to say when handing back test results: “read ’em and weep.”

network-panel-dev-tools

By using the Network tab in Browser Tools I was able to see that the browser was having to wait 15 or more seconds for anything to come back from the server. This is too slow.

The next thing I wanted to know was how many HTTP requests were being made to external services and which ones were being made repeatedly or were taking a long time. For this dose of reality I used the httplog gem, which logs useful information about every HTTP request, including how long the application has to wait for a response.

When added to the project’s Gemfile, httplog starts printing out useful information to the log about HTTP requests, such as this set of entries about the request to fetch finding aid information. I can see that the application is waiting over half a second to get a response back from the finding aid service:


D, [2016-08-06T12:51:09.531076 #2529] DEBUG -- : [httplog] Connecting: library.duke.edu:80
D, [2016-08-06T12:51:09.854003 #2529] DEBUG -- : [httplog] Sending: GET http://library.duke.edu:80/rubenstein/findingaids/harrisalex.xml
D, [2016-08-06T12:51:09.855387 #2529] DEBUG -- : [httplog] Data:
D, [2016-08-06T12:51:10.376456 #2529] DEBUG -- : [httplog] Status: 200
D, [2016-08-06T12:51:10.377061 #2529] DEBUG -- : [httplog] Benchmark: 0.520600972 seconds

As I expected, this request and many others were contributing significantly to the application’s slowness.

It was a bit harder to determine which parts of the code and which methods were also making the application slow. For this, I mainly used two approaches. The first was to look at the application logs which tracks how long different views take to assemble. This helped narrow down which parts of the code were especially slow (and also confirmed what I was seeing with httplog). For instance in the log I can see different partials that make up the whole page and how long each of them takes to assemble. From the log:


12:51:09 INFO: Rendered digital_collections/_home_featured_collections.html.erb (0.8ms)
12:51:09 INFO: Rendered digital_collections/_home_highlights.html.erb (1.3ms)
12:51:10 INFO: Rendered catalog/_show_finding_aid_full.html.erb (953.4ms)
12:51:11 INFO: Rendered catalog/_show_blog_post_feature.html.erb (0.9ms)
12:51:11 INFO: Rendered catalog/_show_blog_posts.html.erb (914.5ms)

(The finding aid and blog posts are slow due to the aforementioned HTTP requests.)

widget2

One particular area of concern was extremely slow searches. To identify the problem I turned to yet another tool. Rack-mini-profiler is a gem that when added to your project’s Gemfile adds an expandable tab on every page of the site. When you visit pages of the application in a browser it displays a detailed report of how long it takes to build each section of the page. This made it possible to narrow down areas of the application that were too slow.

search_results

What I found was that the thumbnail section of the page — which can appear up to twenty times or more on a search result page was very slow. And it wasn’t loading the images that was slow but running the code to select the correct thumbnail image took a long time to run. (Thumbnail selection is complicated in the repository because there are various types and sources for thumbnails.)

Having identified several contributors to the site’s poor performance (expensive thumbnail selection, and frequent and costly HTTP requests to various services) I could now work to address each of the issues.

I used three different approaches to improving the application’s performance: fragment caching, memoization, and code optimization.

Caching

finding_aid

I decided to use fragment caching to address the slow loading of finding aid information. The benefit of caching is that it’s really fast. Once Rails has the snippet of HTML cached (either in memory or on disk, depending on how it’s configured) it can use that fragment of cached markup, bypassing a lot of code and, in this case, that slow HTTP request. One downside to caching is that if something in the finding aid changes the application won’t reflect the change until the cache is cleared or expires (after 7 days in this case).


<% cache("finding_aid_brief_#{document.ead_id}", expires_in: 7.days) do %>
<%= source_collection({ :document => document, :placement => 'left' }) %>
<% end %>

Memoization

Memoization is similar to caching in that you’re storing information to be used repeatedly rather then recalculated every time. This can be a useful technique to use with expensive (slow) methods that get called frequently. The parent_collection_count method returns the total number of collections in a portal in the repository (such as the Digital Collections portal). This method is somewhat expensive because it first has to run a query to get information about all of the collections and then count them. Since this gets used more than once, I’m using Ruby’s conditional assignment operator (||=) to tell Ruby not to recalculate the value of @parent_collection_count every time the method is called. With memoization, if the value is already stored Ruby just reuses the previously calculated value. (There are some gotchas with this technique, but it’s very useful in the right circumstances.)


def parent_collections_count
@parent_collections_count ||= response(parent_collections_search).total
end

Code Optimization

One of the reasons thumbnails were slow to load in search results is that some items in the repository have hundreds of images. The method used to find the thumbnail path was loading image path information for all the item’s images rather than just the first one. To address this I wrote a new method that fetches just the item’s first image to use as the item’s thumbnail.

Combined, these changes made a significant improvement to the site’s performance. Overall application speed and performance will remain one of our priorities as we add features to the Duke Digital Repository.