The Elastic Ruler: Measuring Scholarly Use of Digital Collections

Our Digital Collections program aspires to build “distinctive digital collections that provide access to Duke’s unique library and archival materials for teaching, learning, and research at Duke and worldwide.” Those are our primary stated objectives, though the reach and the value of putting collections online extends far beyond.  For instance, these uses might not qualify as scholarly, but we celebrate them all the same:

Digital Collections items pinned on Pinterest
Digital Collections items pinned on Pinterest

Regardless of how much value we assign to different kinds of uses, determining the impact of our work is a hard problem to solve. There are no simple instruments to measure our outcomes, and the measurements we do take can at times feel uncertain, as if taken of a moving object with a wildly elastic ruler.   Some helpful resources are out there, of both theoretical and practical varieties, but focusing on what matters most remains a challenge.

Back to our mission: how much are our collections actually used for the scholarly purposes we trumpet–teaching, learning, and research–versus other more casual uses? How do we distinguish these uses within the data we collect? Getting clearer answers could help us in several areas. First, what should we even digitize? What compelling stories of user engagement could be told to illustrate the value of the collections? How might we drum up more interest in the collections within scholarly communities?

Some of my Duke colleagues and I began exploring these questions this year in depth. We’ll have much more to report later, but already our work has uncovered some bits of interest to share. And, of course, we’ve unearthed more questions than answers.


Like many places, we use a service called Google Analytics to track how much our collections are accessed. We use analytics to understand what kinds of things that we digitize resonate with users online, and to to help us make informed improvements to the website. Google doesn’t track any personally identifiable data (thankfully); data is aggregated to a degree where privacy is protected yet site owners can still see generally where their traffic comes from.

For example, we know that on average1, our site visitors view just over 5 pages/visit, and stay for about 3.5 minutes. 60.3% of visitors bounce (that is, leave after seeing only one page). Mobile devices account for 20.1% of traffic. Over 26% of visits come from outside the U.S.  The most common way a visit originates is via search engine (37.5%), and social media traffic—especially from Facebook—is quite significant (15.7% of visits). The data is vast; the opportunities for slicing and dicing it seem infinite. And we’ll forever grapple with how best to track, interpret,  report, and respond to the things that are most meaningful to us.

Scholarly Traffic

There are two bits of Analytics data that can provide us with clues about our collections’ use in scholarly environments:

  1. Traffic on scholarly networks (a filtered view of ISPs)
  2. Referrals from scholarly pages (a filtered view of Referrer paths)

Tracking these figures (however imperfect) could help us get a better sense for the trends in the tenor of our audience, and help us set goals for any outreach efforts we undertake.

Screen Shot 2015-06-26 at 4.41.52 PM

Traffic on Scholarly Networks

One key clue for scholarly use is the name of visitors’ Internet Service Provider (ISP). For example, a visit from somewhere on Duke’s campus has an ISP “duke university,” a NYC public school “new york city public schools,” and McGill University (in Canada) “mcgill university.” Of course, plenty of scholarly work gets done off-campus (where an ISP is likely Time Warner, Verizon, AT&T, etc.), and not all network traffic that happens on a campus is actually for scholarly purposes. So there are the usual caveats about signal and noise within the data.

Alas, we know that over the past calendar year1, we had:

  • 11.7% of our visits (“sessions”) from visitors on a scholarly network (as defined in our filters by: ISP name has universit*, college*, or school* in it)2.
  • 74,724 visits via scholarly networks
  • 4,121 unique scholarly network ISPs

Referrals from Course Websites or Online Syllabi on .Edu Sites

Are our collections used for teaching and learning? How much can we tell simply through web analytics?

A referral happens when someone gets to our site by following a link from another site.  In our data, we can see the full web address of any referring pages. But can we infer from a site URL whether a site was a course website or an online syllabus–pages that’d link to our site for the express purpose of teaching?  We can try.

In the past year, referrals filtered by an expression3 to isolate course sites and syllabi on .Edu sites

  • 0.18% of total visits
  • 1,167 visits
  • 68 unique sites (domains)

Or, if we remove the .Edu restriction2

  • 1.21% of total visits
  • 7,718 visits
  • 221 unique sites (domains)

It’s hard to confidently assert that this data is accurate, and indeed many of the pages can’t be verified because they’re only accessible to the students in those classes. But regardless, a look at the data through this lens does occasionally help discover real uses for actual courses and/or generate leads for contacting instructors about the ways they’ve used the collections in their curriculum.

Other Methods

We know web analytics are just a single tool in a giant toolbox for determining how much our collections are contributing to teaching, learning, and research. One technique we’ve tried is using Google Scholar to track citations of collections, then logged and tagged those citations using Delicious. For instance, here are 70 scholarly citations for our Ad*Access collection. Among the citations are 30 articles, 19 books, and 10 theses. 26 sources cited something from the collection as a primary source.  This technique is powerful and illuminates some interesting uses. But it unfortunately takes a lot of time to do well.

We’ve also recently launched a survey on our website that gathers some basic information from visitors about how they’re using the collections. And we have done some outreach with instructors at Duke and beyond. Stay tuned for much more as we explore the data. In the meantime, we would love to hear from others in the field how you approach answering these very same questions.




  1. Data from July 1, 2014 – June 26, 2015.
  2. We had first looked at isolating scholarly networks by narrowing to ISP network domains ending in “.edu” but upon digging further, there are two reasons why the ISP name provides better data. 1) .EDUs are only granted to accredited postsecondary institutions in the U.S., so visits from international universities or middle/high schools wouldn’t count. 2) A full 24% of all our visits have unknowable ISP network domains: “(not set)” or “unknown.unknown,” whereas only 6.3% of visits have unknown ISP names.
  3. Full referrer path: blackboard|sakai|moodle|webct|schoology|^bb|learn|course|isites|syllabus|classroom|^class.|/class/|^classes.|/~CLASS/

…and We’re Putting it on Wax (The Frank Clyde Brown Collection)

My last several posts have focused on endangered–some would say obsolete–audio formats: open reel tape, compact cassette, DAT, and Minidisc. In this installment, we travel back to the dawn of recorded sound and the 20th Century to investigate some of the earliest commercial recording media. Unlike the formats above, which operate on post-WW2 magnetic and optical technology, these systems carved sound waves into stone (or, more accurately, wax) behind strictly acousto-mechanical principles.

Thomas Edison is credited as inventing the first phonograph (“soundwriter”) on July 18, 1877. It consisted of tinfoil wrapped around a hand-cranked metal cylinder. Sound waves would be funneled through a horn, causing a stylus to vibrate and indent a groove around the outside of the cylinder. The cylinder could be played by reversing the procedure: By retracing the groove with the stylus, the sound would be amplified back through the horn and heard as a rough approximation of the original sound.


Alexander Graham Bell quickly improved the innovation by introducing wax as a superior material for the cylinders and using a needle to scratch the sound waves into their surface. He called his device the “Graphophone”. By 1888, Edison had also adopted wax as the preferred medium for recorded cylinders and a patent-sharing agreement was signed. In 1889, the wax cylinder because the first commercially marketed audio medium.


Initially, the cylinders were installed in the ancestors of jukeboxes in public places. Drop a coin into the slot, and the machine would magically dispense a song, monologue, or comedy routine. The technology was soon adapted for home use. Consumers could purchase prerecorded cylinders to play on their machines. Perhaps more amazingly, they could buy a home recording attachment and cut their own content onto the wax.

[PAUSE—shift from PLAY to RECORD mode]


Biographical and Historical Note

Frank Clyde Brown (1870-1943) served as a Professor of English at Trinity College, Duke University, from 1909 until his death. A native of Virginia, he received his Ph.D. at the University of Chicago in 1908. While at Duke University he served in many capacities, including being chairman of his department, University Marshal, and Comptroller of the University during its initial construction. These aspects of his life are chronicled in his papers held by the Duke University Archives.

This collection of materials, however, is concerned with activities to which he devoted equal time and energy, the organization of the North Carolina Folklore Society in 1913 and his personal effort to gather and record the nuances and culture of “folk” of North Carolina and its near neighbors, which occupied him from 1912 until his death. Under the impetus of a 1912 mailing from John A. Lomax, then President of the American Folklore Society, Brown as well as other faculty members and other citizens in North Carolina, became interested in folklore and organized the North Carolina Folklore Society in 1913, with Brown as secretary-treasurer. As secretary-treasurer of this organization from its inception until his death, he provided the organizational impetus behind the Society. Through his course in folklore at Duke, he also sent class after class out to gather the folklore of their locales, both during their studies and afterward. And virtually every summer he could be found in the most remote parts of the state, with notebook and recorder — first a dictaphone employing cylinders, and later a machine employing aluminum discs provided for his use by the University. The result, by 1943, was a collection of about 38,000 written notes on lore, 650 musical scores, 1400 songs vocally recorded, and numerous magazine articles, student theses, books, lists, and other items related to this study. The material originated in at least 84 North Carolina counties, with about 5 percent original in 20 other states and Canada, and came from the efforts of 650 other contributors besides Brown himself.





Thanks to our Audiovisual Archivist, Craig Breaden, for the excellent photos and unused title suggestion (“The Needle and the Damage Done”). Future posts will include updates on work with the Frank C. Brown Collection, other audio collections at Duke, and the history of sound recording and reproduction.


Mini-memes, many meanings: Smoking dirt boy and the congee line bros

Children are smoking in two of my favorite images from our digital collections.

One of them comes from the eleven days in 1964 that William Gedney spent with the Cornett family in Eastern Kentucky. A boy, crusted in dirt, clutching a bent-up Prince Albert can, draws on a cigarette. It’s a miniature of mawkish masculinity that echoes and lightly mocks the numerous shots Gedney took of the Cornett men, often shirtless and sitting on or standing around cars, smoking.

At some point in the now-distant past, while developing and testing our digital collections platform, I stumbled on “smoking dirt boy” as a phrase to use in testing for cases when a search returns only a single result. We kind of adopted him as an unofficial mascot of the digital collections program. He was a mini-meme, one we used within our team to draw chuckles, and added into conference presentations to get some laughs. Everyone loves smoking dirt boy.

It was probably 3-4 years ago that I stopped using the image to elicit guffaws, and started to interrogate my own attitude toward it. It’s not one of Gedney’s most powerful photographs, but it provokes a response, and I had become wary of that response. There’s a very complicated history of photography and American poverty that informs it.

Screen Shot 2015-06-09 at 11.23.58 AM
Screen shot from a genealogy site discussion forum, regarding the Cornett family and William Gedney’s involvement with them.

While preparing this post, I did some research into the Cornett family, and came across the item from a discussion thread on a genealogy site, shown here in a screen cap. “My Mother would not let anyone photograph our family,” it reads. “We were all poor, most of us were clean, the Cornetts were another story.”  It captures the attitudes that intertwine in that complicated history. The resentment toward the camera’s cold eye on Appalachia is apparent, as is the disdain for the family that implicitly wasn’t “clean,” and let the photographer shoot. These attitudes came to bear in an incident just this last spring, in which a group in West Virginia confronted traveling photographers whom they claimed photographed children without permission.

The photographer Roger May has undertaken an effort to change “this visual definition of Appalachia.” His “Looking at Appalachia” project began with a series of three blog posts on William Gedney’s Kentucky photographs (Part 1, Part 2, Part 3). The New York Times’ Lens blog wrote about the project last month. His perspective is one that brings insight and renewal to Gedney’s work

Gedney’s photographs have taken on a life as a digital collection since they were published on the Duke University Libraries’ web site in 1999. It has become a high-use collection for the Rubenstein Library; that use has driven a recent project we have undertaken in the library to re-process the collection and digitize the entire corpus of finished prints, proof prints, and contact sheets. We expect the work to take more than a year and produce more than 20,000 images (compared to the roughly 5000 available now), but when it’s complete, it should add whole new dimensions to the understanding of Gedney’s work.

Another collection given life by its digitization is the Sidney Gamble Photographs. The nitrate negatives are so flammable that the library must store them off site, making access impossible without some form of reproduction. Digitization has made it possible for anyone in the world to experience Gamble’s remarkable documentation of China in the early 20th Century. Since its digitization, this collection has been the subject of a traveling exhibit, and will be featured in the Photography Gallery of the Rubenstein Library’s new space when it opens in August.

The photograph of the two boys in the congee distribution line is another favorite of mine. Again, a child is seen smoking in a context that speaks of poverty. There’s plenty to read in the picture, including the expressions on the faces of the different boys, and the way they press their bowls to their chests. But there are two details that make this image rich with implicit narrative – the cigarette in the taller boy’s mouth, and the protective way he drapes his arm over the shorter one. They have similar, close-cropped haircuts, which are also different from the other boys, suggesting they came from the same place. It’s an immediate assumption that the boys are brothers, and the older one has taken on the care and protection of the younger.

Still, I don’t know the full story, and exploring my assumptions about the congee line boys might lead me to ask probing questions about my own attitudes and “visual definition” of the world. This process is one of the aspects of working with images that makes my work rewarding. Smoking dirt boy and the congee line boys are always there to teach me more.

Sports Information negatives sneak preview

We all probably remember having to pose for an annual class photograph in primary school. If you made the mistake of telling your mother about the looming photograph beforehand you probably had to wear something “nice” and had your hair plastered to your head by your mother while she informed you of the trouble you’d be in if you made a funny face. Everyone looks a little awkward in these photographs and only a few of us wanted to have the picture taken in the first place. Frankly, I’m amazed that they got us all to sit still long enough to take the photograph. Some of us also had similar photographs taken while participating in team sports which also led to some interesting photographs.

These are some of the memories that have been popping up this past month as I digitize nitrate negatives from the Sports Information Office: Photographic Negatives collection circa 1924-1992, 1995 and undated. The collection contains photographic negatives related to sports at Duke. I’ve digitized about half of the negatives and seen images from mostly football, basketball, baseball and boxing. The majority of these photographs are of individuals but there are also team shots, group shots and coaches. While you may have to wait a bit for the publication of these negatives through the Digital Collections website I had to share some of these gems with you.

Some of the images strike me as funny for the expressions, some for the pose and others for the totally out of context background. It makes me wonder what the photographer’s intention/ instruction was.FlexTight X5

To capture these wonderful images we are using a recently purchased Hasselblad FlexTight X5. The Hasselblad is a dedicated high-end film scanner that uses glassless drum scanning technology. Glassless drum scanning takes advantage of all the benefits of a classic drum scanner (high resolution, sharpness, better D-max/ D-min) without all the disadvantages (wet mounting messiness, newton rings, time consuming, price, speed).   This device produces extremely sharp reproductions of which the film grain in the digital image can be seen. A few more important factors about this scanner are: a wide variety of standard film sizes can be digitized along with custom sizes and it captures in a raw file format. This is significant because negatives contain a significant amount of tonal information that printed photographs do not. Once this information is captured we have to adjust each digital image as if we were printing the negative in a traditional dark room. When using image editing software to adjust an image an algorithm is at work making decisions about compressing, expanding, keeping or discarding tonal information in the digital image. This type of adjustment causes data loss. Because we are following archival imaging standards, retaining the largest amount of data is important. Sometimes the data loss is not visible to the naked eye but making adjustments renders the image data “thin”. The more adjustments to an image the less data there is to work with.

A histogram is a visual representation of tonal data in an image. This is a histogram of an image before and after an adjustment.

It kind of reminds me of the scene in Shawshank Redemption (spoiler alert) where the warden is in Andy Dufresne’s (Tim Robbins) cell after discovering he has escaped. The warden throws a rock at a poster on the wall in anger only to find there is a hole in the wall behind the poster. An adjusted digital image is similar in that the image looks normal and solid but there is no depth to it. This becomes a problem if anyone, after digitization, wants to reuse the image in some other context where they will need to make adjustments to suit their purposes. They won’t have a whole lot of latitude to make adjustments before digital artifacts start appearing. By using the Hasselblad RAW file format and capturing in 16 bit RGB we are able to make adjustments to the raw file without data loss. This enables us to create a robust file that will be more useful in the future.

I’m sure there will be many uses for the negatives in this collection. Who wouldn’t want a picture of a former Duke athlete in an odd pose in an out of context environment with a funny look on their face? Right?

Back to the ’80s – Duke Chronicle Style

Ah, the 1980s…a decade of perms, the Walkman, Jelly shoes, and Ziggy Stardust.  It was a time of fashion statements I personally look back on in wonderment.

Personal Computer Ad, 1980
Personal Computer Ad, 1980

Fashionable leotards, shoulder pads, and stirrup pants were all the rage.  And can we say parachute pants?  Thanks, MC Hammer.  If you’re craving a blast from the past, we’ve got you covered.  The digitized 1980s Duke Chronicle has arrived!  Now you can relive that decade of Hill Street Blues and Magnum P.I. from your own personal computer (hopefully,you’re not still using one of these models!).

As Duke University’s student-run newspaper for over 100 years, the Duke Chronicle is a window into the history of the university, North Carolina, and the world.  It may even be a window into your own past if you had the privilege of living through those totally rad years.  If you didn’t get the chance to live it firsthand, you may find great joy in experiencing it vicariously through the pages of the Chronicle, or at least find irony in the fact that ’80s fashion has made a comeback.

Sony Private Stereo Ad, February 12, 1980
Sony Private Stereo Ad, February 12, 1980


Here at Duke, the 1980s was the decade that welcomed Coach Krzyzewski to the basketball team, and made it (almost) all the way to the championship in 1986.  In 1980, the Chronicle celebrated its 75th year of bringing news to campus.  It was also a time of expansion, as Duke Hospital North was constructed in 1980 and the Washington Duke Inn followed in 1988.  President Reagan visited campus, Desmond Tutu spoke at Duke Chapel, and Princess Grace Kelly entertained with poetry at Page Auditorium almost two years to the day before she died.


The 1980s also saw racial unrest in North Carolina, and The Duke Chronicle headlines reflected these tense feelings.  Many articles illustrate a reawakened civil rights movement.  From a call to increase the number of black professors at Duke, to the marching of KKK members down the streets of Greensboro, Durham, and Chapel Hill, North Carolinians found themselves in a continued struggle for equality.  Students and faculty at Duke were no exception.  Unfortunately, these thirty-year-old Chronicle headlines would seem right at home in today’s newspapers.


The 1980s Chronicle issues can inform us of fashion and pop culture, whether we look back at it with distaste or fondness.  But it also enlightens us to the broader social atmosphere that defined the 1980s.   It was a time of change and self-expression, and I invite you to explore the pages of the Duke Chronicle to learn more.

Fashion Ad, May 10, 1984
Fashion Ad, May 10, 1984


The addition of the 1980s issues to the online Duke Chronicle digital collection is part of an ongoing effort to provide digital access to all Chronicle issues from 1905 to 1989.  The next decades to look forward to are the 1970s and 1950s.  Also, stay tuned to Bitstreams for a more in-depth exploration of the newspaper digitization process.  You can learn how we turn the pages of the Duke Chronicle into online digital gold.  At least, that’s what I like to think we do here at the Digital Production Center.  Until then, transport yourself back to the 1980s, Duke Chronicle style (no DeLorean or flux capacitor necessary).

Advertising Culture

IMG_1367When I was a kid, one of my favorite things to do while visiting my grandparents was browsing through their collections of old National Geographic and Smithsonian magazines. I was more interested in the advertisements than the content of the articles. Most of the magazines were dated from the 1950s through the 1980s and they provided me with a glimpse into the world of my parents and grandparents from a time in the twentieth century I had missed.

IMG_1369I also had a fairly obsessive interest in air-cooled Volkswagen Beetles, which had ceased being sold in the US shortly before I was born. They were still a common sight in the 1980s and something about their odd shape and the distinct beat of their air-cooled boxer engine captured my young imagination. I was therefore delighted when an older cousin who had studied graphic design gave to me a collection of several hundred Volkswagen print advertisements that he had clipped from 1960s era Life magazines for a class project. Hinting at my future profession, I placed each sheet in a protective plastic sleeve, gave each one an accession number, and catalogued them in a spreadsheet.

I think that part of the reason I find old advertisements so interesting is what they can reveal about our cultural past. Because advertisements are designed specifically to sell things, they can reveal the collective desires, values, fears, and anxieties of a culture.

All of this to say, I love browsing the advertising collections at Duke Libraries. I’m especially fond of the outdoor advertising collections: the OAAA Archives, the OAAA Slide Library, and the John E. Brennan Outdoor Advertising Survey Reports. Because most of the items in these collections are photographs or slides of billboards they often capture candid street scenes, providing even more of a sense of the time and place where the advertisements were displayed.

I’ve picked out a few to share that I found interesting or funny for one reason or another. Some of the ads I’ve picked use language that sounds dated now, or display ideas or values that are out-moded. Others just show how things have changed. A few happen to have an old VW in them.

The Value of Metadata in Digital Collections Projects

Before you let your eyes glaze over at the thought of metadata, let me familiarize you with the term and its invaluable role in the creation of the library’s online Digital Collections.  Yes, metadata is a rather jargony word librarians and archivists find themselves using frequently in the digital age, but it’s not as complex as you may think.  In the most simplistic terms, the Society of American Archivists defines metadata as “data about data.”  Okay, what does that mean?  According to the good ol’ trusty Oxford English Dictionary, it is “data that describes and gives information about other data.”  In other words, if you have a digitized photographic image (data), you will also have words to describe the image (metadata).

Better yet, think of it this way.  If that image were of a large family gathering and grandma lovingly wrote the date and names of all the people on the backside, that is basic metadata.  Without that information those people and the image would suddenly have less meaning, especially if you have no clue who those faces are in that family photo.  It is the same with digital projects.  Without descriptive metadata, the items we digitize would hold less meaning and prove less valuable for researchers, or at least be less searchable.  The better and more thorough the metadata, the more it promotes discovery in search engines.  (Check out the metadata from this Cornett family photo from the William Gedney collection.)

The term metadata was first used in the late 1960s in computer programming language.  With the advent of computing technology and the overabundance of digital data, metadata became a key element to help describe and retrieve information in an automated way.  The use of the word metadata in literature over the last 45 years shows a steeper increase from 1995 to 2005, which makes sense.  The term became used more and more as technology grew more widespread.  This is reflected in the graph below from Google’s Ngram Viewer, which scours over 5 million Google Books to track the usage of words and phrases over time.

Google Ngram Viewer for “metadata”

Because of its link with computer technology, metadata is widely used in a variety of fields that range from computer science to the music industry.  Even your music playlist is full of descriptive metadata that relates to each song, like the artist, album, song title, and length of audio recording.  So, libraries and archives are not alone in their reliance on metadata.  Generating metadata is an invaluable step in the process of preserving and documenting the library’s unique collections.  It is especially important here at the Digital Production Center (DPC) where the digitization of these collections happens.  To better understand exactly how important a role metadata plays in our job, let’s walk through the metadata life cycle of one of our digital projects, the Duke Chapel Recordings.

The Chapel Recordings project consists of digitizing over 1,000 cassette and VHS tapes of sermons and over 1,300 written sermons that were given at the Duke Chapel from the 1950s to 2000s.  These recordings and sermons will be added to the existing Duke Chapel Recordings collection online.  Funded by a grant from the Lilly Foundation, this digital collection will be a great asset to Duke’s Divinity School and those interested in hermeneutics worldwide.

Before the scanners and audio capture devices are even warmed up at the DPC, preliminary metadata is collected from the analog archival material.  Depending on the project, this metadata is created either by an outside collaborator or in-house at the DPC.  For example, the Duke Chronicle metadata is created in-house by pulling data from each issue, like the date, volume, and issue number.  I am currently working on compiling the pre-digitization metadata for the 1950s Chronicle, and the spreadsheet looks like this:

1950s Duke Chronicle preliminary metadata

As for the Chapel Recordings project, the DPC received an inventory from the University Archives in the form of an Excel spreadsheet.  This inventory contained the preliminary metadata already generated for the collection, which is also used in Rubenstein Library‘s online collection guide.

Chapel Recordings inventory metadata

The University Archives also supplied the DPC with an inventory of the sermon transcripts containing basic metadata compiled by a student.

Duke Chapel Records sermon metadata

Here at the DPC, we convert this preliminary metadata into a digitization guide, which is a fancy term for yet another Excel spreadsheet.  Each digital project receives its own digitization guide (we like to call them digguides) which keeps all the valuable information for each item in one place.  It acts as a central location for data entry, but also as a reference guide for the digitization process.  Depending on the format of the material being digitized (image, audio, video, etc.), the digitization guide will need different categories.  We then add these new categories as columns in the original inventory spreadsheet and it becomes a working document where we plug in our own metadata generated in the digitization process.   For the Chapel Recordings audio and video, the metadata created looks like this:

Chapel Recordings digitization metadata

Once we have digitized the items, we then run the recordings through several rounds of quality control.  This generates even more metadata which is, again, added to the digitization guide.  As the Chapel Recordings have not gone through quality control yet, here is a look at the quality control data for the 1980s Duke Chronicle:

1980s Duke Chronicle quality control metadata

Once the digitization and quality control is completed, the DPC then sends the digitization guide filled with metadata to the metadata archivist, Noah Huffman.  Noah then makes further adds, edits, and deletes to match the spreadsheet metadata fields to fields accepted by the management software, CONTENTdm.  During the process of ingesting all the content into the software, CONTENTdm links the digitized items to their corresponding metadata from the Excel spreadsheet.  This is in preparation for placing the material online. For even more metadata adventures, see Noah’s most recent Bitstreams post.

In the final stage of the process, the compiled metadata and digitized items are published online at our Digital Collections website.  You, the researcher, history fanatic, or Sunday browser, see the results of all this work on the page of each digital item online.  This metadata is what makes your search results productive, and if we’ve done our job right, the digitized items will be easily discovered.  The Chapel Recordings metadata looks like this once published online:

Chapel Recordings metadata as viewed online

Further down the road, the Duke Divinity School wishes to enhance the current metadata to provide keyword searches within the Chapel Recordings audio and video.  This will allow researchers to jump to specific sections of the recordings and find the exact content they are looking for.  The additional metadata will greatly improve the user experience by making it easier to search within the content of the recordings, and will add value to the digital collection.

On this journey through the metadata life cycle, I hope you have been convinced that metadata is a key element in the digitization process.  From preliminary inventories, to digitization and quality control, to uploading the material online, metadata has a big job to do.  At each step, it forms the link between a digitized item and how we know what that item is.  The life cycle of metadata in our digital projects at the DPC is sometimes long and tiring.  But, each stage of the process  creates and utilizes the metadata in varied and important ways.  Ultimately, all this arduous work pays off when a researcher in our digital collections hits gold.

What’s in my tool chest

I recently, while perhaps inadvisably, updated my workstation to the latest version of OS X (Yosemite) and in doing so ended up needing to rebuild my setup from scratch. As such, I’ve been taking stock of the applications and tools that I use on a daily basis for my work and thought it might be interesting to share them. Keep in mind that most of the tools I use are mac-centric, but there are almost always alternatives for those that aren’t cross-platform compatible.


Our department uses Jabber for Instant Messaging. The client of choice for OS X is Adium. It works great — it’s light weight, the interface is intelligible, custom statuses are easy to set, and the notifications are readily apparent without being annoying.

My email and calendaring client of choice is Microsoft Outlook. I’m using version 15.9, which is functionally much more similar to Outlook Web Access (OWA) than the previous version (Outlook 2011). It seems to startup much more quickly and it’s notifications are somehow less annoying, even though they are very similar. Perhaps it’s just the change in color scheme. I had some difficulty initially with setting up shared mailboxes, but I eventually got that to work. [go to Tools > Accounts, add a new account using the shared email address, set access type to username and password, and then use your normal login info. The account will then show up under your main mailbox, and you can customize how it’s displayed, etc.]

Outlook 2015 — now in blue!

Another group that I work with in the library has been testing out Slack, which apparently is quite popular within development teams at lots of cool companies. It seems to me to be a mashup of Google Wave, Newsgroups, and Twitter. Its seems neat, but I worry it might just be another thing to keep up with. Maybe we can eventually use it to replace something else wholesale.

Project Management

We mostly use basecamp for shared planning on projects. I think it’s a great tool, but the UI is starting to feel dated — especially the skeuomorphic text documents. We’ve played around a bit with some other tools (Jira, Trello, Asana, etc) but basecamp has yet to be displaced.

Basecamp text document (I don’t think Steve Jobs would approve)

We also now have access to enterprise-level Box accounts at Duke. We use Box to store project files and assets that don’t make sense to store in something like basecamp or send via email. I think their web interface is great and I also use Box Sync to regularly back up all of my project files. It has built-in versioning which has helped me on a number of occasions with accessing older version of things. I’d been a dropbox user for more than five years, but I really prefer Box now. We also make heavy use of Google Drive. I think everything about it is great.

Another tool we use a lot is Git. We’ve got a library Github account and we also use a Duke-specific private instance of Gitorious. I much prefer Github, fwiw. I’m still learning the best way to use git workflows, but compared to other version management approaches from back in the day (SVN, Mercurial) Git is amazing IMHO.

Design & Production

I almost always start any design work with sketching things out. I tend to grab sheets of 11×17 paper and fold them in half and make little mini booklets. I guess I’m just too cheap to buy real moleskins (or even better, fieldnotes). But yeah, sketching is really important. After that, I tend jump right in to do as much design work in the browser as is possible. However Photoshop, Illustrator, and sometimes Indesign, are still indispensable. Rarely a day goes by that I don’t have at least one of them open.

Still use it everyday...
Photoshop — I still use it a lot!

With regards to media production, I’m a big fan of Sony’s software products. I find that Vegas is both the most flexible NLE platform out there and also the most easy to use. For smaller quicker audio-only tasks, I might fire up Audacity. Handbrake is really handy for quickly transcoding things. And I’ll also give a shout out to Davinci Resolve, which is now free and seems incredibly powerful, but I’ve not had much time to explore it yet.


My code editor of choice right now is Atom — note that it’s mac only. When I work on a windows box, I tend to use notepad++. I’ve also played around a bit with more robust IDEs, like Eclipse and Aptana, but for most of the work I do a simple code editor is plenty.

The UI is easy on the eyes
The Atom UI is easy on the eyes

For local development, I’m a big fan of MAMP. It’s really easy to setup and works great. I’ve also started spinning up dedicated local VMs using Oracle’s Virtual Box. I like the idea of having a separate dedicated environment for a given project that can be moved around from one machine to another. I’m sure there are other better ways to do that though.

I also want to quickly list some Chome browser plugins that I use for dev work: ColorPick Eyedropper, Window Resizer, LiveReload (thanks Cory!), WhatFont, and for fun, Google Art Project.


I also make use of Virtual Box for doing browser testing. I’ve got several different versions of Windows setup so I can test for all flavors of Internet Explorer along with older incarnations of Firefox, Chrome, and Opera. I’ve yet to find a good way to test for older versions of Safari, aside from using something static like browsershots.

With regards to mobile devices, I think testing on as many real-world variations as possible is ideal. But for quick and dirty tests, I make use of the iOS Simulator and the Android SDK emulator. the iOS simulator comes setup with several different hardware configs while you have to set these up manually with the Android suite. In any case, both tools provide a great way to quickly see how a given project will function across many different mobile devices.


Hopefully this list will be helpful to someone out in the world. I’m also interested in learning about what other developers keep in their tool chest.

The Pros and Cons of FFV1

One of the greatest challenges to digitizing moving image content isn’t the actual digitization. It’s the enormous file sizes that result, and the high costs associated with storing and maintaining those files for long-term preservation. Most cultural heritage institutions consider 10-bit uncompressed to be the preservation standard for moving image content. 10-bit uncompressed uses no file compression, as the name states, and is considered the safest, and most reliable format for moving image preservation at this time. It delivers the highest image resolution, color quality, and sharpness, while avoiding motion compensation and compression artifacts.

Unfortunately, one hour of 10-bit uncompressed video can produce a 100 gigabyte file. That’s at least 50 times larger than an audio preservation file of the same duration, and about 1000 times larger than most still image preservation files. In physical media terms, it would take 21 DVDs, or 142 CDs, to store one hour of 10-bit uncompressed video. That’s a lot of data!

Recently, the FFV1 codec has gained in popularity as an alternative to 10-bit uncompressed. FFV1 uses lossless compression to store digitized moving image content at reduced file sizes, without data loss. FFV1 is part of the free, open-source FFmpeg project, and has been in existence since 2003. FFV1 uses entropy encoding to deliver mathematically lossless intra-frame compression, which produces substantially smaller file sizes when compared to uncompressed 10-bit moving image digitization.

Because commercial video digitization hardware does not natively support the FFV1 codec, operation must be conducted via CPU terminal command-line prompts.


Testing in the Digital Production Center showed that files encoded with the FFV1 codec produced files almost 1/3 the size of their 10-bit uncompressed counterparts. Both formats can be carried in a variety of wrappers, or container files, such as AVI (Microsoft) or MOV (Apple), or MKV (open source). The encoded video and audio streams are wrapped together in the container with other data streams that include technical metadata. The type and variety of data that a container can hold are specific to that container format.

Within the terminal command line window, incoming video image and waveform readouts are displayed, while the content is compressed to FFV1.


The reduced file sizes produced via FFV1 are exciting, but there are some downsides. Although FFV1 is open-source, the files will not play using standard video software on Mac and Windows, nor can FFV1 be utilized within commercially-available digitization hardware and software (only via terminal command). This is because no major company (Apple, Microsoft, Adobe, Blackmagic, etc.) has adopted the codec, or announced plans to do so. Any file format that does not eventually achieve widespread adoption and universal playback capability within the broadcasting and filmmaking communities, has a higher risk of long-term obsolescence, and lack of engineering support.

The concept of “lossless compression” is mysterious, and seemingly a paradox. How can it make a file smaller, without eliminating or irreversibly altering any data? In testing, it is difficult to verify that a file converted (compressed) to FFV1 and then converted back (decompressed) is an identical file to its original state. Although the specs may be the same, the before and after file-sizes are not identical. So, “lossless” and “reversible” may not be synonymous, although ideally, they should be. In addition to the software and hardware compatibility issues of FFV1, it is challenging to accurately validate the integrity of a file that incorporates lossless compression.

Adventures in metadata hygiene: using Open Refine, XSLT, and Excel to dedup and reconcile name and subject headings in EAD

OpenRefine, formerly Google Refine, bills itself as “a free, open source, powerful tool for working with messy data.”  As someone who works with messy data almost every day, I can’t recommend it enough.  While Open Refine is a great tool for cleaning up “grid-shaped data” (spreadsheets), it’s a bit more challenging to use when your source data is in some other format, particularly XML.

Some corporate name terms from an EAD collection guide
Some corporate name terms from an EAD (XML) collection guide

As part of a recent project to migrate data from EAD (Encoded Archival Description) to ArchivesSpace, I needed to clean up about 27,000 name and subject headings spread across over 2,000 EAD records in XML.  Because the majority of these EAD XML files were encoded by hand using a basic text editor (don’t ask why), I knew there were likely to be variants of the same subject and name terms throughout the corpus–terms with extra white space, different punctuation and capitalization, etc.  I needed a quick way to analyze all these terms, dedup them, normalize them, and update the XML before importing it into ArchivesSpace.  I knew Open Refine was the tool for the job, but the process of getting the terms 1) out of the EAD, 2) into OpenRefine for munging, and 3) back into EAD wasn’t something I’d tackled before.

Below is a basic outline of the workflow I devised, combining XSLT, OpenRefine, and, yes, Excel.  I’ve provided links to some source files when available.  As with any major data cleanup project, I’m sure there are 100 better ways to do this, but hopefully somebody will find something useful here.

1. Use XSLT to extract names and subjects from EAD files into a spreadsheet

I’ve said it before, but sooner or later all metadata is a spreadsheet. Here is some XSLT that will extract all the subjects, names, places and genre terms from the <controlaccess> section in a directory full of EAD files and then dump those terms along with some other information into a tab-separated spreadsheet with four columns: original_term, cleaned_term (empty), term_type, and eadid_term_source.


 2. Import the spreadsheet into OpenRefine and clean the messy data!

Once you open the resulting tab delimited file in OpenRefine, you’ll see the four columns of data above, with “cleaned_term” column empty. Copy the values from the first column (original_term) to the second column (cleaned_term).  You’ll want to preserve the original terms in the first column and only edit the terms in the second column so you can have a way to match the old values in your EAD with any edited values later on.

OpenRefine offers several amazing tools for viewing and cleaning data.  For my project, I mostly used the “cluster and edit” feature, which applies several different matching algorithms to identify, cluster, and facilitate clean up of term variants. You can read more about clustering in Open Refine here: Clustering in Depth.

In my list of about 27,000 terms, I identified around 1200 term variants in about 2 hours using the “cluster and edit” feature, reducing the total number of unique values from about 18,000 to 16,800 (about 7%). Finding and replacing all 1200 of these variants manually in EAD or even in Excel would have taken days and lots of coffee.

Screenshot of “Cluster & Edit” tool in OpenRefine, showing variants that needed to be merged into a single heading.


In addition to “cluster and edit,” OpenRefine provides a really powerful way to reconcile your data against known vocabularies.  So, for example, you can configure OpenRefine to query the Library of Congress Subject Heading database and attempt to find LCSH values that match or come close to matching the subject terms in your spreadsheet.  I experimented with this feature a bit, but found the matching a bit unreliable for my needs.  I’d love to explore this feature again with a different data set.  To learn more about vocabulary reconciliation in OpenRefine, check out

 3. Export the cleaned spreadsheet from OpenRefine as an Excel file

Simple enough.

4. Open the Excel file and use Excel’s “XML Map” feature to export the spreadsheet as XML.

I admit that this is quite a hack, but one I’ve used several times to convert Excel spreadsheets to XML that I can then process with XSLT.  To get Excel to export your spreadsheet as XML, you’ll first need to create a new template XML file that follows the schema you want to output.  Excel refers to this as an “XML Map.”  For my project, I used this one: controlaccess_cleaner_xmlmap.xml

From the Developer tab, choose Source, and then add the sample XML file as the XML Map in the right hand window.  You can read more about using XML Maps in Excel here.

After loading your XML Map, drag the XML elements from the tree view in the right hand window to the top of the matching columns in the spreadsheet.  This will instruct Excel to map data in your columns to the proper XML elements when exporting the spreadsheet as XML.

Once you’ve mapped all your columns, select Export from the developer tab to export all of the spreadsheet data as XML.

Your XML file should look something like this: controlaccess_cleaner_dataset.xml

Sample chunk of exported XML, showing mappings from original terms to cleaned terms, type of term, and originating EAD identifier.


5. Use XSLT to batch process your source EAD files and find and replace the original terms with the cleaned terms.

For my project, I bundled the term cleanup as part of a larger XSLT “scrubber” script that fixed several other known issues with our EAD data all at once.  I typically use the Oxygen XML Editor to batch process XML with XSLT, but there are free tools available for this.

Below is a link to the entire XSLT scrubber file, with the templates controlling the <controlaccess> term cleanup on lines 412 to 493.  In order to access the XML file  you saved in step 4 that contains the mappings between old values and cleaned values, you’ll need to call that XML from within your XSLT script (see lines 17-19).


What this script does, essentially, is process all of your source EAD files at once, finding and replacing all of the old name and subject terms with the ones you normalized and deduped in OpenRefine. To be more specific, for each term in EAD, the XSLT script will find the matching term in the <original_term>field of the XML file you produced in step 4 above.  If it finds a match, it will then replace that original term with the value of the <cleaned_term>.  Below is a sample XSLT template that controls the find and replace of <persname> terms.

XSLT template that find and replaces old values with cleaned ones.
XSLT template that find and replaces old values with cleaned ones.


Final Thoughts

Admittedly, cobbling together all these steps was quite an undertaking, but once you have the architecture in place, this workflow can be incredibly useful for normalizing, reconciling, and deduping metadata values in any flavor of XML with just a few tweaks to the files provided.  Give it a try and let me know how it goes, or better yet, tell me a better way…please.

More resources for working with OpenRefine:

“Using Google Refine to Clean Messy Data” (Propublica Blog)

Notes from the Duke University Libraries Digital Projects Team