Category Archives: Projects

Who, Why, and What:  the three W’s of the Duke Digital Collections Mini-Survey

My colleague Sean wrote two weeks ago about the efforts a group of us  in the library are making towards understanding the scholarly impacts of Duke Digital Collections.  In this post, I plan to continue the discussion with details about the survey we are conducting as well as share some initial results.

Surveying can be perilous work!
Surveying can be perilous work!

After reviewing the analytics and Google Scholar data Sean wrote about, our working group realized we needed more information.   Our goal in this entire assessment process has been to pull together scholarly use data which will inform our digitization decisions, priorities, technological choices (features on the digital collections platform), and to help us gain an understanding of if and how we are meeting the needs of researcher communities.    Analytics gave us clues, but we still didn’t some of the fundamental facts about our patrons.   After a fervent discussion with many whiteboard notes, the group decided creating a survey would get us more of the data we were looking for.  The resulting survey focuses on the elemental questions we have about our patrons:   who are they, why are they visiting Duke Digital Collections, and what are they going to do with what they find here.

 

The Survey

Creating the survey itself was no small task, but after an almost endless process of writing, rewriting, and consultations with our assessment coordinator we settled on 6 questions (a truely miniature survey).  We considered the first three questions (who, why, what) to be most important, and we intended the last three to provide us with additional information such as Duke affiliation and allow a space for general feedback.  None of the questions were considered “required” so respondents could answer or skip whatever they wanted; we also included space for respondents to write-in further details especially when choosing the “other” option.

Our survey in its completed form.
Our survey in its completed form.

The survey launched on April 30 and remains accessible by hovering over a “feedback” link on every single Digital Collection webpage.  Event tracking analytics show that 0.29% of the patrons that hover over our feedback link click through to the survey. An even smaller number have actually submitted responses.  This has worked out to 56 responses at an average rate of around 1 per day.  Despite that low click through rate, we have been really pleased with the number of responses we have had so far.  The response rate remains steady, and we have already learned a lot from even this small sample of visitor data.  We are not advertising the survey or promoting it, because our target respondents are patrons who find us in the course of their research or general Internet browsing.

Hovering over the help us box reveals expectations and instructions for survey participants.
Hovering over the help us box reveals expectations and instructions for survey participants.

Initial Results

Before I start discussing our results, please note that what I’m sharing here is based on initial responses and my own observations.  No one in digital collections has thoroughly reviewed or analyzed this data.  Additionally, this information is drawn from responses submitted between April 30 – July 8, 2015. We plan to keep the survey online into the academic year to see if our responses change when classes are in session.

With that disclaimer now behind us, let’s review results by question.

Questions 1 and 4:  Who are you?

Since we are concerned with scholarly oriented use more than other types in this exercise, the first question is intended to sort respondents primarily by academic status.   In question 4, respondents are given the chance to further categorized their academic affiliation.

Question 1 Answers # of Responses %
Student 14 25%
Educator 10 18%
Librarian, Archivist or Museum Staff 5 9%
Other 26 47%
55 100

Of the respondents who categorized themselves as “other” in question 1, 11 clarified their otherness by writing their identities in the space provided.  Of this 11, 4 associated themselves with music oriented professions or hobbies, and 2 with fine arts (photographer and filmmaker).  The remaining 5 could not be grouped easily into categories.

As a follow up later in the survey, question 4 asks respondents to categorize their academic affiliation (if they had one).  The results showed that 3 respondents are affiliated with Duke, 12  with other colleges or universities and 9 with a K-12 school.   Of the write-in responses, 3 listed names of universities abroad, and 1 listed a school whose level has not been identified.

Question 2:  Why are you here?

We can tell from our analytics how people get to us (if they were referred to us via a link or sought us out directly), but this information does not address why visitors come to the site.  Enter question 2.

Question 2 Answers # of Responses %
Academic research 15 28
Casual browsing 15 28
Followed a link 9 17%
Personal research 24 44%
Other 6 11%
54

The survey asks that those who select academic research, personal research, and other to write-in their research topic or purpose.  Academic research topics submitted so far primarily revolve around various historical research topics.  Personal research topics reflect a high interest in music (specific songs or types of music), advertising, and other various personal projects.  It is interesting to note that local history related topics have been submitted under all three categories (academic, personal and other).  Additionally,  non-academic researchers seem to be more willing to share sharing their specific topics; 19 of 24 respondents listed their topics as compared to 7 out of 15 academic researchers.

Question 3:  What will you do with the images and/or resources you find on this site?

To me, this question has the potential to provide some of the most illuminating information from our patrons. Knowing how they use the material helps us determine how to enhance access to the digitized objects and what kinds of technology we should be investing in.  This can also shed light on our digitization process itself.  For example, maybe the full text version of an item will provide more benefit to more researchers than an illustrated or hand-written version of the same item (of course we would prefer to offer both, but I think you see where I am going with this).

In designing this question, the group decided it would be valuable to offer options for the those who share items due to their visual or subject appeal (for example, the Pinterest user), the publication minded researcher, and a range of patron types in between.

 

Question 3 Answers # of Responses %
Use for an academic publication 3 6%
Share on social media 10 19%
Use them for homework 8 15%
Use them as a teaching tool in my classes 5 9%
Personal use 31 58%
Use for my job 2 4%
Other 10 19%
53

The 10 “other” respondents all entered subsequent details; they planned to share items with friends and family (in some way other than on social media), they also wanted to use the items they found as a reference, or were working on an academic pursuit that in their mind didn’t fit the listed categories.

Observations

As I said above, these survey results are cursory as we plan to leave the survey up for several more months.  But so far the data reveals that Duke Digital collections serves a wide audience of academic and non-academic users for a range of purposes. For example, one respondent uses the outdoor advertising collections to get a glimpse of how their community has changed over time. Another is concerned with US History in the 1930s, and another is focused on music from the 1900s.

The next phase of the the assessment group’s activities is to meet with researchers and instructors in person and talk with them about their experiences using digital collections (not just Duke’s) for scholarly research or instruction.  We have also been collecting examples of instructors who have used digital collections in their classes.  We plan to create a webpage with these examples with the goal of encouraging other instructors to do the same.  The goal of both of these efforts is to increase academic use of the digital collections (whether that be at the K-12 or collegiate level).

 

Just like this survey team, we stand at the ready, waiting for our chance to analyze and react to our data!

Of course, another next step is to keep collecting this survey data and analyze it further.  All in all, it has been truly exciting to see the results thus far.  As we study the data in more depth this Fall, we plan to work with the Duke University Library Digital Collections Advisory Team to implement any new technical or policy oriented decisions based on our conclusions.  Our minds are already spinning with the possibilities.

The Value of Metadata in Digital Collections Projects

Before you let your eyes glaze over at the thought of metadata, let me familiarize you with the term and its invaluable role in the creation of the library’s online Digital Collections.  Yes, metadata is a rather jargony word librarians and archivists find themselves using frequently in the digital age, but it’s not as complex as you may think.  In the most simplistic terms, the Society of American Archivists defines metadata as “data about data.”  Okay, what does that mean?  According to the good ol’ trusty Oxford English Dictionary, it is “data that describes and gives information about other data.”  In other words, if you have a digitized photographic image (data), you will also have words to describe the image (metadata).

Better yet, think of it this way.  If that image were of a large family gathering and grandma lovingly wrote the date and names of all the people on the backside, that is basic metadata.  Without that information those people and the image would suddenly have less meaning, especially if you have no clue who those faces are in that family photo.  It is the same with digital projects.  Without descriptive metadata, the items we digitize would hold less meaning and prove less valuable for researchers, or at least be less searchable.  The better and more thorough the metadata, the more it promotes discovery in search engines.  (Check out the metadata from this Cornett family photo from the William Gedney collection.)

The term metadata was first used in the late 1960s in computer programming language.  With the advent of computing technology and the overabundance of digital data, metadata became a key element to help describe and retrieve information in an automated way.  The use of the word metadata in literature over the last 45 years shows a steeper increase from 1995 to 2005, which makes sense.  The term became used more and more as technology grew more widespread.  This is reflected in the graph below from Google’s Ngram Viewer, which scours over 5 million Google Books to track the usage of words and phrases over time.

metadatangram_blog
Google Ngram Viewer for “metadata”

Because of its link with computer technology, metadata is widely used in a variety of fields that range from computer science to the music industry.  Even your music playlist is full of descriptive metadata that relates to each song, like the artist, album, song title, and length of audio recording.  So, libraries and archives are not alone in their reliance on metadata.  Generating metadata is an invaluable step in the process of preserving and documenting the library’s unique collections.  It is especially important here at the Digital Production Center (DPC) where the digitization of these collections happens.  To better understand exactly how important a role metadata plays in our job, let’s walk through the metadata life cycle of one of our digital projects, the Duke Chapel Recordings.

The Chapel Recordings project consists of digitizing over 1,000 cassette and VHS tapes of sermons and over 1,300 written sermons that were given at the Duke Chapel from the 1950s to 2000s.  These recordings and sermons will be added to the existing Duke Chapel Recordings collection online.  Funded by a grant from the Lilly Foundation, this digital collection will be a great asset to Duke’s Divinity School and those interested in hermeneutics worldwide.

Before the scanners and audio capture devices are even warmed up at the DPC, preliminary metadata is collected from the analog archival material.  Depending on the project, this metadata is created either by an outside collaborator or in-house at the DPC.  For example, the Duke Chronicle metadata is created in-house by pulling data from each issue, like the date, volume, and issue number.  I am currently working on compiling the pre-digitization metadata for the 1950s Chronicle, and the spreadsheet looks like this:

1950sChronicle_blog
1950s Duke Chronicle preliminary metadata

As for the Chapel Recordings project, the DPC received an inventory from the University Archives in the form of an Excel spreadsheet.  This inventory contained the preliminary metadata already generated for the collection, which is also used in Rubenstein Library‘s online collection guide.

inventorymetadata_blog
Chapel Recordings inventory metadata

The University Archives also supplied the DPC with an inventory of the sermon transcripts containing basic metadata compiled by a student.

inventorysermons_blog
Duke Chapel Records sermon metadata

Here at the DPC, we convert this preliminary metadata into a digitization guide, which is a fancy term for yet another Excel spreadsheet.  Each digital project receives its own digitization guide (we like to call them digguides) which keeps all the valuable information for each item in one place.  It acts as a central location for data entry, but also as a reference guide for the digitization process.  Depending on the format of the material being digitized (image, audio, video, etc.), the digitization guide will need different categories.  We then add these new categories as columns in the original inventory spreadsheet and it becomes a working document where we plug in our own metadata generated in the digitization process.   For the Chapel Recordings audio and video, the metadata created looks like this:

digitizationmetadata_blog
Chapel Recordings digitization metadata

Once we have digitized the items, we then run the recordings through several rounds of quality control.  This generates even more metadata which is, again, added to the digitization guide.  As the Chapel Recordings have not gone through quality control yet, here is a look at the quality control data for the 1980s Duke Chronicle:

qcmetadata_blog
1980s Duke Chronicle quality control metadata

Once the digitization and quality control is completed, the DPC then sends the digitization guide filled with metadata to the metadata archivist, Noah Huffman.  Noah then makes further adds, edits, and deletes to match the spreadsheet metadata fields to fields accepted by the management software, CONTENTdm.  During the process of ingesting all the content into the software, CONTENTdm links the digitized items to their corresponding metadata from the Excel spreadsheet.  This is in preparation for placing the material online. For even more metadata adventures, see Noah’s most recent Bitstreams post.

In the final stage of the process, the compiled metadata and digitized items are published online at our Digital Collections website.  You, the researcher, history fanatic, or Sunday browser, see the results of all this work on the page of each digital item online.  This metadata is what makes your search results productive, and if we’ve done our job right, the digitized items will be easily discovered.  The Chapel Recordings metadata looks like this once published online:

onlinemetadata_blog
Chapel Recordings metadata as viewed online

Further down the road, the Duke Divinity School wishes to enhance the current metadata to provide keyword searches within the Chapel Recordings audio and video.  This will allow researchers to jump to specific sections of the recordings and find the exact content they are looking for.  The additional metadata will greatly improve the user experience by making it easier to search within the content of the recordings, and will add value to the digital collection.

On this journey through the metadata life cycle, I hope you have been convinced that metadata is a key element in the digitization process.  From preliminary inventories, to digitization and quality control, to uploading the material online, metadata has a big job to do.  At each step, it forms the link between a digitized item and how we know what that item is.  The life cycle of metadata in our digital projects at the DPC is sometimes long and tiring.  But, each stage of the process  creates and utilizes the metadata in varied and important ways.  Ultimately, all this arduous work pays off when a researcher in our digital collections hits gold.

Adventures in metadata hygiene: using Open Refine, XSLT, and Excel to dedup and reconcile name and subject headings in EAD

OpenRefine, formerly Google Refine, bills itself as “a free, open source, powerful tool for working with messy data.”  As someone who works with messy data almost every day, I can’t recommend it enough.  While Open Refine is a great tool for cleaning up “grid-shaped data” (spreadsheets), it’s a bit more challenging to use when your source data is in some other format, particularly XML.

Some corporate name terms from an EAD collection guide
Some corporate name terms from an EAD (XML) collection guide

As part of a recent project to migrate data from EAD (Encoded Archival Description) to ArchivesSpace, I needed to clean up about 27,000 name and subject headings spread across over 2,000 EAD records in XML.  Because the majority of these EAD XML files were encoded by hand using a basic text editor (don’t ask why), I knew there were likely to be variants of the same subject and name terms throughout the corpus–terms with extra white space, different punctuation and capitalization, etc.  I needed a quick way to analyze all these terms, dedup them, normalize them, and update the XML before importing it into ArchivesSpace.  I knew Open Refine was the tool for the job, but the process of getting the terms 1) out of the EAD, 2) into OpenRefine for munging, and 3) back into EAD wasn’t something I’d tackled before.

Below is a basic outline of the workflow I devised, combining XSLT, OpenRefine, and, yes, Excel.  I’ve provided links to some source files when available.  As with any major data cleanup project, I’m sure there are 100 better ways to do this, but hopefully somebody will find something useful here.

1. Use XSLT to extract names and subjects from EAD files into a spreadsheet

I’ve said it before, but sooner or later all metadata is a spreadsheet. Here is some XSLT that will extract all the subjects, names, places and genre terms from the <controlaccess> section in a directory full of EAD files and then dump those terms along with some other information into a tab-separated spreadsheet with four columns: original_term, cleaned_term (empty), term_type, and eadid_term_source.

controlaccess_extractor.xsl

 2. Import the spreadsheet into OpenRefine and clean the messy data!

Once you open the resulting tab delimited file in OpenRefine, you’ll see the four columns of data above, with “cleaned_term” column empty. Copy the values from the first column (original_term) to the second column (cleaned_term).  You’ll want to preserve the original terms in the first column and only edit the terms in the second column so you can have a way to match the old values in your EAD with any edited values later on.

OpenRefine offers several amazing tools for viewing and cleaning data.  For my project, I mostly used the “cluster and edit” feature, which applies several different matching algorithms to identify, cluster, and facilitate clean up of term variants. You can read more about clustering in Open Refine here: Clustering in Depth.

In my list of about 27,000 terms, I identified around 1200 term variants in about 2 hours using the “cluster and edit” feature, reducing the total number of unique values from about 18,000 to 16,800 (about 7%). Finding and replacing all 1200 of these variants manually in EAD or even in Excel would have taken days and lots of coffee.

refine_screeshot
Screenshot of “Cluster & Edit” tool in OpenRefine, showing variants that needed to be merged into a single heading.

 

In addition to “cluster and edit,” OpenRefine provides a really powerful way to reconcile your data against known vocabularies.  So, for example, you can configure OpenRefine to query the Library of Congress Subject Heading database and attempt to find LCSH values that match or come close to matching the subject terms in your spreadsheet.  I experimented with this feature a bit, but found the matching a bit unreliable for my needs.  I’d love to explore this feature again with a different data set.  To learn more about vocabulary reconciliation in OpenRefine, check out freeyourmetadata.org

 3. Export the cleaned spreadsheet from OpenRefine as an Excel file

Simple enough.

4. Open the Excel file and use Excel’s “XML Map” feature to export the spreadsheet as XML.

I admit that this is quite a hack, but one I’ve used several times to convert Excel spreadsheets to XML that I can then process with XSLT.  To get Excel to export your spreadsheet as XML, you’ll first need to create a new template XML file that follows the schema you want to output.  Excel refers to this as an “XML Map.”  For my project, I used this one: controlaccess_cleaner_xmlmap.xml

From the Developer tab, choose Source, and then add the sample XML file as the XML Map in the right hand window.  You can read more about using XML Maps in Excel here.

After loading your XML Map, drag the XML elements from the tree view in the right hand window to the top of the matching columns in the spreadsheet.  This will instruct Excel to map data in your columns to the proper XML elements when exporting the spreadsheet as XML.

Once you’ve mapped all your columns, select Export from the developer tab to export all of the spreadsheet data as XML.

Your XML file should look something like this: controlaccess_cleaner_dataset.xml

control_access_dataset_chunk
Sample chunk of exported XML, showing mappings from original terms to cleaned terms, type of term, and originating EAD identifier.

 

5. Use XSLT to batch process your source EAD files and find and replace the original terms with the cleaned terms.

For my project, I bundled the term cleanup as part of a larger XSLT “scrubber” script that fixed several other known issues with our EAD data all at once.  I typically use the Oxygen XML Editor to batch process XML with XSLT, but there are free tools available for this.

Below is a link to the entire XSLT scrubber file, with the templates controlling the <controlaccess> term cleanup on lines 412 to 493.  In order to access the XML file  you saved in step 4 that contains the mappings between old values and cleaned values, you’ll need to call that XML from within your XSLT script (see lines 17-19).

AT-import-fixer.xsl

What this script does, essentially, is process all of your source EAD files at once, finding and replacing all of the old name and subject terms with the ones you normalized and deduped in OpenRefine. To be more specific, for each term in EAD, the XSLT script will find the matching term in the <original_term>field of the XML file you produced in step 4 above.  If it finds a match, it will then replace that original term with the value of the <cleaned_term>.  Below is a sample XSLT template that controls the find and replace of <persname> terms.

XSLT template that find and replaces old values with cleaned ones.
XSLT template that find and replaces old values with cleaned ones.

 

Final Thoughts

Admittedly, cobbling together all these steps was quite an undertaking, but once you have the architecture in place, this workflow can be incredibly useful for normalizing, reconciling, and deduping metadata values in any flavor of XML with just a few tweaks to the files provided.  Give it a try and let me know how it goes, or better yet, tell me a better way…please.

More resources for working with OpenRefine:

“Using Google Refine to Clean Messy Data” (Propublica Blog)

freeyourmetadata.org

Getting to the Finish Line: Wrapping Up Digital Collections Projects

Part of my job as Digital Collections Program Manager is to manage our various projects from idea to proposal to implementation and finally to publication. It can be a long and complicated process with many different people taking part along the way.  When we (we being the Digital Collections Implementation Team or DCIT) launch a project online, there are special blog posts, announcements and media attention.  Everyone feels great about a successful project implementation, however as the excitement of the launch subsides the project team is not quite done. The last step in a digital collections project at Duke is the post project review.

Project post-mortems keeps the team from feeling like the men in this image!

Post project reviews are part of project management best practices for effectively closing and assessing the outcomes of projects.  There are a lot of resources for project management available online, but as usual Wikipedia provides a good summary of project post-mortems as well as the different types and phases of project management in general.   Also if you Google “project post-mortem,” you will get more links then you know what to do with.

Process

 As we finish up projects we conduct what we call a “post-mortem,” and it is essentially a post project review.   The name evokes autopsies, and what we do is not dissimilar but thankfully there are no bodies involved (except when we closed up the recent Anatomical Fugitive Sheets digital collection – eh? see what I did there? wink wink).  The goals of our post mortem process are for the project team to do the following:

  • Reflect on the project’s outcomes both positive and negative
  • Document any unique decisions or methods employed during the project
  • Document resources put into the project.

In practice, this means that I ask the project team to send me comments about what they thought went well and what was challenging about the project in question.   Sometimes we meet in person to do this, but often we send comments through email or our project management tool.  I also meet in person with each project champion as a project wraps up.  Project champions are the people that propose and conceive a project.  I ask everyone the same general questions: what worked about the project and what was challenging. With champions, this conversation is also an opportunity to discuss any future plans for promotion as well as think of any related projects that may come up in the future.

DCIT's Post-Mortem Template
DCIT’s Post-Mortem Template

Once I have all the comments from the team and champion I put these into my post-mortem template (see right – click to expand).  I also pull together project stats such as the number of items published, and the hours spent on the project.  Everyone in the core project team is asked to track and submit the hours they spend on projects, which makes pulling stats an easy process.  I designed the template I use as a word document.  Its structured enough to be organized but unstructured enough for me to add new categories on the fly as needed (for example, we worked with a design contractor on a recent project so I added a “working with contractor” section).

 Seems like a simple enough process right?  It is, assuming you can have two ingredients.  First, you need to have a high degree of trust in your core team and good relationships with project stakeholders.  The ability to speak honestly (really really honestly) about a project is a necessity for the information you gather to be useful.  Secondly, you do actually have to conduct the review.  My team gets pulled so quickly from project to project, its really easy to NOT make time for this process.  What helps my team, is that post mortems are a formal part of our project checklists.  Also, I worked with my team to set up our information gathering process, so we all own it and its relevant and easy for them.

DCIT is never to busy for project reviews!

Impacts

The impacts these documents have on our work are very positive. First there is short term benefit just by having the core team communicate what they thought worked and didn’t work. Since we instituted this in the last year, we have used these lessons learns to make small but important changes to our process.

This process also gives the project team direct feedback from our project champions.  This is something I get a lot through my informal interactions with various stakeholders in my role as project manager, however the core team doesn’t always get exposed to direct feedback both positive and negative.

The long term benefit is using the data in these reports to make predictions about resources needed for future projects, track project outcomes at a program level, and for other uses we haven’t considered yet.

Further Resources

 All in all, I cannot recommend a post project review process to anyone and everyone who is managing projects enough.  If you are not convinced by my template (which is very simple), there are lots of examples out there.  Google “project post-mortem templates” (or similar terminology) to see a huge variety.

There are also a few library and digital collections project related resources you may find useful as well:

Here is a blog post from California Digital Library on project post-mortems that was published in 2010, but remains relevant. 

UCLA’s Library recently published a “Library Special Collections Digital Project Toolkit” that includes an “Assessment and Evaluation” section and a “Closeout Questionnaire”

 

 

A Look Under the Hood—and the Flaps—of the Anatomical Fugitive Sheets Collection

We have digitized some fairly complex objects over the years that have challenged our Digital Collections team to push the boundaries of typical digital library solutions for digitization and publication. It happens often: objects we want to digitize are sort of like something we’ve done for a previous project, but not quite, so we can’t simply mimic whatever we did before to get the new project done. We’re frequently flexing our creative muscles.  In many cases, our most successful projects ended up that way because we didn’t concede to the temptation of representing items digitally in an oversimplified manner, or, worse still, as something they are not.

Working with so many rare and unique items from the Rubenstein Library through the years, we’ve become unfazed by these representation challenges and time and again have simply pulled together our team’s brainpower (and willpower) to make something work. Dare I say it, we’ve been unflappable. But this year, we met our match and surely needed some help.

In March, we published ten anatomical fugitive sheets from the 1500s to 1600s. They’re printed illustrations from the Rubenstein Library’s History of Medicine Collections, depicting the human body using layers of paper flaps that can be lifted to reveal internal organs. They’re amazing. They’re distinctive. And they’re really complicated.

Fugitive Sheet
Fugitive Sheet example, accessible online at http://library.duke.edu/digitalcollections/rubenstein_fgsms01003/ (Photo Credit: Les Todd)

The complexity of this project necessitated enlisting help from beyond the library’s walls. Early on, Prof. Mark Olson in Duke’s Art, Art History & Visual Studies department was instrumental in helping us identify modern technical approaches for capturing and modeling such objects. We contracted out development work through local web firm Cuberis, who programmed the bulk of the UI. In-house, we handled digitization, metadata, and integration with our discovery & access application with a lot of collaborative creativity between the digital collections team, the collection curator, conservators, and rare materials cataloger.

In a moment, I’ll discuss what modern technologies make the Fugitive Sheets interface hum. But first, here’s a look at what others have done with flap-based items.

Flaps in the Wind, Er… Wild

There are a few examples of anatomical flap objects represented on the Web, both at Duke and beyond. Common approaches include:

  1. A Sequence of Images. Capture one image of the full item for every state of the flaps possible, then let a user navigate them as if viewing a paginated document or photo sequence.
  2. Video. Either film someone lifting the flaps, or make an auto-playing video of the image sequence above.
  3. Flash. Develop a Flash application and put a SWF file on the web.

The third approach is actually what powers Duke’s Four Seasons project, which remains one of the best interactive historical anatomy interfaces available today. Developed way back in 2000 by Educational Media Services, Four Seasons began as a Java program distributed on CD-ROM (gasp!) and in subsequent years found a home as a Flash application embedded on the library website.

Flash-based flap interface for The Four Seasons, available at http://library.duke.edu/rubenstein/history-of-medicine/four-seasons
Flash-based flap interface for The Four Seasons, available at http://library.duke.edu/rubenstein/history-of-medicine/four-seasons

Flash has fallen out of favor over the last decade for many reasons, most notably: 1) it won’t work on iOS devices, 2) it’s bad for accessibility, 3) it’s invisible to search engines, and most importantly, 4) most of what Flash used to do exclusively can now be done just as well using HTML5.

Anatomy of a Modern Flap Interface

The Web has made giant leaps forward in the past five years due to advances in HTML, CSS, and Javascript and the evolution of web browsers. Key specs for HTML5 and CSS3 have been supported by all major browsers for several years now.  Below are the vital bits (so to speak) in use by the Anatomical Fugitive Sheets. Many of these things would not have worked (or worked well) on the Web five years ago.

HTML5 Parts

1. SVG (scalable vector graphics). An <svg> element in HTML contains shape data for each flap using a coordinates system. The <path> holds a string with line instructions using shorthand (M, L, c, etc.) for tracing the contour: MoveTo, Lineto, Curveto, Arcto. We duplicate the <path> with a transform attribute to render the shape of the back of the flap.

SVG for flap
SVG coordinates in a <path> element representing the back of a flap.

2. Cross-window messaging API. Each fugitive sheet is rendered within an <iframe> on a page and the clickable layer navigation lives in its parent page, so they’re essentially two separate web pages presented as if one. Having a click in one page do something in another is possible through the Javascript method postMessage, part of the HTML5 spec.

  • From parent page to iframe: frame.contentWindow.postMessage(message, '*');
  • From iframe to parent page: window.top.postMessage(message, '*');

CSS3 Parts

  1. transition Property. Here’s where the flap animation action happens.  The flap elements all have the style declaration transition:1s ease-in-out. That ensures that when a flap property like height changes, it animates over the course of one second, slower at the start and end and quicker in the middle.  Clicking to open a flap calls a Javascript function that simultaneously switches the height of the flap front to zero and the back to its full size.
  2. transform Property. This scales down the figure and all its interactive components for display in the iframe, e.g., body.framed .flip-up-wrapper { transform:scale(.5) }; This scaling doesn’t apply in the full-size and zoomed-in views and thus enables the flaps to work identically at full- or half-resolution.

Capture & Encoding

Capture

Because the fugitive sheets are large and extremely fragile, our Digital Production Center staff and conservators worked carefully together to untangle and prop open each flap to be photographed separately. It often required two or more people to steady and flatten the flaps while being careful not to cast shadows on the layer being shot. I wasn’t there, but in my mind I imagine a game of library Twister.

Staff captured images using an overhead reproduction camera using white paper below each flap to make it easier to later determine and crop the contours. Unlike most images we digitize, the flaps’ derivative images are stored and delivered in PNG format to preserve transparency.

Encoding

As we do for all digital collections, we encode in an XML document the structural, administrative, and descriptive data about the digital objects using accepted library standards so that 1) the data can be preserved and ported between applications, and 2) we can use it to power our discovery & access interface. We use METS, a flexible Library of Congress standard for describing all kinds of digital objects.

METS worked pretty well for representing the flap data (see example), and we tapped into a few parts of the standard that we’ve never or rarely used for other items. Specifically, we:

  • added the LC MIX namespace for technical image metadata
  • used an amdSec to store flap heights & widths
  • used file/@GROUPID to divide flap images between figure 1, figure 2, etc.
  • used fptr/area/@COORDS to hold the SVG path coordinates for each flap

The descriptive metadata for the fugitive sheets posed its own challenges outside the box for our usual projects. All the information about the sheets existed as MARC catalog records, and crosswalking from MARC to anything else is more of an art than a science.

Looking Ahead

We’ll try to build on the accomplishments from the Fugitive Sheets Collection as we tackle new complex digitization projects. The History of Medicine Collections in particular are brimming with items that will be far more challenging than these sheets to model, like paginated flap books with fold-out pages and flaps that open in different directions. Undaunted, we’ll keep flapping our wings to stay aloft.

Embeds, Math & Beyond

This week, in conjunction with our H. Lee Waters Film Collection unveiling, we rolled out a handy new Embed feature for digital collections items.  The idea is to make it as easy as possible for someone to share their discoveries from our collections, with proper attribution, on other websites or blogs.

How To

It’s simple, really, and mimics the experience you’re likely to encounter getting embed code from other popular sites with videos, images, and the like. We modeled our approach loosely on the Internet Archive‘s video embed service (e.g., visit this video and click the Share icon, but only if you are unafraid of clowns).

Embed Link

Click the “Embed” link under an item from Duke Digital Collections, and copy the snippet of code that pops up. Paste it in your website, and you’re done!

Examples

I’ll paste a few examples below using different kinds of items. The embed code is short and nearly identical for all of these:

A Single Image

Paginated Item

A Video

Single-Track Audio

Multi-Track Audio

Document with Document Viewer

Technical Considerations

Building this feature required a little bit of math, some trial & error, and a few tricks. The steps were to:

  • Set up a service to return customized item pages at the path http://library.duke.edu/digitalcollections/embed/<itemid>/
  • Use CSS & JS to make the media as fluid as possible to fill whatever space it ends up in
  • Use a fixed height and overflow: auto on the attribution box so longer content will scroll
  • Use link rel=”canonical” to ensure the item’s embed page is associated with the real item page (especially to improve links / ranking signals for search engines).
  • Present the user a copyable HTML <iframe> element in the regular item page that has the correct height & width attributes to accommodate the item(s) to be embedded

This last point is where the math comes in. Take a single image item, for example. With a landscape-orientation image we need to give the user a different <iframe> height to copy than we would for a portrait. It gets even more complicated when we have to account for multiple tracks of audio or video, or combinations of the two.

Coming Soon

We’ll refine this feature a bit in the coming weeks, and work out any embed-bugs we discover. We’ll also be developing a similar feature for embedding digitized content found in our archival collection guides.

Assembling the Game of Stones

Back in October, Molly detailed DigEx’s work on creating an exhibit for the Link Media Wall. We’ve finally finalized our content and hope to have the new exhibit published to the large display in the next week or two. I’d like to detail how this thing is actually put together.

HTML Code

In our planning meetings the super group talked about a few different approaches for how to start. We considered using a CMS like WordPress or Drupal, Four Winds (our institutional digital signage software), or potentially rolling our own system. In the end though, I decided to build using super basic HTML / CSS / Javascript. After the group was happy with the design, I built a simple page page framework to match our desired output of 3840 x 1080 pixels. And when I mean simple, I mean simple.

got_assembly

I broke the content chunks into five main sections: the masthead (which holds the branding), the navigation (which highlights the current section and construction period), the map (which shows the location of the buildings), the thumbnail (which shows the completed building and adds some descriptive text), and the images (which houses a set of cross-fading historic photos illustrating the progression of construction). Working with a fixed-pixel layout feels strange in the modern world of web development, but it’s quick and satisfying to crank out. I’m using the jQuery Cycle plugin to transition the images, which is lightweight and offers lots of configurable options. I also created a transparent PNG file containing a gradient that fades to the background color which overlays the rotating images.

Another part of the puzzle I wrestled with was how to transition from one section of the exhibit to another. I thought about housing all of the content on a single page and using some JS to move from one to the next, but I was a little worried about performance so I again opted for the super simple solution. Each page has a meta refresh in the header set to the number of seconds that it takes to cycle through the corresponding set of images and with a destination of the next section of the exhibit. It’s a little clunky in execution and I would probably try something more elegant next time, but it’s solid and it works.

Here’s a preview of the exhibit cycling through all of the content. It’s been time compressed – the actual exhibit will take about ten minutes to play through.

In a lot of ways this exhibit is an experiment in both process and form, and I’m looking forward to seeing how our vision translates to the Media Wall space. Using such simple code means that if there are any problems, we can quickly make changes. I’m also looking forward to working on future exhibits and helping to highlight the amazing items in our collections.

New Angles & Avenues for Bitstreams

This week, we added a display of our most recent Bitstreams blog posts to our Digital Collections homepage (example), and likewise, a view of posts relevant to a given collection on the respective collection’s homepage (example).

Screen Shot 2014-11-12 at 1.19.56 PM

Background

Our Digital Projects & Production team has been writing in Bitstreams at least weekly since February 2014. We’ve had some excellent guest contributors, too. Some posts share updates about new digital collections or additions, while others share insights, lessons learned, and behind-the-scenes looks at the projects we’re currently tackling.

Many of our posts have been featured on our library homepage and library news site. But until now, we haven’t been able to display any of them—not even the ones about new digital collections—alongside the collections themselves. So, if you visited the DukEngineer collection in the past, you likely missed out on Melanie’s excellent overview, which puts the magazine in context and highlights the best of what’s inside.

Past Solutions

Syndicating tagged blog posts for display elsewhere is a pretty common use case, and we’ve used a bunch of different solutions as our platforms have evolved. Each solution has naturally been painstakingly tailored to accommodate the inner workings of both the source and the destination. Seven years ago, we were writing custom XSLT to create and then consume our own RSS feeds in Cascade Server CMS. We have since hopped over to Wordpress for managing news and blogs (whew!). An older version of our digital collections app used WordPress’ XML-RPC API to get tagged posts and parsed them with Python.

These days, our library website does blog syndication by using a combo of WordPress RSS, Drupal’s feed aggregator module, and occasionally Yahoo! Pipes for data mashing and munging. It works well in Drupal, but other platforms require other approaches.

Under the Hood: Angular.js and Wordpress JSON API

Bret Davidson’s Code4Lib 2014 presentation, Towards Pasta Code Nirvana: Using JavaScript MVC to Fill Your Programming Ravioli  (slides) made me hungry. Hungry for pasta, yes, but also for knowledge. I wanted to:

  1. Experiment with one of the Javascript MVC frameworks to learn how they work, and in the process…
  2. Build something potentially useful for digital collections that could be ported over to a new application framework in the future (e.g., from our current Django app to a future Ruby on Rails app).

From the many possibilities, I chose AngularJS. It seemed well-documented, increasingly popular, and with Google’s backing, it seems like it’ll be around for awhile.

WordPress JSON API

Among Angular’s virtues is that it really simplifies the process of getting and using JSON data from an API. I found Wordpress’ JSON API plugin, which was interestingly developed by staff at MoMA so they could use WordPress as a back-end to a site with a Rails front-end. So we first had to enable that for our Bitstreams blog.

AngularJS

angularjsAngularJS definitely helps keep code clean, especially by abstracting the model (the blogposts & associated characteristics, as well as the page state) from the view (indicates how to display the data) from the controller (gets and refines the data into the model, updates the model upon interactions with the view). I’ve done several projects in the past using jQuery and DOM manipulation to retrieve and display data. It usually works, but in the process I create a veritable rat’s nest of spaghetti code wherein /* no amount of commenting */ can truly help disentangle what’s happening.

Angular also supercharges HTML with more useful attributes to control a display. I’ve only just scratched the surface, but it’s clear that built-in directives like ng-repeat and filters like limitTo spare me from writing a ton of Javascript, e.g., <li ng-repeat="post in blogposts | limitTo:pageSize">. After the initial learning curve, the markup is visually intuitive. And it’s nice that directives and filters are extensible so you can make your own.

Source code: controller js, HTML (view source)

Initial Lessons Learned

  • AngularJS has a steeper learning curve than I’d expected; I assumed I could do this mini-project in a few hours, but it took a couple days to really get a handle on the basic pieces I needed for this project.
  • Writing an Angular app within a Django app is tricky. Both use {{ variable }} template tags so I had to change Angular to use [[ variable ]] instead.

Looking Ahead

I consider this an encouraging proof of concept. While our own blog posts can be interesting, there are many other sources of valuable data out in the world that are relevant to our collections that would add value for our researchers if we were able to easily get and display them. AngularJS won’t be the answer to all of these needs, but it’s nice to have in the toolset.

Profiling Movement Activists in 7 Steps

SNCC workers prepare to go to Belzoni in the Fall of 1963 to organize for the Freedom Vote. Courtesy of www.crmvet.org.
SNCC workers prepare to go to Belzoni in the Fall of 1963 to organize for the Freedom Vote. Courtesy of www.crmvet.org.

On the surface, writing a 500-word profile about a SNCC (Student Nonviolent Coordinating Committee) field secretary or a Mississippi-beautician-turned-grassroots-organizer doesn’t seem like a formidable task. Five hundred words hardly takes ten minutes to type. But the One Person, One Vote project is aiming for more than short biographies; it’s trying to capture why each individual was important to the movement and show that using stories. So this is how we craft a profile in 7 steps:

Step 1: Choose a person

Back in June, the Editorial Board generated a list of people that the One Person, One Vote site needed to profile in order to understand SNCC’s voting rights activism. In less than an hour, we had a list of over a hundred names that included SNCC field secretaries, local people, movement elders, and everyone in between. And those were only the first names that came to mind! We narrowed that list down to 65 people, and that’s what the project team has been working from.

Step 2: Find out everything you can about the person

The first step in profile writing is research. We have a library of twenty books for instant referencing of secondary sources. Next comes surveying available primary sources. Our profiles include documents, photographs, audio clips, news stories, and other items created during the movement to make historical actors come to life. Some of our go-to places to find these sources include: Wisconsin Historical Society’s Freedom Summer Digital Collection, the Civil Rights History Project at the Library of Congress, the Joseph Sinsheimer interviews and SNCC 40th Anniversary Conference tapes at Duke University, the Civil Rights in Mississippi Digital Archive at the University of Southern Mississippi, the University of Georgia’s Civil Rights Digital Library. There are more, of course, but that’s the start.

Step 3: Figure out why the person you’re profiling was important to the movement

The central question behind every profile the project team writes is: who was _______ to the movement? Once you start filling in the blank, the answers vary to an incredible degree. Movement elders like Ella Baker and Myles Horton contributed to the movement in different, yet equally important ways as  Mississippi-born field secretaries like Charles McLaurin, Sam Block, and Willie Peacock. Trying to figure out how and why is no small undertaking. This is where the guidance of our Visiting Activist Scholar helps focus the One Person, One Vote site on the themes that were at the heart of the movement: grassroots activism, community organizing, and individual empowerment.

Step 4: Use stories to express that in 500 words

Next, spend hours trying to express who the person you’re profiling was to the movement in only five hundred words. The profiles on the One Person, One Vote site aren’t mini academic biographies. Instead, we try to tell stories that illustrate who the people were and how their lives and work influenced SNCC’s voting rights activism in the 1960s. Finding the right story to highlight these central themes is key, and telling it well takes time and (lots of) revision.

The OPOV project does all content production in Google Drive. Here is the OPOV Profile Log to keep tract of profiles through the steps towards completion.
The OPOV project does all content production in Google Drive. Here is the OPOV Profile Log to keep tract of profiles through the steps towards completion.

Step 5: Workshop profile draft with project team

The first draft of every profile is workshopped with the One Person, One Vote project team (made up of 4 undergrads, 2 graduate students, and the project manager). As a group, we suggest how profiles can better convey their central theme, make sure that the person’s story is readable and compelling, line edit for clunky writing, and go through the primary sources.

Step 6: Send to Visiting Activist Scholar for editing

All of the revised profile drafts go the Visiting Activist Scholar for a final round of editing. Charlie Cobb, a journalist and former SNCC field secretary, is our first Visiting Activist Scholar. He helps bring the profiles to life in ways that only someone who was a part of the movement can. Charlie adds details about events, mannerisms of people, and behind-the-scene stories that never made it into history books. While the project team relies on available primary and secondary sources, the Visiting Activist Scholars adds something extra to the profiles on the One Person, One Vote site.

Step 7:  Voila!

Profiles goes through one last proofreading and polishing. Then come 2015, the will be posted on the One Person, One Vote site and the primary sources will be embedded and linked to from the Resources section of the profile pages. Voila!

 

 

Preview of the W. Duke, Sons & Co. Digital Collection

T206_Piedmont_cards
When I almost found the T206 Honus Wagner

It was September 6, 2011 (thanks Exif metadata!) and I thought I had found one–a T206 Honus Wagner card, the “Holy Grail” of baseball cards.  I was in the bowels of the Rubenstein Library stacks skimming through several boxes of a large collection of trading cards that form part of the W. Duke, Sons & Co. adverting materials collection when I noticed a small envelope labeled “Piedmont.”  For some reason, I remembered that the Honus Wagner card was issued as part of a larger set of cards advertising the Piedmont brand of cigarettes in 1909.  Yeah, I got pretty excited.

I carefully opened the envelope, removed a small stack of cards, and laid them out side by side, but, sadly, there was no Honus Wagner to be found.  A bit deflated, I took a quick snapshot of some of the cards with my phone, put them back in the envelope, and went about my day.  A few days later, I noticed the photo again in my camera roll and, after a bit of research, confirmed that these cards were indeed part of the same T206 set as the famed Honus Wagner card but not nearly as rare.

Fast forward three years and we’re now in the midst of a project to digitize, describe, and publish almost the entirety of the W. Duke, Sons & Co. collection including the handful of T206 series cards I found.  The scanning is complete (thanks DPC!) and we’re now in the process of developing guidelines for describing the digitized cards.  Over the last few days, I’ve learned quite a bit about the history of cigarette cards, the Duke family’s role in producing them, and the various resources available for identifying them.

T206 Harry Lumley
1909 Series T206 Harry Lumley card (front), from the W. Duke, Sons & Co. collection in the Rubenstein Library
T206 Harry Lumley card (back)
1909 Series T206 Harry Lumley card (back)

 

 

Brief History of Cigarette Cards

A Bad Decision by the Umpire
“A Bad Decision by the Umpire,” from series N86 Scenes of Perilous Occupations, W. Duke, Sons & Co. collection, Rubenstein Library.
  • Beginning in the 1870s, cigarette manufacturers like Allen and Ginter and Goodwin & Co. began the practice of inserting a trade card into cigarette packages as a stiffener. These cards were usually issued in sets of between 25 and 100 to encourage repeat purchases and to promote brand loyalty.
  • In the late 1880s, the W. Duke, Sons, & Co. (founded by Washington Duke in 1881), began inserting cards into Duke brand cigarette packages.  The earliest Duke-issued cards covered a wide array of subject matter with series titled Actors and Actresses, Fishers and Fish, Jokes, Ocean and River Steamers, and even Scenes of Perilous Occupations.
  • In 1890, the W. Duke & Sons Co., headed by James B. Duke (founder of Duke University), merged with several other cigarette manufacturers to form the American Tobacco Company.
  • In 1909, the American Tobacco Company (ATC) first began inserting baseball cards into their cigarettes packages with the introduction of the now famous T206 “White Border” set, which included a Honus Wagner card that, in 2007, sold for a record $2.8 million.
The American Card Catalog
Title page from library’s copy of The American Card Catalog by Jefferson R. Burdick.

Identifying Cigarette Cards

  • The T206 designation assigned to the ATC’s “white border” set was not assigned by the company itself, but by Jefferson R. Burdick in his 1953 publication The American Card Catalog (ACC), the first comprehensive catalog of trade cards ever published.
  • In the ACC, Burdick devised a numbering scheme for tobacco cards based on manufacturer and time period, with the two primary designations being the N-series (19th century tobacco cards) and the T-series (20th century tobacco cards).  Burdick’s numbering scheme is still used by collectors today.
  • Burdick was also a prolific card collector and his personal collection of roughly 300,000 trade cards now resides at the Metropolitan Museum of Art in New York.

 

Preview of the W. Duke, Sons & Co. Digital Collection [coming soon]

Dressed Beef (Series N81 Jokes)
“Dressed Beef” from Series N81 Jokes, W. Duke, Sons & Co. collection, Rubenstein Library
  •  When published, the W. Duke, Sons & Co. digital collection will feature approximately 2000 individual cigarette cards from the late 19th and early 20th centuries as well as two large scrapbooks that contain several hundred additional cards.
  • The collection will also include images of other tobacco advertising ephemera such as pins, buttons, tobacco tags, and even examples of early cigarette packs.
  • Researchers will be able to search and browse the digitized cards and ephemera by manufacturer, cigarette brand, and the subjects they depict.
  • In the meantime, researchers are welcome to visit the Rubenstein Library in person to view the originals in our reading room.