Six or seven years ago, we discovered a handy new data mashup service from Yahoo! called Yahoo! Pipes. It had a slick drag-n-drop visual programming interface that made it easy to grab data from a bunch of different live sources, then combine, reshape, and conditionally change it into a new dynamic feed modeled however we happened to need it. “Pipes” was a perfect name, a nod to the | (pipe) character used in Unix to chain command-line inputs and outputs, and evocative of the blue pipes you would drag to connect modules in the Pipes UI to funnel data from one to another. It was—quite literally—a series of tubes.
Over the years, we grew to rely on Yahoo! Pipes’ data-mashing wizardry for several features central to the presentation of information on our library website. If you’ve read Bitstreams in the past, you probably have followed a link that was shuttled through Pipes before ultimately being rendered on the website.
Here’s are some of the things we had done in the library website that Pipes made possible:
Collect data from various sources on the web and transform it
Combine disparate data into a single stream
Emit a new customized feed at a URL for other services to access
Differences from Pipes
No visual editor; instead, you hand-code JSON to configure
Open source rather than hosted; you have to run it yourself
Constantly being improved by developers worldwide
A Ruby on Rails app; can be forked/customized as needed
To recreate each feed we’d built in Pipes, we had to build two kinds of Huginn Agents: one or more “Website Agents” to gather and extract the data we need, then a “Data Output Agent” to publish a new customized feed. Agents are set up by writing some configuration rules structured as JSON.
Huginn description: “The Website Agent scrapes a website, XML document, or JSON feed and creates Events based on the results.”
With a Website Agent, we’re gathering data from a source (for us, typically RSS or raw XML). We specify a URL, then start structuring what elements we want to extract using XPath expressions.
Data Output Agent
Huginn description: The Data Output Agent outputs received events as either RSS or JSON. Use it to output a public or private stream of Huginn data.
The Data Output Agent uses one or more Website Agents as data sources. We configure some rules about what to expose and can further refine the data in the output using Liquid Templating. In the case of New Additions to the catalog, it’s here where we make a <media:content> element in our feed and assemble a URL to a cover image from bits of data extracted from the raw XML.
So far, so good. Huginn is now successfully powering most of the feeds that we had previously managed through Yahoo! Pipes. We look forward to seeing what kinds of features are added by the developer community.
Shoutouts to Cory Lown & Michael Daul for all their work in helping make the transition from Pipes to Huginn.
Many of my Bitstreams posts have featured old-school audio formats (wax cylinder, cassette and open reel tape, Minidisc) and discussed how we go about digitizing these obsolete media to bring them to present-day library users at the click of a mouse. In this post, I will take a different tack and show how this sound technology was represented and marketed during its heyday. The images used here are taken from one of our very own digital collections–the Duke Chronicle of the 1960s.
Students of that era would have primarily listened to music on vinyl records purchased directly from a local retailer. The advertisement above boasts of “complete stocks, latest releases, finest variety” with sale albums going for as little as $2.98 apiece. This is a far cry from the current music industry landscape where people consume most of their media via instant download and streaming from iTunes or Spotify and find new artists and songs via blogs, Youtube videos, or social media. The curious listener of the 1960’s may have instead discovered a new band though word of mouth, radio, or print advertising. If they were lucky, the local record shop would have the LP in stock and they could bring it home to play on their hi-fi phonograph (like the one shown below). Notice that this small “portable” model takes up nearly the whole tabletop.
Duke students of the 1960s would have also used magnetic tape-based media for recording and playing back sound. The advertisement above uses Space Age imagery and claims that the recorder (“small enough to fit in the palm of your hand”) was used by astronauts on lunar missions. Other advertisements suggest more grounded uses for the technology: recording classroom lectures, practicing public speaking, improving foreign language comprehension and pronunciation, and “adding fun to parties, hayrides, and trips.”
Creative uses of the technology are also suggested. The “Add-A-Track” system allows you to record multiple layers of sound to create your own unique spoken word or musical composition. You can even use your tape machine to record a special message for your Valentine (“the next best thing to you personally”). Amplifier kits are also available for the ambitious electronics do-it-yourselfer to build at home.
These newspaper ads demonstrate just how much audio technology and our relationship to it have changed over the past 50 years. Everything is smaller, faster, and more “connected” now. Despite these seismic shifts, one thing hasn’t changed. As the following ad shows, the banjo never goes out of style.
Last year we at Duke University Libraries circulated a prospectus for our still-young partnership with the SNCC Legacy Project, seeking bids from web contractors to help with developing the web site that we rolled out last March as One Person, One Vote (OPOV). Now, almost 18 months later, we’re back – but wiser – hoping to do it again – but bigger.
Thanks to a grant from the Mellon Foundation, we’ll be moving to a new phase of our partnership with the SNCC Legacy Project and the Center for Documentary Studies. The SNCC Digital Gateway will build on the success of the OPOV pilot, bringing Visiting Activist Scholars to campus to work with Duke undergraduates and graduates on documenting the historic drive for voting rights, and the work of the Student Nonviolent Coordinating Committee.
As before, we seek an experienced and talented contractor to join with our project team to design and build a compelling site. If you think your outfit might be right for the job, please review the RFP embedded below and get in touch.
We experience a number of different cycles in the Digital Projects and Production Services Department (DPPS). There is of course the project lifecycle, that mysterious abstraction by which we try to find commonalities in work processes that can seem unique for every case. We follow the academic calendar, learn our fate through the annual budget cycle, and attend weekly, monthly, and quarterly meetings.
The annual reporting cycle at Duke University Libraries usually falls to departments in August, with those reports informing a master library report completed later. Because of the activities and commitments around the opening of the Rubenstein Library, the departments were let off the hook for their individual reports this year. Nevertheless, I thought I would use my turn in the Bitstreams rotation to review some highlights from our 2014-15 cycle.
This sermon struck me because of its direct reference to specific events related to the Civil Rights Movement (at least more than the others) and how closely it echoes current events across the nation, particularly the story of Emmett Till’s horrific murder and the fact that his mother chose to have an open casket so that everyone could see the brutality of racism.
I am in awe of the strength it must have taken Emmett’s mother, Mamie Till, to make the decision to have an open casket at her son’s funeral.
Duke has many collections related to the history of the Civil Rights Movement. This collection provides a religious context to the events of our relatively recent past, not only of the Civil Rights Movement but of many social, political and spiritual issues of our time.
In a recent feature on their blog, our colleagues at NCSU Libraries posted some photographs of dogs from their collections. Being a person generally interested in dogs and old photographs, I became curious where dogs show up in Duke’s Digital Collections. Using very unsophisticated methods, I searched digital collections for “dogs” and thought I’d share what I found.
Of the 60 or so collections in Digital Collections 19 contain references to dogs. The table below lists the collections in which dogs or references to dogs appear most frequently.
Today we will take a detailed look at how the Duke Chronicle, the university’s beloved newspaper for over 100 years, is digitized. Since our scope of digitization spans nine decades (1905-1989), it is an ongoing project the Digital Production Center (DPC), part of Digital Projects and Production Services (DPPS) and Duke University Libraries’ Digital Collections Program, has been chipping away at. Scanning and digitizing may seem straightforward to many – place an item on a scanner and press scan, for goodness sake! – but we at the DPC want to shed light on our own processes to give you a sense of what we do behind the scenes. It seems like an easy-peasy process of scanning and uploading images online, but there is much more that goes into it than that. Digitizing a large collection of newspapers is not always a fun-filled endeavor, and the physical act of scanning thousands of news pages is done by many dedicated (and patient!) student workers, staff members, and me, the King Intern for Digital Collections.
Many steps in the digitization process do not actually occur in the DPC, but among other teams or departments within the library. Though I focus mainly on the DPC’s responsibilities, I will briefly explain the steps others perform in this digital projects tango…or maybe it’s a waltz?
Each proposed project must first be approved by the Advisory Council for Digital Collections (ACDC), a team that reviews each project for its strategic value. Then it is passed on to the Digital Collections Implementation Team (DCIT) to perform a feasibility study that examines the project’s strengths and weaknesses (see Thomas Crichlow’s post for an overview of these teams). The DCIT then helps guide the project to fruition. After clearing these hoops back in 2013, the Duke Chronicle project started its journey toward digital glory.
We pull 10 years’ worth of newspapers at a time from the University Archives in Rubenstein Library. Only one decade at a time is processed to make the 80+ years of Chronicle publications more manageable. The first stop is Conservation. To make sure the materials are stable enough to withstand digitizing, Conservation must inspect the condition of the paper prior to giving the DPC the go-ahead. Because newspapers since the mid-19th century were printed on cheap and very acidic wood pulp paper, the pages can become brittle over time and may warrant extensive repairs. Senior Conservator, Erin Hammeke, has done great work mending tears and brittle edges of many Chronicle pages since the start of this project. As we embark on digitizing the older decades, from the 1940s and earlier, Erin’s expertise will be indispensable. We rely on her not only to repair brittle pages but to guide the DPC’s strategy when deciding the best and safest way to digitize such fragile materials. Also, several volumes of the Chronicle have been bound, and to gain the best digital image scan these must be removed from their binding. Erin to the rescue!
Now that Conservation has assessed the condition and given the DPC the green light, preliminary prep work must still be done before the scanner comes into play. A digitization guide is created in Microsoft Excel to list each Chronicle issue along with its descriptive metadata (more information about this process can be found in my metadata blog post). This spreadsheet acts as a guide in the digitization process (hence its name, digitization guide!) to keep track of each analog newspaper issue and, once scanned, its corresponding digital image. In this process, each Chronicle issue is inspected to collect the necessary metadata. At this time, a unique identifier is assigned to every issue based on the DPC’s naming conventions. This identifier stays with each item for the duration of its digital life and allows for easy identification of one among thousands of Chronicle issues. At the completion of the digitization guide, the Chronicle is now ready for the scanner.
The Scanning Process
With all loose unbound issues, the Zeutschel is our go-to scanner because it allows for large format items to be imaged on a flat surface. This is less invasive and less damaging to the pages, and is quicker than other scanning methods. The Zeutschel can handle items up to 25 x 18 inches, which accommodates the larger sized formats of the Chronicle used in the 1940s and 1950s. If bound issues must be digitized, due to the absence of a loose copy or the inability to safely dis-bound a volume, the Phase One digital camera system is used as it can better capture large bound pages that may not necessarily lay flat.
For every scanning session, we need the digitization guide handy as it tells what to name the image files using the previously assigned unique identifier. Each issue of the newspaper is scanned as a separate folder of images, with one image representing one page of the newspaper. This system of organization allows for each issue to become its own compound object – multiple files bound together with an XML structure – once published to the website. The Zeutschel’s scanning software helps organize these image files into properly named folders. Of course, no digitization session would be complete without the initial target scan that checks for color calibration (See Mike Adamo’s post for a color calibration crash course).
The scanner’s plate glass can now be raised with the push of a button (or the tap of a foot pedal) and the Chronicle issue is placed on the flatbed. Lowering the plate glass down, the pages are flattened for a better scan result. Now comes the excitement… we can finally press SCAN. For each page, the plate glass is raised, lowered, and the scan button is pressed. Chronicle issues can have anywhere from 2 to 30 or more pages, so you can image this process can become monotonous – or even mesmerizing – at times. Luckily, with the smaller format decades, like the 1970s and 1980s, the inner pages can be scanned two at a time and the Zeutschel software separates them into two images, which cuts down on the scan time. As for the larger formats, the pages are so big you can only fit one on the flatbed. That means each page is a separate scan, but older years tended to publish less issues, so it’s a trade-off. To put the volume of this work into perspective, the 1,408 issues of the 1980s Chronicle took 28,089 scans to complete, while the 1950s Chronicle of about 482 issues took around 3,700 scans to complete.
Every scanned image that pops up on the screen is also checked for alignment and cropping errors that may require a re-scan. Once all the pages in an issue are digitized and checked for errors, clicking the software’s Finalize button will compile the images in the designated folder. We now return to our digitization guide to enter in metadata pertaining to the scanning of that issue, including capture person, capture date, capture device, and what target image relates to this session (subsequent issues do not need a new target scanned, as long as the scanning takes place in the same session).
Now, with the next issue, rinse and repeat: set the software settings and name the folder, scan the issue, finalize, and fill out the digitization guide. You get the gist.
We now find ourselves with a slue of folders filled with digitized Chronicle images. The next phase of the process is quality control (QC). Once every issue from the decade is scanned, the first round of QC checks all images for excess borders to be cropped, crooked images to be squared, and any other minute discrepancy that may have resulted from the scanning process. This could be missing images, pages out of order, or even images scanned upside down. This stage of QC is often performed by student workers who diligently inspect image after image using Adobe Photoshop. The second round of QC is performed by our Digital Production Specialist Zeke Graves who gives every item a final pass.
At this stage, derivatives of the original preservation-quality images are created. The originals are archived in dark storage, while the smaller-sized derivatives are used in the CONTENTdm ingest process. CONTENTdm is the digital collection management software we use that collates the digital images with their appropriate descriptive metadata from our digitization guide, and creates one compound object for each Chronicle issue. It also generates the layer of Optical Character Recognition (OCR) data that makes the Chronicle text searchable, and provides an online interface for users to discover the collection once published on the website. The images and metadata are ingested into CONTENTdm’s Project Client in small batches (1 to 3 years of Chronicle issues) to reduce the chance of upload errors. Once ingested into CONTENTdm, the items are then spot-checked to make sure the metadata paired up with the correct image. During this step, other metadata is added that is specific to CONTENTdm fields, including the ingest person’s initials. Then, another ingest must run to push the files and data from the Project Client to the CONTENTdm server. A third step after this ingest finishes is to approve the items in the CONTENTdm administrative interface. This gives the go-ahead to publish the material online.
Hold on, we aren’t done yet. The project is now passed along to our developers in DPPS who must add this material to our digital collections platform for online discovery and access (they are currently developing Tripod3 to replace the previous Tripod2 platform, which is more eloquently described in Will Sexton’s post back in April). Not only does this improve discoverability, but it makes all of the library’s digital collections look more uniform in their online presentation.
Then, FINALLY, the collection goes live on the web. Now, just repeat the process for every decade of the Duke Chronicle, and you can see how this can become a rather time-heavy and laborious process. A labor of love, that is.
I could have narrowly stuck with describing to you the scanning process and the wonders of the Zeutschel, but I felt that I’d be shortchanging you. Active scanning is only a part of the whole digitization process which warrants a much broader narrative than just “push scan.” Along this journey to digitize the Duke Chronicle, we’ve collectively learned many things. The quirks and trials of each decade inform our process for the next, giving us the chance to improve along the way (to learn how we reflect upon each digital project after completion, go to Molly Bragg’s blog post on post-mortem reports).
If your curiosity is piqued as to how the Duke Chronicle looks online, the Fall 1959-Spring 1970 and January 1980-February 1989 issues are already available to view in our digital collections. The 1970s Chronicle is the next decade slated for publication, followed by the 1950s. Though this isn’t a comprehensive detailed account of the digitization process, I hope it provides you with a clearer picture of how we bring a collection, like the Duke Chronicle, into digital existence.
One project we’ve been working on recently in the Digital Projects Department is a revamped Library Exhibits website that will launch in concert with the opening of the newly renovated Rubenstein Library in August. The interface is going to focus on highlighting the exhibit spaces, items, and related events. Here’s a mockup of where we hope to be shortly:
On a somewhat related note, I recently traveled to Italy and was able to spend an afternoon at the Venice Biennale, which is an international contemporary art show that takes place every other year. Participating artists install their work across nearly 90 pavilions and there’s also a central gallery space for countries that don’t have their own buildings. It’s really an impressive amount of work to wander through in a single day and I wasn’t able to see everything, but many of the works I did see was amazing. Three exhibits in particular were striking to me.
Garden of Eden
The first I’ll highlight is the work of Joana Vasconcelos, titled Il Giardino dell’Eden, which was housed in a silver tent of a building from one of the event sponsors, Swatch (the watch company). As I entered I was immediately met with a dark and cool space, which was fantastic on this particularly hot and humid day. The room was filled with an installation of glowing fiber optic flowers that pulsated with different patterns of color. It was beautiful and super engaging. I spent a long time wandering through the pathway trying to take it all in.
Another engrossing installation was housed in the French Pavilion; Revolutions by Celeste Boursier-Mougenot. I walked into a large white room where a tree with a large exposed rootball was sitting off to the side. There were deep meditative tones being projected from somewhere close by. I noticed people were lounging in the wings of the space, so I wandered over to check it out for myself. What looked like a wooden bleacher of sorts actually turned out to be made of some sort of painted foam. So as I stumbled and laughed when I tried to first walk on it, like many others who came into the space later, I plopped down to soak in the exhibit. I noticed the deep tones were subtly rhythmic and they definitely gave off a meditative vibe, so it was nice to relax a bit after a long day of walking. But then I noticed the large tree was not where it had been when I first entered the room. It was moving, but very slowly. Utterly interesting. It almost seemed to levitate. I’d really like to know how it worked (there were also two more trees outside the pavilion that moved in the same way). Overall it was a fantastic experience.
Red Sea of Keys
My favorite installation was in the Japanese Pavilion; The Key in the Hand by Chiharu Shiota. The space was filled with an almost incomprehensible number of keys dangling from entangled red yarn suspended from the ceiling of the room. There were also a few small boats positioned around the space. My first instinct was that I was standing underneath a red sea. It’s really hard to describe just how much ‘red’ there actually is in the space. The intricacy of the threads and the uniqueness of almost every key I looked at was simply mind blowing. I think my favorite part of the exhibit was nestled in a corner of the room where an iPad sat looping a time compressed video of the installation of the work. It was uniquely satisfying to watch it play out and come together over and over. I’m not sure how to tap into that experience for exhibits in the library, but it’s something we can certainly aim for!
Sean Aery, Digital Projects Developer, Duke Rachel Ingold, Curator for the History of Medicine Collections, Duke
Duke’s Digital Collections program recently published a remarkable set of 16th-17th century anatomical fugitive sheets from The Rubenstein Library’s History of Medicine Collections. These illustrated sheets are similar to broadsides, but feature several layers of delicate flaps that lift to show inside the human body. The presenters will discuss the unique challenges posed by the source material including conservation, digitization, description, data modeling, and UI design. They will also demonstrate the resulting digital collection, which has already earned several accolades for its innovative yet elegant solutions for a project with a high degree of complexity.
One of the most tedious and time-consuming tasks we do in the Digital Production Center is cropping and straightening still image files. Hired students spend hours sitting at our computers, meticulously straightening and cropping extraneous background space out of hundreds of thousands of photographed images, using Adobe Photoshop. This process is neccessary in order to present a clean, concise image for our digital collections, but it causes delays in the completion of our projects, and requires a lot of student labor. Auto cropping software has long been sought after in digital imaging, but few developers have been able to make it work efficiently, for all materials. The Digital Production Center’s Zeutschel overhead scanner utilizes auto cropping software, but the scanner can only be used with completely flat media, due to its limited depth of field. Thicker and more fragile materials must be photographed using our Phase One digital camera system, shown above.
Recently, Digital Transitions, who is the supplier of Phase One and it’s accompanying Capture One software, announced an update to the software which includes an auto crop and straightening feature. The new software is called Capture One Cultural Heritage, and is specifically designed for use in libraries and archival institutions. The auto crop feature, previously unavailable in Capture One, is a real breakthrough, and there are several options for how to use it.
First of all, the user can choose to auto crop “On Capture” or “On Crop.” That is, the software can auto crop instantly, right after a photograph has been taken (On Capture), or it can be applied to the image, or batch of images, at a later time (On Crop). You can also choose between auto cropping at a fixed size, or by the edge of the material. For instance, if you are photographing a collection of posters that are all sized 18” x 24,” you would choose “Fixed Size” and set the primary crop to “18 x 24,” or slightly larger if you want your images to have an outer border. The software recognizes the rectangular shape, and applies the crop. If you are photographing a collection of materials that are a variety of different sizes, you would choose “Generic,” which tells the software to crop wherever it sees a difference between the edge of the material and the background. “Padding” can be used to give those images a border.
Because Capture One utilizes raw files, the auto crops are non-destructive edits. One benefit of this is that if your background color is close to the color of your material, you can temporarily adjust the contrast of the photograph in order to darken the edges of the object, thus enhancing the delineation between object and background. Next apply the auto crop, which will be more successful due to it’s ability to recognize the newly-defined edges of the material. After the crops are applied, you can reverse the contrast adjustment, thus returning the images to their original state, while still keeping the newly-generated crops.
Like a lot of technological advances, reliable auto cropping seemed like a fantasy just a few years ago, but is now a reality. It doesn’t work perfectly every time, and quality control is still necessary to uncover errors, but it’s a big step forward. The only thing disconcerting is the larger question facing our society. How long will it be before our work is completely automated, and humans are left behind?
Notes from the Duke University Libraries Digital Projects Team