Tag Archives: The Migration

Behind the Scenes, Digital Collections, Technology

Tools for Metadata Remediation

August 12, 2016 Maggie Dickson

Last fall, I wrote about how we were embarking on a large-scale remediation of our digital collections metadata in preparation for migrating those collections to the Duke Digital Repository. I started remediation at the beginning of this calendar year, and it’s been at times a slow-going but ultimately pretty satisfying experience. I’m always interested in hearing about how other people working with metadata actually do their work, so I thought I would share some of the tools and techniques I have been using to conduct this project.

First things first: documentation

The metadata task group charged with this work undertook a broad review and analysis of our data, and created a giant google spreadsheet containing all of the fields currently in use in our collections to document the following:

What collections used the field?
Was usage consistent within/across collections?
What kinds of values were used? Controlled vocabularies or free text?
How many unique values?

Based on this analysis, the group made recommendations for remediation of fields and values, which were documented in the spreadsheet as well.

OpenRefine

I often compose little love notes in my head to OpenRefine, because my life would be much harder without it in that I would have to harass my lovely developer colleagues for a lot more help. OpenRefine is a powerful tool for analyzing and remediating messy data, and allows me to get a whole lot done without having to do much scripting. It was really useful for doing the initial review and analysis: here’s a screen shot of how I could use the facets to, for example, limit by the Dublin Core contributor property, see which fields are mapped to contributor, which collections use contributor fields, and what the array of values looks like.

The facet feature is also great for performing batch level operations like changing field names and editing values. The cluster and edit feature has been really useful for normalizing values so that they will function well as facetable fields in the user interface. OpenRefine also allows for bulk transformations using regular expressions, so I can make changes based on patterns, rather than specific strings. Eventually I want to take advantage of it’s ability to reconcile data against external sources, too, but it will be more effective if we get through the cleaning process first.

TextWrangler

For some of the more complex transformations I’ve found it is easier and faster to use regular expressions in a text editor – I’ve been using Text Wrangler, which is free and very handy. Here’s an example of one of the regular expressions I used when converting date values to the Extended Date Time Format, which we’ve covered on this blog here and here, and, um, here (what can I say? We love EDTF):

Ruby Scripting

And in a few cases, like where I’ve needed to swap out a lot of individual values with other ones, I’ve had to dip my toes into Ruby scripting. I’ve done some self educating on Ruby via Duke’s access to Lynda.com courses, but I mostly have benefited from the very kind and patient developer on our task group. I also want to make a shout out to RailsBridge, which is an organization dedicated to making tech more diverse by teaching free workshops on Rails and Ruby for underrepresented groups of people in tech. I attended a full day Ruby on Rails workshop organized by RailsBridge Triangle here in Raleigh N.C . and found it to be really approachable and informative, and would encourage anyone interested in building out her/his tech skills to look for local opportunities.

Behind the Scenes, Technology

Hang in there, the migration is coming

May 27, 2016 Will Sexton

Detail from Hugh Mangum photographs - N318 — Wouldn’t you rather read a post featuring pictures of cats from our digital collections than this boring item about a migration project that isn’t even really explained? Detail from Hugh Mangum photographs – N318.

While I would really prefer to cat-blog my merry way into the holiday weekend, I feel duty-bound to follow up on my previous posts about the digital collections migration project that has dominated our 2016.

Since I last wrote, we have launched two more new collections in the Fedora/Hydra platform that comprises the Duke Digital Repository. The larger of the two, and a major accomplishment for our digital collections program, was the Duke Chapel Recordings. We also completed the Alex Harris Photographs.

Meanwhile, we are working closely with our colleagues in Digital Repository Services to facilitate a whole other migration, from Fedora 3 to 4, and onto a new storage platform. It’s the great wheel in which our own wheel is only the wheel inside the wheel. Like the wheel in the sky, it keeps on turning. We don’t know where we’ll be tomorrow, though we expect the platform migration to be completed inside of a month.

hang-in-there-baby-kitten-poster — A poster like this, with the added phrase “Friday’s coming,” used to hang in one of the classrooms in my junior high. I wish we had that poster in our digital collections.

Last time, I wrote hopefully of the needle moving on the migration of digital collections into the new platform, and while behind the scenes the needle is spasming toward the FULL side of the gauge, for the public it still looks stuck just a hair above EMPTY. We have two batches of ten previously published collections ready to re-launch when we roll over to Fedora 4, which we hope will be in June – one is a group of photography collections, and the other a group of manuscripts-based collections.

In the meantime, the work on migrating the digital collections and building a new UI for discovery and access absorbs our team. Much of what we’ve learned and accomplished during this project has related to the migration, and quite a bit has appeared in this blog.

Our Metadata Architect, Maggie Dickson, has undertaken wholesale remediation of twenty years’ worth of digital collections metadata. Dealing with date representation alone has been a critical effort, as evidenced by the series of posts by her and developer Cory Lown on their work with EDTF.

Sean Aery has posted about his work as a developer, including the integration of the OpenSeadragon image viewer into our UI. He also wrote about “View Item in Context,” four words in a hyperlink that represent many hours of analysis, collaboration, and experimentation within our team.

I expect, by the time the wheel has completed another rotation, and it’s my turn again to write for the blog, there will be more to report. Batches will have been launched, features deployed, and metadata remediated. Even more cat pictures will have been posted to the Internet. It’s all one big cycle and the migration is part of it.

Digital Collections, Technology

Moving the Needle: Bring on Phase 2 of the Tripod3/Digital Collections Migration

February 15, 2016 Will Sexton

Last time I wrote for Bitstreams, I said “Today is the New Future.” It was a day of optimism, as we published for the first time in our next-generation platform for digital collections. The debut of the W. Duke, Sons & Co. Advertising Materials, 1880-1910 was the first visible success of a major effort to migrate our digital collections into the Duke Digital Repository. “Our current plan,” I propounded, “Is to have nearly all of the content of Duke Digital Collections available in the new platform by the end of March, 2016.”

Since then we’ve published a second collection – the Benjamin and Julia Stockton Rush Papers – in the new platform, but we’ve also done more extensive planning for the migration. We’ll divide the work into six-week phases or “supersprints” that overlay the shorter sprints of our software development cycle. The work will take longer than I suggested in October – we now project the bulk of it to be completed by the end of the fourth six-week phase, or toward the end of June of this year, with some continuing until deeper in the calendar year.

As it happens, today represents the rollover from Phase 1 to Phase 2 of our plan. Phase 1 was relatively light in its payload. During the next phase – concluding in six weeks on March 28 – we plan to add 24 of the collections currently published in our older platform, as well as two new collections.

As team leader, I take upon myself the hugely important task of assigning mottos to each phase of the project. The motto for Phase 1 was “Plant the seeds in the bottle.” It derives from the story of David Latimer’s bottle garden, which he planted in 1960 and has not watered since Duke Law alum Richard Nixon was president.

This image from from the Friedrich Carl Peetz Photographs, along with many other items from our photography and manuscript collections, will be among those re-published in the Duke Digital Repository during Phase 2 of our migration process.

Imagine, I said to the group, we are creating self-sustaining environments for our collections, that we can stash under the staircase next to the wine rack. Maybe we tend to them once or twice, but they thrive without our constant curation and intervention. Everyone sort of looked at me as if I had suggested using a guillotine as a bagel slicer for a staff breakfast event. But they’re all good sports. We hunkered down, and expect to publish one new collection, and re-publish two of the older collections, in the new platform this week.

The motto for Phase 2 is “Move the needle.” The object here is to lean on our work in Phase 1 to complete a much larger batch of materials. We’ll extend our work on photography collections in Phase 1 to include many of the existing photography collections. We’ll also re-publish many of the “manuscript collections,” which is our way of referring to the dozen or so collections that we previously published by embedding content in collection guides.

If we are successful in this approach, by the end of Phase 2, we’ll have completed a significant portion of the digital collections migrated to the Duke Digital Repository. Each collection, presumably, will flourish, sealed in a fertile, self-regulating environment, like bottle gardens, or wine.

Here’s a page where we’ll track progress.

As we’ve written previously, we’re in the process of re-digitizing the William Gedney Photographs, so they will not be migrated to the Duke Digital Repository in Phase 2, but will wait until we’ve completed that project.

Announcements, Digital Collections, New Collections

Today is the New Future: The Tripod3 Project and our Next-Gen UI for Digital Collections

October 22, 2015 Will Sexton

Yesterday was Back to the Future day, and the Internet had a lot of fun with it. I guess now it falls to each and every one of us, to determine whether or not today begins a new future. It’s certainly true for Duke Digital Collections.

Today we roll out – softly – the first release of Tripod3, the next-generation platform for digital collections. For now, the current version supports a single, new collection, the W. Duke, Sons & Co. Advertising Materials, 1880-1910. We’re excited about both the collection – which Noah Huffman previewed in this blog almost exactly a year ago – and the platform, which represents a major milestone in a project that began nearly a year ago.

The next few months will see a great deal more work on the project. We have new collections scheduled for December and the first quarter of 2016, we’ll gradually migrate the collections from our existing site, and we’ll be developing the features and the look of the new site in an iterative process of feedback, analysis, and implementation. Our current plan is to have nearly all of the content of Duke Digital Collections available in the new platform by the end of March, 2016.

The completion of the Tripod3 project will mean the end of life for the current-generation platform, which we call, to no one’s surprise, Tripod2. However, we have not set an exact timeline for sunsetting Tripod2. During the transitional phase, we will do everything we can to make the architecture of Duke Digital Collections transparent, and our plans clear.

After the jump, I’ll spend the rest of this post going into a little more depth about the project, but want to express my pride and gratitude to an excellent team – you know who you are – who helped us achieve this milestone.

Continue reading Today is the New Future: The Tripod3 Project and our Next-Gen UI for Digital Collections →

Notes from the Duke University Libraries Digital Projects Team