All posts by Jim Tuttle

And Then There’s The Other Stuff… Meet FileTracker

The Duke Digital Repository is a pretty nice place if you’re a file in need of preservation and perhaps some access.  Provided you’re well-described and your organizational relationship to other files and collections is well understood, you could hardly hope for a better home.  But what if you’re not?  What if you’re an important digitized file with only collection-level description?  Or what if you’re digital reproduction of an 18th century encyclopedia created by a conservator to supplement traditional conservation methods?  It takes time to prepare materials for the repository.  We try our best to preserve the materials in the repository, but we also have to think about the other stuff.

We may apply different levels of preservation to materials depending on their source, uniqueness, cost to reproduce or reacquire, and other factors, but the baseline is knowing the objects we’re maintaining are the same objects we were given.  For that, we rely on fixity and checksums.  Unfortunately, it’s not easy to keep track of a couple of hundred terabytes of files from different collections, with different organizational schemes, different owners, and sometimes active intentional change.  The hard part isn’t only knowing what has changed, but providing that information to the owners and curators of the data so they can determine if those changes are intentional and desirable.  Seems like a lot, right?

We’re used some great tools from our colleagues, notably ACE Audit Control Environment, for scheduled fixity reporting.  We really wanted, though, to provide reporting to data owners that was tailored to they way they thought of their data to help reduce noise (with hundreds of terabytes there can be a lot of it!) and make it easier for them to identify unintentional changes.  So, we got work.

That work is named FileTracker.  FileTracker is a Rails application for tracking files and their fixity information.  It’s got a nice dashboard, too.

 

 

What we really needed, though, was a way to disentangle the work of the monitoring application from the work of stakeholder reporting.  The database that FileTracker generates makes it much easier to generate reports that contain the information that stakeholders want.  For instance, one stakeholder may want to know the number of files in each directory and the difference between the present number of files and the number of files at last audit.  We can also determine when files have been moved or renamed and not report those as missing files.

If you’d like to know more, see https://github.com/duke-libraries/file-tracker.

On TRAC: Assessment Tools and Trustworthiness

Duke Digital Repository is, among other things, a digital preservation platform and the locus of much of our work in that area.  As such, we often ponder the big questions:

  1. What is the repository?
  2. What is digital preservation?
  3. How are we doing?

 

 

 

What is the repository?

Fortunately, Ginny gave us a good start on defining the repository in Revisiting: What is the Repository?  It’s software, hardware, and  collaboration.  It’s processes, policies, attention, and intention.  While digital preservation is one of the focuses of the repository, digital preservation extends beyond the repository and should far outlive the repository.

What is digital preservation?

There are scores of definitions, but this Medium Definition from ALCTS is representative:

Digital preservation combines policies, strategies and actions to ensure access to reformatted and born digital content regardless of the challenges of media failure and technological change. The goal of digital preservation is the accurate rendering of authenticated content over time.

This is the short answer to the question: Accurate rendering of authenticated digital content over time.  This is the motivation behind the work described in Preservation Architecture: Phase 2 – Moving Forward with Duke Digital Repository.

How are we doing?

There are 2 basic methodologies for assessing this work- reactive and proactive.  A reactive approach to digital preservation might be characterized by “Hey!  We haven’t lost anything yet!”, which is why we like the proactive approach.

Digital preservation can be be a pretty deep rabbit hole and it can be an expensive proposition to attempt to mitigate the long tail of risk.  Fortunately, the community of practice has developed tools to assist in the planning and execution of trustworthy repositories.  At Duke, we’ve got several years experience working in the framework of the Center for Research Libraries’ Trustworthy Repositories Audit & Certification: Criteria and Checklist (TRAC) as the primary assessment tool by which we measure our efforts.  Much of the work to document our preservation environment and the supporting institutional commitment was focused on our DSpace repository, DukeSpace.  A great deal has changed in the recent 3 years including significant growth in our team and scope.  So, once again we’re working to measure ourselves against the standards of our profession and to use that process to inform our work.

There are 3 areas of focus in TRAC: Organizational Infrastructure, Digital Object Management, and Technologies, Technical Infrastructure, & Security.  These cover a very wide and deep field and include things like:

  • Securing Service Level of Agreements for all service providers
  • Documenting the organizational commitments of both Duke University and Duke University Libraries and sustainability plans relating to the repository
  • Creating and implementing routine testing of backup, remote replication, and restoration of data and relevant infrastructure
  • Creating and approving documentation on a wide variety of subjects for internal and external audiences

 

Back to the question: How are we doing?

Well, we’re making progress!  Naturally we’re starting with ensuring the basic needs are met first- successfully preserving the bits, maximizing transparency and external validation that we’re not losing the bits, and working on a sustainable, scalable architecture.  We have a lot of work ahead of us, of course.  The boxes in the illustration are all the same size, but the work they represent is not.  For example, the Disaster Recovery Plan at Hathi Trust is 61 pages of highly detailed thoughtfulness.  However, these works build on each other so we’re confident that the work we’re doing on the supporting bodies of policy, procedure, and documentation will make ease the work to a complete Disaster Recovery Plan.

Nuts, Bolts, and Bits: Further Down the Preservation Path

It’s been awhile since we last wrote about the preservation architecture underlying the repository in Preservation Architecture: Phase 2 – Moving Forward with Duke Digital Repository.   Iceberg.  Fickr user: pere.We’ve made some terrific progress in the interim, but most of that is invisible to our users not unlike our chilly friends, icebergs.

Let’s take a brief tour to surface some these changes!

 

Policy and Procedure Development

The recently formed Digital Preservation Advisory Group has been working on policy and procedure to bring DDR into compliance with the ISO 16363 Audit and Certification of Trustworthy Digital Repositories Minimum Criteria. We’ve been working on diverse policy areas like defining how embargoes may be set; how often fixity must be checked and reported to stakeholders; in what situations may content be removed and who must be involved in that decision; and what conditions necessitate a ‘tombstone’ to explain the removal of an object.   Some of these policies are internal and some have already been made publicly available.  For example, see our Deaccession Policy and our Preservation Policy.   We’ve made great progress due to the fantastic example set by our friends at Purdue University Research Repository and others.

Preservation Infrastructure

Duke, DuraCloud, and GlacierDurham, North Carolina, is a lovely city– close to mountains, the beach, and full of fantastic restaurants!  Sometimes, though, your digital assets just need to get away from it all.  Digital preservation demands some geographic diversity.  No repository wants all of its data to be subject to a hurricane, of course!  That’s why we’ve partnered with DuraCloud, a preservation-focused cloud provider, to store copies of our digital assets in geographically diverse locations.  Our data now enjoys homes at Duke, at DuraCloud, and in Amazon Glacier!

To bring transparency to the process of remotely replicating our assets and validating the local and remote assets, we’ve recently implemented a process that externalizes these tasks from Fedora and delivers scheduled reports to stakeholders enumerating and detailing the health of their assets.

 

Research and Development

The DDR has grown tremendously in the last year and with it has grown the need to standardize and scale to demand.  Writing Python to arrange files to conform to our Standard Ingest Format was a perfectly reasonable solution in early 2016.  Likewise, programmatic reformatting of endangered file formats wasn’t feasible with the resources available at the time.  We also did need to worry about traffic scaling back then.  Times have changed!

DDR staff are exploring tools to allow non-developers to easily ingest large amounts of material, methods to identify and migrate files to better supported formats, and are planning for more sustainable and durable architecture like increased inter-application messaging to allow us to externalize processes that have been handled within the repository to external servers.

Repository Mega-Migration Update

We are shouting it from the roof tops: The migration from Fedora 3 to Fedora 4 is complete!  And Digital Repository Services are not the only ones relieved.  We appreciate the understanding that our colleagues and users have shown as they’ve been inconvenienced while we’ve built a more resilient, more durable, more sustainable preservation platform in which to store and share our digital assets.

shouting_from_the_rooftops

We began the migration of data from Fedora 3 on Monday, May 23rd.  In this time we’ve migrated roughly 337,000 objects in the Duke Digital Repository.  The data migration was split into several phases.  In case you’re interested, here are the details:

  1. Collections were identified for migration beginning with unpublished collections, which comprise about 70% of the materials in the repository
  2. Collections to be migrated were locked for editing in the Fedora 3 repository to prevent changes that inadvertently won’t be migrated to the new repository
  3. Collections to be migrated were passed to 10 migration processors for actual ingest into Fedora 4
    • Objects were migrated first.  This includes the collection object, content objects, item objects, color targets for digital imaging, and attachments (objects related to, but not part of, a collection like deposit agreements
    • Then relationships between objects were migrated
    • Last, metadata was migrated
  4. Collections were then validated in Fedora 4
  5. When validation is complete, collections will be unlocked for editing in Fedora 4

Presto!  Voila!  That’s it!

MV5BMTEwNjMwMjc3MDdeQTJeQWpwZ15BbWU4MDg0OTA4MDIx._V1_UX182_CR0,0,182,268_AL_

While our customized version of the Fedora migrate gem does some validation of migrated content, we’ve elected to build an independent process to provide validation.  Some of the validation is straightforward such as comparing checksums of Fedora 3 files against those in Fedora 4.  In other cases, being confident that we’ve migrated everything accurately can be much more difficult. In Fedora 3, we can compare checksums of metadata files while in Fedora 4 object metadata is stored opaquely in a database without checksums that can be compared.  The short of it is that we’re working hard to prove successful migration of all of our content and it’s harder than it looks.  It’s kind of like insurance- protecting us from the risk of lost or improperly migrated data.

We’re in the final phases of spiffing up the Fedora 4 Digital Repository user interface, which is scheduled to be deployed the week of July 11th.  That release will not include any significant design changes, but is simply compatible with the new Fedora 4 code base.  We are planning to release enhancements to our Data & Visualizations collection, and are prioritizing work on the homepage of the Duke Digital Repository… you will likely see an update on that coming up in a subsequent blog post!

Preservation Architecture: Phase 2 – Moving Forward with Duke Digital Repository

 

DukeSpace circa 2013
DukeSpace circa 2013

 

In 2013, the average price for a gallon of gas was $3.80, President Obama was inaugurated for a second term, and Duke University Libraries offered DukeSpace as an institutional repository.  Some things haven’t changed much, but the preservation architecture protecting the digital materials curated by the Libraries has changed a lot!

We still provide DukeSpace, but are laying the foundation to migrate collections and processes to the Duke Digital Repository (DDR).  The DDR was conceived of and developed as a digital preservation repository, an environment intended to preserve and sustain the rich digital collections; university scholarship and research data; purchased collections, and history of Duke far into the future.  Only through the grace of our partnership with Digital Projects and Production Services has the DDR recently also become a site that no longer hurts the eyes of our visitors.

The Duke Digital Repository endeavors to protect our assets from a large and diverse threat model. There are threats that are not addressed in the systems model presented here, such as those identified in the SPOT Model for Risk Assessment, of course. We formally consider these baseline threats to include:

  • Natural disasters including accidents at our local nuclear power station, fire, and hurricanes
  • Data degradation also known as bit rot or bit decay
  • External actors or threats posed by people external to the DDR team including those who manage our infrastructure
  • Internal actors including intentional or unintentional security risks and exploits by privileged staff in the libraries and supporting IT organizations

Phase 1 of our ingress into digital preservation established that DSpace, the software powering DukeSpace, was not sufficient for our needs, which led to an environmental scan and pilot project with Fedora and then Fedora and Hydra. This provided us with some of the infrastructure to mitigate the threats we had identified, but not all.  In Phase 1 we were to perform some important preservation tasks including:

  • Prove authenticity by offering checksum fixity validation on ingest and periodically
  • Identify and report on data degradation
  • Capture context in the form of descriptive, administrative, and technical metadata
  • Identify files in need of remediation using file characterization tools

Phase 2 allows us to address a greater range of threats and therefore offer a higher level of security to our collections.  In Phase 2 we’re doing several concurrent migrations including migrating our archival storage to infrastructure that will allow for dynamic resizing, de-duplication, and block-level integrity checking; moving to a horizontally scaled server architecture to allow the repository to grow to meet increasing demands of size (individual file size and size of collection) and traffic; and adopting a cloud replication disaster recovery process using DuraCloud to replace our local-only disk/tape infrastructure.  These changes provide significant protection against our baseline threat model by providing geographic diversity to our replicas, allowing us to constantly monitor the health of our 3 cloud replicas, and providing administrative diversity to the management of our replicas ensuring no single threat may corrupt all 4 copies of our data.

More detail about the repository architecture to come.