The Duke Digital Repository is a pretty nice place if you’re a file in need of preservation and perhaps some access. Provided you’re well-described and your organizational relationship to other files and collections is well understood, you could hardly hope for a better home. But what if you’re not? What if you’re an important digitized file with only collection-level description? Or what if you’re digital reproduction of an 18th century encyclopedia created by a conservator to supplement traditional conservation methods? It takes time to prepare materials for the repository. We try our best to preserve the materials in the repository, but we also have to think about the other stuff.
We may apply different levels of preservation to materials depending on their source, uniqueness, cost to reproduce or reacquire, and other factors, but the baseline is knowing the objects we’re maintaining are the same objects we were given. For that, we rely on fixity and checksums. Unfortunately, it’s not easy to keep track of a couple of hundred terabytes of files from different collections, with different organizational schemes, different owners, and sometimes active intentional change. The hard part isn’t only knowing what has changed, but providing that information to the owners and curators of the data so they can determine if those changes are intentional and desirable. Seems like a lot, right?
We’re used some great tools from our colleagues, notably ACE Audit Control Environment, for scheduled fixity reporting. We really wanted, though, to provide reporting to data owners that was tailored to they way they thought of their data to help reduce noise (with hundreds of terabytes there can be a lot of it!) and make it easier for them to identify unintentional changes. So, we got work.
That work is named FileTracker. FileTracker is a Rails application for tracking files and their fixity information. It’s got a nice dashboard, too.
What we really needed, though, was a way to disentangle the work of the monitoring application from the work of stakeholder reporting. The database that FileTracker generates makes it much easier to generate reports that contain the information that stakeholders want. For instance, one stakeholder may want to know the number of files in each directory and the difference between the present number of files and the number of files at last audit. We can also determine when files have been moved or renamed and not report those as missing files.
If you’d like to know more, see https://github.com/duke-libraries/file-tracker.