All posts by Moira Downey

What we talk about when we talk about digital preservation

(Header image: Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark)

Here at Duke University Libraries, we often talk about digital preservation as though everyone is familiar with the various corners and implications of the phrase, but “digital preservation” is, in fact, a large and occasionally mystifying topic. What does it mean to “preserve” a digital resource for the long term? What does “the long term” even mean with regard to digital objects? How are libraries engaging in preserving our digital resources? And what are some of the best ways to ensure that your personal documents will be reusable in the future? While the answers to some of these questions are still emerging, the library can help you begin to think about good strategies for keeping your content available to other users over time by highlighting agreed-upon best practices, as well as some of the services we are able to provide to the Duke community.

File formats

Not all file formats have proven to be equally robust over time! Have you ever tried to open a document created using a Microsoft Office product from several years ago, only to be greeted with a page full of strangely encoded gibberish? Proprietary software like the products in the Office suite can be convenient and produce polished contemporary documents. But software changes, and there is often no guarantee that the beautifully formatted paper you’ve written using Word will be legible without the appropriate software 5 years down the line. One solution to this problem is to always have a version of that software available to you to use. Libraries are beginning to investigate this strategy (often using a technique called emulation) as an important piece of the digital preservation puzzle. The Emulation as a Service (EaaS) architecture is an emerging tool designed to simplify access to preserved digital assets by allowing end users to interact with the original environments running on different emulators.

An alternative to emulation as a solution is to save your files in a format that can be consumed by different, changing versions of software. Experts at cultural heritage institutions like the Library of Congress and the US National Archives and Records Administration have identified an array of file formats about which they feel some degree of confidence that the software of the future will be able to consume. Formats like plain text or PDFs for textual data, value separated files (like comma-separated values, or CSVs), MP3s and MP4s for audio and video data respectively, and JPEGs for still images have all proven to have some measure of durability as formats. What’s more, they will help to make your content or your data more easily accessible to folks who do not have access to particular kinds of software. It can be helpful to keep these format recommendations in mind when working with your own materials.

File format migration

The formats recommended by the LIbrary of Congress and others have been selected not only because they are interoperable with a wide variety of software applications, but also because they have proven to be relatively stable over time, resisting format obsolescence. The process of moving data from an obsolete format to one that is usable in the present day is known as file format migration or format conversion. Libraries generally have yet to establish scalable strategies for extensive migration of obsolete file formats, though it is generally a subject of some concern.

Here at DUL, we encourage the use of one of these recommended formats for content that is submitted to us for preservation, and will even go so far as to convert your files prior to preservation in one of our repository platforms where possible and when appropriate to do so. This helps us ensure that your data will be usable in the future. What we can’t necessarily promise is that, should you give us content in a file format that isn’t one we recommend, a user who is interested in your materials will be able to read or otherwise use your files ten years from now. For some widely used formats, like MP3 and MP4, staff at the Libraries anticipate developing a strategy for migrating our data from this format, in the event that the format becomes superseded. However, the Libraries do not currently have the staff to monitor and convert rarer, and especially proprietary formats to one that is immediately consumable by contemporary software. The best we can promise is that we are able to deliver to the end users of the future the same digital bits you initially gave to us.

Bit-level preservation

Which brings me to a final component of digital preservation: bit-level preservation. At DUL, we calculate a checksum for each of the files we ingest into any of our preservation repositories. Briefly, a checksum is an algorithmically derived alphanumeric hash that is intended to surface errors that may have been introduced to the file during its transmission or storage. A checksum acts somewhat like a digital fingerprint, and is periodically recalculated for each file in the repository environment by the repository software to ensure that nothing has disrupted the bits that compose each individual file. In the event that the re-calculated checksum does not match the one supplied when the file has been ingested into the repository, we can conclude with some level of certainty that something has gone wrong with the file, and it may be necessary to revert to an earlier version of the data. THe process of generating, regenerating, and cross-checking these checksums is a way to ensure the file fixity, or file integrity, of the digital assets that DUL stewards.

It Takes a Village to Curate Your Data: Duke Partners with the Data Curation Network

In early 2017, Duke University Libraries launched a research data curation program designed to help researchers on campus ensure that their data are adequately prepared for both sharing and publication, and long term preservation and re-use. Why the focus on research data? Data generated by scholars in the course of their investigation are increasingly being recognized as outputs similar in importance to the scholarly publications they support. Open data sharing reinforces unfettered intellectual inquiry, fosters transparency, reproducibility and broader analysis, and permits the creation of new data sets when data from multiple sources are combined. For these reasons, a growing number of publishers and funding agencies like PLoS ONE and the National Science Foundation are requiring researchers to make openly available the data underlying the results of their research.

Data curation steps

But data sharing can only be successful if the data have been properly documented and described. And they are only useful in the long term if steps have been taken to mitigate the risks of file format obsolescence and bit rot. To address these concerns, Duke’s data curation workflow will review a researcher’s data for appropriate documentation (such as README files or codebooks), solicit and refine Dublin Core metadata about the dataset, and make sure files are named and arranged in a way that facilitates secondary use. Additionally, the curation team can make suggestions about preferred file formats for long-term re-use and conduct a brief review for personally identifiable information. Once the data package has been reviewed, the curation team can then help researchers make their data available in Duke’s own Research Data Repository, where the data can be licensed and assigned a Digital Object Identifier, ensuring persistent access.

 

“The Data Curation Network (DCN) serves as the “human layer” in the data repository stack and seamlessly connects local data sets to expert data curators via a cross-institutional shared staffing model.”

 

New to Duke’s curation workflow is the ability to rely on the domain expertise of our colleagues at a few other research institutions. While our data curators here at Duke possess a wealth of knowledge about general research data-related best practices, and are especially well-versed in the vagaries of social sciences data, they may not always have the all the information they need to sufficiently assess the state of a dataset from a researcher. As an answer to this problem, the Data Curation Network, an Alfred P. Sloan Foundation-funded endeavor, has established a cross-institutional staffing model that distributes the domain expertise of each of its partner institutions. Should a curator at one institution encounter data of a kind with which they are unfamiliar, submission to the DCN opens up the possibility for enhanced curation from a network partner with the requisite knowledge.

DCN Partner Institutions
DCN Partner Institutions

Duke joins Cornell University, Dryad Digital Repository, Johns Hopkins University, University of Illinois, University of Michigan, University of Minnesota, and Pennsylvania State University in partnering to provide curatorial expertise to the DCN. As of January of this year, the project has moved out of pilot phase into production, and is actively moving data through the network. If a Duke researcher were to submit a dataset our curation team thought would benefit from further examination by a curator with domain knowledge, we will now reach out to the potential depositor to receive clearance to submit the data to the network. We’re very excited about this opportunity to provide this enhancement to our service!

Looking forward, the DCN hopes to expand their offerings to include nation-wide training on specialized data curation and to extend the curation services the network offers beyond the partner institutions to individual end users. Duke looks forward to contributing as the project grows and evolves.

Sustaining Open

On learning that this year’s conference on Open Repositories would be held in Bozeman, Montana, I was initially perplexed. What an odd, out-of-the-way corner of the world in which to hold an international conference on the work of institutional digital repositories. After touching down in Montana, however, it quickly became apparent how appropriate the setting would be to this year’s conference—a geographic metaphor for the conference theme of openness and sustainability. I grew up out west, but coastal California has nothing on the incomprehensibly vast and panoramic expanse of western Montana. I was fortunate enough to pass a few days driving around the state before the conference began, culminating in a long afternoon spent at Yellowstone National Park. As we wrapped up our hike that afternoon by navigating the crowds and the boardwalks hovering over the terraces of the Mammoth Hot Springs, I wondered about the toll our presence took on the park, what responsible consumption of the landscape looks like, and how we might best preserve the park’s beauty for the future.

Beaver Pond Loop Trail, Yellowstone National Park

Tuesday’s opening remarks from Kenning Arlitsch, conference host Montana State University’s Dean of Libraries, reflected these concerns, pivoting from a few words on what “open” means for library and information professionals to a lengthier consideration of the impact of “openness” on the uniqueness and precarity of the greater Yellowstone eco-system. Dr. Arlitsch noted that “[w]e can always create more digital space, but we cannot create more of these wild spaces.” While I agree unreservedly with the latter part of his statement, as the conference progressed, I found myself re-evaluating the whole of that assertion. Although it’s true that we may be able to create more digital space with some ease (particularly as the strict monetary cost of digital storage becomes more manageable), it’s what we do with this space that is meaningful for the future. One of my chief takeaways from my time in Montana was that responsibly stewarding our digital commons and sustaining open knowledge for the long term is hard, complicated work. As the volume of ever more complex digital assets accelerates, finding ways responsibly ensure access now and for the future is increasingly difficult.


“Research and Cultural Heritage communities have embraced the idea of Open; open communities, open source software, open data, scholarly communications, and open access publications and collections. These projects and communities require different modes of thinking and resourcing than purchasing vended products. While open may be the way forward, mitigating fatigue, finding sustainable funding, and building flexible digital repository platforms is something most of us are striving for.”


Many of the sessions I attended took the curation of research data in institutional repositories as their focus; in particular, a Monday workshop on “Engaging Liaison Librarians in the Data Deposit Workflow: Starting the Conversation” highlighted that research data curation is taking place through a wide array of variously resourced and staffed workflows across institutions. A good number of institutions do not have their own local repository for data, and even those larger organizations with broad data curation expertise and robust curatorial workflows (like Carnegie Mellon University, representatives from which led the workshop) may outsource their data publishing infrastructure to applications like Figshare, rather than build a local solution. Curatorial tasks tended to mean different things in different organizational contexts, and workflows varied according to staffing capacity. Our workshop breakout group spent some time debating the question of whether institutional repositories should even be in the business of research data curation, given the demanding nature of the work and the disparity in available resources among research organizations. It’s a tough question without any easy answers; while there are some good reasons for institutions to engage in this kind of work where they are able (maintaining local ownership of open data, institutional branding for researchers), it’s hard to escape the conclusion that many IRs are under-equipped from the standpoint of staff or infrastructure to sustainably process the on-coming wave of large-scale research data.

Mammoth Hot Springs, Yellowstone National Park

Elsewhere, from a technical perspective, presentations chiefly seemed to emphasize modularity, microservices, and avoiding reinventing the wheel. Going forward, it seems as though community development and shared solutions to problems held in common will be integral strategies to sustainably preserving our institutional research output and digital cultural heritage. The challenge resides in equitably distributing this work and in providing appropriate infrastructure to support maintenance and governance of the systems preserving and providing access to our data.

The Art of Revolution

 

Model for Monument to the Third International, Vladimir Tatlin

Russia has been back in the news of late for a variety of reasons, some, perhaps, more interesting than others. Last year marked the centennial of the 1917 Russian Revolution, arguably one of the foundational events of the 20th century. The 1917 Revolution was the beginning of enormous upheaval that touched all parts of Russian life. While much of this tumult was undeniably and grotesquely violent, some real beauty and lasting works of art emerged from the maelstrom. New forms of visual art and architecture, rooted in a utopian vision for the new, modern society, briefly flourished. One of the most visible of these movements, begun in the years immediately preceding the onset of revolution, was Constructivism.

 

As first articulated by Vladimir Tatlin, Constructivism as a philosophy held that art should be ‘constructed’; that is to say, art shouldn’t be created as an expression of beauty, but rather to represent the world and should be used for a social purpose. Artists like Tatlin, El Lissitzky, Naum Gabo, and Alexander Rodchenko worked in conversation with the output of the Cubists and Futurists (along with their Russian Suprematist compatriots, like Kazimir Malevich), distilling everyday objects to their most basic forms and materials.

Beat the Whites with the Red Wedge, El Lissitzky, 1919

As the Revolution proceeded, artists of all kinds were rapidly brought on board to help create art that would propagate the Bolshevik cause. Perhaps one of El Lissitzky’s most well-known works, “Beat the Whites with the Red Wedge”, is illustrative of this phenomenon. It uses the new, abstract, constructed forms to convey the image of the Red Army (the Bolsheviks) penetrating and defeating the White Army (the anti-Bolsheviks). Alexander Rodchenko’s similarly well-known “Books” poster, an advertisement for the Lengiz Publishing House, is another informative example, blending the use of geometric forms and bright colors with advertising for a  publishing house that produced materials important to the Soviet cause.

 

Lengiz, Alexander Rodchenko, 1924

Constructivism (and its close kin, Suprematism) would go on to have an enormous impact on Russian and Soviet propaganda and other political materials throughout the existence of the Soviet Union. The Duke Digital Repository has an impressive collection of Russian political posters, spanning almost the entire history of the Soviet Union, from the 1917 Revolution on through to the Perestroika of the 1980s. The collection contains posters and placards emphasizing the benefits of Communism, the achievements of the Soviet Union under Communism, and finally the potential dangers inherent in the reconstruction and openness that characterized the period under Mikhail Gorbachev.  I wanted to use this blog post to highlight a few of my favorites below, some of which bear evidence of this broader art historical legacy.

 

Literacy, the road to Communism, 1920
Easter. Contrast of joyous Easter of Long Ago with Serious Workers of Com.[mmunist] Russia, 1930
Member of a Religious Sect Is Fooling the People, 1925
Young Leninists are the children of Il’ich, 1924
Female workers and peasants, make your way to the voting booth! Under the red banner, in the same ranks as the men, we inspire fear in the bourgeoisie!, 1925

 

Moving the mountain (of data)

It’s a new year! And a new year means new priorities. One of the many projects DUL staff have on deck for the Duke Digital Repository in the coming calendar year is an upgrade to DSpace, the software application we use to manage and maintain our collections of scholarly publications and electronic theses and dissertations. As part of that upgrade, the existing DSpace content will need to be migrated to the new software. Until very recently, that existing content has included a few research datasets deposited by Duke community members. But with the advent of our new research data curation program, research datasets have been published in the Fedora 3 part of the repository. Naturally, we wanted all of our research data content to be found in one place, so that meant migrating the few existing outliers. And given the ongoing upgrade project, we wanted to be sure to have it done and out of the way before the rest of the DSpace content needed to be moved.

The Integrated Precipitation and Hydrology Experiment

Most of the datasets that required moving were relatively small–a handful of files, all of manageable size (under a gigabyte) that could be exported using DSpace’s web interface. However, a limited series of data associated with a project called The Integrated Precipitation and Hydrology Experiment (IPHEx) posed a notable exception. There’s a lot of data associated with the IPHEx project (recorded daily for 7 years, along with some supplementary data files, and iterated over 3 different areas of coverage, the total footprint came to just under a terabyte, spread over more than 7,000 files), so this project needed some advance planning.

First, the size of the project meant that the data were too large to export through the DSpace web client, so we needed the developers to wrangle a behind the scenes dump of what was in DSpace to a local file system. Once we had everything we needed to work with (which included some previously unpublished updates to the data we received last year from the researchers), we had to make some decisions on how to model it. The data model used in DSpace was a bit limiting, which resulted in the data being made available as a long list of files for each part of the project. In moving the data to our Fedora repository, we gained a little more flexibility with how we could arrange the files. We determined that we wanted to deviate slightly from the arrangement in DSpace, grouping the files by month and year.

This meant we would have group all the files into subdirectories containing the data for each month–for over 7,000 files, that would have been extremely tedious to do by hand, so we wrote a script to do the sorting for us. That completed, we were able to carry out the ingest process as normal. The final wrinkle associated with the IPHEx project was making sure that the persistent identifiers each part of the project data had been assigned in DSpace still resolved to the correct content. One of our developers was able to set up a server redirect to ensure that each URL would still take a user to the right place. As of the new year, the IPHEx project data (along with our other migrated DSpace datasets) are available in their new home!

At least (of course) until the next migration.

September scale-up: promoting the DDR and associated services to faculty and students

It’s September, and Duke students aren’t the only folks on campus in back-to-school mode. On the contrary, we here at the Duke Digital Repository are gearing up to begin promoting our research data curation services in real earnest. Over the last eight months, our four new research data staff have been busy getting to know the campus and the libraries, getting to know the repository itself and the tools we’re working with, and establishing a workflow. Now we’re ready to begin actively recruiting research data depositors!

As our colleagues in Data and Visualization Services noted in a presentation just last week, we’re aiming to scale up our data services in a big way by engaging researchers at all stages of the research lifecycle, not just at the very end of a research project. We hope to make this effort a two-front one. Through a series of ongoing workshops and consultations, the Research Data Management Consultants aspire to help researchers develop better data management habits and take the longterm preservation and re-use of their data into account when designing a project or applying for grants. On the back-end of things, the Content Analysts will be able to carry out many of the manual tasks that facilitate that longterm preservation and re-use, and are beginning to think about ways in which to tweak our existing software to better accommodate the needs of capital-D Data.

This past spring, the Data Management Consultants carried out a series of workshops intending to help researchers navigate the often muddy waters of data management and data sharing; topics ranged from available and useful tools to the occasionally thorny process of obtaining consent for–and the re-use of–data from human subjects.

Looking forward to the fall, the RDM consultants are planning another series of workshops to expand on the sessions given in the spring, covering new tools and strategies for managing research output. One of the tools we’re most excited to share is the Open Science Framework (OSF) for Institutions, which Duke joined just this spring. OSF is a powerful project management tool that helps promote transparency in research and allows scholars to associate their work and projects with Duke.

On the back-end of things, much work has been done to shore up our existing workflows, and a number of policies–both internal and external–have been met with approval by the Repository Program Committee. The Content Analysts continue to become more familiar with the available repository tools, while weighing in on ways in which we can make the software work better. The better part of the summer was devoted to collecting and analyzing requirements from research data stakeholders (among others), and we hope to put those needs in the development spotlight later this fall.

All of this is to say: we’re ready for it, so bring us your data!

Going with the Flow: building a research data curation workflow

Why research data? Data generated by scholars in the course of investigation are increasingly being recognized as outputs nearly equal in importance to the scholarly publications they support. Among other benefits, the open sharing of research data reinforces unfettered intellectual inquiry, fosters reproducibility and broader analysis, and permits the creation of new data sets when data from multiple sources are combined. Data sharing, though, starts with data curation.

In January of this year, Duke University Libraries brought on four new staff members–two Research Data Management Consultants and two Digital Content Analysts–to engage in this curatorial effort, and we have spent the last few months mapping out and refining a research data curation workflow to ensure best practices are applied to managing data before, during, and after ingest into the Duke Digital Repository.

What does this workflow entail? A high level overview of the process looks something like the following:

After collecting their data, the researcher will take what steps they are able to prepare it for deposit. This generally means tasks like cleaning and de-identifying the data, arranging files in a structure expected by the system, and compiling documentation to ensure that the data is comprehensible to future researchers. The Research Data Management Consultants will be on hand to help guide these efforts and provide researchers with feedback about data management best practices as they prepare their materials.

Our form for metadata capture

Depositors will then be asked to complete a metadata form and electronically sign a deposit agreement defining the terms of deposit. After we receive this information, someone from our team will invite the depositor to transfer their files to us, usually through Box.

Consultant tasks

As this stage, the Research Data Management Consultants will begin a preliminary review of the researcher’s data by performing a cursory examination for personally identifying or protected health information, inspecting the researcher’s documentation for comprehension and completeness, analyzing the submitted metadata for compliance with the research data application profile, and evaluating file formats for preservation suitability. If they have any concerns, they will contact the researcher to make some suggestions about ways to better align the deposit with best practices.

Analyst tasks

When the deposit is in good shape, the Research Data Management Consultants will notify the Digital Content Analysts, who will finalize the file arrangement and migrate some file formats, generate and normalize any necessary or missing metadata, ingest the files into the repository, and assign the deposit a DOI. After the ingest is complete, the Digital Content Analysts will carry out some quality assurance on the data to verify that the deposit was appropriately and coherently structured and that metadata has been correctly assigned. When this is confirmed, they will publish the data in the repository and notify the depositor.

Of course, this workflow isn’t a finished piece–we hope to continue to clarify and optimize the process as we develop relationships with researchers at Duke and receive more data. The Research Data Management Consultants in particular are enthusiastic about the opportunity to engage with scholars earlier in the research life cycle in order to help them better incorporate data curation standards in the beginning phases of their projects. All of us are looking forward to growing into our new roles, while helping to preserve Duke’s research output for some time to come.