All posts by Moira Downey

Curating for a community: joining the DCN

At DUL, we talk quite a lot about the value of research data curation. The Libraries provide a curatorial review of all data packages submitted to the Research Data Repository for publication. This review can help to enhance a researcher’s dataset by enabling a second or third pair of eyes to look over the data and ensure that all documentation is as complete as possible and that the dataset as a whole has been optimized for long term reuse. Although it’s not necessary to have expertise in the domain of the data under review, it can be helpful to give the curator a fuller picture of what is needed to help make those data FAIR. While data curators working in the Libraries possess a wealth of knowledge about general research data-related best practices, and are especially well-versed in the vagaries of social sciences data, they may not always have the all the information they need to sufficiently assess the state of a dataset from a researcher.

As I discussed in a blog post back in 2019, for the last few years, Duke has been a part of a project designed to address gaps in domain proficiency that are a natural part of a curation program of our size. The Data Curation Network has functioned as grant-supported consortium of data curation professionals located in research institutions who have pooled their knowledge to provide enhanced review for data that fall outside the expertise of local curators. Partner institutions can submit datasets to the Network and they will be matched with a DCN curator with the relevant domain experience. Beyond providing curation services, the DCN generates a variety of community resources pertaining to data curation, including a standardized set of curation steps and workflow, a list of essential data curation activities, and a growing roster of instructional primers to support the curation of various kinds of data.

The DCN has grown since my last post, and now includes curators from 11 institutions and the Dryad research data repository. DCN curators work with data from disciplines ranging from aerospace engineering to urban and regional planning and tackle data types from qualitative survey responses to machine learning model training datasets.

Updated for 2021!

Although two members have worked with the DCN for a few years, the rest of the DUL research data curation team is now getting in on the action. Last week, the two Repository Services Analysts embedded with the curation team began the process of onboarding to serve as DCN curators. While we have been able to contribute to local curation of datasets for the RDR, this new opportunity presents us with a chance to not only gain valuable experience working with some practiced curators, but also to contribute back to the community that has helped to support our work. We are very excited to expand and deepen our DCN participation!

Sharing data and research in a time of global pandemic, Part 2

[Header image from Fischer, E., Fischer, M., Grass, D., Henrion, I., Warren, W., Westman, E. (2020, August 07). Low-cost measurement of facemask efficacy for filtering expelled droplets during speech. Science Advances. https://advances.sciencemag.org/content/early/2020/08/07/sciadv.abd3083]

Back in March, just as things were rapidly shutting down across the United States, I wrote a post reflecting on how integral the practice of sharing and preserving research data would be to any solution to the crisis posed by COVID-19. While some of the language in that post seems a bit naive in retrospect (particularly the bit about RDAP’s annual meeting being one of the last in-person conferences of just the spring, as opposed to the entire calendar year!), the emphasis on the importance of rapid and robust data sharing has stood the test of time. In late June, the Research Data Alliance released a set of recommendations and guidelines for sharing research data under circumstances shaped by COVID-19, and a number of organizations, including the National Institutes of Health, have established portals for finding data related to the disease. Access to data has been forefront in the minds of many researchers.

Perhaps in response to this general sentiment (or maybe because folks haven’t been able to access their labs?!), we in the Libraries have seen a notable increase in the number of submissions to our Research Data Repository for data publication. These datasets have derived from a broad range of disciplines, spanning Environmental Sciences to Dermatology. I wanted to use this blog post as an opportunity to highlight a few of our accessions from the last several months.

One of our most prolific sources of data deposits has historically been the lab of Dr. Patrick Charbonneau, associate professor of Chemistry and Physics. Dr. Charbonneau’s lab investigates glass and its physical properties and contributes to a project known as The Simons Collaboration on Cracking the Glass Problem, which addresses issues like disorder, nonlinear response and far-from-equilibrium dynamics. The most recent contribution from Dr. Charbonneau’s research group, published just last week, is fairly characteristic of the materials we receive from Dr. Charbonneau’s group. It contains the raw binary observational data and scripts that were used to create the figures which appear in the researcher’s article. Making these research products available helps other scholars to repeat or reproduce (and thereby strengthen) the findings elucidated in an associated research publication.

Fig01 / Fig02b, Data from: Finite-dimensional vestige of spinodal criticality above the dynamical glass transition

 

Another recent data deposit—a first of its kind for the RDR—is a Q-sort concourse for the Human Dimensions of Large Marine Protected Areas project, which investigates the formulation of large marine protected areas (defined by the project as “any ocean area larger than 100,000 km² that has been designated for the purpose of conservation”) as a global movement. Q-methodology is a psychology and social sciences research method used to study viewpoints. In this study, 40 interviewees were asked to evaluate statements related to large-scale marine protected areas. Q-sorts can be particularly helpful when researchers wish to describe subjective viewpoints related to an issue.

Q sort record sheet from: Q-Sort Concourse and Data for the Human Dimensions of Large MPAs project

Finally, perhaps our most timely deposit has come from a group investigating an alternate method to evaluate the efficacy of masks to reduce the transmission of respiratory droplets during regular speech. “Low-cost measurement of facemask efficacy for filtering expelled droplets during speech,” published last week in Science Advances, is a proof-of-concept study that proposes an optical measurement technique that the group asserts is both inexpensive and easy to use. Because the topic of measuring mask efficiency is still both complex and unsettled, the group hopes this work will help improve evaluation in order to guide mask selection and policy decisions.

Screenshot of Speaker1_None_05.mp4, Video data from: Low-cost measurement of facemask efficacy for filtering expelled droplets during speech

The dataset consists of a series of movie recordings, that capture an operator wearing a face mask and speaking in the direction of an expanded laser beam inside a dark enclosure. Droplets that propagate through the laser beam scatter light, which is then recorded with a cell phone camera. The group tested 12 kinds of masks (see below), and recorded 2 sets of controls with no masks. 

Figure 2 from Low-cost measurement of facemask efficacy for filtering expelled droplets during speech

We hope to keep up the momentum our data management, curation, and publication program has gained over the last few months, but we need your help! For more information on using the Duke Research Data Repository to share and preserve your data, please visit our website, or drop up a line at datamangement@duke.edu. A full list of the datasets we’ve published since moving to fully remote operations in March is available below.

  • Zhang, Y. (2020). Data from: Contributions of World Regions to the Global Tropospheric Ozone Burden Change from 1980 to 2010. Duke Research Data Repository. https://doi.org/10.7924/r40p13p11
  • Campbell, L. M., Gray, N., & Gruby, R. (2020). Data from: Q-Sort Concourse and Data for the Human Dimensions of Large MPAs project. Duke Research Data Repository. https://doi.org/10.7924/r4j38sg3b
  • Berthier, L., Charbonneau, P., & Kundu, J. (2020). Data from: Finite-dimensional vestige of spinodal criticality above the dynamical glass transition. Duke Research Data Repository. https://doi.org/10.7924/r4jh3m094
  • Fischer, E., Fischer, M., Grass, D., Henrion, I., Warren, W., Westman, E. (2020). Video data files from: Low-cost measurement of facemask efficacy for filtering expelled droplets during speech. Duke Research Data Repository. V2 https://doi.org/10.7924/r4ww7dx6q
  • Lin, Y., Kouznetsova, T., Chang, C., Craig, S. (2020). Data from: Enhanced polymer mechanical degradation through mechanochemically unveiled lactonization. Duke Research Data Repository. V2 https://doi.org/10.7924/r4fq9x365
  • Chavez, S. P., Silva, Y., & Barros, A. P. (2020). Data from: High-elevation monsoon precipitation processes in the Central Andes of Peru. Duke Research Data Repository. V2 https://doi.org/10.7924/r41n84j94
  • Jeuland, M., Ohlendorf, N., Saparapa, R., & Steckel, J. (2020). Data from: Climate implications of electrification projects in the developing world: a systematic review. Duke Research Data Repository. https://doi.org/10.7924/r42n55g1z
  • Cardones, A. R., Hall, III, R. P., Sullivan, K., Hooten, J., Lee, S. Y., Liu, B. L., Green, C., Chao, N., Rowe Nichols, K., Bañez, L., Shah, A., Leung, N., & Palmeri, M. L. (2020). Data from: Quantifying skin stiffness in graft-versus-host disease, morphea and systemic sclerosis using acoustic radiation force impulse imaging and shear wave elastography. Duke Research Data Repository. https://doi.org/10.7924/r4h995b4q
  • Caves, E., Schweikert, L. E., Green, P. A., Zipple, M. N., Taboada, C., Peters, S., Nowicki, S., & Johnsen, S. (2020). Data and scripts from: Variation in carotenoid-containing retinal oil droplets correlates with variation in perception of carotenoid coloration. Duke Research Data Repository. https://doi.org/10.7924/r4jw8dj9h
  • DiGiacomo, A. E., Bird, C. N., Pan, V. G., Dobroski, K., Atkins-Davis, C., Johnston, D. W., Ridge, J. T. (2020). Data from: Modeling salt marsh vegetation height using Unoccupied Aircraft Systems and Structure from Motion. Duke Research Data Repository. https://doi.org/10.7924/r4w956k1q
  • Hall, III, R. P., Bhatia, S. M., Streilein, R. D. (2020). Data from: Correlation of IgG autoantibodies against acetylcholine receptors and desmogleins in patients with pemphigus treated with steroid sparing agents or rituximab. Duke Research Data Repository. https://doi.org/10.7924/r4rf5r157
  • Jin, Y., Ru, X., Su, N., Beratan, D., Zhang, P., & Yang, W. (2020). Data from: Revisiting the Hole Size in Double Helical DNA with Localized Orbital Scaling Corrections. Duke Research Data Repository. https://doi.org/10.7924/r4k072k9s
  • Kaleem, S. & Swisher, C. B. (2020). Data from: Electrographic Seizure Detection by Neuro ICU Nurses via Bedside Real-Time Quantitative EEG. Duke Research Data Repository. https://doi.org/10.7924/r4mp51700
  • Yi, G. & Grill, W. M. (2020). Data and code from: Waveforms optimized to produce closed-state Na+ inactivation eliminate onset response in nerve conduction block. Duke Research Data Repository. https://doi.org/10.7924/r4z31t79k
  • Flanagan, N., Wang, H., Winton, S., Richardson, C. (2020). Data from: Low-severity fire as a mechanism of organic matter protection in global peatlands: thermal alteration slows decomposition. Duke Research Data Repository. https://doi.org/10.7924/r4s46nm6p
  • Gunsch, C. (2020). Data from: Evaluation of the mycobiome of ballast water and implications for fungal pathogen distribution. Duke Research Data Repository. https://doi.org/10.7924/r4t72cv5v
  • Warnell, K., & Olander, L. (2020). Data from: Opportunity assessment for carbon and resilience benefits on natural and working lands in North. Carolina. Duke Research Data Repository. https://doi.org/10.7924/r4ww7cd91

Sharing data and research in a time of global pandemic

[Header image from the New York Times Coronavirus Map, March 17th, 2020]

Just before Duke stopped travel for all faculty and staff last week, I was able to attend what will probably turn out to have been one of the last conferences of the spring in the Research Data Access and Preservation Association’s (RDAP) annual summit in Santa Fe, New Mexico. RDAP is a community of “data managers and curators, librarians, archivists, researchers, educators, students, technologists, and data scientists from academic institutions, data centers, funding agencies, and industry who represent a wide range of STEM disciplines, social sciences, and humanities,” and who are committed to creating, maintaining, and teaching best practices for the access and preservation of research data. While there were many interesting presentations and posters about the work being done in this area at various institutions around the country, the conference and RDAP’s work more broadly resonated with me in a very general and timely way, which did not necessarily stem from anything I heard during the week. 

In a situation like the global pandemic we are now facing, open and unfettered access to research data is vital for treating patients, attempting to stem the course of the disease, and potentially developing life-saving vaccines lives.

A recent editorial in Science, Translational Medicine, argues that data-driven models and centralized data sharing are the best way to approach this kind of outbreak, stating “[w]e believe that scientific efforts need to include determining the values (and ranges) of the above key variables and identifying any other important ones. In addition, information on these variables should be shared freely among the scientific and the response and resilience communities, such as the Red Cross, other nongovernmental organizations, and emergency responders” [1]. As another article points out, sharing viral samples from around the world has allowed scientists to get a better picture of the disease’s genetic makeup: “[c]omparing those genomes allowed Bedford and colleagues to piece together a viral family tree. ‘We can chart this out on the map, then, because we know that this genome is connected to this genome by these mutations,’ he said. ‘And we can learn about these transmission links'” [2].


We can chart this out on the map, then, because we know that this genome is connected to this genome by these mutations. And we can learn about these transmission links.


Scientists are also accelerating the research lifecycle by using preprint servers like arXiv, bioRxiv, and medRxiv to share their preliminary conclusions without waiting on the often glacial process of peer review. This isn’t a wholly unalloyed positive, and many preprints warrant the increased scrutiny that peer review represents. Moreover, scientific research often benefits from the kind of contextualization and unpacking that peer review and science journalism can occasionally provide. But in the acute crisis that the current outbreak presents, the rapid spread of information among scientific peer networks can undoubtedly save lives.

Continuing to develop and build the infrastructure—in terms of both technology and policy frameworks—needed to conduct the kind of data sharing we are seeing now remains a goal for the scientific community moving forward.

The Libraries, along with communities like RDAP, the Research Data Alliance, and the Data Curation Network, endorse and support this mission, and we will continue to play our role in preserving and providing persistent access to research data as best we can as we all move forward through this together. In the meantime, we hope everyone in the Duke community stays safe and healthy!

[1] Layne, S. P., Hyman, J. M., Morens, D. M., & Taubenberger, J. K. (2020, March 11). New coronavirus outbreak: Framing questions for pandemic prevention. Science Translational Medicine 12(534). https://doi.org/10.1126/scitranslmed.abb1469

[2] Sanders, L. (2020, February 13). Coronavirus’s genetic fingerprints are used to rapidly map its spread. Science News. https://www.sciencenews.org/article/coronavirus-genetic-fingerprints-are-used-to-rapidly-map-spread

What we talk about when we talk about digital preservation

(Header image: Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark)

Here at Duke University Libraries, we often talk about digital preservation as though everyone is familiar with the various corners and implications of the phrase, but “digital preservation” is, in fact, a large and occasionally mystifying topic. What does it mean to “preserve” a digital resource for the long term? What does “the long term” even mean with regard to digital objects? How are libraries engaging in preserving our digital resources? And what are some of the best ways to ensure that your personal documents will be reusable in the future? While the answers to some of these questions are still emerging, the library can help you begin to think about good strategies for keeping your content available to other users over time by highlighting agreed-upon best practices, as well as some of the services we are able to provide to the Duke community.

File formats

Not all file formats have proven to be equally robust over time! Have you ever tried to open a document created using a Microsoft Office product from several years ago, only to be greeted with a page full of strangely encoded gibberish? Proprietary software like the products in the Office suite can be convenient and produce polished contemporary documents. But software changes, and there is often no guarantee that the beautifully formatted paper you’ve written using Word will be legible without the appropriate software 5 years down the line. One solution to this problem is to always have a version of that software available to you to use. Libraries are beginning to investigate this strategy (often using a technique called emulation) as an important piece of the digital preservation puzzle. The Emulation as a Service (EaaS) architecture is an emerging tool designed to simplify access to preserved digital assets by allowing end users to interact with the original environments running on different emulators.

An alternative to emulation as a solution is to save your files in a format that can be consumed by different, changing versions of software. Experts at cultural heritage institutions like the Library of Congress and the US National Archives and Records Administration have identified an array of file formats about which they feel some degree of confidence that the software of the future will be able to consume. Formats like plain text or PDFs for textual data, value separated files (like comma-separated values, or CSVs), MP3s and MP4s for audio and video data respectively, and JPEGs for still images have all proven to have some measure of durability as formats. What’s more, they will help to make your content or your data more easily accessible to folks who do not have access to particular kinds of software. It can be helpful to keep these format recommendations in mind when working with your own materials.

File format migration

The formats recommended by the LIbrary of Congress and others have been selected not only because they are interoperable with a wide variety of software applications, but also because they have proven to be relatively stable over time, resisting format obsolescence. The process of moving data from an obsolete format to one that is usable in the present day is known as file format migration or format conversion. Libraries generally have yet to establish scalable strategies for extensive migration of obsolete file formats, though it is generally a subject of some concern.

Here at DUL, we encourage the use of one of these recommended formats for content that is submitted to us for preservation, and will even go so far as to convert your files prior to preservation in one of our repository platforms where possible and when appropriate to do so. This helps us ensure that your data will be usable in the future. What we can’t necessarily promise is that, should you give us content in a file format that isn’t one we recommend, a user who is interested in your materials will be able to read or otherwise use your files ten years from now. For some widely used formats, like MP3 and MP4, staff at the Libraries anticipate developing a strategy for migrating our data from this format, in the event that the format becomes superseded. However, the Libraries do not currently have the staff to monitor and convert rarer, and especially proprietary formats to one that is immediately consumable by contemporary software. The best we can promise is that we are able to deliver to the end users of the future the same digital bits you initially gave to us.

Bit-level preservation

Which brings me to a final component of digital preservation: bit-level preservation. At DUL, we calculate a checksum for each of the files we ingest into any of our preservation repositories. Briefly, a checksum is an algorithmically derived alphanumeric hash that is intended to surface errors that may have been introduced to the file during its transmission or storage. A checksum acts somewhat like a digital fingerprint, and is periodically recalculated for each file in the repository environment by the repository software to ensure that nothing has disrupted the bits that compose each individual file. In the event that the re-calculated checksum does not match the one supplied when the file has been ingested into the repository, we can conclude with some level of certainty that something has gone wrong with the file, and it may be necessary to revert to an earlier version of the data. THe process of generating, regenerating, and cross-checking these checksums is a way to ensure the file fixity, or file integrity, of the digital assets that DUL stewards.

It Takes a Village to Curate Your Data: Duke Partners with the Data Curation Network

In early 2017, Duke University Libraries launched a research data curation program designed to help researchers on campus ensure that their data are adequately prepared for both sharing and publication, and long term preservation and re-use. Why the focus on research data? Data generated by scholars in the course of their investigation are increasingly being recognized as outputs similar in importance to the scholarly publications they support. Open data sharing reinforces unfettered intellectual inquiry, fosters transparency, reproducibility and broader analysis, and permits the creation of new data sets when data from multiple sources are combined. For these reasons, a growing number of publishers and funding agencies like PLoS ONE and the National Science Foundation are requiring researchers to make openly available the data underlying the results of their research.

Data curation steps

But data sharing can only be successful if the data have been properly documented and described. And they are only useful in the long term if steps have been taken to mitigate the risks of file format obsolescence and bit rot. To address these concerns, Duke’s data curation workflow will review a researcher’s data for appropriate documentation (such as README files or codebooks), solicit and refine Dublin Core metadata about the dataset, and make sure files are named and arranged in a way that facilitates secondary use. Additionally, the curation team can make suggestions about preferred file formats for long-term re-use and conduct a brief review for personally identifiable information. Once the data package has been reviewed, the curation team can then help researchers make their data available in Duke’s own Research Data Repository, where the data can be licensed and assigned a Digital Object Identifier, ensuring persistent access.

 

“The Data Curation Network (DCN) serves as the “human layer” in the data repository stack and seamlessly connects local data sets to expert data curators via a cross-institutional shared staffing model.”

 

New to Duke’s curation workflow is the ability to rely on the domain expertise of our colleagues at a few other research institutions. While our data curators here at Duke possess a wealth of knowledge about general research data-related best practices, and are especially well-versed in the vagaries of social sciences data, they may not always have the all the information they need to sufficiently assess the state of a dataset from a researcher. As an answer to this problem, the Data Curation Network, an Alfred P. Sloan Foundation-funded endeavor, has established a cross-institutional staffing model that distributes the domain expertise of each of its partner institutions. Should a curator at one institution encounter data of a kind with which they are unfamiliar, submission to the DCN opens up the possibility for enhanced curation from a network partner with the requisite knowledge.

DCN Partner Institutions
DCN Partner Institutions

Duke joins Cornell University, Dryad Digital Repository, Johns Hopkins University, University of Illinois, University of Michigan, University of Minnesota, and Pennsylvania State University in partnering to provide curatorial expertise to the DCN. As of January of this year, the project has moved out of pilot phase into production, and is actively moving data through the network. If a Duke researcher were to submit a dataset our curation team thought would benefit from further examination by a curator with domain knowledge, we will now reach out to the potential depositor to receive clearance to submit the data to the network. We’re very excited about this opportunity to provide this enhancement to our service!

Looking forward, the DCN hopes to expand their offerings to include nation-wide training on specialized data curation and to extend the curation services the network offers beyond the partner institutions to individual end users. Duke looks forward to contributing as the project grows and evolves.

Sustaining Open

On learning that this year’s conference on Open Repositories would be held in Bozeman, Montana, I was initially perplexed. What an odd, out-of-the-way corner of the world in which to hold an international conference on the work of institutional digital repositories. After touching down in Montana, however, it quickly became apparent how appropriate the setting would be to this year’s conference—a geographic metaphor for the conference theme of openness and sustainability. I grew up out west, but coastal California has nothing on the incomprehensibly vast and panoramic expanse of western Montana. I was fortunate enough to pass a few days driving around the state before the conference began, culminating in a long afternoon spent at Yellowstone National Park. As we wrapped up our hike that afternoon by navigating the crowds and the boardwalks hovering over the terraces of the Mammoth Hot Springs, I wondered about the toll our presence took on the park, what responsible consumption of the landscape looks like, and how we might best preserve the park’s beauty for the future.

Beaver Pond Loop Trail, Yellowstone National Park

Tuesday’s opening remarks from Kenning Arlitsch, conference host Montana State University’s Dean of Libraries, reflected these concerns, pivoting from a few words on what “open” means for library and information professionals to a lengthier consideration of the impact of “openness” on the uniqueness and precarity of the greater Yellowstone eco-system. Dr. Arlitsch noted that “[w]e can always create more digital space, but we cannot create more of these wild spaces.” While I agree unreservedly with the latter part of his statement, as the conference progressed, I found myself re-evaluating the whole of that assertion. Although it’s true that we may be able to create more digital space with some ease (particularly as the strict monetary cost of digital storage becomes more manageable), it’s what we do with this space that is meaningful for the future. One of my chief takeaways from my time in Montana was that responsibly stewarding our digital commons and sustaining open knowledge for the long term is hard, complicated work. As the volume of ever more complex digital assets accelerates, finding ways responsibly ensure access now and for the future is increasingly difficult.


“Research and Cultural Heritage communities have embraced the idea of Open; open communities, open source software, open data, scholarly communications, and open access publications and collections. These projects and communities require different modes of thinking and resourcing than purchasing vended products. While open may be the way forward, mitigating fatigue, finding sustainable funding, and building flexible digital repository platforms is something most of us are striving for.”


Many of the sessions I attended took the curation of research data in institutional repositories as their focus; in particular, a Monday workshop on “Engaging Liaison Librarians in the Data Deposit Workflow: Starting the Conversation” highlighted that research data curation is taking place through a wide array of variously resourced and staffed workflows across institutions. A good number of institutions do not have their own local repository for data, and even those larger organizations with broad data curation expertise and robust curatorial workflows (like Carnegie Mellon University, representatives from which led the workshop) may outsource their data publishing infrastructure to applications like Figshare, rather than build a local solution. Curatorial tasks tended to mean different things in different organizational contexts, and workflows varied according to staffing capacity. Our workshop breakout group spent some time debating the question of whether institutional repositories should even be in the business of research data curation, given the demanding nature of the work and the disparity in available resources among research organizations. It’s a tough question without any easy answers; while there are some good reasons for institutions to engage in this kind of work where they are able (maintaining local ownership of open data, institutional branding for researchers), it’s hard to escape the conclusion that many IRs are under-equipped from the standpoint of staff or infrastructure to sustainably process the on-coming wave of large-scale research data.

Mammoth Hot Springs, Yellowstone National Park

Elsewhere, from a technical perspective, presentations chiefly seemed to emphasize modularity, microservices, and avoiding reinventing the wheel. Going forward, it seems as though community development and shared solutions to problems held in common will be integral strategies to sustainably preserving our institutional research output and digital cultural heritage. The challenge resides in equitably distributing this work and in providing appropriate infrastructure to support maintenance and governance of the systems preserving and providing access to our data.

The Art of Revolution

 

Model for Monument to the Third International, Vladimir Tatlin

Russia has been back in the news of late for a variety of reasons, some, perhaps, more interesting than others. Last year marked the centennial of the 1917 Russian Revolution, arguably one of the foundational events of the 20th century. The 1917 Revolution was the beginning of enormous upheaval that touched all parts of Russian life. While much of this tumult was undeniably and grotesquely violent, some real beauty and lasting works of art emerged from the maelstrom. New forms of visual art and architecture, rooted in a utopian vision for the new, modern society, briefly flourished. One of the most visible of these movements, begun in the years immediately preceding the onset of revolution, was Constructivism.

 

As first articulated by Vladimir Tatlin, Constructivism as a philosophy held that art should be ‘constructed’; that is to say, art shouldn’t be created as an expression of beauty, but rather to represent the world and should be used for a social purpose. Artists like Tatlin, El Lissitzky, Naum Gabo, and Alexander Rodchenko worked in conversation with the output of the Cubists and Futurists (along with their Russian Suprematist compatriots, like Kazimir Malevich), distilling everyday objects to their most basic forms and materials.

Beat the Whites with the Red Wedge, El Lissitzky, 1919

As the Revolution proceeded, artists of all kinds were rapidly brought on board to help create art that would propagate the Bolshevik cause. Perhaps one of El Lissitzky’s most well-known works, “Beat the Whites with the Red Wedge”, is illustrative of this phenomenon. It uses the new, abstract, constructed forms to convey the image of the Red Army (the Bolsheviks) penetrating and defeating the White Army (the anti-Bolsheviks). Alexander Rodchenko’s similarly well-known “Books” poster, an advertisement for the Lengiz Publishing House, is another informative example, blending the use of geometric forms and bright colors with advertising for a  publishing house that produced materials important to the Soviet cause.

 

Lengiz, Alexander Rodchenko, 1924

Constructivism (and its close kin, Suprematism) would go on to have an enormous impact on Russian and Soviet propaganda and other political materials throughout the existence of the Soviet Union. The Duke Digital Repository has an impressive collection of Russian political posters, spanning almost the entire history of the Soviet Union, from the 1917 Revolution on through to the Perestroika of the 1980s. The collection contains posters and placards emphasizing the benefits of Communism, the achievements of the Soviet Union under Communism, and finally the potential dangers inherent in the reconstruction and openness that characterized the period under Mikhail Gorbachev.  I wanted to use this blog post to highlight a few of my favorites below, some of which bear evidence of this broader art historical legacy.

 

Literacy, the road to Communism, 1920
Easter. Contrast of joyous Easter of Long Ago with Serious Workers of Com.[mmunist] Russia, 1930
Member of a Religious Sect Is Fooling the People, 1925
Young Leninists are the children of Il’ich, 1924
Female workers and peasants, make your way to the voting booth! Under the red banner, in the same ranks as the men, we inspire fear in the bourgeoisie!, 1925

 

Moving the mountain (of data)

It’s a new year! And a new year means new priorities. One of the many projects DUL staff have on deck for the Duke Digital Repository in the coming calendar year is an upgrade to DSpace, the software application we use to manage and maintain our collections of scholarly publications and electronic theses and dissertations. As part of that upgrade, the existing DSpace content will need to be migrated to the new software. Until very recently, that existing content has included a few research datasets deposited by Duke community members. But with the advent of our new research data curation program, research datasets have been published in the Fedora 3 part of the repository. Naturally, we wanted all of our research data content to be found in one place, so that meant migrating the few existing outliers. And given the ongoing upgrade project, we wanted to be sure to have it done and out of the way before the rest of the DSpace content needed to be moved.

The Integrated Precipitation and Hydrology Experiment

Most of the datasets that required moving were relatively small–a handful of files, all of manageable size (under a gigabyte) that could be exported using DSpace’s web interface. However, a limited series of data associated with a project called The Integrated Precipitation and Hydrology Experiment (IPHEx) posed a notable exception. There’s a lot of data associated with the IPHEx project (recorded daily for 7 years, along with some supplementary data files, and iterated over 3 different areas of coverage, the total footprint came to just under a terabyte, spread over more than 7,000 files), so this project needed some advance planning.

First, the size of the project meant that the data were too large to export through the DSpace web client, so we needed the developers to wrangle a behind the scenes dump of what was in DSpace to a local file system. Once we had everything we needed to work with (which included some previously unpublished updates to the data we received last year from the researchers), we had to make some decisions on how to model it. The data model used in DSpace was a bit limiting, which resulted in the data being made available as a long list of files for each part of the project. In moving the data to our Fedora repository, we gained a little more flexibility with how we could arrange the files. We determined that we wanted to deviate slightly from the arrangement in DSpace, grouping the files by month and year.

This meant we would have group all the files into subdirectories containing the data for each month–for over 7,000 files, that would have been extremely tedious to do by hand, so we wrote a script to do the sorting for us. That completed, we were able to carry out the ingest process as normal. The final wrinkle associated with the IPHEx project was making sure that the persistent identifiers each part of the project data had been assigned in DSpace still resolved to the correct content. One of our developers was able to set up a server redirect to ensure that each URL would still take a user to the right place. As of the new year, the IPHEx project data (along with our other migrated DSpace datasets) are available in their new home!

At least (of course) until the next migration.

September scale-up: promoting the DDR and associated services to faculty and students

It’s September, and Duke students aren’t the only folks on campus in back-to-school mode. On the contrary, we here at the Duke Digital Repository are gearing up to begin promoting our research data curation services in real earnest. Over the last eight months, our four new research data staff have been busy getting to know the campus and the libraries, getting to know the repository itself and the tools we’re working with, and establishing a workflow. Now we’re ready to begin actively recruiting research data depositors!

As our colleagues in Data and Visualization Services noted in a presentation just last week, we’re aiming to scale up our data services in a big way by engaging researchers at all stages of the research lifecycle, not just at the very end of a research project. We hope to make this effort a two-front one. Through a series of ongoing workshops and consultations, the Research Data Management Consultants aspire to help researchers develop better data management habits and take the longterm preservation and re-use of their data into account when designing a project or applying for grants. On the back-end of things, the Content Analysts will be able to carry out many of the manual tasks that facilitate that longterm preservation and re-use, and are beginning to think about ways in which to tweak our existing software to better accommodate the needs of capital-D Data.

This past spring, the Data Management Consultants carried out a series of workshops intending to help researchers navigate the often muddy waters of data management and data sharing; topics ranged from available and useful tools to the occasionally thorny process of obtaining consent for–and the re-use of–data from human subjects.

Looking forward to the fall, the RDM consultants are planning another series of workshops to expand on the sessions given in the spring, covering new tools and strategies for managing research output. One of the tools we’re most excited to share is the Open Science Framework (OSF) for Institutions, which Duke joined just this spring. OSF is a powerful project management tool that helps promote transparency in research and allows scholars to associate their work and projects with Duke.

On the back-end of things, much work has been done to shore up our existing workflows, and a number of policies–both internal and external–have been met with approval by the Repository Program Committee. The Content Analysts continue to become more familiar with the available repository tools, while weighing in on ways in which we can make the software work better. The better part of the summer was devoted to collecting and analyzing requirements from research data stakeholders (among others), and we hope to put those needs in the development spotlight later this fall.

All of this is to say: we’re ready for it, so bring us your data!

Going with the Flow: building a research data curation workflow

Why research data? Data generated by scholars in the course of investigation are increasingly being recognized as outputs nearly equal in importance to the scholarly publications they support. Among other benefits, the open sharing of research data reinforces unfettered intellectual inquiry, fosters reproducibility and broader analysis, and permits the creation of new data sets when data from multiple sources are combined. Data sharing, though, starts with data curation.

In January of this year, Duke University Libraries brought on four new staff members–two Research Data Management Consultants and two Digital Content Analysts–to engage in this curatorial effort, and we have spent the last few months mapping out and refining a research data curation workflow to ensure best practices are applied to managing data before, during, and after ingest into the Duke Digital Repository.

What does this workflow entail? A high level overview of the process looks something like the following:

After collecting their data, the researcher will take what steps they are able to prepare it for deposit. This generally means tasks like cleaning and de-identifying the data, arranging files in a structure expected by the system, and compiling documentation to ensure that the data is comprehensible to future researchers. The Research Data Management Consultants will be on hand to help guide these efforts and provide researchers with feedback about data management best practices as they prepare their materials.

Our form for metadata capture

Depositors will then be asked to complete a metadata form and electronically sign a deposit agreement defining the terms of deposit. After we receive this information, someone from our team will invite the depositor to transfer their files to us, usually through Box.

Consultant tasks

As this stage, the Research Data Management Consultants will begin a preliminary review of the researcher’s data by performing a cursory examination for personally identifying or protected health information, inspecting the researcher’s documentation for comprehension and completeness, analyzing the submitted metadata for compliance with the research data application profile, and evaluating file formats for preservation suitability. If they have any concerns, they will contact the researcher to make some suggestions about ways to better align the deposit with best practices.

Analyst tasks

When the deposit is in good shape, the Research Data Management Consultants will notify the Digital Content Analysts, who will finalize the file arrangement and migrate some file formats, generate and normalize any necessary or missing metadata, ingest the files into the repository, and assign the deposit a DOI. After the ingest is complete, the Digital Content Analysts will carry out some quality assurance on the data to verify that the deposit was appropriately and coherently structured and that metadata has been correctly assigned. When this is confirmed, they will publish the data in the repository and notify the depositor.

Of course, this workflow isn’t a finished piece–we hope to continue to clarify and optimize the process as we develop relationships with researchers at Duke and receive more data. The Research Data Management Consultants in particular are enthusiastic about the opportunity to engage with scholars earlier in the research life cycle in order to help them better incorporate data curation standards in the beginning phases of their projects. All of us are looking forward to growing into our new roles, while helping to preserve Duke’s research output for some time to come.