Category Archives: repository

Coming Soon: Duke Research Data Repository 2.0

We are happy to announce that the Duke Research Data Repository (RDR) will be updating our platform to provide enhancements for data depositors. The platform will be implemented in partnership with Duke University Libraries and TIND, a spinoff of CERN (see the press release). Below we explain what you need to know about this change.

Timeline

  • Starting on February 15, we will stop accepting new submissions to the repository. Deposit services will resume in early March.
  • The migration is slated to occur during the week of February 23 – February 27.
  • Updates to the migration timeline will be available on the RDR website.

Essential Things to Know

  • Datasets will still be accessible for download during this system update and we don’t anticipate significant “down-time” for the site.
  • After the migration is complete, you will have a new user experience, but the core services of the RDR will stay the same (see more below).
  • If you are interested in what the new system will look like, check out the WashU Data Repository.

What is changing?

  • Homepage: There will be a new landing page with slightly different navigation. All key documentation can be found under the “About” or “Resources” pages. You can still reach our site at research.repository.duke.edu.
  • Links: While URLs will change for datasets, all DOIs assigned to datasets will continue to work. We always encourage you to use DOIs (vs. URLs from your browser window) as those are the most stable and persistent!
  • Persistent Identifiers (PIDs): All datasets will continue to receive DOIs, but we will also provide more robust support for ORCIDs (for unique identification of people) and RORs (for unique identification of organizations/institutions) for better tracking of research outputs and to comply with upcoming PID funder requirements
  • Metadata: We will now be integrating more DataCite metadata. Key metadata (descriptive information) will remain primarily the same; however, the form for describing your data will now have more built-in features for metadata standardization and compliance with best practices (see PIDs above).
  • Dataset structure: All datasets will have a single landing page for datasets (no-sub pages as with the current platform). This may result in certain datasets being slightly remodeled (e.g., folders zipped) to accommodate the new dataset structure. No files have been changed and data integrity, reuse, and reproducibility were prioritized in this process.

What do depositors need to know?

  • File upload: For smaller datasets, you will now upload your files directly within the web form prior to hitting the “Submit” button (vs. via a Box link in the current workflow). For larger datasets, upload will continue to be facilitated via Globus.
  • Organization: Datasets will either need to be flat (no folders) or folders (if important to retain for access/reproducibility) will need to be packaged (e.g., zipped/tarred/etc) prior to upload.
  • ORCIDs: To allow linking ORCIDs with datasets, we strongly encourage all dataset authors to have an ORCID prior to beginning a data submission. Go to the ORCID website to get your ORCID today!

What is new and improved?

  • Submission dashboard: A new submission dashboard will allow depositors to track all current and past deposits in one easy location.
  • Versioning: A new versioning module will allow depositors to more easily request the creation of a new version of their dataset. All previous versions will still be retained and the new version will receive a new DOI.
  • File Previews: Previewing common file formats (e.g., tabular files, images, PDFs, zips) will now be available on the dataset landing page.
  • Data Citations: You will now be able to copy data citations in a wide variety of citation styles.

What is staying the same?

  • Curation by RDR staff: Datasets will continue to be reviewed by RDR staff prior to publication to help researchers make their data as FAIR (Findable, Accessible, Interoperable, and Reusable) as possible.
  • Preservation and retention: Data will continue to comply with the DUL preservation policy including redundant copies, fixity checks, and over all information security as required by Duke. Our stated retention policy also will not change.
  • Globus download: For datasets over a certain size threshold, users will continue to have the option to download those files via Globus (in addition to the option to download over the web browser). If the “Download from Globus” button appears at the top of a dataset, we encourage using Globus as downloading large scale data over a web browser has some challenges (e.g., timeouts, etc.).
  • Embargoes: Depositors can continue to request embargoes for up to one year and we can continue to facilitate access to embargoed files for journal reviewers. Embargoed file names will now be viewable on the dataset page but cannot be downloaded until the embargo is lifted.
  • Collections: Project-specific collections will still be supported and can be requested by emailing datamanagement@duke.edu.

The RDR curation team is excited to bring you this new and improved system and look forward to continuing to support data sharing, curation, and reproducibility for Duke generated data.

Please don’t hesitate to reach out with any questions at datamanagement@duke.edu.

Code Repository vs Archival Repository. You need both.

Years ago I heard the following quote attributed to Seamus Ross from 2007:

Digital objects do not, in contrast to many of their analog counterparts, respond well to benign neglect. 

National Wildlife Property Repository
National Wildlife Property Repository. USFWS Mountain-Prairie. https://flic.kr/p/SYVPBB

Meaning, you cannot simply leave digital files to their bit-rot tendencies while expecting them to be usable in the future.  Digital repositories are part of a solution to this problem.  But to review, there are many types of repositories, both digital and analog:  repositories of bones, insects, plants, books, digital data, etc.  Even among the subset of digital repositories there are many types.  Some digital repositories keep your data safe for posterity and replication.  Some help you manage the distribution of analysis and code.  Knowing about these differences will affect not only the ease of your computational workflow, but also the legacy of your published works.  

Version-control repositories and their hubs

The most widely known social coding hubs include GitHub, Bitbucket and GitLab.  These hubs leverage Git version-control software to track the evolution of project repositories – typically a software or computational analysis project.  Importantly, Git and GitHub are not the same thing but they work well together.

Git repository
GIT Repository. Treviño. https://flic.kr/p/SSras

Version control works by monitoring any designated folder or project directory, making that directory a local repository or repo.  Among other benefits, using version control enables “time travel.” Interactions with earlier versions of a project are commonplace.  It’s simple to retrieve a deleted paragraph from a report written six months ago.  However there are many advanced features as well. For example, unlike common file-syncing tools, it’s easy to recreate an earlier state of an entire project directory and every file from a particular point in time.  This feature among others makes Git version-control a handy tool in support of many research workflows and the respective outputs:  documents, visualizations, dashboards, slides, analysis, code, software, etc.  

Binary. Michael Coghlan. https://flic.kr/p/aYEytM

Git is one of the most popular, open-source, version-control applications; originally developed in 2005 to facilitate the evolution of the world’s most far reaching and successful open-source coding project.  Linux is a world-wide collaborative project that spans multiple developers, project managers, natural languages, geographies, and time-zones.  While Git can handle large projects, it is extensible and can easily scale up or down to support a wide range of workflows.  Additionally, Git is not just for software and code files.  Essentially any file on a file system can be monitored with Git:   MSWord, PDF files, images, datasets, etc. 

 

There are many ways to share a Git repository and profile your work.  The term push refers to a convenient process of synchronizing a repo up to a remote social coding hub.  Additional features of a hub include issue tracking, collaboration, hosting documentation, and Kanban Method planning.  Conveniently, pushing a repo to GitHub means maintaining a seamless, two-location backup – a push will simultaneously and efficiently synchronize the timeline and file versions. Meanwhile, at a repo editor’s discretion, any collaborator or interested party can be granted access to their GitHub repository.

Many public instances of social-coding hubs operate on a freemium model.  At GitHub most users pay nothing.  It’s also possible to run a local instance of a coding hub.  For example, OIT offers a local instance of GitLab, delivering many of the same features while enabling permissions, authorization, and access Via Duke’s NetID.

While social coding hubs are great tools for distributing files and managing project life-cycles, in and of themselves they do not sufficiently ensure long-term reproducible access to research data.  To do that simply synchronize version-control repositories with archival research data repositories.

Research Data Repositories


Preserving the computational artifacts of formal academic works requires a repository focus that is complementary to version-control repositories and social-coding hubs.  Nonetheless, version control is not a requirement of a data repository where the goal is long-term preservation. Fortunately, many special-purpose data repositories exist.  Discipline-specific research repositories are sometimes associated with academic societies.  There also exist more generalized archival research repositories such as Zenodo.org.  Additionally, many research universities host institutional research data repositories.  Not surprisingly, such a research data repository exists at Duke where the Duke University Libraries promotes and cooperatively shepherds Duke’s Research Data Repository (RDR).  

Colossus computer
Colossus. Chris Monk. https://flic.kr/p/fJssqg

Unlike social coding hubs, data repositories operate under different funding models and are motivated by different horizons.  Coding hubs like GitHub do not promise long-term retention, instead they focus on immediate distribution of version-control repos and offer project management features. Research data repositories take a long view centered closer to the artifacts of formal research and publication.  

By archiving the data milestones of publication, a deposit in the RDR links a formal publication – book edition, chapter, or serial article, etc. – with the data and code (i.e., a compendium) used to produce a single tangible instance of publication.  In turn, the building blocks of computational thinking and research processes are preserved for posterity because the RDR maintains an assurance of long term sustainability.  

Creator of MacPaint
Bill Atkinson. creator of MacPaint. painted in MacPaint” Photo by Kyra Rehn. https://flic.kr/p/e9urBF

In the Duke RDR, particular effort is focussed on preserving unique versions of data associated with each formal publication.  In this way, authors can associate a digital object identifier, or DOI, with the precise code and data used to draft an accepted paper or research project.  Once deposited in the RDR, researchers across the globe can look at these archives to verify, to learn, to refute, to cite, or be inspired toward new avenues of investigation.

By preserving workflow artifacts endemic to publication milestones, research data repositories preserve the record of academic progress.  Importantly, the preservation of these digital outcomes or artifacts is strongly encouraged by funding agencies.  Increasingly, these archival access points are a requirement for funding, especially among publicly funded research.  As such, the Duke RDR exists with aims to preserve and make the academic record accessible, and to create a library of reproducible academic research.  

Conclusion

The imperatives for preserving research data are derived from expressly different motives than those driving version-control repositories.  Minimally, version-control repositories do not promise academic posterity.  However, among the drivers of scholarship is the intentional engagement with the preserved academic record.  In reality, while unlikely, your GitHub repository could vanish in the blink of the next Wall Street acquisition. Conversely research data repositories exist with different affordances.  These two types of repositories complement each other.  Once more, they can be synchronized to enable and preserve digital processes that comprise many forms of data-driven research.  Using both types of repositories imply workflows that positively contribute to a scholarly legacy. It is this promise of academic transmission that drives Duke’s RDR, and benefits scholars by enabling access to persistent copies of research.