Category Archives: Data Management

OSF@Duke: By the Numbers and Beyond

The Open Science Framework (OSF) is a data and project management platform developed by the Center for Open Science that is designed to support the entire research lifecycle. OSF has a variety of features including file management and versioning, integration with third-party tools, granular permissions and sharing capabilities, and communication functionalities. It also supports growing scholarly communication formats including preprints and preregistrations, which enable more open and reproducible research practices.

In early 2017, Duke University became a partner institution with the OSF. As a partner institution, Duke researchers can sign into the OSF using their NetID and affiliate a project with Duke, which allows it to be displayed on the Duke OSF page. After 2 years of supporting OSF for Institutions here at Duke, the Research Data Management (RDM) team wanted to gain a better perspective surrounding how our community was using the tool and their perceptions. 

As of March 10, 2019, Duke has 202 users that have signed into the system using their Duke credentials (and there are possibly more users that are authenticating using personal email accounts). Of these users, 177 total projects have been created and affiliated with Duke. Forty-six of these projects are public and 132 remain private. Duke users have also registered 80 Duke affiliated projects, 62 of which are public and 18 are embargoed. A registration is a time-stamped read-only copy of an OSF project that can be used to preregister a research design, to create registered reports for journals, or at the conclusion of a project to formally record the authoritative copy of materials.

But what do OSF users think of the tool and how are they using it within their workflows? A few power users shared their thoughts:

Optimizing research workflows: A number of researchers noted how the OSF has helped streamline their workflows through creating a “central place that everyone has access to.” OSF has helped “keeping track of the ‘right’ version of things” and “bypassing the situation of having different versioned documents in different places.” Additionally, the OSF has supported “documenting workflow pipelines.”

Facilitating collaboration: One of the key features of the OSF is that researchers, regardless of institutional affiliation, can contribute to a project and integrate the tools they already use. Matt Makel, Director of Research at TIP, explains how OSF supports his research – “I collaborate with many colleagues at other institutions. OSF solves the problem of negotiating which tools to use to share documents. Rather than switching platforms across (or worse, within) projects, OSF is a great hub for our productivity.”

Offering an end-to-end data management solution: Some research groups are also using OSF in multiple stages of their projects and for multiple purposes. As one researcher expressed – “My research group uses OSF for every project. That includes preregistration and archiving research materials, data, data management and analysis syntax, and supplemental materials associated with publications. We also use it to post preprints to PsyArXiv.”

It also surfaced that OSF supported an ideological perception regarding a shift in the norms of scholarly communication. As Elika Bergelson, Crandall Family Assistant Professor in Psychology and Neuroscience, aptly put it “Open science is the way of the future.” Here within Duke University Libraries, we aim to continue to support these shifting norms and the growing benefits of openness through services, platforms, and training.

To learn more about how the OSF might support your research, join us on April 3 from 10-11 am for hands-on OSF workshop. Register here: https://duke.libcal.com/event/4803444

If you have other questions about using the OSF in a project, the RDM team is available for consultations or targeted demonstrations or trainings for research teams. We also have an OSF project that can help you understand the basic features of the tool.

Contact askdata@duke.edu to learn more or request an OSF demonstration.

Computational Reproducibility Pilot – Code Ocean Trial

A goal of Duke University Libraries (DUL) Code Ocean Logois to support the  growing and changing needs of the Duke research community. This can take many forms. Within Data and Visualization Services, we provide learning opportunities, consulting services, and computational resources to help Duke researchers implement their data-driven research projects. Monitoring and assessing new tools and platforms also helps DUL stay in tune with changing research norms and practices. Today the increasing focus on the importance of transparency and reproducibility has resulted in the development of new tools  and resources to help researchers produce and share more reproducible results. One such tool is Code Ocean.

Code Ocean is a computational reproducibility platform that employs Docker technology to execute code in the cloud. The platform does two key things—it integrates the metadata, code, data and dependencies into a single ‘compute capsule’, ensuring that the code will run—and it does this in a single web interface that displays all inputs and results. Within the platform, it is possible to develop, edit or download the code, run routines, and visualize, save or download output, all from a personal computer. Users or reviewers can upload their own data and test the effects of changing parameters or modification of the code. Users can also share their data and code through the platform. Code Ocean provides a DOI for all capsules facilitating attribution and a permanent connection to any published work.

In order to help us understand and evaluate the usefulness of the Code Ocean platform to the Duke research community, DUL will be offering trial access to the Code Ocean cloud-based computational reproducibility platform starting on October 1, 2018. To learn more about what is included in the trial access and to sign up to participate, visit the Code Ocean pilot portal page.

If you have any questions, contact askdata@duke.edu.

Highlights from Expanding our Research Data Management Program

Since the launch of our expanded research data management (RDM) program in January, the Research Data Management Team in DVS has been busy defining and implementing our suite of services. Our “Lifecycle Services” are designed to assist scholars at all stages of their research project from the planning phase to the final curation and disposition of their data in an archive or repository. Our service model centers on four key areas: data management planning, data workflow design, data and documentation review, and data repository support. Over the past nine months, we have  worked with Duke researchers across disciplines to provide these services, allowing us to see their value in action. Below we present some examples of how we have supported researchers within our four support areas.

Data Management Planning

With increasing data management plan requirements Data Management Planningas well as growing  expectations that funding agencies will more strictly enforce and evaluate these plans, researchers are seeking assistance ensuring their plans comply with funder requirements. Through in-person consultations and online review through the DMPTool, we have helped researchers enhance their DMPs for a variety of funding agencies including the NSF Sociology Directorate, the Department of Energy, and the NSF Computer & Information Science & Engineering (CISE) Program.

Data Workflow Design

As research teams begin a project there are a variety Data Workflow Designof organizational and workflow decisions that need to be made from selecting appropriate tools to implementing storage and backup strategies (to name a few). Over the past 6 months, we have had the opportunity to help a multi-institutional Duke Marine Lab Behavioral Response Study (BRS) implement their project workflow using the Open Science Framework (OSF). We have worked with project staff to think through the organization of materials, provided training on the use of the tool, and strategized on storage and backup options.

Data and Documentation Review

During a project, researchers make decisions about how to format, Data and Documentation Reviewdescribe, and structure their data for sharing and preservation. Questions may also arise surrounding how to ethically share human subjects data and navigate intellectual property or copyright issues. In conversations with researchers, we have provided suggestions for what formats are best for portability and preservation, discussed their documentation and metadata plans, and helped resolve intellectual property questions for secondary data.

Data Repository Support

At the end of a project, researchers may be required Data Repository Supportor choose to deposit their data in an archive or repository. We have advised faculty and students on repository options based on their discipline, data type, and repository features. One option available to the Duke community is the Duke Digital Repository. Over the past nine months, we have assisted with the curation of a variety of datasets deposited within the DDR, many of which underlie journal publications.

This year Duke news articles have featured two research studies with datasets archived within the DDR, one describing a new cervical cancer screening device and another presenting cutting-edge research on a potential new state of matter. The accessibility of both Asiedu et al.’s screening device data and Charbonneau and Yaida’s glass study data enhances the overall transparency and reproducibility of these studies.

Our experiences thus far have enabled us to better understand the diversity of researchers’ needs and allowed us to continue to hone and expand our knowledge base of data management best practices, tools, and resources. We are excited to continue to work with and learn from researchers here at Duke!

Open Science Framework @ Duke

Center for Open ScienceThe Open Science Framework (OSF) is a free, open source project management tool developed and maintained by the Center for Open Science (COS). OSF offers many features that can help scholars manage their workflow and outputs throughout the research lifecycle. From collaborating effectively, to managing data, code, and protocols in a centralized location, to sharing project materials with the broader research community, the OSF provides tools that support openness, research integrity, and reproducibility. Some of the key functionalities of the OSF include:

  • Integrations with third-party tools that researchers already use (i.e., Box, Google Drive, GitHub, Mendeley, etc.)
  • Hierarchical organizational structures
  • Unlimited native OSF storage*
  • Built-in version control
  • Granular privacy and permission controls
  • Activity log that tracks all project changes
  • Built-in collaborative wiki and commenting pane
  • Analytics for public projects
  • Persistent, citable identifiers for projects, components, and files along with Digital Object Identifiers (DOIs) and Archival Resource Keys (ARKs) available for public OSF projects
  • And more!

Duke University is a partner institution with OSF, meaning  you can sign into the OSF using your NetID and affiliate your projects with Duke. Visit the Duke OSF page to see some Duke research projects and outputs from our community.

Duke University Libraries has also partnered with COS to host a workshop this fall entitled “Increasing Openness and Reproducibility in Quantitative Research.” This workshop will teach participants how they can increase the reproducibility of their work and will include hands-on exercises using the OSF.

Workshop Details
Date: October 3, 2017
Time: 9 am to 12 pm
Register:
http://duke.libcal.com/event/3433537

If you are interested in affiliating an existing OSF project, want to learn more about how the OSF can support your workflow, or would like a demonstration of the OSF, please contact askdata@duke.edu.

*Individual file size limit of 5 GB. Users can upload larger files by connecting third party add-ons to their OSF projects.

Fall Data and Visualization Workshops

2017 Data and Visualization Workshops

Visualize, manage, and map your data in our Fall 2017 Workshop Series.  Our workshops are designed for researchers who are new to data driven research as well as those looking to expand skills with new methods and tools. With workshops exploring data visualization, digital mapping, data management, R, and Stata, the series offers a wide range of different data tools and techniques. This fall, we are extending our partnership with the Graduate School and offering several workshops in our data management series for RCR credit (please see course descriptions for further details).

Everyone is welcome at Duke Libraries workshops.  We hope to see you this fall!

Workshop Series by Theme

Data Management

09-13-2017 – Data Management Fundamentals
09-18-2017 – Reproducibility: Data Management, Git, & RStudio 
09-26-2017 – Writing a Data Management Plan
10-03-2017 – Increasing Openness and Reproducibility in Quantitative Research
10-18-2017 – Finding a Home for Your Data: An Introduction to Archives & Repositories
10-24-2017 – Consent, Data Sharing, and Data Reuse 
11-07-2017 – Research Collaboration Strategies & Tools 
11-09-2017 – Tidy Data Visualization with Python

Data Visualization

09-12-2017 – Introduction to Effective Data Visualization 
09-14-2017 – Easy Interactive Charts and Maps with Tableau 
09-20-2017 – Data Visualization with Excel
09-25-2017 – Visualization in R using ggplot2 
09-29-2017 – Adobe Illustrator to Enhance Charts and Graphs
10-13-2017 – Visualizing Qualitative Data
10-17-2017 – Designing Infographics in PowerPoint
11-09-2017 – Tidy Data Visualization with Python

Digital Mapping

09-12-2017 – Intro to ArcGIS Desktop
09-27-2017 – Intro to QGIS 
10-02-2017 – Mapping with R 
10-16-2017 – Cloud Mapping Applications 
10-24-2017 – Intro to ArcGIS Pro

Python

11-09-2017 – Tidy Data Visualization with Python

R Workshops

09-11-2017 – Intro to R: Data Transformations, Analysis, and Data Structures  
09-18-2017 – Reproducibility: Data Management, Git, & RStudio 
09-25-2017 – Visualization in R using ggplot2 
10-02-2017 – Mapping with R 
10-17-2017 – Intro to R: Data Transformations, Analysis, and Data Structures
10-19-2017 – Developing Interactive Websites with R and Shiny 

Stata

09-20-2017 – Introduction to Stata
10-19-2017 – Introduction to Stata 

 

 

 

 

 

 

 

 

 

 

 

 

Love Your Data Week (Feb. 13-17)

In cooperation with the Triangle Research Library Network, Duke Libraries will be participating in Love Your Data Week on February 13-17, 2017. Love Your Data Week is an international event to help researchers take better care of their data. The campaign focuses on raising awareness and building community around data management, sharing, preservation, and reuse.

The theme for Love Your Data Week 2017 is data quality, with a related message for each day.

  • Monday: Defining Data Quality
  • Tuesday: Documenting, Describing, and Defining
  • Wednesday: Good Data Examples
  • Thursday: Finding the Right Data
  • Friday: Rescuing Unloved Data

Throughout the week, Data and Visualization Services will be contributing to the conversation on Twitter (@duke_data). We will also host the following local programming related to the daily themes:

In honor of Love Your Data Week chocolates will be provided at these workshops!

The new Research Data Management staff at the Duke Libraries are available to help researchers care for their data through consultations, support services, and instruction.  We can assist with writing data management plans that comply with funder policies, advise on data management best practices, and facilitate the ingest of data into repositories. To learn more about general data management best practices, see our newly updated RDM guide

Contact us at askdata@duke.edu to find out how we can help you love your data! 

Get involved in Love Your Data Week by following the conversation at #LYD17, #loveyourdata, and #trlndata.

All promotional Love Your Data 2017 materials used under a Creative Commons Attribution 4.0 International License.

Citation: Bass, M., Neeser, A., Atwood, T., and Coates, H. (2017). Love Your Data Week Promotional Materials. [image files]. Retrieved from https://osf.io/r8tht/files/

New Data Management Services @ Duke

Data ManagementDuke Libraries are happy to announce a new set of research data management services designed to help researchers secure grant funding, increase research impact, and preserve valuable data. Building on the recommendations of the Digital Research Faculty Working Group and the Duke Digital Research Data Services and Support report, Data and Visualization Services have added two new research data management consultants who are available to work with researchers across the university and medical center on a broad range of data management concerns from data creation to data curation.

Interested in learning more about data management?

Our New Data Management Consultants

sophialh2Sophia Lafferty-Hess attended the University of North Carolina at Chapel Hill where she received a Master of Science in Information Science and Master of Public Administration. Prior to coming to Duke, Sophia worked at the Odum Institute for Research in Social Science at UNC-Chapel Hill within the Data Archive as a Research Data Manager. In this position, Sophia provided consultations to researchers on data management best practices, curated research data to support long-term preservation and reuse, and provided training and instruction on data management policies, strategies, and tools.

While at Odum, Sophia also helped lead the development of a data curation and verification service for journals to help enforce data sharing and replication policies, which included verifying that data meet quality standards for reuse and that the data and code can properly reproduce the analytic results presented in the article. Sophia’s current research interests include the impact of journal data sharing policies on data availability and the development of data curation workflows.

jen2Jen Darragh comes to us from Johns Hopkins University where she served for the past seven years as the Data Services and Sociology Librarian, and Hopkins Population Center Restricted Projects Coordinator.  In this position, Jen  developed the libraries’ Restricted Data Room and designed the secure data enclave spaces and staff support for the Johns Hopkins Population Center.

Jen received her Bachelor of Arts Degree in Psychology from Westminster College (PA) and her Master of Library and Information Sciences degree from the University of Pittsburgh.  She has been involved with socio-behavioral research data throughout her career.  Jen is particularly interested in the development of centralized, controlled data access for sensitive human subjects’ data (subject to HIPAA or FERPA requirements) to facilitate broader, yet more secure sharing of existing research data as a means to produce new, cutting-edge research.

 

Duke Libraries and SSRI welcome Mara Sedlins!

On behalf of Duke Libraries and the Social Science Research Institute, I am happy to welcome Mara Sedlins to Duke.  As the library and SSRI work to develop a rich set of data management, analysis, and archiving strategies for Duke researchers, Mara’s postdoctoral position provides a unique opportunity to work closely with researchers across campus to improve both training and workflows for data curation at Duke.  – Joel Herndon, Head of Data and Visualization Services, Duke Libraries  

2016-08-25 11.06.17 HDRI am excited to join the Data and Visualization Services team this fall as a postdoctoral fellow in data curation for the social sciences (sponsored by CLIR and funded by the Alfred P. Sloan Foundation). For the next two years, I will be working with Duke Libraries and the Social Science Research Institute to develop best practices for managing a variety of research data in the social sciences.

My research background is in social and personality psychology. I received my PhD at the University of Washington, where I worked to develop and validate a new measure of automatic social categorization – to what extent do people, automatically and without conscious awareness, sort faces into socially constructed categories like gender and race? The measure has been used in studies examining beliefs about human genetic variation and the racial labels people assign to multiracial celebrities like President Barack Obama.

While in Seattle, I was also involved in several projects at Microsoft Research assessing computer-supported cooperative work technologies, focusing on people’s preferences for different types of avatar representations, compared to video or audio-only conferencing. I also have experience working with data from a study of risk factors for intimate partner violence, managing a database of donors and volunteers for a historical archive, and organizing thousands of high-resolution images for a large-scale digital comic art restoration project.

I look forward to applying the insights gained from working on a diverse array of data-intensive projects to the problem of developing and promoting best practices for data management throughout the research lifecycle.  I am particularly interested in questions such as:

  • How can researchers write actionable data management plans that improve the quality of their research?
  • What strategies can be used to organize and document data files during a project so that it’s easy to find and understand them later?
  • What steps need to be taken so that data can be discovered and re-used effectively by other researchers?

These are just a few of the questions that are central to the rapidly evolving field of data curation for the sciences and beyond.

 

Data and Visualization Spring 2016 Workshops

Spring 2016 DVS WorkshopsSPRING 2016: Data and Visualization Workshops 

Interested in getting started in data driven research or exploring a new approach to working with research data?  Data and Visualization Services’ spring workshop series features a range of courses designed to showcase the latest data tools and methods.  Begin working with data in our Basic Data Cleaning/Analysis or the new Structuring Humanities Data  workshop.  Explore data visualization in the Making Data Visual class.  Our wide range of workshops offers a variety of approaches for the meeting the challenges of 21st century data driven research.   Please join us!

Workshop by Theme

DATA SOURCES

DATA CLEANING AND ANALYSIS

DATA ANALYSIS

MAPPING AND GIS

DATA VISUALIZATION

* – For these workshops, no prior experience with data projects is necessary!  These workshops are great introductions to basic data practices.

Shapefiles vs. Geodatabases

Ever wonder what the difference between a shapefile and a geodatabase is in GIS and why each storage format is used for different purposes?  It is important to decide which format to use before beginning your project so you do not have to convert many files midway through your project.

Basics About Shapefiles:

Shapefiles are simple storage formats that have been used in ArcMap since the 1990s when Esri created ArcView (the early version of ArcMap 10.3).  Therefore, shapefiles have many limitations such as:

  • Takes up more storage space on your computer than a geodatabase
  • Do not support names in fields longer than 10 characters
  • Cannot store date and time in the same field
  • Do not support raster files
  • Do not store NULL values in a field; when a value is NULL, a shapefile will use 0 instead

Users are allowed to create points, lines, and polygons with a shapefile.  One shapefile must have at least 3 files but most shapefiles have around 6 files.  A shapefile must have:

  • .shp – this file stores the geometry of the feature
  • .shx – this file stores the index of the geometry
  • .dbf – this file stores the attribute information for the feature

All files for the shapefile must be stored in the same location with the same name or else the shapefile will not load.  When a shapefile is opened in Windows Explorer it will look different than when opened in ArcCatalog.

Shapefile_Windows

 

Basics About Geodatabases:

Geodatabases allow users to thematically organize their data and store spatial databases, tables, and raster datasets.  There are two types of single user geodatabases: File Geodatabase and Personal Geodatabase.  File geodatabases have many benefits including:

  • 1 TB of storage limits of each dataset
  • Better performance capabilities than Personal Geodatabase
  • Many users can view data inside the File Geodatabase while the geodatabase is being edited by another user
  • The geodatabase can be compressed which helps reduce the geodatabases’ size on the disk

On the other hand, Personal Geodatabases were originally designed to be used in conjunction with Microsoft Access and the Geodatabase is stored as an Access file (.mdb).  Therefore Personal Geodatabases can be opened directly in Microsoft Access, but the entire geodatabase can only have 2 GB of storage.

To organize your data into themes you can create Feature Datasets within a geodatabase.  Feature datasets store Feature Classes (which are the equivalent to shapefiles) with the same coordinate system.  Like shapefiles, users can create points, lines, and polygons with feature classes; feature classes also have the ability to create annotation, and dimension features.

Geodatabase

In order to create advanced datasets (such as add a network dataset, a geometric network, a terrain dataset, a parcel fabric, or run topology on an existing layer) in ArcGIS, you will need to create a Feature Dataset.

You will not be able to access any files of a File geodatabase in Windows Explorer.  When you do, the Durham_County geodatabase shown above will look like this:

Windows2

 

Tips:

  • When you copy shapefiles anytime, use ArcCatalog. If you use Windows Explorer and do not select all the files for a shapefile, the shapefile will be corrupt and will not load.
  • When using a geodatabase, use a File Geodatabase. There is more storage capacity, multiple users can view/read the database at the same time, and the file geodatabase runs tools and queries faster than a Personal Geodatabase.
  • Use a shapefile when you want to read the attribute table or when you have a one or two tools/processes you need to do. Long-term projects should be organized into a File Geodatabase and Feature Datasets.
  • Many files downloaded from the internet are shapefiles. To convert them into your geodatabase, right click the shapefile, click “Export,” and select “To Geodatabase (single).”

Export_Shp