Computational Reproducibility Pilot – Code Ocean Trial

A goal of Duke University Libraries (DUL) Code Ocean Logois to support the  growing and changing needs of the Duke research community. This can take many forms. Within Data and Visualization Services, we provide learning opportunities, consulting services, and computational resources to help Duke researchers implement their data-driven research projects. Monitoring and assessing new tools and platforms also helps DUL stay in tune with changing research norms and practices. Today the increasing focus on the importance of transparency and reproducibility has resulted in the development of new tools  and resources to help researchers produce and share more reproducible results. One such tool is Code Ocean.

Code Ocean is a computational reproducibility platform that employs Docker technology to execute code in the cloud. The platform does two key things—it integrates the metadata, code, data and dependencies into a single ‘compute capsule’, ensuring that the code will run—and it does this in a single web interface that displays all inputs and results. Within the platform, it is possible to develop, edit or download the code, run routines, and visualize, save or download output, all from a personal computer. Users or reviewers can upload their own data and test the effects of changing parameters or modification of the code. Users can also share their data and code through the platform. Code Ocean provides a DOI for all capsules facilitating attribution and a permanent connection to any published work.

In order to help us understand and evaluate the usefulness of the Code Ocean platform to the Duke research community, DUL will be offering trial access to the Code Ocean cloud-based computational reproducibility platform starting on October 1, 2018. To learn more about what is included in the trial access and to sign up to participate, visit the Code Ocean pilot portal page.

If you have any questions, contact askdata@duke.edu.

Expanding Support for Data Visualization in Duke Libraries

Angela ZossOver the last six years, Data and Visualization Services (DVS) has expanded support for data visualization in the Duke community under the expert guidance of Angela Zoss. In this period, Angela developed Duke University Libraries’ visualization program through a combination of thoughtful consultations, training, and events that expanded the community of data visualization practice at Duke while simultaneously increasing the impact of Duke research.

As of May 1st, Duke Libraries is happy to announce that Angela will expand her role in promoting data visualization in the Duke community by transitioning to a new position in the library’s Assessment and User Experience department. In her new role, Angela will support a larger effort in Duke Libraries to increase data-driven decision making. In Data and Visualization Services, Eric Monson will take the lead on research consultation and training for data visualization in the Duke community. Eric, who has been a data visualization analyst with DVS since 2015 and has a long history of supporting data visualization at Duke, will serve as DVS’ primary contact for data visualization.

DVS wishes Angela success in her new position. We look forward to continuing to work with the Duke community to expand data visualization research on campus.

Using Tableau with Qualtrics data at Duke

Logos for Qualtrics and TableauThe end of the spring semester always brings presentations of final projects, some of which may have been in the works since the fall or even the summer. Tableau, a software application designed specially for visualization, is a great option for projects that would benefit from interactive charts and maps.

Visualizing survey data, however, can be a bit of a pain. If your project uses Qualtrics, for example, you may be having trouble getting the data ready for visualization and analysis. Qualtrics is an extremely powerful survey tool, but the data it creates can be very complicated, and typical data analysis tools aren’t designed to handle that complexity.

Luckily, here at Duke, Tableau users can use Tableau’s Web Data Connector to pull Quatrics data directly into Tableau! It’s so easy, you may never analyze your Qualtrics data another way again.

Process

Here are the basics. There are also instructions from Qualtrics.

In Qualtrics: Copy your survey URLScreenshot of Tableau URL in Qualtrics

  • Go to your Duke Qualtrics account
  • Click on the survey of interest
  • Click on the Data & Analysis tab at the top
  • Click on the Export & Import button
  • Select Export Data
  • Click on Tableau
  • Copy the URL

In Tableau (Public or Desktop): Paste your survey URL

Tableau Web Data Connection

  • Under Connect, click on Web Data Connector (may be under “More…” for Tableau Public or “To a server… More…” for Tableau Desktop)
  • Paste the survey URL into the web data connector URL box and hit enter/return
  • When a login screen appears, click the tiny “Api Token Login” link, which should be below the green Log in button

In Qualtrics: Create and copy your API token

Generate Qualtrics API Token

  • Go to your Duke Qualtrics account
  • Click on your account icon in the upper-right corner
  • Select Account Settings…
  • On the Account Settings page, click on the Qualtrics IDs tab
  • Under API, check for a token. If you don’t have one yet, click on Generate Token
  • Copy your token

In Tableau (Public or Desktop): Paste your API token

  • Paste in your API token and click the Login button
  • Select the data fields you would like to import

Note: there is an option to “transpose” some of the fields on import. This is useful for many of the types of visualizations you might want to create from survey data. Typically, you want to transpose fields that represent the questions asked in the survey, but you may not want to transpose demographics data or identifiers. See also the Qualtrics tips on transposing data.

Resources

For more tips on how to use Tableau with Qualtrics data, check out the resources below:

Can’t we just make a Venn diagram?

When I’m teaching effective visualization principles, one of the most instructive processes is critiquing published visualizations and reviewing reworks done by professionals. I’ll often show examples from Cole Nussbaumer Knaflic’s blog, Storytelling with Data, and Jon Schwabish’s blog, The Why Axis. Both brilliant! (Also, check out my new favorite blog Uncharted, by Lisa Charlotte Rost, for wonderful visualization discussions and advice!)

What we don’t usually get to see is the progression of an individual visualization throughout the design process, from data through rough drafts to final product. I thought it might be instructive to walk through an example from one of my recent consults. Some of the details have been changed because the work is unpublished and the jargon doesn’t help the story.

Data full of hits and misses

Five tests data, hits and misses per patient A researcher came to me for help with an academic paper figure. He and his collaborator were comparing five literature-accepted methods for identifying which patients might have a certain common disease from their medical records. Out of 60,000 patients, about 10% showed a match with at least one of the tests. The resulting data was a spreadsheet with a column of patient IDs, and five columns of tests, with a one if a patient was identified as having the disease by that particular test, and a zero if their records didn’t match for that test. As you can see in the figure, there were many inconsistencies between who seemed to have the disease across the five tests!

So you want to build a snowman

Five tests overlap, original Venn diagram The researchers wanted a visualization to represent the similarities and differences between the test results. Specifically, they wanted to make a Venn diagram, which consists of ellipsoids representing overlapping sets. They had an example they’d drawn by hand, but wanted help making it into an accurate depiction of their data. I resisted, explaining that I didn’t know of a program that would accomplish what he wanted, and that it is likely to be mathematically impossible to take their five-dimensional data set and represent it quantitatively as a Venn diagram in 2D. Basically, you can’t get the areas of all of the overlapping regions to be properly proportional to the number of patients that had hits on all of the combinations of the five tests. The Venn diagram works fine schematically, as a way to get across an idea of set overlap, but it would never be a data visualization that would reveal quantitative patterns from their findings. At worst, it would be a misleading distortion of their results.

Count me in

Five tests data pairwise table with colored cells His other idea was to show the results as a table of numbers in the form of a matrix. Each of the five tests were listed across the top and the side, and the cell contents showed the quantity of patients who matched on that pair of tests. The number matching on a single test was listed on the diagonal. Those patterns can be made more visual by coloring the cells with “conditional formatting” in Excel, but the main problem with the table is that it hides a bunch of interesting data! We don’t see any of the numbers for people who hit on the various combinations of three tests, or four, or the ones that hit on all five.

Five test data heatmap and number of tests hit per patient

I suggested we start exploring the hit combinations by creating a heatmap of the original data, but sort the patients (along the horizontal axis) by how many tests tests they hit (listing the tests up the vertical axis). Black circles are overlaid showing the number of tests hit for any given patient.

There are too many patients / lines here to show clearly the combinations of tests, but this visualization already illuminated two things that made sense to the researchers. First, there is a lot of overlap between ALPHA (a) and BETA-based (b) tests, and between GAMMA Method (c) and Modified GAMMA (d), because these test pairs are variations of each other. Second, the overlaps indicate a way the definitions are logically embedded in each other; (a) is a subset of (b), and (b) is for the most part a subset of (c).

Five tests combinations Tableau bubble plot

My other initial idea was to show the numbers of patients identified in each of the test overlaps as bubbles in Tableau. Here I continue the shorthand of labeling each test by the letters [a,b,c,d,e], ordered from ALPHA to OMEGA. The number of tests hit are both separated in space and encoded in the color (low to high = light to dark).

Add some (effective?) dimensions

I felt the weakness of this representation was that the bubbles were not spatially associated with their corresponding tests. Inspired by multi-dimensional radial layouts such as those used in the Stanford dissertation browser, I created a chart (in Adobe Illustrator) with five axes for the tests. I wanted each bubble to “feel a pull” from each of the passed tests, so it made sense to place the five-hit “abcde” bubble at the center, and each individual, “_b___”, “__c__”, ____e” bubble right by its letter – making the radius naturally correspond to the number of test hit. Other bubbles were placed (manually) in between their combination of axes / tests.

Five test combinations hits polar bubble plot

The researchers liked this version. It was eye-catching, and the gravitation of bubbles in the b/c quadrant vaguely illustrated the pattern of hits and known test subsets. One criticism, though, was that it was a bit confusing – it wasn’t obvious how the bubbles were placed around the circles, and it might take people too long to figure out how to read the plot. It also, took up a lot of page space.

Give these sets a hand

One of the collaborators, after seeing this representation, suggested trying to use it as the basis for an Euler diagram. Like a Venn diagram, it’s a visualization used to show set inclusion and overlap, but unlike in a Venn, an Euler is drawn using arbitrary shapes surrounding existing labels or images representing the set members. I thought it was an interesting idea, but I initially dismissed the idea as too difficult. I had already put more time than I typically spend on a consult into this visualization (our service model is to help people learn how to make their own visualizations, not produce visualizations for them). Also, I had never made an Euler diagram. While I had seen some good talks about them, I didn’t have any software on hand which would automate the process. So, I responded that the researchers should feel free to try creating curves around the sets themselves, but I wasn’t interested in pursuing it further.

Five tests polar bubbles with hand-drawn set boundaries About two minutes after I sent the email, I began looking at the diagram and wondering if I could draw the sets! I printed out a black and white copy and started drawing lines with colored pencils, making one enclosing shape for each test [a-e]. It turned out that my manual layout resulted in fairly compact curves, except for “_bc_e”, which had ambiguous positioning, anyway. Five tests first draft Euler diagram The curve drawing was so easy that I started an Illustrator version. I kept the circles’ area the same (corresponding to their quantitative value) but pushed them around to make the set shapes more compact.

Ironically, I had come back almost exactly to the researchers’ original idea! The important distinction is that the bubbles keep it quantitative, with the regions only representing set overlap.

We’ve come full ellipsoid

Angela Zoss constructively pointed out that there were now too many colors, and the shades encoding number of hits wasn’t necessary. She also felt the region labels weren’t clear. Those fixes, plus some curve smoothing (Path -> Simplify in Illustrator) led me to a final version we were all very happy with!

It’s still not a super simple visualization, but both the quantitative and set overlap patterns are reasonably clear. This results was only possible, though, through trying multiple representations and getting feedback on each!

Five tests final quantitative bubble Euler diagram

If you’re interested in learning how to create visualizations like this yourself, sign up for the DVS announcements listserve, or keep an eye on our upcoming workshops list. We also have videos of many past workshops, including Angela’s Intro to Effective Data Visualization, and my Intro to Tableau, Illustrator for Charts, and Illustrator for Diagrams.

Love Data Week (Feb. 12-16)

Love Data Week is here again! Love Data Week is an international social media campaign to raise awareness and build community to engage on topics related to research data management, sharing, preservation, reuse, and library-based research data services.

This year the theme for Love Data Week is Data Stories, with a focus on four topics:

  • Stories about data
  • Telling stories with data
  • Connected conversations across different data communities
  • We are data: seeing the people behind the data

Since last year, we have some new data stories at Duke Libraries. The Duke Digital Repository now contains nearly 30 data sets that Duke researchers have shared for preservation and reuse. There are 23 Duke-affiliated projects in the Open Science Framework, a free web app developed by the Center for Open Science that facilitates good project and data management practices. And our Research Data Management team has continued to offer consultation and instruction services to a variety of researchers on campus.

We invite you to join the data story by attending data related events coming up at Duke during Love Data Week:

  • Tuesday, February 13:

    Civil and Environmental Engineering Seminar: Data, data everywhere … Making sense of observations and models across scales 
  • Wednesday, February 14:

    Introduction to the Open Science Framework*: Learn how this free, open source tool can help you manage and share your research data.

    Story Maps with ArcGIS Online: Learn how to tell your data story with an interactive map that integrates other media (photos, text, videos) and shows changes over space and time.

    Qualitative Data Analysis Workshop at the Social Science Research Institute: Learn how to transform interview scripts into analyzable data and other foundational skills in qualitative data analysis. 
  • Thursday, February 15:

    Data Dialogue at the Information Initiative at Duke: Design intuition, ethnography, and data science 
  • Friday, February 16:

    Visualization Friday Forum: Invisible Visualization: Making data visualizations accessible to the blind and other people with disabilities

*In honor of Love Data Week, chocolate will be offered at this event.

Keep an eye on additional workshops coming up for the rest of the spring semester!

Contact us at askdata@duke.edu for help with your data story, and follow the conversation at #lovedata18

All promotional Love Data 2018 materials used under a Creative Commons Attribution 4.0 International License.

Citation: Nurnberger, A., Coates, H. L., Condon, P., Koshoffer, A. E., Doty, J., Zilinski, L., … Foster, E. D. (2017). Love Data Week [image files]. Retrieved from https://osf.io/r8tht

Announcing Tidyverse workshops for Winter 2018

Coming this winter the Data & Visualization Services Department will once again host a workshop series on the R programming language. Our spring offering is modeled on our well received R we having fun yet‽ (Rfun) fall workshop series. The four-part series will introduce R as a language for modern data manipulation by highlighting a set of tidyverse packages that enable functional data science. We will approach R using the free RStudio IDE, an intent to make reproducible literate code, and a bias towards the tidyverse. We believe this open tool-set provides a context that enables and reinforces reproducible workflows, analysis, and reporting.

This six-part series will introduce R as a language for modern data manipulation by highlighting a set of tidyverse packages that enable functional data science.

January Line-up

Title Date Registration Past Workshop
Intro to R Jan 19
1 – 3pm
register Resources
R Markdown
with Dr. Çetinkaya-Rundel
Jan 23
9am
register
Shiny
with Dr. Çetinkaya-Rundel
Jan 25
9am
register
Mapping with R Jan 25
1-3pm
register Resources
Reproducibility & Git Jan 29
1-3pm
register Resources
Visualizationg with ggplot2 Feb 1
9:30-11:30am
register Resources

An official announcement with links to registration is forthcoming. Feel free to subscribe to the Rfun or DVS-Announce lists. Or look to the DVS Workshop page for official registration links as soon as they are available.

Workshop Arrangement

This workshop series is intended to be iterative and recursive. We recommend starting with the Introduction to R. Proceed through the remaining three workshops in any order of interest.

Recordings and Past Workshops

We presented a similar version of this workshop series last fall and recorded each session whenever possible. You can stream past workshops and engage with the shareable data sets at your-own-pace (see the Past Workshop resources links, above.) Alternatively, all the past workshop resource links are presented in one listicle: Rfun recap.

Highlights from Expanding our Research Data Management Program

Since the launch of our expanded research data management (RDM) program in January, the Research Data Management Team in DVS has been busy defining and implementing our suite of services. Our “Lifecycle Services” are designed to assist scholars at all stages of their research project from the planning phase to the final curation and disposition of their data in an archive or repository. Our service model centers on four key areas: data management planning, data workflow design, data and documentation review, and data repository support. Over the past nine months, we have  worked with Duke researchers across disciplines to provide these services, allowing us to see their value in action. Below we present some examples of how we have supported researchers within our four support areas.

Data Management Planning

With increasing data management plan requirements Data Management Planningas well as growing  expectations that funding agencies will more strictly enforce and evaluate these plans, researchers are seeking assistance ensuring their plans comply with funder requirements. Through in-person consultations and online review through the DMPTool, we have helped researchers enhance their DMPs for a variety of funding agencies including the NSF Sociology Directorate, the Department of Energy, and the NSF Computer & Information Science & Engineering (CISE) Program.

Data Workflow Design

As research teams begin a project there are a variety Data Workflow Designof organizational and workflow decisions that need to be made from selecting appropriate tools to implementing storage and backup strategies (to name a few). Over the past 6 months, we have had the opportunity to help a multi-institutional Duke Marine Lab Behavioral Response Study (BRS) implement their project workflow using the Open Science Framework (OSF). We have worked with project staff to think through the organization of materials, provided training on the use of the tool, and strategized on storage and backup options.

Data and Documentation Review

During a project, researchers make decisions about how to format, Data and Documentation Reviewdescribe, and structure their data for sharing and preservation. Questions may also arise surrounding how to ethically share human subjects data and navigate intellectual property or copyright issues. In conversations with researchers, we have provided suggestions for what formats are best for portability and preservation, discussed their documentation and metadata plans, and helped resolve intellectual property questions for secondary data.

Data Repository Support

At the end of a project, researchers may be required Data Repository Supportor choose to deposit their data in an archive or repository. We have advised faculty and students on repository options based on their discipline, data type, and repository features. One option available to the Duke community is the Duke Digital Repository. Over the past nine months, we have assisted with the curation of a variety of datasets deposited within the DDR, many of which underlie journal publications.

This year Duke news articles have featured two research studies with datasets archived within the DDR, one describing a new cervical cancer screening device and another presenting cutting-edge research on a potential new state of matter. The accessibility of both Asiedu et al.’s screening device data and Charbonneau and Yaida’s glass study data enhances the overall transparency and reproducibility of these studies.

Our experiences thus far have enabled us to better understand the diversity of researchers’ needs and allowed us to continue to hone and expand our knowledge base of data management best practices, tools, and resources. We are excited to continue to work with and learn from researchers here at Duke!

Open Science Framework @ Duke

Center for Open ScienceThe Open Science Framework (OSF) is a free, open source project management tool developed and maintained by the Center for Open Science (COS). OSF offers many features that can help scholars manage their workflow and outputs throughout the research lifecycle. From collaborating effectively, to managing data, code, and protocols in a centralized location, to sharing project materials with the broader research community, the OSF provides tools that support openness, research integrity, and reproducibility. Some of the key functionalities of the OSF include:

  • Integrations with third-party tools that researchers already use (i.e., Box, Google Drive, GitHub, Mendeley, etc.)
  • Hierarchical organizational structures
  • Unlimited native OSF storage*
  • Built-in version control
  • Granular privacy and permission controls
  • Activity log that tracks all project changes
  • Built-in collaborative wiki and commenting pane
  • Analytics for public projects
  • Persistent, citable identifiers for projects, components, and files along with Digital Object Identifiers (DOIs) and Archival Resource Keys (ARKs) available for public OSF projects
  • And more!

Duke University is a partner institution with OSF, meaning  you can sign into the OSF using your NetID and affiliate your projects with Duke. Visit the Duke OSF page to see some Duke research projects and outputs from our community.

Duke University Libraries has also partnered with COS to host a workshop this fall entitled “Increasing Openness and Reproducibility in Quantitative Research.” This workshop will teach participants how they can increase the reproducibility of their work and will include hands-on exercises using the OSF.

Workshop Details
Date: October 3, 2017
Time: 9 am to 12 pm
Register:
http://duke.libcal.com/event/3433537

If you are interested in affiliating an existing OSF project, want to learn more about how the OSF can support your workflow, or would like a demonstration of the OSF, please contact askdata@duke.edu.

*Individual file size limit of 5 GB. Users can upload larger files by connecting third party add-ons to their OSF projects.

Fall Data and Visualization Workshops

2017 Data and Visualization Workshops

Visualize, manage, and map your data in our Fall 2017 Workshop Series.  Our workshops are designed for researchers who are new to data driven research as well as those looking to expand skills with new methods and tools. With workshops exploring data visualization, digital mapping, data management, R, and Stata, the series offers a wide range of different data tools and techniques. This fall, we are extending our partnership with the Graduate School and offering several workshops in our data management series for RCR credit (please see course descriptions for further details).

Everyone is welcome at Duke Libraries workshops.  We hope to see you this fall!

Workshop Series by Theme

Data Management

09-13-2017 – Data Management Fundamentals
09-18-2017 – Reproducibility: Data Management, Git, & RStudio 
09-26-2017 – Writing a Data Management Plan
10-03-2017 – Increasing Openness and Reproducibility in Quantitative Research
10-18-2017 – Finding a Home for Your Data: An Introduction to Archives & Repositories
10-24-2017 – Consent, Data Sharing, and Data Reuse 
11-07-2017 – Research Collaboration Strategies & Tools 
11-09-2017 – Tidy Data Visualization with Python

Data Visualization

09-12-2017 – Introduction to Effective Data Visualization 
09-14-2017 – Easy Interactive Charts and Maps with Tableau 
09-20-2017 – Data Visualization with Excel
09-25-2017 – Visualization in R using ggplot2 
09-29-2017 – Adobe Illustrator to Enhance Charts and Graphs
10-13-2017 – Visualizing Qualitative Data
10-17-2017 – Designing Infographics in PowerPoint
11-09-2017 – Tidy Data Visualization with Python

Digital Mapping

09-12-2017 – Intro to ArcGIS Desktop
09-27-2017 – Intro to QGIS 
10-02-2017 – Mapping with R 
10-16-2017 – Cloud Mapping Applications 
10-24-2017 – Intro to ArcGIS Pro

Python

11-09-2017 – Tidy Data Visualization with Python

R Workshops

09-11-2017 – Intro to R: Data Transformations, Analysis, and Data Structures  
09-18-2017 – Reproducibility: Data Management, Git, & RStudio 
09-25-2017 – Visualization in R using ggplot2 
10-02-2017 – Mapping with R 
10-17-2017 – Intro to R: Data Transformations, Analysis, and Data Structures
10-19-2017 – Developing Interactive Websites with R and Shiny 

Stata

09-20-2017 – Introduction to Stata
10-19-2017 – Introduction to Stata 

 

 

 

 

 

 

 

 

 

 

 

 

Love Your Data Week (Feb. 13-17)

In cooperation with the Triangle Research Library Network, Duke Libraries will be participating in Love Your Data Week on February 13-17, 2017. Love Your Data Week is an international event to help researchers take better care of their data. The campaign focuses on raising awareness and building community around data management, sharing, preservation, and reuse.

The theme for Love Your Data Week 2017 is data quality, with a related message for each day.

  • Monday: Defining Data Quality
  • Tuesday: Documenting, Describing, and Defining
  • Wednesday: Good Data Examples
  • Thursday: Finding the Right Data
  • Friday: Rescuing Unloved Data

Throughout the week, Data and Visualization Services will be contributing to the conversation on Twitter (@duke_data). We will also host the following local programming related to the daily themes:

In honor of Love Your Data Week chocolates will be provided at these workshops!

The new Research Data Management staff at the Duke Libraries are available to help researchers care for their data through consultations, support services, and instruction.  We can assist with writing data management plans that comply with funder policies, advise on data management best practices, and facilitate the ingest of data into repositories. To learn more about general data management best practices, see our newly updated RDM guide

Contact us at askdata@duke.edu to find out how we can help you love your data! 

Get involved in Love Your Data Week by following the conversation at #LYD17, #loveyourdata, and #trlndata.

All promotional Love Your Data 2017 materials used under a Creative Commons Attribution 4.0 International License.

Citation: Bass, M., Neeser, A., Atwood, T., and Coates, H. (2017). Love Your Data Week Promotional Materials. [image files]. Retrieved from https://osf.io/r8tht/files/