Can’t we just make a Venn diagram?

2018-02-19 Eric Monson, Ph.D. 5 Comments

When I’m teaching effective visualization principles, one of the most instructive processes is critiquing published visualizations and reviewing reworks done by professionals. I’ll often show examples from Cole Nussbaumer Knaflic’s blog, Storytelling with Data, and Jon Schwabish’s blog, The Why Axis. Both brilliant! (Also, check out my new favorite blog Uncharted, by Lisa Charlotte Rost, for wonderful visualization discussions and advice!)

What we don’t usually get to see is the progression of an individual visualization throughout the design process, from data through rough drafts to final product. I thought it might be instructive to walk through an example from one of my recent consults. Some of the details have been changed because the work is unpublished and the jargon doesn’t help the story.

Data full of hits and misses

A researcher came to me for help with an academic paper figure. He and his collaborator were comparing five literature-accepted methods for identifying which patients might have a certain common disease from their medical records. Out of 60,000 patients, about 10% showed a match with at least one of the tests. The resulting data was a spreadsheet with a column of patient IDs, and five columns of tests, with a one if a patient was identified as having the disease by that particular test, and a zero if their records didn’t match for that test. As you can see in the figure, there were many inconsistencies between who seemed to have the disease across the five tests!

So you want to build a snowman

The researchers wanted a visualization to represent the similarities and differences between the test results. Specifically, they wanted to make a Venn diagram, which consists of ellipsoids representing overlapping sets. They had an example they’d drawn by hand, but wanted help making it into an accurate depiction of their data. I resisted, explaining that I didn’t know of a program that would accomplish what he wanted, and that it is likely to be mathematically impossible to take their five-dimensional data set and represent it quantitatively as a Venn diagram in 2D. Basically, you can’t get the areas of all of the overlapping regions to be properly proportional to the number of patients that had hits on all of the combinations of the five tests. The Venn diagram works fine schematically, as a way to get across an idea of set overlap, but it would never be a data visualization that would reveal quantitative patterns from their findings. At worst, it would be a misleading distortion of their results.

Count me in

His other idea was to show the results as a table of numbers in the form of a matrix. Each of the five tests were listed across the top and the side, and the cell contents showed the quantity of patients who matched on that pair of tests. The number matching on a single test was listed on the diagonal. Those patterns can be made more visual by coloring the cells with “conditional formatting” in Excel, but the main problem with the table is that it hides a bunch of interesting data! We don’t see any of the numbers for people who hit on the various combinations of three tests, or four, or the ones that hit on all five.

I suggested we start exploring the hit combinations by creating a heatmap of the original data, but sort the patients (along the horizontal axis) by how many tests tests they hit (listing the tests up the vertical axis). Black circles are overlaid showing the number of tests hit for any given patient.

There are too many patients / lines here to show clearly the combinations of tests, but this visualization already illuminated two things that made sense to the researchers. First, there is a lot of overlap between ALPHA (a) and BETA-based (b) tests, and between GAMMA Method (c) and Modified GAMMA (d), because these test pairs are variations of each other. Second, the overlaps indicate a way the definitions are logically embedded in each other; (a) is a subset of (b), and (b) is for the most part a subset of (c).

My other initial idea was to show the numbers of patients identified in each of the test overlaps as bubbles in Tableau. Here I continue the shorthand of labeling each test by the letters [a,b,c,d,e], ordered from ALPHA to OMEGA. The number of tests hit are both separated in space and encoded in the color (low to high = light to dark).

Add some (effective?) dimensions

I felt the weakness of this representation was that the bubbles were not spatially associated with their corresponding tests. Inspired by multi-dimensional radial layouts such as those used in the Stanford dissertation browser, I created a chart (in Adobe Illustrator) with five axes for the tests. I wanted each bubble to “feel a pull” from each of the passed tests, so it made sense to place the five-hit “abcde” bubble at the center, and each individual, “_b___”, “__c__”, ____e” bubble right by its letter – making the radius naturally correspond to the number of test hit. Other bubbles were placed (manually) in between their combination of axes / tests.

The researchers liked this version. It was eye-catching, and the gravitation of bubbles in the b/c quadrant vaguely illustrated the pattern of hits and known test subsets. One criticism, though, was that it was a bit confusing – it wasn’t obvious how the bubbles were placed around the circles, and it might take people too long to figure out how to read the plot. It also, took up a lot of page space.

Give these sets a hand

One of the collaborators, after seeing this representation, suggested trying to use it as the basis for an Euler diagram. Like a Venn diagram, it’s a visualization used to show set inclusion and overlap, but unlike in a Venn, an Euler is drawn using arbitrary shapes surrounding existing labels or images representing the set members. I thought it was an interesting idea, but I initially dismissed the idea as too difficult. I had already put more time than I typically spend on a consult into this visualization (our service model is to help people learn how to make their own visualizations, not produce visualizations for them). Also, I had never made an Euler diagram. While I had seen some good talks about them, I didn’t have any software on hand which would automate the process. So, I responded that the researchers should feel free to try creating curves around the sets themselves, but I wasn’t interested in pursuing it further.

About two minutes after I sent the email, I began looking at the diagram and wondering if I could draw the sets! I printed out a black and white copy and started drawing lines with colored pencils, making one enclosing shape for each test [a-e]. It turned out that my manual layout resulted in fairly compact curves, except for “_bc_e”, which had ambiguous positioning, anyway. The curve drawing was so easy that I started an Illustrator version. I kept the circles’ area the same (corresponding to their quantitative value) but pushed them around to make the set shapes more compact.

Ironically, I had come back almost exactly to the researchers’ original idea! The important distinction is that the bubbles keep it quantitative, with the regions only representing set overlap.

We’ve come full ellipsoid

Angela Zoss constructively pointed out that there were now too many colors, and the shades encoding number of hits wasn’t necessary. She also felt the region labels weren’t clear. Those fixes, plus some curve smoothing (Path -> Simplify in Illustrator) led me to a final version we were all very happy with!

It’s still not a super simple visualization, but both the quantitative and set overlap patterns are reasonably clear. This results was only possible, though, through trying multiple representations and getting feedback on each!

If you’re interested in learning how to create visualizations like this yourself, sign up for the DVS announcements listserve, or keep an eye on our upcoming workshops list. We also have videos of many past workshops, including Angela’s Intro to Effective Data Visualization, and my Intro to Tableau, Illustrator for Charts, and Illustrator for Diagrams.

Uncategorized

Love Data Week (Feb. 12-16)

2018-02-09 Mara Sedlins, Ph.D.

Love Data Week is here again! Love Data Week is an international social media campaign to raise awareness and build community to engage on topics related to research data management, sharing, preservation, reuse, and library-based research data services.

This year the theme for Love Data Week is Data Stories, with a focus on four topics:

Stories about data
Telling stories with data
Connected conversations across different data communities
We are data: seeing the people behind the data

Since last year, we have some new data stories at Duke Libraries. The Duke Digital Repository now contains nearly 30 data sets that Duke researchers have shared for preservation and reuse. There are 23 Duke-affiliated projects in the Open Science Framework, a free web app developed by the Center for Open Science that facilitates good project and data management practices. And our Research Data Management team has continued to offer consultation and instruction services to a variety of researchers on campus.

We invite you to join the data story by attending data related events coming up at Duke during Love Data Week:

Tuesday, February 13:

Civil and Environmental Engineering Seminar: Data, data everywhere … Making sense of observations and models across scales
Wednesday, February 14:

Introduction to the Open Science Framework*: Learn how this free, open source tool can help you manage and share your research data.

Story Maps with ArcGIS Online: Learn how to tell your data story with an interactive map that integrates other media (photos, text, videos) and shows changes over space and time.

Qualitative Data Analysis Workshop at the Social Science Research Institute: Learn how to transform interview scripts into analyzable data and other foundational skills in qualitative data analysis.
Thursday, February 15:

Data Dialogue at the Information Initiative at Duke: Design intuition, ethnography, and data science
Friday, February 16:

Visualization Friday Forum: Invisible Visualization: Making data visualizations accessible to the blind and other people with disabilities

*In honor of Love Data Week, chocolate will be offered at this event.

Keep an eye on additional workshops coming up for the rest of the spring semester!

All promotional Love Data 2018 materials used under a Creative Commons Attribution 4.0 International License.

Citation: Nurnberger, A., Coates, H. L., Condon, P., Koshoffer, A. E., Doty, J., Zilinski, L., … Foster, E. D. (2017). Love Data Week [image files]. Retrieved from https://osf.io/r8tht

workshops

Announcing Tidyverse workshops for Winter 2018

2017-12-05 John Little

Coming this winter the Data & Visualization Services Department will once again host a workshop series on the R programming language. Our spring offering is modeled on our well received R we having fun yet‽ (Rfun) fall workshop series. The four-part series will introduce R as a language for modern data manipulation by highlighting a set of tidyverse packages that enable functional data science. We will approach R using the free RStudio IDE, an intent to make reproducible literate code, and a bias towards the tidyverse. We believe this open tool-set provides a context that enables and reinforces reproducible workflows, analysis, and reporting.

This six-part series will introduce R as a language for modern data manipulation by highlighting a set of tidyverse packages that enable functional data science.

January Line-up

Title	Date	Registration	Past Workshop
Intro to R	Jan 19 1 – 3pm	register	Resources
R Markdown with Dr. Çetinkaya-Rundel	Jan 23 9am	register
Shiny with Dr. Çetinkaya-Rundel	Jan 25 9am	register
Mapping with R	Jan 25 1-3pm	register	Resources
Reproducibility & Git	Jan 29 1-3pm	register	Resources
Visualizationg with ggplot2	Feb 1 9:30-11:30am	register	Resources

An official announcement with links to registration is forthcoming. Feel free to subscribe to the Rfun or DVS-Announce lists. Or look to the DVS Workshop page for official registration links as soon as they are available.

Workshop Arrangement

This workshop series is intended to be iterative and recursive. We recommend starting with the Introduction to R. Proceed through the remaining three workshops in any order of interest.

Recordings and Past Workshops

We presented a similar version of this workshop series last fall and recorded each session whenever possible. You can stream past workshops and engage with the shareable data sets at your-own-pace (see the Past Workshop resources links, above.) Alternatively, all the past workshop resource links are presented in one listicle: Rfun recap.

Data Curation, Data Management

Highlights from Expanding our Research Data Management Program

2017-10-05 Sophia Lafferty-Hess

Since the launch of our expanded research data management (RDM) program in January, the Research Data Management Team in DVS has been busy defining and implementing our suite of services. Our “Lifecycle Services” are designed to assist scholars at all stages of their research project from the planning phase to the final curation and disposition of their data in an archive or repository. Our service model centers on four key areas: data management planning, data workflow design, data and documentation review, and data repository support. Over the past nine months, we have worked with Duke researchers across disciplines to provide these services, allowing us to see their value in action. Below we present some examples of how we have supported researchers within our four support areas.

Data Management Planning

With increasing data management plan requirements as well as growing expectations that funding agencies will more strictly enforce and evaluate these plans, researchers are seeking assistance ensuring their plans comply with funder requirements. Through in-person consultations and online review through the DMPTool, we have helped researchers enhance their DMPs for a variety of funding agencies including the NSF Sociology Directorate, the Department of Energy, and the NSF Computer & Information Science & Engineering (CISE) Program.

Data Workflow Design

As research teams begin a project there are a variety of organizational and workflow decisions that need to be made from selecting appropriate tools to implementing storage and backup strategies (to name a few). Over the past 6 months, we have had the opportunity to help a multi-institutional Duke Marine Lab Behavioral Response Study (BRS) implement their project workflow using the Open Science Framework (OSF). We have worked with project staff to think through the organization of materials, provided training on the use of the tool, and strategized on storage and backup options.

Data and Documentation Review

During a project, researchers make decisions about how to format, describe, and structure their data for sharing and preservation. Questions may also arise surrounding how to ethically share human subjects data and navigate intellectual property or copyright issues. In conversations with researchers, we have provided suggestions for what formats are best for portability and preservation, discussed their documentation and metadata plans, and helped resolve intellectual property questions for secondary data.

Data Repository Support

At the end of a project, researchers may be required or choose to deposit their data in an archive or repository. We have advised faculty and students on repository options based on their discipline, data type, and repository features. One option available to the Duke community is the Duke Digital Repository. Over the past nine months, we have assisted with the curation of a variety of datasets deposited within the DDR, many of which underlie journal publications.

This year Duke news articles have featured two research studies with datasets archived within the DDR, one describing a new cervical cancer screening device and another presenting cutting-edge research on a potential new state of matter. The accessibility of both Asiedu et al.’s screening device data and Charbonneau and Yaida’s glass study data enhances the overall transparency and reproducibility of these studies.

Our experiences thus far have enabled us to better understand the diversity of researchers’ needs and allowed us to continue to hone and expand our knowledge base of data management best practices, tools, and resources. We are excited to continue to work with and learn from researchers here at Duke!

Data Curation, Data Management

Open Science Framework @ Duke

2017-09-05 Sophia Lafferty-Hess

The Open Science Framework (OSF) is a free, open source project management tool developed and maintained by the Center for Open Science (COS). OSF offers many features that can help scholars manage their workflow and outputs throughout the research lifecycle. From collaborating effectively, to managing data, code, and protocols in a centralized location, to sharing project materials with the broader research community, the OSF provides tools that support openness, research integrity, and reproducibility. Some of the key functionalities of the OSF include:

Integrations with third-party tools that researchers already use (i.e., Box, Google Drive, GitHub, Mendeley, etc.)
Hierarchical organizational structures
Unlimited native OSF storage*
Built-in version control
Granular privacy and permission controls
Activity log that tracks all project changes
Built-in collaborative wiki and commenting pane
Analytics for public projects
Persistent, citable identifiers for projects, components, and files along with Digital Object Identifiers (DOIs) and Archival Resource Keys (ARKs) available for public OSF projects
And more!

Duke University is a partner institution with OSF, meaning you can sign into the OSF using your NetID and affiliate your projects with Duke. Visit the Duke OSF page to see some Duke research projects and outputs from our community.

Duke University Libraries has also partnered with COS to host a workshop this fall entitled “Increasing Openness and Reproducibility in Quantitative Research.” This workshop will teach participants how they can increase the reproducibility of their work and will include hands-on exercises using the OSF.

Workshop Details
Date: October 3, 2017
Time: 9 am to 12 pm
Register: http://duke.libcal.com/event/3433537

If you are interested in affiliating an existing OSF project, want to learn more about how the OSF can support your workflow, or would like a demonstration of the OSF, please contact askdata@duke.edu.

*Individual file size limit of 5 GB. Users can upload larger files by connecting third party add-ons to their OSF projects.

Data Curation, Data Management, data science, Data Visualization, GIS, rstats, spatial humanities, stata, tutorial, workshops

Fall Data and Visualization Workshops

2017-08-21 Joel Herndon, Ph.D.

Visualize, manage, and map your data in our Fall 2017 Workshop Series. Our workshops are designed for researchers who are new to data driven research as well as those looking to expand skills with new methods and tools. With workshops exploring data visualization, digital mapping, data management, R, and Stata, the series offers a wide range of different data tools and techniques. This fall, we are extending our partnership with the Graduate School and offering several workshops in our data management series for RCR credit (please see course descriptions for further details).

Everyone is welcome at Duke Libraries workshops. We hope to see you this fall!

Workshop Series by Theme

Love Your Data Week (Feb. 13-17)

2017-02-09 Sophia Lafferty-Hess

In cooperation with the Triangle Research Library Network, Duke Libraries will be participating in Love Your Data Week on February 13-17, 2017. Love Your Data Week is an international event to help researchers take better care of their data. The campaign focuses on raising awareness and building community around data management, sharing, preservation, and reuse.

The theme for Love Your Data Week 2017 is data quality, with a related message for each day.

Monday: Defining Data Quality
Tuesday: Documenting, Describing, and Defining
Wednesday: Good Data Examples
Thursday: Finding the Right Data
Friday: Rescuing Unloved Data

Throughout the week, Data and Visualization Services will be contributing to the conversation on Twitter (@duke_data). We will also host the following local programming related to the daily themes:

Tuesday February 14: Data Management Tools: Colectica for Excel: Learn about the importance of documentation and how to document your data using Colectica.
Thursday February 16: Web Scraping: Gathering webpage data, parsing, and APIs: Learn how to build a corpus of data through scraping, crawling, and parsing web content.
Spring 2017 Data Management Workshops: Check out other upcoming data management workshops on tools and strategies that can help you love your data!

In honor of Love Your Data Week chocolates will be provided at these workshops!

The new Research Data Management staff at the Duke Libraries are available to help researchers care for their data through consultations, support services, and instruction. We can assist with writing data management plans that comply with funder policies, advise on data management best practices, and facilitate the ingest of data into repositories. To learn more about general data management best practices, see our newly updated RDM guide.

Get involved in Love Your Data Week by following the conversation at #LYD17, #loveyourdata, and #trlndata.

All promotional Love Your Data 2017 materials used under a Creative Commons Attribution 4.0 International License.

Citation: Bass, M., Neeser, A., Atwood, T., and Coates, H. (2017). Love Your Data Week Promotional Materials. [image files]. Retrieved from https://osf.io/r8tht/files/

Data Curation, Data Management

New Data Management Services @ Duke

2017-01-17 Joel Herndon, Ph.D.

Duke Libraries are happy to announce a new set of research data management services designed to help researchers secure grant funding, increase research impact, and preserve valuable data. Building on the recommendations of the Digital Research Faculty Working Group and the Duke Digital Research Data Services and Support report, Data and Visualization Services have added two new research data management consultants who are available to work with researchers across the university and medical center on a broad range of data management concerns from data creation to data curation.

Interested in learning more about data management?

Join us at the Research Computing Symposium on January 18th to learn more about new services and staff
Attend a workshop on data management:
- Data Management Fundamentals (Feb 6)
- Data Management and Reproducibility (Feb 20)
- Consent, Data Sharing and Data Reuse (Mar 21)
- Data Management Tools: The Dataverse Project (Mar 29)
Ask a question or schedule a consultation at askdata@duke.edu.

Our New Data Management Consultants

Sophia Lafferty-Hess attended the University of North Carolina at Chapel Hill where she received a Master of Science in Information Science and Master of Public Administration. Prior to coming to Duke, Sophia worked at the Odum Institute for Research in Social Science at UNC-Chapel Hill within the Data Archive as a Research Data Manager. In this position, Sophia provided consultations to researchers on data management best practices, curated research data to support long-term preservation and reuse, and provided training and instruction on data management policies, strategies, and tools.

While at Odum, Sophia also helped lead the development of a data curation and verification service for journals to help enforce data sharing and replication policies, which included verifying that data meet quality standards for reuse and that the data and code can properly reproduce the analytic results presented in the article. Sophia’s current research interests include the impact of journal data sharing policies on data availability and the development of data curation workflows.

Jen Darragh comes to us from Johns Hopkins University where she served for the past seven years as the Data Services and Sociology Librarian, and Hopkins Population Center Restricted Projects Coordinator. In this position, Jen developed the libraries’ Restricted Data Room and designed the secure data enclave spaces and staff support for the Johns Hopkins Population Center.

Jen received her Bachelor of Arts Degree in Psychology from Westminster College (PA) and her Master of Library and Information Sciences degree from the University of Pittsburgh. She has been involved with socio-behavioral research data throughout her career. Jen is particularly interested in the development of centralized, controlled data access for sensitive human subjects’ data (subject to HIPAA or FERPA requirements) to facilitate broader, yet more secure sharing of existing research data as a means to produce new, cutting-edge research.

Data Curation, Data Management, Uncategorized

Duke Libraries and SSRI welcome Mara Sedlins!

2016-09-08 Joel Herndon, Ph.D. 2 Comments

On behalf of Duke Libraries and the Social Science Research Institute, I am happy to welcome Mara Sedlins to Duke. As the library and SSRI work to develop a rich set of data management, analysis, and archiving strategies for Duke researchers, Mara’s postdoctoral position provides a unique opportunity to work closely with researchers across campus to improve both training and workflows for data curation at Duke. – Joel Herndon, Head of Data and Visualization Services, Duke Libraries

I am excited to join the Data and Visualization Services team this fall as a postdoctoral fellow in data curation for the social sciences (sponsored by CLIR and funded by the Alfred P. Sloan Foundation). For the next two years, I will be working with Duke Libraries and the Social Science Research Institute to develop best practices for managing a variety of research data in the social sciences.

My research background is in social and personality psychology. I received my PhD at the University of Washington, where I worked to develop and validate a new measure of automatic social categorization – to what extent do people, automatically and without conscious awareness, sort faces into socially constructed categories like gender and race? The measure has been used in studies examining beliefs about human genetic variation and the racial labels people assign to multiracial celebrities like President Barack Obama.

While in Seattle, I was also involved in several projects at Microsoft Research assessing computer-supported cooperative work technologies, focusing on people’s preferences for different types of avatar representations, compared to video or audio-only conferencing. I also have experience working with data from a study of risk factors for intimate partner violence, managing a database of donors and volunteers for a historical archive, and organizing thousands of high-resolution images for a large-scale digital comic art restoration project.

I look forward to applying the insights gained from working on a diverse array of data-intensive projects to the problem of developing and promoting best practices for data management throughout the research lifecycle. I am particularly interested in questions such as:

How can researchers write actionable data management plans that improve the quality of their research?
What strategies can be used to organize and document data files during a project so that it’s easy to find and understand them later?
What steps need to be taken so that data can be discovered and re-used effectively by other researchers?

These are just a few of the questions that are central to the rapidly evolving field of data curation for the sciences and beyond.

Data Analysis, Data Sources, Data Visualization, GIS, Uncategorized, workshops

Fall 2016 DVS Workshop Series

2016-08-24 Joel Herndon, Ph.D.

Data and Visualization Services is happy to announce its Fall 2016 Workshop Series. Learn new ways of enhancing your research with a wide range of data driven research methods, data tools, and data sources.

Can’t attend a session? We record and share most of our workshops online. We are also happy to consult on any of the topics above in person. We look forward to seeing you in the workshops, in the library, or online!

Data Sources

Web Scraping and Gathering Data from Websites (Sep 27)

Data Cleaning and Analysis

OpenRefine: Data/Text Cleaning, Mining and Transformations (Sep 9)

Regular Expressions (Sep 26)

Data Analysis

Introduction to Stata (Two sessions: Sep 21, Oct 18)

Introduction to R: Data Transformations, Analysis, and Data Structures (Sep 13)

Mapping and GIS

Introduction to ArcGIS (Two sessions: Sep 14, Oct 13)

Introduction to QGIS (Sep 29)

ArcGIS Online (Oct 17)