Jon Schwabish – Excel Data Visualization Hero!

Two data visualization topics that people occasionally request are presentation design and Excel skills. We have a couple older videos on Basic Data Cleaning and Analysis for Data Tables, and Advanced Excel for Data Projects in our CDVS Online Learning page; and the storytelling and graphic design principles I cover in my Effective Academic Posters presentation apply equally well to presentations; but in case you haven’t heard of him before, I want to tell you about a master of these topics, one of my data visualization heroes, Jon Schwabish, founder of PolicyViz.

Besides his emphasis on clear communication of results, one of the things I admire most about Schwabish is his focus on Microsoft Excel as a legitimate tool for crafting that communication. While not free and open-source, it’s a piece of software that many people have access to, and despite some of its limitations (e.g. reproducibility issues), it is a very capable tool for data processing and visualization. If you want to make lots of people better communicators, teach them how to use the tools they already have!

Of course, visit policyviz.com and the PolicyViz YouTube channel to access the plethora of resources Jon is constantly generating, but to get you started I want to point out a few of my favorites.

I get frustrated that Excel doesn’t have a built-in, easy way to make horizontal dot plots with error bars. (On the med-side they tend to call these Forest plots, although they are useful whenever you have categories and a quantity with confidence intervals. Don’t just create a table – make it visual with a plot!) Jon’s Labeling Dot Plots blog post and accompanying YouTube video was super useful – it taught me a general approach for using scatterplots in Excel to create a variety of chart types that Excel doesn’t support natively! The method is a pain the first time you do it, and I get a bit belligerent because I hate that you have to employ this workaround, but it’s so brilliant and flexible that I’m tempted to teach a CDVS workshop on just this one chart type. More broadly, he also has an Excel Tutorials section of his YouTube channel, and he sells a PDF on his website called A Step-by-Step Guide to Advanced Data Visualization in Excel 2016.One of the best ways to become a better visualizer and communicator is to get feedback on your work and iterate through multiple drafts. To compliment that, it’s wonderful to get an expert’s take on a published visualization, along with proposed alternatives. For years Jon has been publishing brillant visualization redesigns on his blog. He doesn’t just criticize – he shows you alternatives and talks about their strengths and weaknesses. There is also a DataViz Critiques section on his YouTube channel.

In early 2021 he released over 50 daily videos in a series called One Chart at a Time where visualization experts “expand your graphic literacy” and “help you learn about more than just the standard bar, line, and pie chart.”

Along with Alice Feng, Schwabish published in 2021 the “Do No Harm Guide: Applying Equity Awareness in Data Visualization”. You can download the report at urban.org and listen to a talk they gave about it on YouTube to get their reflections on “how data practitioners can approach their work through a lens of diversity, equity, and inclusion … to encourage thoughtfulness in how analysts work with and present their data.”

Finally, people are always asking me what books they should read to get better at visualization. Take a look at Schwabish’s books, along with his lists of recommended DataViz books and Presentation books!

Code Repository vs Archival Repository. You need both.

Years ago I heard the following quote attributed to Seamus Ross from 2007:

Digital objects do not, in contrast to many of their analog counterparts, respond well to benign neglect. 

National Wildlife Property Repository
National Wildlife Property Repository. USFWS Mountain-Prairie. https://flic.kr/p/SYVPBB

Meaning, you cannot simply leave digital files to their bit-rot tendencies while expecting them to be usable in the future.  Digital repositories are part of a solution to this problem.  But to review, there are many types of repositories, both digital and analog:  repositories of bones, insects, plants, books, digital data, etc.  Even among the subset of digital repositories there are many types.  Some digital repositories keep your data safe for posterity and replication.  Some help you manage the distribution of analysis and code.  Knowing about these differences will affect not only the ease of your computational workflow, but also the legacy of your published works.  

Version-control repositories and their hubs

The most widely known social coding hubs include GitHub, Bitbucket and GitLab.  These hubs leverage Git version-control software to track the evolution of project repositories – typically a software or computational analysis project.  Importantly, Git and GitHub are not the same thing but they work well together.

Git repository
GIT Repository. Treviño. https://flic.kr/p/SSras

Version control works by monitoring any designated folder or project directory, making that directory a local repository or repo.  Among other benefits, using version control enables “time travel.” Interactions with earlier versions of a project are commonplace.  It’s simple to retrieve a deleted paragraph from a report written six months ago.  However there are many advanced features as well. For example, unlike common file-syncing tools, it’s easy to recreate an earlier state of an entire project directory and every file from a particular point in time.  This feature among others makes Git version-control a handy tool in support of many research workflows and the respective outputs:  documents, visualizations, dashboards, slides, analysis, code, software, etc.  

Binary. Michael Coghlan. https://flic.kr/p/aYEytM

Git is one of the most popular, open-source, version-control applications; originally developed in 2005 to facilitate the evolution of the world’s most far reaching and successful open-source coding project.  Linux is a world-wide collaborative project that spans multiple developers, project managers, natural languages, geographies, and time-zones.  While Git can handle large projects, it is extensible and can easily scale up or down to support a wide range of workflows.  Additionally, Git is not just for software and code files.  Essentially any file on a file system can be monitored with Git:   MSWord, PDF files, images, datasets, etc. 

 

There are many ways to share a Git repository and profile your work.  The term push refers to a convenient process of synchronizing a repo up to a remote social coding hub.  Additional features of a hub include issue tracking, collaboration, hosting documentation, and Kanban Method planning.  Conveniently, pushing a repo to GitHub means maintaining a seamless, two-location backup – a push will simultaneously and efficiently synchronize the timeline and file versions. Meanwhile, at a repo editor’s discretion, any collaborator or interested party can be granted access to their GitHub repository.

Many public instances of social-coding hubs operate on a freemium model.  At GitHub most users pay nothing.  It’s also possible to run a local instance of a coding hub.  For example, OIT offers a local instance of GitLab, delivering many of the same features while enabling permissions, authorization, and access Via Duke’s NetID.

While social coding hubs are great tools for distributing files and managing project life-cycles, in and of themselves they do not sufficiently ensure long-term reproducible access to research data.  To do that simply synchronize version-control repositories with archival research data repositories.

Research Data Repositories


Preserving the computational artifacts of formal academic works requires a repository focus that is complementary to version-control repositories and social-coding hubs.  Nonetheless, version control is not a requirement of a data repository where the goal is long-term preservation. Fortunately, many special-purpose data repositories exist.  Discipline-specific research repositories are sometimes associated with academic societies.  There also exist more generalized archival research repositories such as Zenodo.org.  Additionally, many research universities host institutional research data repositories.  Not surprisingly, such a research data repository exists at Duke where the Duke University Libraries promotes and cooperatively shepherds Duke’s Research Data Repository (RDR).  

Colossus computer
Colossus. Chris Monk. https://flic.kr/p/fJssqg

Unlike social coding hubs, data repositories operate under different funding models and are motivated by different horizons.  Coding hubs like GitHub do not promise long-term retention, instead they focus on immediate distribution of version-control repos and offer project management features. Research data repositories take a long view centered closer to the artifacts of formal research and publication.  

By archiving the data milestones of publication, a deposit in the RDR links a formal publication – book edition, chapter, or serial article, etc. – with the data and code (i.e., a compendium) used to produce a single tangible instance of publication.  In turn, the building blocks of computational thinking and research processes are preserved for posterity because the RDR maintains an assurance of long term sustainability.  

Creator of MacPaint
Bill Atkinson. creator of MacPaint. painted in MacPaint” Photo by Kyra Rehn. https://flic.kr/p/e9urBF

In the Duke RDR, particular effort is focussed on preserving unique versions of data associated with each formal publication.  In this way, authors can associate a digital object identifier, or DOI, with the precise code and data used to draft an accepted paper or research project.  Once deposited in the RDR, researchers across the globe can look at these archives to verify, to learn, to refute, to cite, or be inspired toward new avenues of investigation.

By preserving workflow artifacts endemic to publication milestones, research data repositories preserve the record of academic progress.  Importantly, the preservation of these digital outcomes or artifacts is strongly encouraged by funding agencies.  Increasingly, these archival access points are a requirement for funding, especially among publicly funded research.  As such, the Duke RDR exists with aims to preserve and make the academic record accessible, and to create a library of reproducible academic research.  

Conclusion

The imperatives for preserving research data are derived from expressly different motives than those driving version-control repositories.  Minimally, version-control repositories do not promise academic posterity.  However, among the drivers of scholarship is the intentional engagement with the preserved academic record.  In reality, while unlikely, your GitHub repository could vanish in the blink of the next Wall Street acquisition. Conversely research data repositories exist with different affordances.  These two types of repositories complement each other.  Once more, they can be synchronized to enable and preserve digital processes that comprise many forms of data-driven research.  Using both types of repositories imply workflows that positively contribute to a scholarly legacy. It is this promise of academic transmission that drives Duke’s RDR, and benefits scholars by enabling access to persistent copies of research.  

 

CDVS Data Workshops: Spring 2022

As we begin the new year, the Center of Data and Visualization Sciences is happy to announce a series of twenty-one data workshops designed to empower you to reach your goals in 2022. With a focus on data management, data visualization, and data science, we hope to provide a diverse set of approaches that can save time, increase the impact of your research, and further your career goals.

While the pandemic has shifted most of our data workshops online, we remain keenly interested in offering workshops that reflect the needs and preferences of the Duke research community. In November, we surveyed our 2021 workshop participants to understand how we can better serve our attendees this spring. We’d like to thank those who participated in our brief email survey and share a few of our observations based on the response that we received.

Workshops Formats

While some of our workshops participants (11%) prefer in-person workshops and others (24%) expressed a preference for hybrid workshops, a little over half of the survey respondents (52%) expressed a preference for live zoom workshops. Our goal for the spring is to continue offering “live” zoom sessions while continuing to explore possibilities for increasing the number of hybrid and in-person options. We hope to reevaluate our workshops communities preferences later this year and will continue to adjust formats as appropriate.

Workshop format preferences
52% of respondents prefer online instruction, while another 24% would like to hybrid options

Participant Expectations

With the rapid shift to online content in the last two years coupled with a growing body of online training materials, we are particularly interested in how our workshop attendees evaluate online courses and their expectations for these courses.  More specifically, we were curious about whether registering for an online session includes more than simply the expectation of attending the online workshop.

While we are delighted to learn that the majority of our respondents (87%) intend to attend the workshop (our turnout rate has traditionally been about 50%), we learned that a growing number of participants had other expectations (note: for this question, participants could choose more than one response). Roughly sixty-seven percent of the sample indicated they expected to have a recording of the session available. While another sixty-six percent indicated that they expected a copy of the workshop materials (slides, data, code) even if they were unable to attend.

As a result of our survey, CDVS will make an increasing amount of our content available online this spring..  In 2021, we launched a new webpage designed to showcase our learning materials. In addition to our online learning site, CDVS maintains a github site (CDVS) as well as site focused on R learning materials (Rfun).

We appreciate your feedback on the data workshops and look forward to working with you in the upcoming year!

5 CDVS Online Learning Things

Within the Center for Data and Visualization Sciences (CDVS) we pride ourselves on providing numerous educational opportunities for the Duke community. Like many others during the COVID-19 pandemic, we have spent a large amount of time considering how to translate our in-person workshops to online learning experiences, explored the use of flipped classroom models, and learned together about the wonderful (and sometimes not so wonderful) features of common technology platforms (we are talking about you, Zoom).

Online learning setupWe also wanted to more easily surface the various online learning resources we have developed over the years via the web. Recognizing that learning takes place both synchronously and asynchronously, we have made available numerous guides, slide decks, example datasets, and both short-form and full-length workshops on our Online Learning Page. Below we highlight 5 online learning resources that we thought others interested in data driven research may wish to explore:

  • Mapping & GIS: R has become a popular and reproducible option for mapping and spatial analysis. Our Geospatial Data in R guide and workshop video introduce the use of the R language for producing maps. We cover the advantages of a code-driven approach such as R for visualizing geospatial data and demonstrate how to quickly and efficiently create a variety of map types for a website, presentation, or publication. 
  • Data Visualization: Visualization is a powerful way to reveal patterns in data, attract attention, and get your message across to an audience quickly and clearly. But, there are many steps in that journey from exploration to information to influence, and many choices to make when putting it all together to tell your story. In our Effective Data Visualization workshop, we cover some basic guidelines for effective visualization, point out a few common pitfalls to avoid, and run through a critique and iterations of an existing visualization to help you start seeing better choices beyond the program defaults.
  • Data ScienceQuickStart with R is our beginning data science module focusing on the Tidyverse — a data-first approach to data wrangling, analysis, and visualization.  Beyond introducing the Tidyverse approach to reproducible data workflows, we offer a rich allotment of other R learning resources at our Rfun site: workshop videos, case studies, shareable data, and code. Links to all our data science materials can also be found collated on our Online Learning page (above).
  • Data Management: Various stakeholders are stressing the importance of practices that make research more open, transparent, and reproducible including NIH who has released a new data management & sharing policy. In collaboration with the Office of Scientific Integrity, our Meeting Data Management Plan Requirements workshop presents details on the new NIH policy, describes what makes a strong plan, and where to find guidance, tools, resources, and assistance for building funder-based plans.
  • Data Sources: The U.S. Census has been collecting information on persons and businesses since the late 18th century, and tackling this huge volume of data can be daunting. Our guide to U.S. Census data highlights many useful places to view or download this data, with the Product Comparisons tab providing in chart form a quick overview of product contents and features. Other tabs provide more details about these dissemination products, as well as about sources for Economic Census data.

In the areas of data science, mapping & GIS, data visualization, and data management, we cover many other topics and tools including ArcGIS, QGIS, Tableau, Python for tabular data and visualization, Adobe Illustrator, MS PowerPoint, effective academic posters, reproducibility, ethics of data management and sharing, and publishing research data. Access more resources and past recordings on our online learning page or go to our upcoming workshops list to register for a synchronous learning opportunity.

Change is coming – are you open to it?

This blog post is a collaboration between Paolo Mangiafico from ScholarWorks and Sophia Lafferty-Hess from the Center for Data and Visualization Sciences and the Duke Research Data Repository.

Open access journals have been around forOpen sign several decades, and almost all researchers have read them or published in them by now. Perhaps less well known are trends toward more openness in sharing of data, methods, code, and other aspects of research – broadly called open scholarship. There are lots of good reasons to make your research outputs as open as possible, and increasing support at Duke for doing it.

There are many different variants of “open” – including goals of making research accessible to all, making data and methods transparent to increase reproducibility and trust, licensing research to enable broad re-use, and engagement with a variety of stakeholders, among other things. All of these provide benefits to the public and they also provide benefits to Duke researchers. There’s growing evidence that openly available publications and data result in more citations and greater impact (Colavizza 2020), and showing one’s work and making it available for replication helps build greater trust. There’s greater potential economic impact when others can build on research more quickly, and more avenues for collaboration and interdisciplinary engagement.

Recognizing the importance of making research outputs quickly and openly available to other researchers and the public, and supporting greater transparency in research, many funding agencies are now encouraging or requiring it. NIH has had a public access policy for over a decade, and NSF and other agencies have followed with similar policies. NIH has also released a new Data Management and Sharing policy that goes into effect in 2023 with more robust and clearer expectations for how to effectively share data. In Europe, government research funders back a program called Plan S, and in the United States, the recently passed U.S. Innovation and Competition Act (S. 1260) includes provisions that instruct federal agencies to provide free online public access to federally-funded research “not later than 12 months after publication in peer-reviewed journals, preferably sooner.”

The USICA bill aims to maximize the impact of federally-funded research by ensuring that final author manuscripts reporting on taxpayer-funded research are:

  • Deposited into federally designated or maintained repositories;
  • Made available in open and machine-readable formats; 
  • Made available under licenses that enable productive reuse and computational analysis; and
  • Housed in repositories that ensure interoperability and long-term preservation.

Duke got a head start on supporting researchers in making their publications open access in 2010, when Academic Council adopted an open access policy, which since then has been part of the Faculty Handbook (Appendix P). The policy provides the legal basis for Duke faculty to make their own research articles openly available on a personal or institutional website via a non-exclusive license, while also making it possible to comply with any requirements imposed by their journal or funder. Shortly after the policy was adopted, Duke Libraries worked with the Provost’s office to implement a service making open access easy for Duke researchers. DukeSpace, a repository integrated with the Scholars@Duke profile system, allows you to add a publication to your profile and deposit it to Duke’s open access archive in a single step, and have the open access link included in your citations alongside the link to the published version.

Duke Libraries also support a research data repository and services to help the Duke community organize, describe, and archive their research data for open access. This service, with support from the Provost’s office, provides both the infrastructure and curation staff to help Duke researchers make their data FAIR (Findable, Accessible, Interoperable, and Reusable). By publishing datasets with digital object identifiers (DOIs) and data citations, we create a value chain where making data available increases their impact and positions them as standalone research objects. The importance of data sharing specifically is also being formalized at Duke through the current Research Data Policy Initiative, which has a stated mission to “facilitate efficient and quality research, ensure data quality, and foster a culture of data sharing.” Together the Duke community is working to develop services, processes, procedures, and policies that broaden our contributions to society through public access to the outputs of our research.

Are you ready to make your work open? You can find more information about how to deposit your publications and data for open access at Duke on the ScholarWorks website, and consultants from Duke Libraries’ ScholarWorks Center for Scholarly Publishing and Center for Data and Visualization Sciences are available to help you find the best place to make your work open access, choose an appropriate license, and track how it’s being used.

Relational Thinking: Database Re-Modeling for Humanists

Author: Dr. Kaylee P. Alexander
Website: www.kayleealexander.com
Twitter: @kpalex91

Since the summer of 2018 I have been working with a set of nineteenth-century commercial almanacs for the city of Paris. As my dissertation focused heavily on the production of stone funerary markers during this period, I wanted to consult these almanacs to get a sense of how many workers were active in this field of production. Classified as marbriers (stonecutters), the makers of funerary monuments were often one in the same as those who executed various other stone goods and constructions. These almanacs represented a tremendous source of industry information, consistently recording enterprise names and addresses, as well as, at times, specific information about the types of products the enterprise specialized in and any awards they might have won for their work, and what types of new technologies they employed. An so I decided to make a database.

As a Humanities Unbounded graduate assistant with the Center for Data and Visualization Sciences during the summer of 2020, I had the opportunity to explore some of the issues related to database construction and management faced by humanists working on data-based research projects. In order to work out some of these issues, I worked to set up a MySQL database using my commercial almanacs data as a test case to determine which platforms and methods for creating queryable databases would be best suited for those working primarily in the humanities. In the process of setting up this database what became increasingly clear was the need to make clear the usefulness of this process for other humanists undertaking data-driven projects, as well as identify ways of transforming single spreadsheets of data into relational data models without needing to know how to code. Thus, in this blog post I offer a few key points about relational database models that may be useful for scholars in the humanities and share my experiences in constructing a MySQL database from a single Excel spreadsheet.

First of all, some key terms and concepts. MySQL is an open-source relational database management system that uses SQL (Structured Query Language) to create, modify, manage and extract data from a relational database. A relational data model organizes data into a series of tables containing columns (‘attributes’) and rows (‘records’) with unique keys identifying each record. Each table (or, ‘relation’) represents a single entity type and its corresponding attributes. When working with a relational data model you want to make sure that your tables are normalized, or organized in such a way that reduces redundancy in the data set, increases consistency, and facilitates querying.

Although for the purposes of efficient data gathering, I had initially collected all of the information from the commercial almanacs in a single Excel spreadsheet, I knew that I ultimately wanted to reconfigure my data using a relational model that could be shared with and queried efficiently by others. The main benefits of a relational model include ensuring consistency as well as performing combinations of queries to understand various relationships that exist among the information contained in the various tables that would be otherwise difficult to determine from a single spreadsheet. An additional benefit to this system is the ability to add records and edit information without the risk of compromising other information contained in the database.

The first question I needed to ask myself was which entities from my original spreadsheet would become the basis for my relational database. In other words, how would my relational model be organized? What different relationships existed within the dataset, and which variables functioned as entities rather than attributes? One of the key factors in determining which variables would become the entities of my relational model, was the question of whether or not a given variable contained repeated values throughout the master sheet. Ultimately determining the entities wasn’t the trickiest part. It because rather clear early on that it would be best to first create separate tables for businesses and business locations, which would be related via a table for annual activity (in the process of splitting my tables I would end up making more tables, but these were the key starting points).

The most difficult question I encountered was how I would go about splitting all this information as someone with very limited coding experience. How could identify unique values and populate tables with relevant attributes without having to teach myself Python in a pinch, but also without having to retype over one hundred years’ worth of business records? Ultimately, I came up with a rather convoluted system that had me going back and forth between OpenRefine and Excel. Once I got the hang of my system it became almost second nature to me but explaining it to others was another story. This made it abundantly clear that there was a lack of resources for demonstrating how one could create what were essentially a series of normalized tables from a flat data model. So, to make a very long story short, I broke down my convoluted process into a series of simple steps that required nothing more that Excel to transform a flat data model into a relational data model using the UNIQUE() and VLOOKUP() functions. This processes is detailed in a library tutorial I developed geared towards humanists, consisting of both a video demonstrating the process and a PDF containing written instructions.

In the end, all I needed to do was construct the database itself. In order to do this I worked with phpMyAdmin, a free web-based user interface for constructing and querying MySQL databases. Using phpMyAdmin, I was able to easily upload my normalized data tables, manage and query my database, and easily connect to Tableau for data visualization purposes using phpMyAdmin’s user management capabilities.


Kaylee Alexander portrait

Dr. Kaylee P. Alexander is a graduate of the Department of Art, Art History & Visual Studies, where she was also a research assistant with the Duke Art, Law & Markets Initiative (DALMI). Her dissertation research focuses on the visual culture of the cemetery and the market for funerary monuments in nineteenth-century Paris. In the summer of 2020, she served as a Humanities Unbounded graduate assistant with the Center for Data and Visualization Sciences at Duke University Libraries.

Standardizing the U.S. Census

Census Tract Boundary Changes
(https://datasparkri.org/maps/)

The questions asked in the U.S. Census have changed over time to reflect both the data collecting needs of federal agencies and evolving societal norms. Census geographies have also evolved in this time period to reflect population change and shifting administrative boundaries in the United States.

 

Attempts to Provide Standardized Data

For the researcher who needs to compare demographic and socioeconomic data over time, this variability in data and geography can be problematic. Various data providers have attempted to harmonize questions and to generate standard geographies using algorithms that allow for comparisons over time. Some of the projects mentioned in this post have used sophisticated weighting techniques to make more accurate estimates. See, for instance, some of the NHGIS documentation on standardizing data from 1990 and from 2000 to 2010 geography.

NHGIS

The NHGIS Time Series Tables link census summary statistics across time and may require two types of integration: attribute integration, ensuring that the measured characteristics in a time series are comparable across time, and geographic integration, ensuring that the areas summarized by time series are comparable across time.

For attribute integration, NHGIS often uses “nominally integrated tables,” where the aggregated data is presented as it was compiled. For instance, comparing “Durham County” data from 1960 and 2000 based on the common name of the county.

For geographically standardized tables,  when data from one year is aggregated to geographic areas from another year, NHGIS provides documentation with details on the weighting algorithms they use:

1990 to 2010 Tract changes in Cincinnati
(https://www.nhgis.org/documentation/time-series/1990-blocks-to-2010-geog)

NHGIS has resolved discrepancies in the electronic boundary files, as they illustrate here (an area of Cincinnati).

Social Explorer

The Social Explorer Comparability Data is similar to the NHGIS Time Series Tables, but with more of a drill-down consumer interface. (Go to Tables and scroll down to the Comparability Data.) Only 2000 to 2010 data are available at the state, county, and census tract level.  It provides data reallocated from the 2000 U.S. decennial census to the 2010 geographies, so you can get the earlier data in 2010 geographies for better comparison with 2010 data.

LTDB

The Longitudinal Tract Database (LTDB) developed at Brown University provides normalized boundaries at the census tract level for 1970-2010.  Question coverage over time varies. The documentation for the project are available online:

NC State has translated this data into ArcGIS geodatabase format.  They provide a README file, a codebook, and the geodatabase available for download.

Do-It-Yourself

If you need to normalize data that isn’t yet available this way, GIS software may be able to help. Using intersection and re-combining techniques, this software may be able to generate estimates of older data in more recent geographies.  In ArcGIS, this involves setting the ratio policy when creating a feature layer, to allow apportioning numeric values in attributes among the various overlapping geographies. This involves an assumption of an even geographic distribution of the variable across the entire area (which is not as sophisticated as some of the algorithms used by groups such as NHGIS).

Another research strategy employs crosswalks to harmonize census data over time. Crosswalks are tables that let you proportionally assign data from one year to another or to re-aggregate from one type of geography to another.  Some of these are provided by the NHGIS geographic crosswalk files, the Census Bureau’s geographic relationship files, and the Geocorr utility from the Missouri Census Data Center.

You can contact CDVS at askdata@duke.edu to inquire about the options for your project.

 

 

Share More Data in the Duke Research Data Repository!

We are happy to announce expanded features for the public sharing of large scale data in the Duke Research Data Repository! The importance of open science for the public good is more relevant than ever and scientific research is increasingly happening at scale. Relatedly, journals and funding agencies are requiring researchers to share the data produced during the course of their research (for instance see the newly released NIH Data Management and Sharing Policy). In response to this growing and evolving data sharing landscape, the Duke Research Data Repository team has partnered with Research Computing and OIT to integrate the Globus file transfer system to streamline the public sharing of large scale data generated at Duke. The new RDR features include:

  • A streamlined workflow for depositing large scale data to the repository
  • An integrated process for downloading large scale data (datasets over 2GB) from the repository
  • New options for exporting smaller datasets directly through your browser
  • New support for describing and using collections to highlight groups of datasets generated by a project or group (see this example)
  • Additional free storage (up to 100 GB per deposit) to the Duke community during 2021!

While using Globus for both upload and download requires a few configuration steps by end users, we have strived to simplify this process with new user documentation and video walk-throughs. This is the perfect time to share those large(r) datasets (although smaller datasets are also welcome!).

Contact us today with questions or get started with a deposit!

Publish Your Data: Researcher Highlight

This post was authored by Shadae Gatlin, DUL Repository Services Analyst and member of the Research Data Curation Team.

Collaborating for openness

The Duke University Libraries’ Research Data Curation team has the privilege to collaborate with exceptional researchers and scholars who are advancing their fields through open data sharing in the Duke Research Data Repository (RDR). One such researcher, Martin Fischer, Ph.D., Associate Research Professor in the Departments of Chemistry and Physics, recently discussed his thoughts on open data sharing with us. A trained physicist, Dr. Fischer describes himself as an “optics person” his work ranges from developing microscopes that can examine melanin in tissues to looking at pigment distribution in artwork. He has published data in the RDR on more than one occasion and says of the data deposit process that, “I can only say, it was a breeze.”

“I can only say, it was a breeze.”

Dr. Fischer recalls his first time working with the team as being “much easier than I thought it was going to be.” When Dr. Fischer and colleagues experienced obstacles trying to setup OMERO, a server to host their project data, they turned to the Duke Research Data Repository as a possible solution to storing the data. This was Dr. Fischer’s first foray into open data publishing, and he characterizes the team as being  responsive and easy to work with. Due to the large size of the data, the team even offered to pick up the hard drive from Fischer’s office. After they acquired the data, the team curated, archived, and then published it, resulting in Fischer’s first dataset in the RDR.

Why share data?

When asked why he believes open data sharing is important, Dr. Fischer says that “sharing data creates an opportunity for others to help develop things with you.” For example, after sharing his latest dataset  which evaluates the efficacy of masks to reduce the transmission of respiratory droplets, Fischer received requests for a non-proprietary option for data analysis instead of using the team’s data analysis scripts written for the commercial program Mathematica. Peers offered to help develop a Python script, which is now openly available, and for which the developers used the RDR data as a reference. As of January 2021, the dataset has had 991 page views.

Dr. Fischer appreciates the opportunity for research development that open data sharing creates, saying, “Maybe somebody else will develop a routine, or develop something that is better, easier than what we have”. Datasets deposited in the RDR are made publicly available for download and receive a permanent DOI link, which makes the data even more accessible.

“Maybe somebody else will develop a routine, or develop something that is better, easier than what we have.”

In addition to the benefits of long-term preservation and access that publishing data in the RDR provides, Dr. Fischer finds that sharing his data openly encourages a sense of accountability. “I don’t have a problem with other people going in and trying, and making sure it’s actually right. I welcome the opportunity for feedback”. With many research funding agencies introducing policies for research data management and data sharing practices, the RDR is a great option for Duke researchers. Every dataset that is accepted into the RDR is carefully curated to meet FAIR guidelines and optimized for future reuse.

Collaborating with researchers like Dr. Martin Fischer is one of the highlights of working on the Research Data Curation team. We look forward to seeing what fascinating data 2021 will bring to the RDR and working with more Duke researchers to share their data with the world.

Dr. Fischer’s Work in the Duke Research Data Repository:

  • Wilson, J. W., Degan, S., Gainey, C. S., Mitropoulos, T., Simpson, M. J., Zhang, J. Y., & Warren, W. S. (2019). Data from: In vivo pump-probe and multiphoton fluorescence microscopy of melanoma and pigmented lesions in a mouse model. Duke Digital Repository. https://doi.org/10.7924/r4cc0zp95
  • Fischer, E., Fischer, M., Grass, D., Henrion, I., Warren, W., Westman, E. (2020). Video data files from: Low-cost measurement of facemask efficacy for filtering expelled droplets during speech. Duke Research Data Repository. V2 https://doi.org/10.7924/r4ww7dx6q

Celebrating GIS Day 2020

About GIS Day

GIS Day is an international celebration of geographic information systems (GIS) technology. The event provides an opportunity for users of geospatial data and tools to build knowledge, share their work, and explore the benefits of GIS in their communities. Since its establishment in 1999, GIS Day events have been organized by nonprofit organizations, universities, schools, public libraries, and government agencies at all levels.

Held annually on the third Wednesday of November, this year GIS Day is officially today. Happy GIS Day! CDVS has participated in Duke GIS Day activities on campus in past years, but with COVID-19, we had to find other ways to celebrate.

A (Virtual) Map ShowcaseThe English Civil Wars - Story Map

To mark GIS Day this year, CDVS is launching an ArcGIS StoryMaps showcase! We invite any students, faculty, and staff to submit a story map to highlight their mapping and GIS work. Send us an email at askdata@duke.edu if you would like to add yours to the collection. We are keen to showcase the variety of GIS projects happening across Duke, and we will add contributions to the collection as we receive them. Our first entry is a story map created by Kerry Rork as part of a project for undergraduate students that used digital mapping to study the English Civil Wars.

Why Story Maps?

If you aren’t familiar with ArcGIS StoryMaps, this easy-to-use web application integrates maps with narrative text, images, and video. The platform’s compelling, interactive format can be an effective communication tool for any project with a geographic component. We have seen a surge of interest in story maps at Duke, with groups using them to present research, give tours, provide instruction. Check out the learning resources to get started, or contact us at askdata@duke.edu to schedule a consultation with one of our GIS specialists.