All posts by John Little

Data Management, repository, research data, version control

Code Repository vs Archival Repository. You need both.

2022-02-17 John Little 2 Comments

Years ago I heard the following quote attributed to Seamus Ross from 2007:

Digital objects do not, in contrast to many of their analog counterparts, respond well to benign neglect.

National Wildlife Property Repository. USFWS Mountain-Prairie. https://flic.kr/p/SYVPBB

Meaning, you cannot simply leave digital files to their bit-rot tendencies while expecting them to be usable in the future. Digital repositories are part of a solution to this problem. But to review, there are many types of repositories, both digital and analog: repositories of bones, insects, plants, books, digital data, etc. Even among the subset of digital repositories there are many types. Some digital repositories keep your data safe for posterity and replication. Some help you manage the distribution of analysis and code. Knowing about these differences will affect not only the ease of your computational workflow, but also the legacy of your published works.

Version-control repositories and their hubs

The most widely known social coding hubs include GitHub, Bitbucket and GitLab. These hubs leverage Git version-control software to track the evolution of project repositories – typically a software or computational analysis project. Importantly, Git and GitHub are not the same thing but they work well together.

Git repository — GIT Repository. Treviño. https://flic.kr/p/SSras

Version control works by monitoring any designated folder or project directory, making that directory a local repository or repo. Among other benefits, using version control enables “time travel.” Interactions with earlier versions of a project are commonplace. It’s simple to retrieve a deleted paragraph from a report written six months ago. However there are many advanced features as well. For example, unlike common file-syncing tools, it’s easy to recreate an earlier state of an entire project directory and every file from a particular point in time. This feature among others makes Git version-control a handy tool in support of many research workflows and the respective outputs: documents, visualizations, dashboards, slides, analysis, code, software, etc.

Binary. Michael Coghlan. https://flic.kr/p/aYEytM

Git is one of the most popular, open-source, version-control applications; originally developed in 2005 to facilitate the evolution of the world’s most far reaching and successful open-source coding project. Linux is a world-wide collaborative project that spans multiple developers, project managers, natural languages, geographies, and time-zones. While Git can handle large projects, it is extensible and can easily scale up or down to support a wide range of workflows. Additionally, Git is not just for software and code files. Essentially any file on a file system can be monitored with Git: MSWord, PDF files, images, datasets, etc.

There are many ways to share a Git repository and profile your work. The term push refers to a convenient process of synchronizing a repo up to a remote social coding hub. Additional features of a hub include issue tracking, collaboration, hosting documentation, and Kanban Method planning. Conveniently, pushing a repo to GitHub means maintaining a seamless, two-location backup – a push will simultaneously and efficiently synchronize the timeline and file versions. Meanwhile, at a repo editor’s discretion, any collaborator or interested party can be granted access to their GitHub repository.

Many public instances of social-coding hubs operate on a freemium model. At GitHub most users pay nothing. It’s also possible to run a local instance of a coding hub. For example, OIT offers a local instance of GitLab, delivering many of the same features while enabling permissions, authorization, and access Via Duke’s NetID.

While social coding hubs are great tools for distributing files and managing project life-cycles, in and of themselves they do not sufficiently ensure long-term reproducible access to research data. To do that simply synchronize version-control repositories with archival research data repositories.

Research Data Repositories

Preserving the computational artifacts of formal academic works requires a repository focus that is complementary to version-control repositories and social-coding hubs. Nonetheless, version control is not a requirement of a data repository where the goal is long-term preservation. Fortunately, many special-purpose data repositories exist. Discipline-specific research repositories are sometimes associated with academic societies. There also exist more generalized archival research repositories such as Zenodo.org. Additionally, many research universities host institutional research data repositories. Not surprisingly, such a research data repository exists at Duke where the Duke University Libraries promotes and cooperatively shepherds Duke’s Research Data Repository (RDR).

Colossus computer — Colossus. Chris Monk. https://flic.kr/p/fJssqg

Unlike social coding hubs, data repositories operate under different funding models and are motivated by different horizons. Coding hubs like GitHub do not promise long-term retention, instead they focus on immediate distribution of version-control repos and offer project management features. Research data repositories take a long view centered closer to the artifacts of formal research and publication.

By archiving the data milestones of publication, a deposit in the RDR links a formal publication – book edition, chapter, or serial article, etc. – with the data and code (i.e., a compendium) used to produce a single tangible instance of publication. In turn, the building blocks of computational thinking and research processes are preserved for posterity because the RDR maintains an assurance of long term sustainability.

Creator of MacPaint — Bill Atkinson. creator of MacPaint. painted in MacPaint” Photo by Kyra Rehn. https://flic.kr/p/e9urBF

In the Duke RDR, particular effort is focussed on preserving unique versions of data associated with each formal publication. In this way, authors can associate a digital object identifier, or DOI, with the precise code and data used to draft an accepted paper or research project. Once deposited in the RDR, researchers across the globe can look at these archives to verify, to learn, to refute, to cite, or be inspired toward new avenues of investigation.

By preserving workflow artifacts endemic to publication milestones, research data repositories preserve the record of academic progress. Importantly, the preservation of these digital outcomes or artifacts is strongly encouraged by funding agencies. Increasingly, these archival access points are a requirement for funding, especially among publicly funded research. As such, the Duke RDR exists with aims to preserve and make the academic record accessible, and to create a library of reproducible academic research.

Conclusion

The imperatives for preserving research data are derived from expressly different motives than those driving version-control repositories. Minimally, version-control repositories do not promise academic posterity. However, among the drivers of scholarship is the intentional engagement with the preserved academic record. In reality, while unlikely, your GitHub repository could vanish in the blink of the next Wall Street acquisition. Conversely research data repositories exist with different affordances. These two types of repositories complement each other. Once more, they can be synchronized to enable and preserve digital processes that comprise many forms of data-driven research. Using both types of repositories imply workflows that positively contribute to a scholarly legacy. It is this promise of academic transmission that drives Duke’s RDR, and benefits scholars by enabling access to persistent copies of research.

data science, workshops

Flipping Data Workshops

2020-09-15 John Little 1 Comment

John Little is the Data Science Librarian in Duke Libraries Center for Data and Visualizations Sciences. Contact him at askdata@duke.edu.

The Center for Data and Visualization Sciences is and has been open since March! We never closed. We’re answering questions, teaching workshops, have remote virtual machines available, and business is booming.

What’s changed? Due to COVID-19, the CDVS staff are working remotely. While we love meeting with people face-to-face in our lab, that is not currently possible. Meanwhile, digital data wants to be analyzed and our patrons still want to learn. By late spring I began planning to flip my workshops for fall 2020. My main goal was to transform a workshop into something more rewarding than watching the video of a lecture, something that lets the learner engage at their pace, on their terms.

How to flip

Flipping the workshop is a strategy to merge student engagement and active learning. In traditional instruction, a teacher presents a topic and assigns work aimed at reinforcing the lesson.

Background: I offer discrete two-hour workshops that are open to the entire university. There are very few prerequisites and people come with their own level of experience. Since the workshops attract a broad audience, I focus on skills and techniques using general examples that reliably convey information to all learners. In this environment, discipline specific examples risk losing large portions of the audience. As an instructor I must try to leave my expectations of students’ skills and background knowledge — at the door.

In a flipped classroom, materials are assigned and made available in advance. In this way, group Zoom-time can be used for questions and examples. This instruction model allows students to learn at their own pace, pause and rewind videos, practice exercises, or speed up lectures. During the workshop, students can bring questions relevant to their particular point of confusion.

The main instructor goal is to facilitate a topic for student engagement that puts the students in control. This approach has a democratizing effect that allows students to become more active and familiar with the materials. With flipped workshops, student questions appear to be more thoughtful and relevant. When the student is invited to take charge of their learning, the process of investigation becomes their self-driven passion.

For my flipped workshops materials, I offer basic videos to introduce and reinforce particular techniques. I try to keep each video short, less than 25 minutes. At the same time I offer plenty of additional videos on different topical details. More in-depth videos can cover important details that may feel ancillary or even demotivating, even if those details improve task efficiency. Sometimes the details are easier to digest when the student is engaged. This means students start at their own level and gain background when they’re ready. Students may not return to the background material for weeks, but the materials will be ready when they are.

Flipping a consultation?

The Center for Data & Visualization Sciences provides open workshops and Zoom-based consulting. The flipped workshop model aligns perfectly with our consulting services since students can engage with the flipped workshop materials (recordings, code, exercises) at any time. When the student is ready for more information, whether a general question or a specific research question, I can refer to targeted background materials during my consultations. With the background resources, I can keep my consultations relevant and brief while also reducing the risk of under-informing.

For my flipped workshop on R, or other CDVS workshops, please see our workshop page.

data science, rstats, tutorial, workshops

R Open Labs – open hours to learn more R

2019-08-23 John Little 1 Comment

New this fall…

R fun: An R Learning Series — An R workshop series by the Center for Data and Visualization Sciences.

You are invited to stop by the Edge Workshop Room on Mondays for a new Rfun program, the R Open Labs, 6-7pm, Sept. 16 through Oct. 28. No need to register although you are encouraged to double-check the R Open Labs schedule/hours. Bring your laptop!

This is your chance to polish R skills in a comfortable and supportive setting. If you’re a bit more advanced, come and help by demonstrating the supportive learning community that R is known for.

No Prerequisites, but please bring your laptop with R/RStudio installed. No skill level expected. Beginners, intermediate, and advanced are all welcome. One of the great characteristics of the R community is the supportive culture. While we hope you have attended our Intro to R workshop (or watched the video, or equivalent). This is an opportunity to learn more about R and to demystify some part of R that your find confusing.

FAQ

What are Open Labs

Open labs are semi-structured workshops designed to help you learn R. Each week brief instruction will be provided, followed by time to practice, work together, ask questions and get help. Participants can join the lab any time during the session, and are welcome to work on unrelated projects.

The Open Labs model was established by our colleagues at Columbia and adopted by UNC Chapel Hill. We’re giving this a try as well. Come help us define our direction and structure. Our goal is to connect researchers and foster a community for R users on campus.

How do I Get Started?

Attend an R Open Lab. Labs occur on Mondays, 6pm-7pm in the Edge Workshop Room in the Bostock Library. In our first meeting we will decide, as a group, which resource will guide us. We will pick one of the following resources…

R for Data Science by Hadley Wickham & Garrett Grolemund (select chapters, workbook problems, and solutions)
The RStudio interactive R Primers
Advanced R by Hadley Wickham (select chapters and workbook problems)
Or, the interactive dataquest.io learning series on R

Check our upcoming Monday schedule and feel free to RSVP. We will meet for 6 nearly consecutive Mondays during the fall semester.

Please bring a laptop with R and R Studio installed. If you have problems installing the software, we can assist you with installation as time allows. Since we’re just beginning with R Open Labs, we think there will be time for one-on-one attention as well through learning and community building.

How to install R and R Studio

If you are getting started with R and haven’t already installed anything, consider using using these installation instructions. Or simply skip the installation and use one of these free cloud environments:

Duke’s virtual RStudio — requires your NetID login.
RStudio Cloud

Begin Working in R

We’ll start at the beginning, however, R Open Labs recommends that you attend our Intro to R workshop or watch the recorded video. Being a beginner makes you part of our target audience so come ready to learn and ask questions. We also suggest working through materials from our other workshops, or any of the resource materials listed in the Attend an R Open Lab section (above). But don’t let lack of experience stop you from attending. The resources mentioned above will be the target of our learning and exploration.

Is R help available outside of Open Labs?

If you require one-on-one help with R outside of the Open Labs, in-person assistance is available from the Library’s Center for Data & Visualization Sciences, our Center’s Rfun workshops, or our walk-in consulting in the Brandaleone Data and Visualization Lab (floormap. 1st Floor Bostock Library).

workshops

Announcing Tidyverse workshops for Winter 2018

2017-12-05 John Little

Coming this winter the Data & Visualization Services Department will once again host a workshop series on the R programming language. Our spring offering is modeled on our well received R we having fun yet‽ (Rfun) fall workshop series. The four-part series will introduce R as a language for modern data manipulation by highlighting a set of tidyverse packages that enable functional data science. We will approach R using the free RStudio IDE, an intent to make reproducible literate code, and a bias towards the tidyverse. We believe this open tool-set provides a context that enables and reinforces reproducible workflows, analysis, and reporting.

This six-part series will introduce R as a language for modern data manipulation by highlighting a set of tidyverse packages that enable functional data science.

January Line-up

Title	Date	Registration	Past Workshop
Intro to R	Jan 19 1 – 3pm	register	Resources
R Markdown with Dr. Çetinkaya-Rundel	Jan 23 9am	register
Shiny with Dr. Çetinkaya-Rundel	Jan 25 9am	register
Mapping with R	Jan 25 1-3pm	register	Resources
Reproducibility & Git	Jan 29 1-3pm	register	Resources
Visualizationg with ggplot2	Feb 1 9:30-11:30am	register	Resources

An official announcement with links to registration is forthcoming. Feel free to subscribe to the Rfun or DVS-Announce lists. Or look to the DVS Workshop page for official registration links as soon as they are available.

Workshop Arrangement

This workshop series is intended to be iterative and recursive. We recommend starting with the Introduction to R. Proceed through the remaining three workshops in any order of interest.

Recordings and Past Workshops

We presented a similar version of this workshop series last fall and recorded each session whenever possible. You can stream past workshops and engage with the shareable data sets at your-own-pace (see the Past Workshop resources links, above.) Alternatively, all the past workshop resource links are presented in one listicle: Rfun recap.

CIFS, Data Management, Data Storage

Sharing Files: Your Duke Box.com

2015-01-29 John Little 2 Comments

Last fall Duke University released its newest file sharing service known as Duke’s Box. By partnering with Box.com, Duke offers a cloud-storage service which is intuitive, secure, and easy to use. Login with with your NetID, share files with colleagues, and have confidence this cloud storage is compliant with all laws and regulations regarding data privacy and security.

Simple to Use

Duke’s Box is similar to other cloud-based file storage services which support collaboration, productivity, and synchronization. You can drop and drag files, identify collaborators and set permissions (read, edit, comment, etc.) But unlike some services, such as Dropbox or Google Drive, Duke’s Box enables you to be in compliance with data privacy and security. Additionally, you can synchronize data across your devices, at your discretion and subject to Duke’s Security & Usage Practice restrictions

While you may have previously used OIT’s NAS (Network Attached Storage) file storage service known as CIFS for data storage, Duke’s Box is easier to use -although it provides services for slightly different use-cases. For example, CIFS might be more useful if accessing large files (e.g. video files that are larger than 5 GB). However, CIFS doesn’t enable collaboration or sharing. Depending on your needs you may still want to use your departmental or OIT NAS. Either way, you can use both file storage services and each service is free.

Check out this quick-start video:

50 GB of Space by Default

You are automatically provisioned 50 GB of space, but you can request more if you need more. See the Comparison of Document Management & Collaboration Tools at Duke for details.

Individual file size limitations are throttled to less than 5 GB. This means Duke’s Box may be less than ideal for sharing very large files. NAS services may be more appropriate for large files as the time to download or synchronize large files can become inconvenient. But for many common file sharing cases, Duke’s Box is ideal, fast and convenient.

Documentation, Restrictions & Use

While you can store many types of files, there are best practices and restrictions you will want to review. For example, Duke Medicine users are required to complete an online training module prior to account activation.

Security and Use, including more detail on Terms of Service, and example Data Types — including military and space data, FERPA, HIPAA, etc.
Duke’s Box Usage Practices
Comparison of Document Management & Collaboration Tools at Duke
Duke Box – OIT launch page
OIT’s FAQ
Your Duke’s Box “Read Me” folder. OIT has done a great job of providing quick and convenient documentation located right where you need it. See the READ ME folder after you logon to Duke’s Box.

Sharing Your Data With Us

One of the many use-cases for Duke’s Box is a more convenient way for you to share your data with us. As you know we welcome questions about data analysis and visualization. We know describing data can be difficult while sharing your dataset can clarify your question. But sharing your data via email consumes a lot of resources — both yours and ours. Now there’s a better way; please share your data with us via Duke’s Box.

Steps for Sharing Your Data with DVS Consultants

How to Share your files - 5 second annimated loop

Log into Duke’s Box (Use the blue “continue” button)
Open your “home” folder
Put your data in the “sharing” folder
Use the “invite people” button (right-hand sidebar)
- Using a consultant email address, invite the DVS Consultant to see your data. (Don’t worry if you don’t have our email yet. When you start your question at askData@duke.edu, an individual consultant will be back in touch.)

CIFS, Data Curation, Data Management

Access your Duke-Cloud from ANYWHERE

2013-11-05 John Little 2 Comments

Say you’ve been making hella maps or data stories all day. Now you need to move to your comfy work spot and you need your data to come with you. If you use Duke’s CIFS, moving around is easy, and all of your files are already backed-up.

In this example we follow the researcher, Ms. Stu Fac-Staff. Stu is part student, part faculty, and part staff at Duke University. She needs a portable place for her data and wants easy access from her home, lab, and devices. Stu also needs to easily share data with colleagues. No problem! Stu uses CIFS.

Here’s the scenario. Ms. Stu Fac-Staff walks into the Data & GIS Lab in the Duke University Libraries with a flash drive full of data tables. She gathers more supporting data and some advice about crunching the numbers. Stu finishes her day with a visualization and map. (Proudly, Stu imagines this is going to get the A. “Is this grant worthy?” Stu asks herself. “You bet your NSF Application it is!”) Meanwhile, her flash dive is now full and all she wants is to SAVE THE DATA, CONVENIENTLY for later retrieval back home. So Stu stores the data on the Duke Cloud (CIFS.)

How do I get the free CIFS Space and how much can I use/access?

Duke University provides 5 GB (at least!) of easily accessible Cloud-storage space to all faculty, students, and staff
If you need more space, larger quantities are available upon request
The space is called CIFS and is an OIT supported personal home directory of portable file space; CIFS is a mappable drive on your device and the files are backed up
Students are provisioned CIFS space automatically. Faculty & Staff must request the space through the OIT Service Desk

How do I access the data from my device?

In the Data & GIS Lab, after using your NetID to login, open the Windows File Explorer and your CIFS space will be mapped as drive Z.
After you leave our Data & GIS Lab, all you have to do is “map the drive” on your own machine
- If off-campus, use the VPN, then …
- Windows Directions
```
\\homedir.oit.duke.edu\users\<first letter of your NetID>\<NetID>
```
- Mac Directions
```
cifs://homedir.oit.duke.edu/users/<first letter of your NetID>/<NetID>
```
- Mobile. Access CIFS from an app on your mobile device.
Web – For easy distribution to colleagues, you might want to access or distribute your files through the web. To do this, store the files in your ‘public_html‘ directory inside of your CIFS space. Now the files can be downloaded via a web browser. This method is, by default, open to the world; you may want to take additional steps to secure this public_html directory (see below.)
```
http://people.duke.edu/~NetID
```

Can I Secure the Data?

Are you trying to access your mapped drive from off campus?
- Use the VPN directions
- The CIFS protocol encrypts NetID/password but it does not encrypt your data stream over the Internet. If you’re connecting from an unencrypted or untrusted network (e.g. wireless in the coffee shop), the VPN allows for a secure connection.
Did you put files in your public_html folder?
- Unlike the default CIFS space, placing files in the ‘public_html’ directory means they become accessible to the world
- You can control and limit access by following OIT’s “htaccess” instructions