Category Archives: data science

Flipping Data Workshops

John Little is the Data Science Librarian in Duke Libraries Center for Data and Visualizations Sciences. Contact him at askdata@duke.edu.

The Center for Data and Visualization Sciences is and has been open since March! We never closed. We’re answering questions, teaching workshops, have remote virtual machines available, and business is booming.  

What’s changed? Due to COVID-19, the CDVS staff are working remotely. While we love meeting with people face-to-face in our lab, that is not currently possible. Meanwhile, digital data wants to be analyzed and our patrons still want to learn. By late spring I began planning to flip my workshops for fall 2020. My main goal was to transform a workshop into something more rewarding than watching the video of a lecture, something that lets the learner engage at their pace, on their terms.  

How to flip

Flipping the workshop is a strategy to merge student engagement and active learning.  In traditional instruction, a teacher presents a topic and assigns work aimed at reinforcing the lesson. 

Background:  I offer discrete two-hour workshops that are open to the entire university. There are very few prerequisites and people come with their own level of experience.  Since the workshops attract a broad audience, I focus on skills and techniques using general examples that reliably convey information to all learners. In this environment, discipline specific examples risk losing large portions of the audience. As an instructor I must try to leave my expectations of students’ skills and background knowledge — at the door.  

In a flipped classroom, materials are assigned and made available in advance. In this way, group Zoom-time can be used for questions and examples. This instruction model allows students to learn at their own pace, pause and rewind videos, practice exercises, or speed up lectures. During the workshop, students can bring questions relevant to their particular point of confusion.  

The main instructor goal is to facilitate a topic for student engagement that puts the students in control. This approach has a democratizing effect that allows students to become more active and familiar with the materials.  With flipped workshops, student questions appear to be more thoughtful and relevant. When the student is invited to take charge of their learning, the process of investigation becomes their self-driven passion.  

For my flipped workshops materials, I offer basic videos to introduce and reinforce particular techniques. I try to keep each video short, less than 25 minutes.  At the same time I offer plenty of additional videos on different topical details. More in-depth videos can cover important details that may feel ancillary or even demotivating, even if those details improve task efficiency. Sometimes the details are easier to digest when the student is engaged. This means students start at their own level and gain background when they’re ready.  Students may not return to the background material for weeks, but the materials will be ready when they are.

Flipping a consultation?

The Center for Data & Visualization Sciences provides open workshops and Zoom-based consulting. The flipped workshop model aligns perfectly with our consulting services since students can engage with the flipped workshop materials (recordings, code, exercises) at any time. When the student is ready for more information, whether a general question or a specific research question, I can refer to targeted background materials during my consultations. With the background resources, I can keep my consultations relevant and brief while also reducing the risk of under-informing.  

For my flipped workshop on R, or other CDVS workshops, please see our workshop page.

Automated Tagging of Historical, Non-English Sources with Named Entity Recognition (NER): A Resource

Felipe Álvarez de Toledo López-Herrera is a Ph.D. candidate in the Art, Art History, and Visual Studies Department at Duke University and a Digital Humanities Graduate Assistant for Humanities Unbounded, 2019-2020. Contact him at askdata@duke.edu.

[This blogpost introduces a GitHub Repository that provides resources for developing NER projects in historical languages. Please do not hesitate to use the code and ideas made available there, or contact me if there are any issues we could discuss .]

Understanding Historical Art Markets: an Automated Approach

When the Sevillian painters’ guild archive was lost in the 19th century, with it vanished lists of master painters, journeymen, apprentices and possibly dealers recorded in the guilds’ registration books. Nevertheless, researchers working for over a century in other Sevillian archives have published almost twenty volumes of archival documents. These transcriptions, excerpts and summaries reflect the activities of local painters, sculptors, and architects, among other artisans. I use this evidence as a source of data on early modern Seville’s art market in my dissertation research. For this, I have to extract information from many documents in order to query and discern larger patterns.

Image of books and extracted text examples.
Left. Some of the volumes used in this research, in my home library. I have managed to acquire these second-hand; others I have borrowed from libraries. Right. A scan of one of the pages of these books, showing some of the documents from which we extracted data.

Instead of manually keying this information into a spreadsheet or other form of data storage, I chose to scan my sources and test an automated approach using Natural Language Processing. Last semester, within the context of the Humanities Unbounded Digital Humanities Graduate Assistantship, I worked with Named-Entity Recognition (NER), a technique in which computers can be taught to identify named real-world objects in texts. NER models underperform on historical texts because they are trained on modern documents such as news or Wikipedia articles. Furthermore, NLP developers have focused most of their efforts on English language models, resulting in underdeveloped models for other languages. For these reasons, I had to retrain a model to be useful for my purposes. In this blogpost, I give an overview of the process of adapting NER tools for use on non-English historical sources.

Defining Named-Entity Recognition

Named-Entity Recognition (NER) is a set of processes in which a computer program is trained to identify and categorize real-world objects with proper names in a corpus of texts. It can be used to tag names in documents without a standardized structure and label them as people, locations or organizations, among other categories.

Named entity recognition example tags in text
Named-entity recognition is the automated tagging of real-world objects with names, such as people, locations, organizations, or monetary amounts, within texts.

Code libraries such as Spacy, NLTK or Stanford CoreNLP provide widely-tested toolkits for NER. I decided that Spacy would be the best choice for my purposes. Though its Spanish model included less label categories, they performed better out-of-the-box. Importantly, the model worked better for certain basic language structures such as recognizing compound names (last names with several components, such as my own). The Spacy library also proved user-friendly for those of us with little coding knowledge. Its pre-programmed data processing pipeline is easy to modify, given that you have a basic understanding of Python. In my case, I had the time and motivation to acquire this literacy.

I sought to improve the model’s performance in two ways. First, I retrained it on a subset of my own data. This improved performance and allowed me to add new label categories such as dates, monetary amounts and objects. Additionally, I added a component that modernized my texts’ spelling to make them more conducive to proper tagging.

Training NER on Historical Spanish Text: Process and Results

To improve the model, I needed training data – a “gold standard” of perfectly-tagged text. First, I ran the model on a set of 400 documents, which resulted in a set of preliminary tags. Then, I corrected these tags with a tool called Dataturks and reformatted the output to work with Spacy. Once this data was ready, I split it 80-20, which means running a training loop on 80% of correctly-tagged texts to adjust the performance of the model, and reserving 20% for testing or evaluating the model on data it had not yet seen.

Named Entity Recognition output
Final output as stored in my database for one particular document with ID=5.

Finally, I evaluated whether all these changes actually improved the model’s performance, saved the updated model, and exported the output in a format that worked for my own database. For my texts, the model initially worked at around 36% recall (the percentage of true entities that were identified by the model), compared to an 89% recall with modern texts as evaluated by Spacy. After training, recall has increased to 64%. Some tags, such as person or location, perform especially well (85% and 81%, respectively). Though the numbers are not perfect, they show a marked improvement, generated with little training data.

For the 8,607 documents processed, the process has resulted in 59,191 tags referring to people, locations, organizations, dates, objects and money. Next steps include finding descriptors of entities within the text, and modeling relationships between entities appearing in the same document. For now, a look at the detected tags underscores the potential of NER for automating data collection in data-driven humanities research.

Fall 2020 – CDVS Research and Education During COVID-19

The Center for Data and Visualization Sciences is glad to welcome you back to a new academic year! We’re excited to have friends and colleagues returning to the Triangle and happy to connect with Duke community members who will not be on campus this fall.

This fall, CDVS will expand its existing online consultations with a new chat service and new online workshops for all members of the Duke community. Since mid-March, CDVS staff have redesigned instructional sessions, constructed new workflows for accessing research data, and built new platforms for accessing data tools virtually. We look forward to connecting with you online and working with you to achieve your research goals.

In addition to our expanded online tools and instruction, we have redesigned our CDVS-Announce data newsletter to provide a monthly update of data news, events, and workshops at Duke. We hope you will consider subscribing.

Upcoming Virtual CDVS Workshops

CDVS continues to offer a full workshops series for the latest strategies and tools for data focused research. Upcoming workshops for early September include:

R for data science: getting started, EDA, data wrangling
Thursday, Sep 1, 2020 10am – 12pm
This workshop is part of the Rfun series. R and the Tidyverse are a data-first coding language that enables reproducible workflows. In this two-part workshop, you’ll learn the fundamentals of R, everything you need to know to quickly get started. You’ll learn how to access and install RStudio, how to wrangle data for analysis, gain a brief introduction to visualization, practice Exploratory Data Analysis (EDA), and how to generate reports.
Register: https://duke.libcal.com/event/6867861

Research Data Management 101
Wednesday, Sep 9, 2020 10am – 12pm
This workshop will introduce data management practices for researchers to consider and apply throughout the research lifecycle. Good data management practices pertaining to planning, organization, documentation, storage and backup, sharing, citation, and preservation will be presented using examples that span disciplines. During the workshop, participants will also engage in discussions with their peers on data management concepts as well as learn about how to assess data management tools.
Register: https://duke.libcal.com/event/6874814

R for Data Science: Visualization, Pivot, Join, Regression
Wednesday, Sep 9, 2020 1pm – 3pm
This workshop will introduce data management practices for researchers to consider and apply throughout the research lifecycle. Good data management practices pertaining to planning, organization, documentation, storage and backup, sharing, citation, and preservation will be presented using examples that span disciplines. During the workshop, participants will also engage in discussions with their peers on data management concepts as well as learn about how to assess data management tools.
Register: https://duke.libcal.com/event/6867914

ArcGIS StoryMaps
Thursday, September 10, 2020 1pm – 2:30pm
This workshop will help you get started telling stories with maps on the ArcGIS StoryMaps platform. This easy-to-use web application integrates maps with narrative text, images, and videos to provide a powerful communication tool for any project with a geographic component. We will explore the capabilities of StoryMaps, share best practices for designing effective stories, and guide participants step-by-step through the process of creating their own application.
Register: https://duke.libcal.com/event/6878545

Assignment Tableau: Intro to Tableau work-together
Friday, September 11, 2020 10am – 11:30am
Work together over Zoom on an Intro to Tableau assignment. Tableau Public (available for both Windows and Mac) is incredibly useful free software that allows individuals to quickly and easily explore their data with a wide variety of visual representations, as well as create interactive web-based visualization dashboards. Attendees are expected to watch Intro to Tableau Fall 2019 online first, or have some experience with Tableau. This will be an opportunity to work together on the assignment from the end of that workshop, plus have questions answered live.
Register: https://duke.libcal.com/event/6878629

2020 RStudio Conference Livestream Coming to Duke Libraries

RStudio 2020 Conference LogoInterested in attending the 2020 RStudio Conference, but unable to travel to San Francisco? With the generous support of RStudio and the Department of Statistical Science, Duke Libraries will host a livestream of the annual RStudio conference starting on Wednesday, January 29th at 11AM. See the latest in machine learning, data science, data visualization, and R. Registration links and information about sessions follow. Registration is required for the first session and keynote presentations.  Please see the links in the agenda that follows.

Wednesday, January 29th

Location: Rubenstein Library 249 – Carpenter Conference Room

11:00 – 12:00 RStudio Welcome – Special Live Opening Interactive Event for Watch Party Groups
12:00 – 1:00 Welcome for Hadley Wickham and Opening Keynote – Open Source Software for Data Science (JJ Allaire)
1:00 – 2:00 Data, visualization, and designing with AI (Fernanda Viegas and Martin Wattenberg, Google)
2:30 – 4:00 Education Track (registration is not required)
Meet you where you R – Lauren Chadwick, R Studio.
Data Science Education in 2022 (Karl Howe and Greg Wilson, R Studio)
Data science education as an economic and public health intervention in East Baltimore (Jeff Leek, Johns Hopkins)
Of Teacups, Giraffes, & R Markdown (Desiree Deleon, Emory)

Location: Edge Workshop Room – Bostock 127

5:15 – 6:45 All About Shiny  (registration is not required)
Production-grade Shiny Apps with golem (Colin Fay, ThinkR)
Making the Shiny Contest (Duke’s own Mine Cetinkaya-Rundel)
Styling Shiny Apps with Sass and Bootstrap 4(Joe Cheng, RStudio)
Reproducible Shiny Apps with shinymeta (Carson Stewart, RStudio)
7:00 – 8:30 Learning and Using R (registration is not required)
Learning and using R: Flipbooks (Evangeline Reynolds, U Denver)
Learning R with Humorous Side Projects (Ryan Timpe, Lego Group)
Toward a grammar of psychological Experiments (Danielle, Navaro, University of New South Wales)
R for Graphical Clinical Trial Reporting(Frank Harrell, Vanderbilt)

Thursday, January 30th

Location: Edge Workshop Room – Bostock 127

12:00 – 1:00 Keynote: Object of type closure is not subsettable (Jenny Bryan, RStudio)
1:23 – 3:00 Data Visualization Track (registration is not required)
The Glamour of Graphics (William Chase, University of Pennsylvania)
3D ggplots with rayshader (Dr. Tyler Morgan-Wall, Institute for Defense Analyses)
Designing Effective Visualizations (Miriah Meyer, University of Utah)
Tidyverse 2019-2020 (Hadley Wickham, RStudio)
3:00 – 4:00 Livestream of Rstudio Conference Sessions (registration is not required)
4:00 – 5:30 Data Visualization Track 2 (registration is not required)
Spruce up your ggplot2 visualizations with formatted text (Claus Wilke, UT Austin)
The little package that could: taking visualizations to the next level with the scales package (Dana Seidel, Plenty Unlimited)
Extending your ability to extend ggplot2 (Thomas Lin Pedersen, RStudio)
5:45 – 6:30 Career Advice for Data Scientists Panel Discussion (registration is not required)
7:00 – 8:00 Keynote: NSSD Episode 100 (Hillary Parker, Stitchfix and Roger Peng, JHU)

R Open Labs – open hours to learn more R

New this fall…

R fun: An R Learning Series
An R workshop series by the Center for Data and Visualization Sciences.

You are invited to stop by the Edge Workshop Room on Mondays for a new Rfun program, the R Open Labs,  6-7pm, Sept. 16 through Oct. 28. No need to register although you are encouraged to double-check the R Open Labs schedule/hoursBring your laptop!

This is your chance to polish R skills in a comfortable and supportive setting.  If you’re a bit more advanced, come and help by demonstrating the supportive learning community that R is known for.

No Prerequisites, but please bring your laptop with R/RStudio installed. No skill level expected. Beginners, intermediate, and advanced are all welcome. One of the great characteristics of the R community is the supportive culture. While we hope you have attended our Intro to R workshop (or watched the video, or equivalent). This is an opportunity to learn more about R and to demystify some part of R that your find confusing.

FAQ

What are Open Labs

Open labs are semi-structured workshops designed to help you learn R. Each week brief instruction will be provided, followed by time to practice, work together, ask questions and get help. Participants can join the lab any time during the session, and are welcome to work on unrelated projects.

The Open Labs model was established by our colleagues at Columbia and adopted by UNC Chapel Hill. We’re giving this a try as well. Come help us define our direction and structure. Our goal is to connect researchers and foster a community for R users on campus.

How do I Get Started?

Attend an R Open Lab. Labs occur on Mondays, 6pm-7pm in the Edge Workshop Room in the Bostock Library. In our first meeting we will decide, as a group, which resource will guide us. We will pick one of the following resources…

  1. R for Data Science by Hadley Wickham & Garrett Grolemund (select chapters, workbook problems, and solutions)
  2. The RStudio interactive R Primers
  3. Advanced R by Hadley Wickham (select chapters and workbook problems)
  4. Or, the interactive dataquest.io learning series on R

Check our upcoming Monday schedule and feel free to RSVP.  We will meet for 6 nearly consecutive Mondays during the fall semester.

Please bring a laptop with R and R Studio installed.  If you have problems installing the software, we can assist you with installation as time allows. Since we’re just beginning with R Open Labs, we think there will be time for one-on-one attention as well through learning and community building.

How to install R and R Studio

If you are getting started with R and haven’t already installed anything, consider using using these installation instructions.  Or simply skip the installation and use one of these free cloud environments:

Begin Working in R

We’ll start at the beginning, however, R Open Labs recommends that you attend our Intro to R workshop or watch the recorded video. Being a beginner makes you part of our target audience so come ready to learn and ask questions. We also suggest working through materials from our other workshops, or any of the resource materials listed in the Attend an R Open Lab section (above).  But don’t let lack of experience stop you from attending.  The resources mentioned above will be the target of our learning and exploration.

Is R help available outside of Open Labs?

If you require one-on-one help with R outside of the Open Labs, in-person assistance is available from the Library’s Center for Data & Visualization Sciences, our Center’s Rfun workshops, or our walk-in consulting in the Brandaleone Data and Visualization Lab (floormap. 1st Floor Bostock Library).

 

Introducing Duke Libraries Center for Data and Visualization Sciences

As data driven research has grown at Duke, Data and Visualization Services receives an increasing number of requests for partnerships, instruction, and consultations. These requests have deepened our relationships with researchers across campus such that we now regularly interact with researchers in all of Duke’s schools, disciplines, and interdepartmental initiatives.

In order to expand the Libraries commitment to partnering with researchers on data driven research at Duke, Duke University Libraries is elevating the Data and Visualization Services department to the Center for Data and Visualization Sciences (CDVS). The change is designed to enable the new Center to:

  • Expand partnerships for research and teaching
  • Augment the ability of the department to partner on grant, development, and funding opportunities
  • Develop new opportunities for research, teaching, and collections – especially in the areas of data science, data visualization, and GIS/mapping research
  • Recognize the breadth and demand for the Libraries expertise in data driven research support
  • Enhance the role of CDVS activities within Bostock Libraries’ Edge Research Commons

We believe that the new Center for Data and Visualization Sciences will enable us to partner with an increasingly large and diverse range of data research interests at Duke and beyond through funded projects and co-curricular initiatives at Duke. We look forward to working with you on your next data driven project!

Computational Reproducibility Pilot – Code Ocean Trial

A goal of Duke University Libraries (DUL) Code Ocean Logois to support the  growing and changing needs of the Duke research community. This can take many forms. Within Data and Visualization Services, we provide learning opportunities, consulting services, and computational resources to help Duke researchers implement their data-driven research projects. Monitoring and assessing new tools and platforms also helps DUL stay in tune with changing research norms and practices. Today the increasing focus on the importance of transparency and reproducibility has resulted in the development of new tools  and resources to help researchers produce and share more reproducible results. One such tool is Code Ocean.

Code Ocean is a computational reproducibility platform that employs Docker technology to execute code in the cloud. The platform does two key things—it integrates the metadata, code, data and dependencies into a single ‘compute capsule’, ensuring that the code will run—and it does this in a single web interface that displays all inputs and results. Within the platform, it is possible to develop, edit or download the code, run routines, and visualize, save or download output, all from a personal computer. Users or reviewers can upload their own data and test the effects of changing parameters or modification of the code. Users can also share their data and code through the platform. Code Ocean provides a DOI for all capsules facilitating attribution and a permanent connection to any published work.

In order to help us understand and evaluate the usefulness of the Code Ocean platform to the Duke research community, DUL will be offering trial access to the Code Ocean cloud-based computational reproducibility platform starting on October 1, 2018. To learn more about what is included in the trial access and to sign up to participate, visit the Code Ocean pilot portal page.

If you have any questions, contact askdata@duke.edu.

Fall Data and Visualization Workshops

2017 Data and Visualization Workshops

Visualize, manage, and map your data in our Fall 2017 Workshop Series.  Our workshops are designed for researchers who are new to data driven research as well as those looking to expand skills with new methods and tools. With workshops exploring data visualization, digital mapping, data management, R, and Stata, the series offers a wide range of different data tools and techniques. This fall, we are extending our partnership with the Graduate School and offering several workshops in our data management series for RCR credit (please see course descriptions for further details).

Everyone is welcome at Duke Libraries workshops.  We hope to see you this fall!

Workshop Series by Theme

Data Management

09-13-2017 – Data Management Fundamentals
09-18-2017 – Reproducibility: Data Management, Git, & RStudio 
09-26-2017 – Writing a Data Management Plan
10-03-2017 – Increasing Openness and Reproducibility in Quantitative Research
10-18-2017 – Finding a Home for Your Data: An Introduction to Archives & Repositories
10-24-2017 – Consent, Data Sharing, and Data Reuse 
11-07-2017 – Research Collaboration Strategies & Tools 
11-09-2017 – Tidy Data Visualization with Python

Data Visualization

09-12-2017 – Introduction to Effective Data Visualization 
09-14-2017 – Easy Interactive Charts and Maps with Tableau 
09-20-2017 – Data Visualization with Excel
09-25-2017 – Visualization in R using ggplot2 
09-29-2017 – Adobe Illustrator to Enhance Charts and Graphs
10-13-2017 – Visualizing Qualitative Data
10-17-2017 – Designing Infographics in PowerPoint
11-09-2017 – Tidy Data Visualization with Python

Digital Mapping

09-12-2017 – Intro to ArcGIS Desktop
09-27-2017 – Intro to QGIS 
10-02-2017 – Mapping with R 
10-16-2017 – Cloud Mapping Applications 
10-24-2017 – Intro to ArcGIS Pro

Python

11-09-2017 – Tidy Data Visualization with Python

R Workshops

09-11-2017 – Intro to R: Data Transformations, Analysis, and Data Structures  
09-18-2017 – Reproducibility: Data Management, Git, & RStudio 
09-25-2017 – Visualization in R using ggplot2 
10-02-2017 – Mapping with R 
10-17-2017 – Intro to R: Data Transformations, Analysis, and Data Structures
10-19-2017 – Developing Interactive Websites with R and Shiny 

Stata

09-20-2017 – Introduction to Stata
10-19-2017 – Introduction to Stata