5 CDVS Online Learning Things

Within the Center for Data and Visualization Sciences (CDVS) we pride ourselves on providing numerous educational opportunities for the Duke community. Like many others during the COVID-19 pandemic, we have spent a large amount of time considering how to translate our in-person workshops to online learning experiences, explored the use of flipped classroom models, and learned together about the wonderful (and sometimes not so wonderful) features of common technology platforms (we are talking about you, Zoom).

Online learning setupWe also wanted to more easily surface the various online learning resources we have developed over the years via the web. Recognizing that learning takes place both synchronously and asynchronously, we have made available numerous guides, slide decks, example datasets, and both short-form and full-length workshops on our Online Learning Page. Below we highlight 5 online learning resources that we thought others interested in data driven research may wish to explore:

  • Mapping & GIS: R has become a popular and reproducible option for mapping and spatial analysis. Our Geospatial Data in R guide and workshop video introduce the use of the R language for producing maps. We cover the advantages of a code-driven approach such as R for visualizing geospatial data and demonstrate how to quickly and efficiently create a variety of map types for a website, presentation, or publication. 
  • Data Visualization: Visualization is a powerful way to reveal patterns in data, attract attention, and get your message across to an audience quickly and clearly. But, there are many steps in that journey from exploration to information to influence, and many choices to make when putting it all together to tell your story. In our Effective Data Visualization workshop, we cover some basic guidelines for effective visualization, point out a few common pitfalls to avoid, and run through a critique and iterations of an existing visualization to help you start seeing better choices beyond the program defaults.
  • Data ScienceQuickStart with R is our beginning data science module focusing on the Tidyverse — a data-first approach to data wrangling, analysis, and visualization.  Beyond introducing the Tidyverse approach to reproducible data workflows, we offer a rich allotment of other R learning resources at our Rfun site: workshop videos, case studies, shareable data, and code. Links to all our data science materials can also be found collated on our Online Learning page (above).
  • Data Management: Various stakeholders are stressing the importance of practices that make research more open, transparent, and reproducible including NIH who has released a new data management & sharing policy. In collaboration with the Office of Scientific Integrity, our Meeting Data Management Plan Requirements workshop presents details on the new NIH policy, describes what makes a strong plan, and where to find guidance, tools, resources, and assistance for building funder-based plans.
  • Data Sources: The U.S. Census has been collecting information on persons and businesses since the late 18th century, and tackling this huge volume of data can be daunting. Our guide to U.S. Census data highlights many useful places to view or download this data, with the Product Comparisons tab providing in chart form a quick overview of product contents and features. Other tabs provide more details about these dissemination products, as well as about sources for Economic Census data.

In the areas of data science, mapping & GIS, data visualization, and data management, we cover many other topics and tools including ArcGIS, QGIS, Tableau, Python for tabular data and visualization, Adobe Illustrator, MS PowerPoint, effective academic posters, reproducibility, ethics of data management and sharing, and publishing research data. Access more resources and past recordings on our online learning page or go to our upcoming workshops list to register for a synchronous learning opportunity.

Change is coming – are you open to it?

This blog post is a collaboration between Paolo Mangiafico from ScholarWorks and Sophia Lafferty-Hess from the Center for Data and Visualization Sciences and the Duke Research Data Repository.

Open access journals have been around forOpen sign several decades, and almost all researchers have read them or published in them by now. Perhaps less well known are trends toward more openness in sharing of data, methods, code, and other aspects of research – broadly called open scholarship. There are lots of good reasons to make your research outputs as open as possible, and increasing support at Duke for doing it.

There are many different variants of “open” – including goals of making research accessible to all, making data and methods transparent to increase reproducibility and trust, licensing research to enable broad re-use, and engagement with a variety of stakeholders, among other things. All of these provide benefits to the public and they also provide benefits to Duke researchers. There’s growing evidence that openly available publications and data result in more citations and greater impact (Colavizza 2020), and showing one’s work and making it available for replication helps build greater trust. There’s greater potential economic impact when others can build on research more quickly, and more avenues for collaboration and interdisciplinary engagement.

Recognizing the importance of making research outputs quickly and openly available to other researchers and the public, and supporting greater transparency in research, many funding agencies are now encouraging or requiring it. NIH has had a public access policy for over a decade, and NSF and other agencies have followed with similar policies. NIH has also released a new Data Management and Sharing policy that goes into effect in 2023 with more robust and clearer expectations for how to effectively share data. In Europe, government research funders back a program called Plan S, and in the United States, the recently passed U.S. Innovation and Competition Act (S. 1260) includes provisions that instruct federal agencies to provide free online public access to federally-funded research “not later than 12 months after publication in peer-reviewed journals, preferably sooner.”

The USICA bill aims to maximize the impact of federally-funded research by ensuring that final author manuscripts reporting on taxpayer-funded research are:

  • Deposited into federally designated or maintained repositories;
  • Made available in open and machine-readable formats; 
  • Made available under licenses that enable productive reuse and computational analysis; and
  • Housed in repositories that ensure interoperability and long-term preservation.

Duke got a head start on supporting researchers in making their publications open access in 2010, when Academic Council adopted an open access policy, which since then has been part of the Faculty Handbook (Appendix P). The policy provides the legal basis for Duke faculty to make their own research articles openly available on a personal or institutional website via a non-exclusive license, while also making it possible to comply with any requirements imposed by their journal or funder. Shortly after the policy was adopted, Duke Libraries worked with the Provost’s office to implement a service making open access easy for Duke researchers. DukeSpace, a repository integrated with the Scholars@Duke profile system, allows you to add a publication to your profile and deposit it to Duke’s open access archive in a single step, and have the open access link included in your citations alongside the link to the published version.

Duke Libraries also support a research data repository and services to help the Duke community organize, describe, and archive their research data for open access. This service, with support from the Provost’s office, provides both the infrastructure and curation staff to help Duke researchers make their data FAIR (Findable, Accessible, Interoperable, and Reusable). By publishing datasets with digital object identifiers (DOIs) and data citations, we create a value chain where making data available increases their impact and positions them as standalone research objects. The importance of data sharing specifically is also being formalized at Duke through the current Research Data Policy Initiative, which has a stated mission to “facilitate efficient and quality research, ensure data quality, and foster a culture of data sharing.” Together the Duke community is working to develop services, processes, procedures, and policies that broaden our contributions to society through public access to the outputs of our research.

Are you ready to make your work open? You can find more information about how to deposit your publications and data for open access at Duke on the ScholarWorks website, and consultants from Duke Libraries’ ScholarWorks Center for Scholarly Publishing and Center for Data and Visualization Sciences are available to help you find the best place to make your work open access, choose an appropriate license, and track how it’s being used.

Relational Thinking: Database Re-Modeling for Humanists

Author: Dr. Kaylee P. Alexander
Website: www.kayleealexander.com
Twitter: @kpalex91

Since the summer of 2018 I have been working with a set of nineteenth-century commercial almanacs for the city of Paris. As my dissertation focused heavily on the production of stone funerary markers during this period, I wanted to consult these almanacs to get a sense of how many workers were active in this field of production. Classified as marbriers (stonecutters), the makers of funerary monuments were often one in the same as those who executed various other stone goods and constructions. These almanacs represented a tremendous source of industry information, consistently recording enterprise names and addresses, as well as, at times, specific information about the types of products the enterprise specialized in and any awards they might have won for their work, and what types of new technologies they employed. An so I decided to make a database.

As a Humanities Unbounded graduate assistant with the Center for Data and Visualization Sciences during the summer of 2020, I had the opportunity to explore some of the issues related to database construction and management faced by humanists working on data-based research projects. In order to work out some of these issues, I worked to set up a MySQL database using my commercial almanacs data as a test case to determine which platforms and methods for creating queryable databases would be best suited for those working primarily in the humanities. In the process of setting up this database what became increasingly clear was the need to make clear the usefulness of this process for other humanists undertaking data-driven projects, as well as identify ways of transforming single spreadsheets of data into relational data models without needing to know how to code. Thus, in this blog post I offer a few key points about relational database models that may be useful for scholars in the humanities and share my experiences in constructing a MySQL database from a single Excel spreadsheet.

First of all, some key terms and concepts. MySQL is an open-source relational database management system that uses SQL (Structured Query Language) to create, modify, manage and extract data from a relational database. A relational data model organizes data into a series of tables containing columns (‘attributes’) and rows (‘records’) with unique keys identifying each record. Each table (or, ‘relation’) represents a single entity type and its corresponding attributes. When working with a relational data model you want to make sure that your tables are normalized, or organized in such a way that reduces redundancy in the data set, increases consistency, and facilitates querying.

Although for the purposes of efficient data gathering, I had initially collected all of the information from the commercial almanacs in a single Excel spreadsheet, I knew that I ultimately wanted to reconfigure my data using a relational model that could be shared with and queried efficiently by others. The main benefits of a relational model include ensuring consistency as well as performing combinations of queries to understand various relationships that exist among the information contained in the various tables that would be otherwise difficult to determine from a single spreadsheet. An additional benefit to this system is the ability to add records and edit information without the risk of compromising other information contained in the database.

The first question I needed to ask myself was which entities from my original spreadsheet would become the basis for my relational database. In other words, how would my relational model be organized? What different relationships existed within the dataset, and which variables functioned as entities rather than attributes? One of the key factors in determining which variables would become the entities of my relational model, was the question of whether or not a given variable contained repeated values throughout the master sheet. Ultimately determining the entities wasn’t the trickiest part. It because rather clear early on that it would be best to first create separate tables for businesses and business locations, which would be related via a table for annual activity (in the process of splitting my tables I would end up making more tables, but these were the key starting points).

The most difficult question I encountered was how I would go about splitting all this information as someone with very limited coding experience. How could identify unique values and populate tables with relevant attributes without having to teach myself Python in a pinch, but also without having to retype over one hundred years’ worth of business records? Ultimately, I came up with a rather convoluted system that had me going back and forth between OpenRefine and Excel. Once I got the hang of my system it became almost second nature to me but explaining it to others was another story. This made it abundantly clear that there was a lack of resources for demonstrating how one could create what were essentially a series of normalized tables from a flat data model. So, to make a very long story short, I broke down my convoluted process into a series of simple steps that required nothing more that Excel to transform a flat data model into a relational data model using the UNIQUE() and VLOOKUP() functions. This processes is detailed in a library tutorial I developed geared towards humanists, consisting of both a video demonstrating the process and a PDF containing written instructions.

In the end, all I needed to do was construct the database itself. In order to do this I worked with phpMyAdmin, a free web-based user interface for constructing and querying MySQL databases. Using phpMyAdmin, I was able to easily upload my normalized data tables, manage and query my database, and easily connect to Tableau for data visualization purposes using phpMyAdmin’s user management capabilities.


Kaylee Alexander portrait

Dr. Kaylee P. Alexander is a graduate of the Department of Art, Art History & Visual Studies, where she was also a research assistant with the Duke Art, Law & Markets Initiative (DALMI). Her dissertation research focuses on the visual culture of the cemetery and the market for funerary monuments in nineteenth-century Paris. In the summer of 2020, she served as a Humanities Unbounded graduate assistant with the Center for Data and Visualization Sciences at Duke University Libraries.

Standardizing the U.S. Census

Census Tract Boundary Changes
(https://datasparkri.org/maps/)

The questions asked in the U.S. Census have changed over time to reflect both the data collecting needs of federal agencies and evolving societal norms. Census geographies have also evolved in this time period to reflect population change and shifting administrative boundaries in the United States.

 

Attempts to Provide Standardized Data

For the researcher who needs to compare demographic and socioeconomic data over time, this variability in data and geography can be problematic. Various data providers have attempted to harmonize questions and to generate standard geographies using algorithms that allow for comparisons over time. Some of the projects mentioned in this post have used sophisticated weighting techniques to make more accurate estimates. See, for instance, some of the NHGIS documentation on standardizing data from 1990 and from 2000 to 2010 geography.

NHGIS

The NHGIS Time Series Tables link census summary statistics across time and may require two types of integration: attribute integration, ensuring that the measured characteristics in a time series are comparable across time, and geographic integration, ensuring that the areas summarized by time series are comparable across time.

For attribute integration, NHGIS often uses “nominally integrated tables,” where the aggregated data is presented as it was compiled. For instance, comparing “Durham County” data from 1960 and 2000 based on the common name of the county.

For geographically standardized tables,  when data from one year is aggregated to geographic areas from another year, NHGIS provides documentation with details on the weighting algorithms they use:

1990 to 2010 Tract changes in Cincinnati
(https://www.nhgis.org/documentation/time-series/1990-blocks-to-2010-geog)

NHGIS has resolved discrepancies in the electronic boundary files, as they illustrate here (an area of Cincinnati).

Social Explorer

The Social Explorer Comparability Data is similar to the NHGIS Time Series Tables, but with more of a drill-down consumer interface. (Go to Tables and scroll down to the Comparability Data.) Only 2000 to 2010 data are available at the state, county, and census tract level.  It provides data reallocated from the 2000 U.S. decennial census to the 2010 geographies, so you can get the earlier data in 2010 geographies for better comparison with 2010 data.

LTDB

The Longitudinal Tract Database (LTDB) developed at Brown University provides normalized boundaries at the census tract level for 1970-2010.  Question coverage over time varies. The documentation for the project are available online:

NC State has translated this data into ArcGIS geodatabase format.  They provide a README file, a codebook, and the geodatabase available for download.

Do-It-Yourself

If you need to normalize data that isn’t yet available this way, GIS software may be able to help. Using intersection and re-combining techniques, this software may be able to generate estimates of older data in more recent geographies.  In ArcGIS, this involves setting the ratio policy when creating a feature layer, to allow apportioning numeric values in attributes among the various overlapping geographies. This involves an assumption of an even geographic distribution of the variable across the entire area (which is not as sophisticated as some of the algorithms used by groups such as NHGIS).

Another research strategy employs crosswalks to harmonize census data over time. Crosswalks are tables that let you proportionally assign data from one year to another or to re-aggregate from one type of geography to another.  Some of these are provided by the NHGIS geographic crosswalk files, the Census Bureau’s geographic relationship files, and the Geocorr utility from the Missouri Census Data Center.

You can contact CDVS at askdata@duke.edu to inquire about the options for your project.

 

 

Share More Data in the Duke Research Data Repository!

We are happy to announce expanded features for the public sharing of large scale data in the Duke Research Data Repository! The importance of open science for the public good is more relevant than ever and scientific research is increasingly happening at scale. Relatedly, journals and funding agencies are requiring researchers to share the data produced during the course of their research (for instance see the newly released NIH Data Management and Sharing Policy). In response to this growing and evolving data sharing landscape, the Duke Research Data Repository team has partnered with Research Computing and OIT to integrate the Globus file transfer system to streamline the public sharing of large scale data generated at Duke. The new RDR features include:

  • A streamlined workflow for depositing large scale data to the repository
  • An integrated process for downloading large scale data (datasets over 2GB) from the repository
  • New options for exporting smaller datasets directly through your browser
  • New support for describing and using collections to highlight groups of datasets generated by a project or group (see this example)
  • Additional free storage (up to 100 GB per deposit) to the Duke community during 2021!

While using Globus for both upload and download requires a few configuration steps by end users, we have strived to simplify this process with new user documentation and video walk-throughs. This is the perfect time to share those large(r) datasets (although smaller datasets are also welcome!).

Contact us today with questions or get started with a deposit!

Publish Your Data: Researcher Highlight

This post was authored by Shadae Gatlin, DUL Repository Services Analyst and member of the Research Data Curation Team.

Collaborating for openness

The Duke University Libraries’ Research Data Curation team has the privilege to collaborate with exceptional researchers and scholars who are advancing their fields through open data sharing in the Duke Research Data Repository (RDR). One such researcher, Martin Fischer, Ph.D., Associate Research Professor in the Departments of Chemistry and Physics, recently discussed his thoughts on open data sharing with us. A trained physicist, Dr. Fischer describes himself as an “optics person” his work ranges from developing microscopes that can examine melanin in tissues to looking at pigment distribution in artwork. He has published data in the RDR on more than one occasion and says of the data deposit process that, “I can only say, it was a breeze.”

“I can only say, it was a breeze.”

Dr. Fischer recalls his first time working with the team as being “much easier than I thought it was going to be.” When Dr. Fischer and colleagues experienced obstacles trying to setup OMERO, a server to host their project data, they turned to the Duke Research Data Repository as a possible solution to storing the data. This was Dr. Fischer’s first foray into open data publishing, and he characterizes the team as being  responsive and easy to work with. Due to the large size of the data, the team even offered to pick up the hard drive from Fischer’s office. After they acquired the data, the team curated, archived, and then published it, resulting in Fischer’s first dataset in the RDR.

Why share data?

When asked why he believes open data sharing is important, Dr. Fischer says that “sharing data creates an opportunity for others to help develop things with you.” For example, after sharing his latest dataset  which evaluates the efficacy of masks to reduce the transmission of respiratory droplets, Fischer received requests for a non-proprietary option for data analysis instead of using the team’s data analysis scripts written for the commercial program Mathematica. Peers offered to help develop a Python script, which is now openly available, and for which the developers used the RDR data as a reference. As of January 2021, the dataset has had 991 page views.

Dr. Fischer appreciates the opportunity for research development that open data sharing creates, saying, “Maybe somebody else will develop a routine, or develop something that is better, easier than what we have”. Datasets deposited in the RDR are made publicly available for download and receive a permanent DOI link, which makes the data even more accessible.

“Maybe somebody else will develop a routine, or develop something that is better, easier than what we have.”

In addition to the benefits of long-term preservation and access that publishing data in the RDR provides, Dr. Fischer finds that sharing his data openly encourages a sense of accountability. “I don’t have a problem with other people going in and trying, and making sure it’s actually right. I welcome the opportunity for feedback”. With many research funding agencies introducing policies for research data management and data sharing practices, the RDR is a great option for Duke researchers. Every dataset that is accepted into the RDR is carefully curated to meet FAIR guidelines and optimized for future reuse.

Collaborating with researchers like Dr. Martin Fischer is one of the highlights of working on the Research Data Curation team. We look forward to seeing what fascinating data 2021 will bring to the RDR and working with more Duke researchers to share their data with the world.

Dr. Fischer’s Work in the Duke Research Data Repository:

  • Wilson, J. W., Degan, S., Gainey, C. S., Mitropoulos, T., Simpson, M. J., Zhang, J. Y., & Warren, W. S. (2019). Data from: In vivo pump-probe and multiphoton fluorescence microscopy of melanoma and pigmented lesions in a mouse model. Duke Digital Repository. https://doi.org/10.7924/r4cc0zp95
  • Fischer, E., Fischer, M., Grass, D., Henrion, I., Warren, W., Westman, E. (2020). Video data files from: Low-cost measurement of facemask efficacy for filtering expelled droplets during speech. Duke Research Data Repository. V2 https://doi.org/10.7924/r4ww7dx6q

Celebrating GIS Day 2020

About GIS Day

GIS Day is an international celebration of geographic information systems (GIS) technology. The event provides an opportunity for users of geospatial data and tools to build knowledge, share their work, and explore the benefits of GIS in their communities. Since its establishment in 1999, GIS Day events have been organized by nonprofit organizations, universities, schools, public libraries, and government agencies at all levels.

Held annually on the third Wednesday of November, this year GIS Day is officially today. Happy GIS Day! CDVS has participated in Duke GIS Day activities on campus in past years, but with COVID-19, we had to find other ways to celebrate.

A (Virtual) Map ShowcaseThe English Civil Wars - Story Map

To mark GIS Day this year, CDVS is launching an ArcGIS StoryMaps showcase! We invite any students, faculty, and staff to submit a story map to highlight their mapping and GIS work. Send us an email at askdata@duke.edu if you would like to add yours to the collection. We are keen to showcase the variety of GIS projects happening across Duke, and we will add contributions to the collection as we receive them. Our first entry is a story map created by Kerry Rork as part of a project for undergraduate students that used digital mapping to study the English Civil Wars.

Why Story Maps?

If you aren’t familiar with ArcGIS StoryMaps, this easy-to-use web application integrates maps with narrative text, images, and video. The platform’s compelling, interactive format can be an effective communication tool for any project with a geographic component. We have seen a surge of interest in story maps at Duke, with groups using them to present research, give tours, provide instruction. Check out the learning resources to get started, or contact us at askdata@duke.edu to schedule a consultation with one of our GIS specialists.

CDVS Chat or Zoom for Online Data Advice

As students and classes moved online in the spring of 2020, the Center for Data and Visualization Sciences realized that it was time to expand our existing email (askdata@duke.edu) and lab based consultation services to meet the data demands of online learning and remote projects. Six months and hundreds of online consultations later, we have developed a new appreciation for the online tools that allow us to partner with Duke researchers around the world. Whether you prefer to chat, zoom, or email, we hope to work with you on your next data question!

Chat

 

Ever had a quick question about how to visualize or manage your data, but weren’t sure where to get help? Having trouble figuring out how to get the data software to do what you need for class/research? CDVS offers roughly thirty hours of chat support each week.  Data questions on chat cover our full range of data support. If we cannot resolve a question in the chat session, we will make a referral for a more extended consultation.

Zoom

We’re going to be honest…  we miss meeting Duke students and faculty in the Brandaleone Lab in the Edge and consulting on data problems!  However, virtual data consultations over zoom have some advantages over an in-person data consultations at the library. With zoom features such as screen sharing, multiple participants, and chat, we can reach both individuals and project teams in a format where everyone can see the screen and sharing resource links is simple. As of October 1st, we have used zoom to consult on questions from creating figures in the R programming languages to advising Bass Connection teams on the best way to visualize their research.  We are happy to schedule zoom consultations via email at: askdata@duke.edu.

Just askdata@duke.edu

Even with our new data chat service and video chat services, we are still delighted to advise on questions over email at askdata@duke.edu. As the days grow shorter this fall and project deadlines loom, we look forward to working with you to resolve your data challenges!

Flipping Data Workshops

John Little is the Data Science Librarian in Duke Libraries Center for Data and Visualizations Sciences. Contact him at askdata@duke.edu.

The Center for Data and Visualization Sciences is and has been open since March! We never closed. We’re answering questions, teaching workshops, have remote virtual machines available, and business is booming.  

What’s changed? Due to COVID-19, the CDVS staff are working remotely. While we love meeting with people face-to-face in our lab, that is not currently possible. Meanwhile, digital data wants to be analyzed and our patrons still want to learn. By late spring I began planning to flip my workshops for fall 2020. My main goal was to transform a workshop into something more rewarding than watching the video of a lecture, something that lets the learner engage at their pace, on their terms.  

How to flip

Flipping the workshop is a strategy to merge student engagement and active learning.  In traditional instruction, a teacher presents a topic and assigns work aimed at reinforcing the lesson. 

Background:  I offer discrete two-hour workshops that are open to the entire university. There are very few prerequisites and people come with their own level of experience.  Since the workshops attract a broad audience, I focus on skills and techniques using general examples that reliably convey information to all learners. In this environment, discipline specific examples risk losing large portions of the audience. As an instructor I must try to leave my expectations of students’ skills and background knowledge — at the door.  

In a flipped classroom, materials are assigned and made available in advance. In this way, group Zoom-time can be used for questions and examples. This instruction model allows students to learn at their own pace, pause and rewind videos, practice exercises, or speed up lectures. During the workshop, students can bring questions relevant to their particular point of confusion.  

The main instructor goal is to facilitate a topic for student engagement that puts the students in control. This approach has a democratizing effect that allows students to become more active and familiar with the materials.  With flipped workshops, student questions appear to be more thoughtful and relevant. When the student is invited to take charge of their learning, the process of investigation becomes their self-driven passion.  

For my flipped workshops materials, I offer basic videos to introduce and reinforce particular techniques. I try to keep each video short, less than 25 minutes.  At the same time I offer plenty of additional videos on different topical details. More in-depth videos can cover important details that may feel ancillary or even demotivating, even if those details improve task efficiency. Sometimes the details are easier to digest when the student is engaged. This means students start at their own level and gain background when they’re ready.  Students may not return to the background material for weeks, but the materials will be ready when they are.

Flipping a consultation?

The Center for Data & Visualization Sciences provides open workshops and Zoom-based consulting. The flipped workshop model aligns perfectly with our consulting services since students can engage with the flipped workshop materials (recordings, code, exercises) at any time. When the student is ready for more information, whether a general question or a specific research question, I can refer to targeted background materials during my consultations. With the background resources, I can keep my consultations relevant and brief while also reducing the risk of under-informing.  

For my flipped workshop on R, or other CDVS workshops, please see our workshop page.

Automated Tagging of Historical, Non-English Sources with Named Entity Recognition (NER): A Resource

Felipe Álvarez de Toledo López-Herrera is a Ph.D. candidate in the Art, Art History, and Visual Studies Department at Duke University and a Digital Humanities Graduate Assistant for Humanities Unbounded, 2019-2020. Contact him at askdata@duke.edu.

[This blogpost introduces a GitHub Repository that provides resources for developing NER projects in historical languages. Please do not hesitate to use the code and ideas made available there, or contact me if there are any issues we could discuss .]

Understanding Historical Art Markets: an Automated Approach

When the Sevillian painters’ guild archive was lost in the 19th century, with it vanished lists of master painters, journeymen, apprentices and possibly dealers recorded in the guilds’ registration books. Nevertheless, researchers working for over a century in other Sevillian archives have published almost twenty volumes of archival documents. These transcriptions, excerpts and summaries reflect the activities of local painters, sculptors, and architects, among other artisans. I use this evidence as a source of data on early modern Seville’s art market in my dissertation research. For this, I have to extract information from many documents in order to query and discern larger patterns.

Image of books and extracted text examples.
Left. Some of the volumes used in this research, in my home library. I have managed to acquire these second-hand; others I have borrowed from libraries. Right. A scan of one of the pages of these books, showing some of the documents from which we extracted data.

Instead of manually keying this information into a spreadsheet or other form of data storage, I chose to scan my sources and test an automated approach using Natural Language Processing. Last semester, within the context of the Humanities Unbounded Digital Humanities Graduate Assistantship, I worked with Named-Entity Recognition (NER), a technique in which computers can be taught to identify named real-world objects in texts. NER models underperform on historical texts because they are trained on modern documents such as news or Wikipedia articles. Furthermore, NLP developers have focused most of their efforts on English language models, resulting in underdeveloped models for other languages. For these reasons, I had to retrain a model to be useful for my purposes. In this blogpost, I give an overview of the process of adapting NER tools for use on non-English historical sources.

Defining Named-Entity Recognition

Named-Entity Recognition (NER) is a set of processes in which a computer program is trained to identify and categorize real-world objects with proper names in a corpus of texts. It can be used to tag names in documents without a standardized structure and label them as people, locations or organizations, among other categories.

Named entity recognition example tags in text
Named-entity recognition is the automated tagging of real-world objects with names, such as people, locations, organizations, or monetary amounts, within texts.

Code libraries such as Spacy, NLTK or Stanford CoreNLP provide widely-tested toolkits for NER. I decided that Spacy would be the best choice for my purposes. Though its Spanish model included less label categories, they performed better out-of-the-box. Importantly, the model worked better for certain basic language structures such as recognizing compound names (last names with several components, such as my own). The Spacy library also proved user-friendly for those of us with little coding knowledge. Its pre-programmed data processing pipeline is easy to modify, given that you have a basic understanding of Python. In my case, I had the time and motivation to acquire this literacy.

I sought to improve the model’s performance in two ways. First, I retrained it on a subset of my own data. This improved performance and allowed me to add new label categories such as dates, monetary amounts and objects. Additionally, I added a component that modernized my texts’ spelling to make them more conducive to proper tagging.

Training NER on Historical Spanish Text: Process and Results

To improve the model, I needed training data – a “gold standard” of perfectly-tagged text. First, I ran the model on a set of 400 documents, which resulted in a set of preliminary tags. Then, I corrected these tags with a tool called Dataturks and reformatted the output to work with Spacy. Once this data was ready, I split it 80-20, which means running a training loop on 80% of correctly-tagged texts to adjust the performance of the model, and reserving 20% for testing or evaluating the model on data it had not yet seen.

Named Entity Recognition output
Final output as stored in my database for one particular document with ID=5.

Finally, I evaluated whether all these changes actually improved the model’s performance, saved the updated model, and exported the output in a format that worked for my own database. For my texts, the model initially worked at around 36% recall (the percentage of true entities that were identified by the model), compared to an 89% recall with modern texts as evaluated by Spacy. After training, recall has increased to 64%. Some tags, such as person or location, perform especially well (85% and 81%, respectively). Though the numbers are not perfect, they show a marked improvement, generated with little training data.

For the 8,607 documents processed, the process has resulted in 59,191 tags referring to people, locations, organizations, dates, objects and money. Next steps include finding descriptors of entities within the text, and modeling relationships between entities appearing in the same document. For now, a look at the detected tags underscores the potential of NER for automating data collection in data-driven humanities research.