All posts by Eric Monson, Ph.D.

Relational Thinking: Database Re-Modeling for Humanists

Author: Dr. Kaylee P. Alexander
Website: www.kayleealexander.com
Twitter: @kpalex91

Since the summer of 2018 I have been working with a set of nineteenth-century commercial almanacs for the city of Paris. As my dissertation focused heavily on the production of stone funerary markers during this period, I wanted to consult these almanacs to get a sense of how many workers were active in this field of production. Classified as marbriers (stonecutters), the makers of funerary monuments were often one in the same as those who executed various other stone goods and constructions. These almanacs represented a tremendous source of industry information, consistently recording enterprise names and addresses, as well as, at times, specific information about the types of products the enterprise specialized in and any awards they might have won for their work, and what types of new technologies they employed. An so I decided to make a database.

As a Humanities Unbounded graduate assistant with the Center for Data and Visualization Sciences during the summer of 2020, I had the opportunity to explore some of the issues related to database construction and management faced by humanists working on data-based research projects. In order to work out some of these issues, I worked to set up a MySQL database using my commercial almanacs data as a test case to determine which platforms and methods for creating queryable databases would be best suited for those working primarily in the humanities. In the process of setting up this database what became increasingly clear was the need to make clear the usefulness of this process for other humanists undertaking data-driven projects, as well as identify ways of transforming single spreadsheets of data into relational data models without needing to know how to code. Thus, in this blog post I offer a few key points about relational database models that may be useful for scholars in the humanities and share my experiences in constructing a MySQL database from a single Excel spreadsheet.

First of all, some key terms and concepts. MySQL is an open-source relational database management system that uses SQL (Structured Query Language) to create, modify, manage and extract data from a relational database. A relational data model organizes data into a series of tables containing columns (‘attributes’) and rows (‘records’) with unique keys identifying each record. Each table (or, ‘relation’) represents a single entity type and its corresponding attributes. When working with a relational data model you want to make sure that your tables are normalized, or organized in such a way that reduces redundancy in the data set, increases consistency, and facilitates querying.

Although for the purposes of efficient data gathering, I had initially collected all of the information from the commercial almanacs in a single Excel spreadsheet, I knew that I ultimately wanted to reconfigure my data using a relational model that could be shared with and queried efficiently by others. The main benefits of a relational model include ensuring consistency as well as performing combinations of queries to understand various relationships that exist among the information contained in the various tables that would be otherwise difficult to determine from a single spreadsheet. An additional benefit to this system is the ability to add records and edit information without the risk of compromising other information contained in the database.

The first question I needed to ask myself was which entities from my original spreadsheet would become the basis for my relational database. In other words, how would my relational model be organized? What different relationships existed within the dataset, and which variables functioned as entities rather than attributes? One of the key factors in determining which variables would become the entities of my relational model, was the question of whether or not a given variable contained repeated values throughout the master sheet. Ultimately determining the entities wasn’t the trickiest part. It because rather clear early on that it would be best to first create separate tables for businesses and business locations, which would be related via a table for annual activity (in the process of splitting my tables I would end up making more tables, but these were the key starting points).

The most difficult question I encountered was how I would go about splitting all this information as someone with very limited coding experience. How could identify unique values and populate tables with relevant attributes without having to teach myself Python in a pinch, but also without having to retype over one hundred years’ worth of business records? Ultimately, I came up with a rather convoluted system that had me going back and forth between OpenRefine and Excel. Once I got the hang of my system it became almost second nature to me but explaining it to others was another story. This made it abundantly clear that there was a lack of resources for demonstrating how one could create what were essentially a series of normalized tables from a flat data model. So, to make a very long story short, I broke down my convoluted process into a series of simple steps that required nothing more that Excel to transform a flat data model into a relational data model using the UNIQUE() and VLOOKUP() functions. This processes is detailed in a library tutorial I developed geared towards humanists, consisting of both a video demonstrating the process and a PDF containing written instructions.

In the end, all I needed to do was construct the database itself. In order to do this I worked with phpMyAdmin, a free web-based user interface for constructing and querying MySQL databases. Using phpMyAdmin, I was able to easily upload my normalized data tables, manage and query my database, and easily connect to Tableau for data visualization purposes using phpMyAdmin’s user management capabilities.


Kaylee Alexander portrait

Dr. Kaylee P. Alexander is a graduate of the Department of Art, Art History & Visual Studies, where she was also a research assistant with the Duke Art, Law & Markets Initiative (DALMI). Her dissertation research focuses on the visual culture of the cemetery and the market for funerary monuments in nineteenth-century Paris. In the summer of 2020, she served as a Humanities Unbounded graduate assistant with the Center for Data and Visualization Sciences at Duke University Libraries.

Can’t we just make a Venn diagram?

When I’m teaching effective visualization principles, one of the most instructive processes is critiquing published visualizations and reviewing reworks done by professionals. I’ll often show examples from Cole Nussbaumer Knaflic’s blog, Storytelling with Data, and Jon Schwabish’s blog, The Why Axis. Both brilliant! (Also, check out my new favorite blog Uncharted, by Lisa Charlotte Rost, for wonderful visualization discussions and advice!)

What we don’t usually get to see is the progression of an individual visualization throughout the design process, from data through rough drafts to final product. I thought it might be instructive to walk through an example from one of my recent consults. Some of the details have been changed because the work is unpublished and the jargon doesn’t help the story.

Data full of hits and misses

Five tests data, hits and misses per patient A researcher came to me for help with an academic paper figure. He and his collaborator were comparing five literature-accepted methods for identifying which patients might have a certain common disease from their medical records. Out of 60,000 patients, about 10% showed a match with at least one of the tests. The resulting data was a spreadsheet with a column of patient IDs, and five columns of tests, with a one if a patient was identified as having the disease by that particular test, and a zero if their records didn’t match for that test. As you can see in the figure, there were many inconsistencies between who seemed to have the disease across the five tests!

So you want to build a snowman

Five tests overlap, original Venn diagram The researchers wanted a visualization to represent the similarities and differences between the test results. Specifically, they wanted to make a Venn diagram, which consists of ellipsoids representing overlapping sets. They had an example they’d drawn by hand, but wanted help making it into an accurate depiction of their data. I resisted, explaining that I didn’t know of a program that would accomplish what he wanted, and that it is likely to be mathematically impossible to take their five-dimensional data set and represent it quantitatively as a Venn diagram in 2D. Basically, you can’t get the areas of all of the overlapping regions to be properly proportional to the number of patients that had hits on all of the combinations of the five tests. The Venn diagram works fine schematically, as a way to get across an idea of set overlap, but it would never be a data visualization that would reveal quantitative patterns from their findings. At worst, it would be a misleading distortion of their results.

Count me in

Five tests data pairwise table with colored cells His other idea was to show the results as a table of numbers in the form of a matrix. Each of the five tests were listed across the top and the side, and the cell contents showed the quantity of patients who matched on that pair of tests. The number matching on a single test was listed on the diagonal. Those patterns can be made more visual by coloring the cells with “conditional formatting” in Excel, but the main problem with the table is that it hides a bunch of interesting data! We don’t see any of the numbers for people who hit on the various combinations of three tests, or four, or the ones that hit on all five.

Five test data heatmap and number of tests hit per patient

I suggested we start exploring the hit combinations by creating a heatmap of the original data, but sort the patients (along the horizontal axis) by how many tests tests they hit (listing the tests up the vertical axis). Black circles are overlaid showing the number of tests hit for any given patient.

There are too many patients / lines here to show clearly the combinations of tests, but this visualization already illuminated two things that made sense to the researchers. First, there is a lot of overlap between ALPHA (a) and BETA-based (b) tests, and between GAMMA Method (c) and Modified GAMMA (d), because these test pairs are variations of each other. Second, the overlaps indicate a way the definitions are logically embedded in each other; (a) is a subset of (b), and (b) is for the most part a subset of (c).

Five tests combinations Tableau bubble plot

My other initial idea was to show the numbers of patients identified in each of the test overlaps as bubbles in Tableau. Here I continue the shorthand of labeling each test by the letters [a,b,c,d,e], ordered from ALPHA to OMEGA. The number of tests hit are both separated in space and encoded in the color (low to high = light to dark).

Add some (effective?) dimensions

I felt the weakness of this representation was that the bubbles were not spatially associated with their corresponding tests. Inspired by multi-dimensional radial layouts such as those used in the Stanford dissertation browser, I created a chart (in Adobe Illustrator) with five axes for the tests. I wanted each bubble to “feel a pull” from each of the passed tests, so it made sense to place the five-hit “abcde” bubble at the center, and each individual, “_b___”, “__c__”, ____e” bubble right by its letter – making the radius naturally correspond to the number of test hit. Other bubbles were placed (manually) in between their combination of axes / tests.

Five test combinations hits polar bubble plot

The researchers liked this version. It was eye-catching, and the gravitation of bubbles in the b/c quadrant vaguely illustrated the pattern of hits and known test subsets. One criticism, though, was that it was a bit confusing – it wasn’t obvious how the bubbles were placed around the circles, and it might take people too long to figure out how to read the plot. It also, took up a lot of page space.

Give these sets a hand

One of the collaborators, after seeing this representation, suggested trying to use it as the basis for an Euler diagram. Like a Venn diagram, it’s a visualization used to show set inclusion and overlap, but unlike in a Venn, an Euler is drawn using arbitrary shapes surrounding existing labels or images representing the set members. I thought it was an interesting idea, but I initially dismissed the idea as too difficult. I had already put more time than I typically spend on a consult into this visualization (our service model is to help people learn how to make their own visualizations, not produce visualizations for them). Also, I had never made an Euler diagram. While I had seen some good talks about them, I didn’t have any software on hand which would automate the process. So, I responded that the researchers should feel free to try creating curves around the sets themselves, but I wasn’t interested in pursuing it further.

Five tests polar bubbles with hand-drawn set boundaries About two minutes after I sent the email, I began looking at the diagram and wondering if I could draw the sets! I printed out a black and white copy and started drawing lines with colored pencils, making one enclosing shape for each test [a-e]. It turned out that my manual layout resulted in fairly compact curves, except for “_bc_e”, which had ambiguous positioning, anyway. Five tests first draft Euler diagram The curve drawing was so easy that I started an Illustrator version. I kept the circles’ area the same (corresponding to their quantitative value) but pushed them around to make the set shapes more compact.

Ironically, I had come back almost exactly to the researchers’ original idea! The important distinction is that the bubbles keep it quantitative, with the regions only representing set overlap.

We’ve come full ellipsoid

Angela Zoss constructively pointed out that there were now too many colors, and the shades encoding number of hits wasn’t necessary. She also felt the region labels weren’t clear. Those fixes, plus some curve smoothing (Path -> Simplify in Illustrator) led me to a final version we were all very happy with!

It’s still not a super simple visualization, but both the quantitative and set overlap patterns are reasonably clear. This results was only possible, though, through trying multiple representations and getting feedback on each!

Five tests final quantitative bubble Euler diagram

If you’re interested in learning how to create visualizations like this yourself, sign up for the DVS announcements listserve, or keep an eye on our upcoming workshops list. We also have videos of many past workshops, including Angela’s Intro to Effective Data Visualization, and my Intro to Tableau, Illustrator for Charts, and Illustrator for Diagrams.