And thirdly, the code is more what you’d call “guidelines” than actual rules.
— Captain Barbossa, Pirates of the Caribbean
I’ve spent the last few hours of work on our Integrating Digital Epigraphies (IDEs) project writing some code to parse epigraphic citations into their component parts. Citations are the connective tissue of humanities scholarship, so by turning them into machine-actionable links, we can start to align different projects, and to bring together in one place the sum of knowledge about, say, a given inscription, thereby making research more efficient, and also potentially allowing macroscopic views of the knowledge domain. Parsing these citations is a step toward that goal. Epigraphists normally cite editions, especially those published in the big corpora, like Inscriptiones Graecae, in a shorthand fashion. “IG I³ 8” is inscription number 8, in the third edition of the first volume of the Inscriptiones Graecae series. Typically, these short-form citations follow the same pattern: an abbreviated title, volume and fascicle information (if any), followed by an item number. Different bibliographies may abbreviate titles differently, treat volume info differently (e.g. Arabic vs. Roman numerals for volume numbers or II² vs. II[2] or II 2 for volume + edition), and there is even variation in how item numbers are treated.
If we want to be able to align citations from different projects, we have to be able to reconcile these differences, and the first step towards doing that is to be able to deal with the components of a citation separately, because, for example, the only difference between two bibliographies might be that they abbreviate the title of a series differently, so alignment would just entail matching the series titles.
Citation parsing is one of those things that crop up in DH programming that look superficially simple, but are actually quite hard to get right 99% of the time, and probably impossible (short of creating your own artificial intelligence and teaching it epigraphy—and hoping it doesn’t reciprocate by murdering you) to get 100% right. The method I’ve settled on is to employ a state machine that first splits the citation into tokens and then categorizes each token as title, volume, or item, finishing when everything is parceled out. My algorithm starts by assuming the first token is, or is part of, the title and the last is, or is part of, the item, and then works backwards from there. We can’t really know what the title looks like, but the item number ought to follow some rules (e.g., it ought to be numberlike, unless it’s the not-very-helpful “passim”). The thing before the item might be the volume info, and that ought to be quite regular too, so once we’ve found a pattern that looks item-like, we can work backwards until we’ve captured everything that looks volume-ish; then everything left must be part of the title.
This works reasonably well, up to a point. That point is where you have to start building in all the exceptions to these “rules” you’ve derived. If the item number you’ve isolated starts with “col.” or “fr(ag).” (for column or fragment), for example, then you’ve only got a sub-reference, not the full item identifier, and you need to keep looking. If there’s an “n.” or a “pp.” then everything after that is probably an item (or part of one, if the “n.” means “note”). Sometimes the item reference contains line numbers, and you have to be able to tell them from item numbers. Potentially, any number of these exceptions exist, because the bibliography doesn’t actually have rules. It’s trying to summarize a heterogeneous domain in a fairly regular way for (expert) human consumption. The only real rule (if there are any) is consistency in naming scheme, and even that may be subject to error. Or perhaps a better way of putting it is that there are layers of overlapping and sometimes contradictory rules, not a simple, single set. The difficulty of DH programming is that we are often applying an entirely rule-driven process to a merely guideline-driven dataset. Some level of failure is to be expected. Moreover, some level of tweaking will be required to accomodate the slightly differing conventions of different bibliographies. And, of course, it will not necessarily cope with errors of various types. It may be that my parser code will work perfectly on the citations I’ve collected thus far, but will be broken by a new variant. It is also very likely that it will only work in the narrow domain of epigraphy, and will not generalize to other types of citation.
The profound heterogeneity and complexity of humanities data is one of the things that makes DH programming such a challenge (and a joy). It can also be a source of much frustration for those of us who know the nasty details, because portability of code is a real problem. I wince a bit whenever I hear some variation on “Our solution X works on our problem Y, so it should work on your problem Z” (and I will confess to having uttered those words myself on occasion), because there’s simply no guarantee I won’t have to completely rewrite X to make it fit for Z. X might indeed turn out not to work even for the general case of Y, let alone for Z. You might just have gotten lucky.
This also makes me very skeptical of turnkey solutions and of the notion of outsourcing or “just-in-time-” sourcing DH development. The push-button solutions I’ve seen for humanities data are pretty much all toys (think word clouds, for example). There’s definite value in toys, but they’re not going to be replacing anybody’s job. At their best, they provoke questions and provide different ways to look at your sources. By the same token, if you don’t yourself understand the nuts and bolts of what you’re doing, you’d better have a core member of your team who does when you exceed the operational parameters of your solution.
All this leads me to think that recent angst about Digital Humanities “techno-solutionism” is not merely an overreaction, it’s actually deeply mistaken—as indeed would be DH triumphalism (though my impression is that’s more of a strawman than a real thing). Barring the creation of the AI who’s also a humanities scholar, you’re simply not going to be able to do push-button Digital Humanities. Actual DH (however you choose to define it) isn’t the Humanities with digital pixie dust sprinkled on it. It’s hard, often unglamorous, sometimes mechanical work that has a lot in common with the unglamorous sides of traditional humanities work, like working your way painstakingly through archives to find historical evidence, or digging in the dirt for fragments left behind by civilizations past. To be sure, there are high-level insights and interpretation that should come out of this work (and cool visualizations), but there are no reliable shortcuts, because there aren’t really any rules.
Hi Hugh,
Really enjoyed the panel yesterday.
Is IDEs up and running already? or still in the making?
If I may be of some encouragement, this is a brilliant tool!
I cannot wait to use it.
All the best,
Julien
Hi Julien,
It’s still under development, but we plan to have a release soon. I hope in November.
Hi,
I was searching “parsing bibliographic reference list digital humanities” and got here. However, I cannot find a link to the tool that is mentioned in the comments… Or do you recommend any other specific tool/library? I have a mass of thousands references that I have to parse into metadata (author, title, publisher, pages, etc.).
I will add that I loved the post as it “touches my developer’s heart”:
Thanks!