Graphs, Trees, and Streams: The TEI Data Model

I’ve been involved for a while now in an effort to get the Pointers section of the TEI Guidelines into a workable state. This is, admittedly, an obscure corner of a world that people outside the (itself small) text encoding/markup community would find baffling. TEI Pointers are probably not of interest to you, but the theory, practise, and theory-in-practise of text encoding with TEI might be. Early on, TEI was quite interested in the theory of text, and several of its developers famously published on the “OHCO” theory—text as an Ordered Hierarchy of Content Objects. That is, texts are composed of logical units, such as books, chapters, paragraphs, sentences. These are hierarchical (a book has chapters, chapters have paragraphs, paragraphs are made up of sentences, and so on) and ordered. And there’s a good deal of truth to OHCO. Many texts we’re familiar with really are structured this way, or at least can be modeled in this fashion. OHCO precisely corresponds with the structure of XML, which is itself a tree—a hierarchical data structure with ordered nodes. OHCO isn’t completely satisfactory, to be sure. Exceptions and conflicts are easy to come by. Books, for example, have multiple structures: they have chapters and paragraphs to be sure, but they also have pages, which are themselves ordered containers of text and cut across sentences and paragraphs, though usually not chapters. So OHCO sort of works, but isn’t 100% correct.

TEI works though, even if it doesn’t have a Grand Unified Theory of Text behind it. And that’s a strength—it’s responsive to the needs of its community without having a Theory that modifications to the guidelines have to be tested against. It’s just not that opinionated. There are most definitely principles and best practices that the TEI Council follows in applying changes, but there isn’t a Theory (or ideology) of TEI that it follows. The project I’ve been working on involves methods for addressing pieces of a TEI document that aren’t simply nodes (elements, attributes, or text nodes), which has made me think about what theories of text the current version of TEI actually instantiates (whether those theories are articulated or not). One of the things TEI Pointers do is address the underlying text stream of a document. So, for example, the match() pointer might point to a piece of text that spans element boundaries.

<seg xml:id="a1">The quick <unclear>br</unclear>oun fox...</seg>

A match pointer #match(a1,'broun') would address the (irregularly spelled) word “broun” even though it is broken up by the <unclear> tag. You might use it if you wanted to annotate the word somehow—to provide a normalized spelling, but didn’t want to do it inline. You could do something like

<choice>
    <orig><ref target="#match(a1,'broun')"/></orig>
    <reg>brown</reg>
</choice>

(out of line, somewhere else in the document).

There’s a lot of practical text theory going on here: first, we have a stream of text, presumably representing the digital transcription of an analog source.

The quick broun fox...

This text stream is embedded in a tree structure comprised of the markup, which surrounds it in a <seg> element and marks letters which are illegible, but understandable in their context with an <unclear> tag. You could do the regularization inline if you wanted to:

<seg xml:id="a1">The quick <choice>
    <orig><unclear>br</unclear>oun</orig>
    <reg>brown</reg></choice> fox...</seg>

You can see, I think, that it would be all too easy to run into cases where the overlapping concerns (here, regularizing spelling and noting physical features of the text) might necessitate overlapping markup. It’s also interesting to think about what this does to the text stream, which is now:

The quick \n    broun\n    brown fox...

I’ve included the “whitespace”, printing line breaks as “\n”, to show what actual text is present in the stream, which, while I’ve reformatted it here purely for the sake of having it print sensibly in this blog post, represents the kinds of things you see all the time in TEI documents. People have argued that this sort of thing is a harmful feature of TEI, because the real text stream no longer matches the notional text stream, which should now be thought of either as duplex:

The quick broun fox...

and

The quick brown fox...

at the same time, or as forking at the word “broun|brown” and rejoining after. TEI texts that employ constructs like this can really no longer be said completely to follow OHCO, incidentally, because these structures, while certainly hierarchical, are not internally ordered (there’s no rule about which of the orig/reg pair comes first—they’re parallel). There is a useful theoretical and practical question here: should your text stream align with a particular reading of the text? Or are you happy to have it split and rejoin when you deploy structures like <choice> or <app> inline in your text? I think the TEI’s stance on this would be “it depends”.

Regardless of your feelings in this matter (the arguments over this often become rather religious), we’ve identified two distinct, but related structures here, the text stream, and the XML tree in which it is embedded, and which imposes its semantics upon the text. And I’ve hinted at a third structure in my standoff markup example above, with the <ref target="#match(a1,'broun')"/>. TEI documents also have linking mechanisms, whereby part of a document can be linked to another part, or to part of another document entirely. This really comprises another data structure—a graph—where the nodes are element, attribute, or text nodes in the document and the arcs are the links. The graph joins together arbitrary portions of the document and adds further layers of meaning to it.

The TEI data model then, is a hybrid of stream, tree, and graph, where each “layer” annotates and adds meaning to the layer(s) below. Seen in this light, my work on Pointers is an effort to give the graph layer full access to the text stream, not necessarily mediated by the tree layer. So maybe it’s not quite as thoroughly obscure as I thought.

Further Reading

If you’re interested at all in finding out more about the ongoing work on TEI Pointers, and what they might be useful for, you can take a look at https://github.com/hcayless/TEI_Pointers_Draft for the stable version of the draft proposal (there’s also a google doc linked from there, which you can comment on, and I’d be happy if you did). I’ve also been working on a browser-based implementation, which you can see at https://github.com/hcayless/tei-xpointer.js, with a demo implementation at http://tei.philomousos.com/. [Update 2017-07-17: this work was completed and incorporated into the TEI Guidelines and published in an article in the Journal of the TEI, Rebooting TEI Pointers.]

On OHCO the foundational article is DeRose, S.-J., D. Durand, E. Mylonas, and A.-H. Renear (1990). “What Is Text, Really?” Journal of Computing in Higher Education 1: 3–26. Reprinted in the ACM/SIGDOC Journal of Computer Documentation 21,3: 1–24. See Renear’s article in A Companion to Digital Humanities for a summary and commentary.

On arguments against adulterating the text stream, see Ted Nelson’s 1997 article and more recently Desmond Schmidt, “The inadequacy of embedded markup for cultural heritage texts”, Literary and Linguistic Computing 25.2 (2010). Adam Soroka’s and my Balisage paper was a response to the discussion following that article (Cayless, Hugh A., and Adam Soroka. “On Implementing string-range() for TEI.” Presented at Balisage: The Markup Conference 2010, Montréal, Canada, August 3 – 6, 2010. In Proceedings of Balisage: The Markup Conference 2010. Balisage Series on Markup Technologies, vol. 5 (2010). doi:10.4242/BalisageVol5.Cayless01.)