The Trouble with Triples

In this week’s post, I’m going to stick my neck out and criticize RDF and its suitability for encoding data in the Humanities. I should start by stipulating that we have a big investment in RDF: is driven by an RDF triplestore and we’re working on an RDF vocabulary to help extend our ability to represent our data. That said, let’s start with the basics: an RDF triple is made up of three URIs or two URIs and a “Literal” (a bit of text, perhaps). So you get things like:

<;9> <> <;9;1026/source>

Or, to put it another way, the volume identified by the URI;9 contains the text identified by the URI;9;1026/source. Triples in a triple store are members of a “Graph”. In this case the graph is identified by the URI So in fact, the triples we talk about are really quads, with a subject, predicate, object, and context:

<;9> <> <;9;1026/source> <>

So far, so good. The great problem with this design is that we don’t really have a good way of talking about triples. As we start to move beyond a self-contained and internally controlled system to one where data are shared between partner projects, we’re starting to care more about triple provenance. Humanities data are full of guesses, estimates, contradictions, and arguments, so it’s not only likely that we’ll acquire triples that contradict each other, it’s desirable. But we have to be able to source our statements. This may be a basic incompatibility between Humanities data and the RDF data model. Our statements are not in the form “X Y Z”, but “W asserts/implies X Y Z”.

There is the method called “reification”, in which a single RDF triple is exploded into a Statement, with subject, predicate, and object properties, so a single triple becomes four triples. Eric Hellman wrote about reification a few years back:

Unfortunately, RDF, the data model underlying Linked Data and the Semantic Web, has no built-in mechanism to attach data to its source. To some extent, this is a deliberate choice in the design of the model, and also a deep one. True facts can’t really have sources, so a knowledge representation system that includes connections of facts to their sources is, in a way, polluted. Instead, RDF takes the point of view that statements are asserted, and if you want to deal with assertions and how they are asserted in a clean logic system, the assertions should be reified.

Hellman notes other problems with reification in a later post. But let’s reify my statement anyway:

<> <> <>
<> <> <;9>
<> <> <>
<> <> <;9;1026/source>

So, merely by adding another five triples for every original (greatly increasing the size of my dataset), I can now say things about my triple. Unfortunately, graphs are not part of the RDF specification, so I can’t actually say that my Statement belongs to the <> context. You may ask, what happens if I want to say something about an assertion about my original triple (when it was made, for example, and by whom). I will respond by gibbering at you and going to hide under the desk.

No. This is lunacy. The one workaround I can see using the existing setup is to coopt the graph (or context) element of the quad, and use that as a handle for making statements about our statements. There are a few downsides to that approach though: one is that it makes querying with SPARQL messy, another is that now you have to manage a multiplicity of graphs, which is something you might not want to do. You also risk losing provenance data if you merge graphs together. And it somewhat mangles the concept and usefulness of the graph as a collection of statements (you may want to group stuff for reasons other than provenance). Portability is a problem too, because most of the serialization formats out there don’t deal with quads.

A better approach might be that taken by Datomic, which adds to every triple an automatic id (maybe based on a hash of the triple itself to help manage duplication and provide integrity checks). Given that extra handle, it would be simple to tack on statements about the provenance of a triple, to note when multiple sources agree (or disagree) about assertions, and so on. A number of triplestores already do something like this internally, but it really needs to be folded into the data model either of RDF or its successor. For SPARQL, this might be as easy as adding an id(subject, predicate, object) function that would return the URI identifying the given triple. Then you could find out what had been said about a triple with a simple query:

SELECT ?p ?o
FROM <graph>
WHERE { id(<subject_uri>, <predicate_uri>, <object_uri>) ?p ?o }

Compare that to the reification version:

PREFIX rdf: <>
SELECT ?p ?o
FROM <graph>
WHERE { ?statement ?p ?o .
?statement rdf:subject <subject_uri> .
?statement rdf:predicate <predicate_uri> .
?statement rdf:object <object_uri> }

Actually, we could get a good deal of traction just from having a standard URI format for triple IDs and an algorithm for producing them. Maybe a prefix like <urn:3ID/> and then a SHA1 hash of the triple, serialized as N3? So our original triple would have the ID <urn:3ID/eb9dfed3b6304ca955baa554c836306d8c61736b>. We still wouldn’t be in a world where you could look up a triple given its ID, but having a standard way to produce IDs might help push things in that direction. Since we aren’t in a world where triple provenance has much priority, I’m still pondering about the best ways to deal with the problem. What do you think?