In this week’s post, I’m going to stick my neck out and criticize RDF and its suitability for encoding data in the Humanities. I should start by stipulating that we have a big investment in RDF: Papyri.info is driven by an RDF triplestore and we’re working on an RDF vocabulary to help extend our ability to represent our data. That said, let’s start with the basics: an RDF triple is made up of three URIs or two URIs and a “Literal” (a bit of text, perhaps). So you get things like:
<http://papyri.info/ddbdp/psi;9> <http://purl.org/dc/terms/hasPart> <http://papyri.info/ddbdp/psi;9;1026/source>
Or, to put it another way, the volume identified by the URI http://papyri.info/ddbdp/psi;9 contains the text identified by the URI http://papyri.info/ddbdp/psi;9;1026/source. Triples in a triple store are members of a “Graph”. In this case the graph is identified by the URI http://papyri.info/graph
. So in fact, the triples we talk about are really quads, with a subject, predicate, object, and context:
<http://papyri.info/ddbdp/psi;9> <http://purl.org/dc/terms/hasPart> <http://papyri.info/ddbdp/psi;9;1026/source> <http://papyri.info/graph>
So far, so good. The great problem with this design is that we don’t really have a good way of talking about triples. As we start to move beyond a self-contained and internally controlled system to one where data are shared between partner projects, we’re starting to care more about triple provenance. Humanities data are full of guesses, estimates, contradictions, and arguments, so it’s not only likely that we’ll acquire triples that contradict each other, it’s desirable. But we have to be able to source our statements. This may be a basic incompatibility between Humanities data and the RDF data model. Our statements are not in the form “X Y Z”, but “W asserts/implies X Y Z”.
There is the method called “reification”, in which a single RDF triple is exploded into a Statement, with subject, predicate, and object properties, so a single triple becomes four triples. Eric Hellman wrote about reification a few years back:
Unfortunately, RDF, the data model underlying Linked Data and the Semantic Web, has no built-in mechanism to attach data to its source. To some extent, this is a deliberate choice in the design of the model, and also a deep one. True facts can’t really have sources, so a knowledge representation system that includes connections of facts to their sources is, in a way, polluted. Instead, RDF takes the point of view that statements are asserted, and if you want to deal with assertions and how they are asserted in a clean logic system, the assertions should be reified.
Hellman notes other problems with reification in a later post. But let’s reify my statement anyway:
<http://example.com/myTriple> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement>
<http://example.com/myTriple> <http://www.w3.org/1999/02/22-rdf-syntax-ns#subject> <http://papyri.info/ddbdp/psi;9>
<http://example.com/myTriple> <http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate> <http://purl.org/dc/terms/hasPart>
<http://example.com/myTriple> <http://www.w3.org/1999/02/22-rdf-syntax-ns#object> <http://papyri.info/ddbdp/psi;9;1026/source>
So, merely by adding another five triples for every original (greatly increasing the size of my dataset), I can now say things about my triple. Unfortunately, graphs are not part of the RDF specification, so I can’t actually say that my Statement belongs to the <http://papyri.info/graph>
context. You may ask, what happens if I want to say something about an assertion about my original triple (when it was made, for example, and by whom). I will respond by gibbering at you and going to hide under the desk.
No. This is lunacy. The one workaround I can see using the existing setup is to coopt the graph (or context) element of the quad, and use that as a handle for making statements about our statements. There are a few downsides to that approach though: one is that it makes querying with SPARQL messy, another is that now you have to manage a multiplicity of graphs, which is something you might not want to do. You also risk losing provenance data if you merge graphs together. And it somewhat mangles the concept and usefulness of the graph as a collection of statements (you may want to group stuff for reasons other than provenance). Portability is a problem too, because most of the serialization formats out there don’t deal with quads.
A better approach might be that taken by Datomic, which adds to every triple an automatic id (maybe based on a hash of the triple itself to help manage duplication and provide integrity checks). Given that extra handle, it would be simple to tack on statements about the provenance of a triple, to note when multiple sources agree (or disagree) about assertions, and so on. A number of triplestores already do something like this internally, but it really needs to be folded into the data model either of RDF or its successor. For SPARQL, this might be as easy as adding an id(subject, predicate, object) function that would return the URI identifying the given triple. Then you could find out what had been said about a triple with a simple query:
SELECT ?p ?o
FROM <graph>
WHERE { id(<subject_uri>, <predicate_uri>, <object_uri>) ?p ?o }
Compare that to the reification version:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?p ?o
FROM <graph>
WHERE { ?statement ?p ?o .
?statement rdf:subject <subject_uri> .
?statement rdf:predicate <predicate_uri> .
?statement rdf:object <object_uri> }
Actually, we could get a good deal of traction just from having a standard URI format for triple IDs and an algorithm for producing them. Maybe a prefix like <urn:3ID/>
and then a SHA1 hash of the triple, serialized as N3? So our original triple would have the ID <urn:3ID/eb9dfed3b6304ca955baa554c836306d8c61736b>
. We still wouldn’t be in a world where you could look up a triple given its ID, but having a standard way to produce IDs might help push things in that direction. Since we aren’t in a world where triple provenance has much priority, I’m still pondering about the best ways to deal with the problem. What do you think?
I like the idea of assigning automatic ids to triples. It’s standard practice in relational databases, is it not? And it’s not computationally expensive. So why not adopt it for RDF, especially if it will help push things in the direction of looking up triples by id? That would lead to much more succinct statements and queries, would it not?
It is common practice. It’s not just the verbosity of reification that bothers me—it’s that it makes certain operations I might really like to perform impossible. For example, I might well want to have a trigger that inserted provenance information every time a new triple (or set of triples) is inserted. This would be dead easy with quads, but I’d have to somehow avoid doing this for reified statement triples. Reified triples are just new statements, so they don’t really link back to the original triple in any way. What if I wanted to mark a triple as deprecated (replaced by better information)?
I’m not saying there’s absolutely no way to do this sort of thing under the current regime, but it’s harder than it needs to be.
Hugh, I think you’re right that RDF by itself doesn’t supply all of the answers, especially when it comes to dealing with questions of provenance and ownership. Models like OAC and SAM provide data structures and vocabulary we can standardize on to enable us to talk about that. Of course they too can be expressed in RDF and produce even more triples, so it’s not really responsive to your question of how to make queries efficient, but maybe it just proves the point that triples are just a part of the solution and not the silver bullet.
Yeah, I agree. But I keep wishing that RDF was *just a little* bit more useful. We’d really like to be able to model chains of scholarly reasoning and RDF seems like it would be perfect for that, but it isn’t.
I’ve recently started a project on knowledge representation, and I find myself agreeing with what you said about RDF. I found one possible solution with MarkLogic where you can specify arbitrary attributes on triples expressed in XML. I would be interested to know if there are any other advancement in this area in these 3 years.