“What form does the data take?” is a question that developers ask early in the life-cycle of any information technology project.
Last year, Doris Duke Archivist Mary Samouelian approached some of us in the IT department with an idea for a project that involved a specific kind of data. She wanted to produce an interactive timeline of Doris Duke’s life for a presentation she would give at a Friends of the Duke University Libraries meeting in May. We took it on, and resolved to do something innovative with it. The final result of our work is available here; for more on the project, see Mary’s post on the Devil’s Tale blog, “The Doris Duke Collection Reimagined.”
To me, an innovation means opening the way to a new service or a new capacity. A one-off project wouldn’t have done that.
When we took up the project in earnest in mid-February, the data was in the form of an extensive and detailed Microsoft Word document that Mary had written. One of the first questions we needed to resolve was how to represent the information in the Word document as data.
We needed a way for Mary to read and edit the data on an ongoing basis. At the same time, the data must be available in a structured format that computers can manipulate. This tension between the reading methods of intuitive, interpretive human beings and fussy, unforgiving computers is the central challenge of representing data.
As it happens, archivists already represent timelines in a way that computers can process. Encoded Archival Description (EAD) is an XML standard for archival finding aids. Among its many features, it specifies a way for archivists to build timelines related to the creators of a collection’s material. As a practiced author of finding aids, Mary is familiar with the use of EAD. Since the development team for the project is the same group that recently built our finding aids site, EAD seemed like a natural fit for the project.
However, there is an emerging standard, related to EAD, that also caught our attention. Encoded Archival Context for Corporate Bodies, Persons and Families (EAC-CPF or just EAC) “provides a grammar for encoding names of creators of archival materials and related information.” I first became familiar with it when I saw a presentation at the 2010 code4lib for the Social Networks and Archival Context (SNAC) project. The presenter called their prototype implementation “Facebook for dead people.”
That site uses EAC records from a variety of institutions to accomplish several ends. First, it shows the array of collections from the participating institutions associated with an individual – say, Walt Whitman. Second, it builds a social network among individuals, linking a creator like Whitman to other parties with whom he corresponded, was related, or otherwise associated.
Another aim of EAC is to establish an infrastructure of name authority for the corporate bodies and people who create archival collections. To that end, the EAC community – including our former Duke colleague Kathy Wisser – has received an IMLS grant, Building a National Archival Authorities Infrastructure. The grant will fund a series of workshops through the Society of American Archivists, and the development of “a set of recommendations addressing business, governance, and technological requirements.”
As the development team discussed Mary’s project, we liked the idea of using EAC-CPF markup to represent information about Doris Duke. For one thing, we admire the SNAC web site, and have discussed in the past using it as a model for a series of “person portals” into our collections. We wanted to familiarize ourselves with EAC, and the Doris Duke project seemed like an appropriate entry point.
There was only one problem. EAC defines a “chronlist” tag for representing timelines, but its specification was not robust enough. It does not support two of our important needs: 1) linking media files (i.e., images) to events; and 2) linking individual events to the finding aids for collections that provide source materials about the events. Faced with this limitation, we decided to take liberties.
In contrast to EAC, our reading of the EAD tag library confirmed that the specification for its “chronlist” tag is robust enough to support our requirements. We decided to mix the parts of EAD that we liked into our EAC document. The basic technique for mixing and matching XML standards is to use namespace declarations. A namespace is a kind of domain identifier for XML elements. It says, to computers (and people) reading a document, “This tag belongs to that schema.”
If my explanation is overly technical, here are some fitting analogies for what we did: we invented a new fusion cuisine dish; we installed a whammy bar on a Les Paul; we used cobra genes to engineer a killer rabbit.
The resulting EAC file for the Doris Duke project is available here. The tags in that document beginning with the prefix “ead:” are the elements we borrowed from the EAD namespace.
The solution that we devised represented a kind of contract between the content creator, Mary, and the development team. It allowed the two parties to work in parallel, Mary encoding and revising the timeline, and the developers building its display.
Duke is participating in the National Archive Authorities Infrastructure project, which will ultimately integrate our collections into that “Facebook for dead people” social network. We’re also developing our expertise by working on more “people portals”; University Archives will be assigning additional Duke family EAC documents as a low-priority, background project to its interns. It probably took double the effort for the development team to produce a new service rather than a one-off project, but it helped us take our first steps toward this promising approach to describing and exposing the contents of our archival collections.