Category Archives: Trident

Duke University Libraries’ metadata tool project.

You Know What We Did This Summer

I’ve been working in academic libraries for fourteen years now, and I still haven’t been able to convince my grandmother that working for a university doesn’t mean you get the summers off.  We certainly haven’t been taking the summer off in the Digital Collections Program here at the Duke University Libraries, even though you haven’t seen most of the results of our summer work yet.

We premiered the Duke Digital Collections iPhone app back in June, which has been getting positive and enthusiastic feedback (thanks!), but otherwise most of our work has been behind-the-scenes stuff that will pay off in the future.  Among our projects:

  • The metadata phase of the Broadsides & Ephemera digital collection has begun in earnest, with a team of eight catalogers and archivists using our new metadata editor to describe these rare and valuable resources.
  • Work continues on Trident, our digital collections system.  With a new repository, a new metadata editor, and all sorts of other new developments, we’ll be able to create and manage digital collections better, faster, and more seamlessly than ever before, and deliver content in new and exciting ways.
  • Our Digital Production Center continues digitizing materials for future collections at a furious rate.  As usual, they’re very speedy and the rest of us sometimes feel like we’re trying to play catch-up with them….
  • We’ve introduced new ways to keep up with the Digital Collections Program, including a Facebook page (come be our friend!) and more frequent Twitter updates, where we’ve been tweeting highlights from the Duke Digital Collections since the spring.  We’ve also been posting with our digital collections colleagues from across the state to the North Carolina Digital Collections Collaboratory blog.
  • Last but certainly not least, we’re about to launch a huge, fantastic, exciting, FUN new digital collection — hopefully next week — that we’re going to have to keep secret a bit longer.  We hate to tease you … well, maybe we want to tease you a little bit.  It’s completely different from anything we’ve done before in several ways that will become clear when it’s published.  We’ve been working like fiends on this one, but we think it’s totally going to be worth it, and hope you will, too, when you see it.  Stay tuned.

As always, thanks for reading, and for your support and interest.  We hope you’re having as good a summer as we are.  Don’t forget the sunscreen and the frosty beverage of your choice….

Open Repositories 2009

I attended Open Repositories 2009 Conference this past week.  Overall it was a very informative conference on the open source repository platforms (Fedora, dSpace, ePrints, Zentity), current projects and developments using these platforms, and future directions of repositories.  Below are some relevant notes from the conference.

Repository Workflow

There were a few presentations that discussed how institutions were managing their repositories, in particular, repositories built with Fedora.  Two of these, eSciDoc and Hydra, had some very useful nuggets.

Hydra is a grant-funded collaboration between Hull University, University of Virginia and Stanford University to build a repository management toolkit to manage their three very different workflows, and be extensible to manage heterogeneous workflows around the Fedora community.  There are a few practices or ideas that we might want to adopt from this project, as well as some possible points of convergence with Trident.

  • The idea to treat workflow processes discretely.  Hull is using BPEL (Business Process Execution Language) to define and implement the processes.  They are using Active Endpoints (was open source, may not be any longer) which provides a really nice GUI for defining and connecting workflow processes.  Not sure if this tool is worth investigating it, but I have seen it before, and have heard good things.
  • Stanford has a good design for representing the state of multiple workflows for an item.  Items have workflow datastreams, which include a number of processes, each with an indicator of state.  They then represent these workflow processes as a checklist in management interface.
  • UVA, like us, is thinking RESTfully.  RESTful approach to workflow steps allows processes to be encapsulated nicely and reused in a variety of ways.
  • Repository API – This is a possible point of eventual convergence, Hydra will be creating a RESTful API layer on top of Fedora, similar in architecture to the one that we have developed for Trident.

eSciDoc is an eResearch environment built on top of Fedora.

  • They have a well established object life cycle.  An item’s stage in it’s life cycle determines who is allowed to do what to the item.  For instance, pending (only the creator can access and modify, collaborators may be invited, item may be deleted), submitted (QC/editorial process, creator cannot modify any longer, metadata may still be enriched),…
  • They have a very tight versioning design in their Fedora repository.  They use an atomistic approach to Fedora, with items and components as separate Fedora objects.  With this approach, they can represent multiple versions of an item in their repository.  They do this by creating a copy of the item fedora object with each version, and a copy of only the changed component with each version.  Their item fedora objects contain all of the pointers to the components.  A handle gets assigned to the published version.

Cloud Storage

Sandy Payette and Michele Kimpton gave an update on the emerging DuraCloud services.  They are currently in development, and will be tested with a few beta sites before general release.  The DuraCloud services will definitely be worth Duke looking into; however, will probably need to wait for more Akubra development before these services can be properly integrated into Fedora.  For Duke’s repository, cloud storage should be evaluated for storage of preservation masters.  Also on the topic of cloud storage, David Tarrant gave an update from ePrints, as well as a reminder, “Clouds do blow away.”

Smart storage underpinning repositories

  • ePrints has exactly what is needed.  Their storage controller allows for rule based storage configuration.  This is now in their current release.
  • Fedora is still developing Akubra.  Some of the beginnings of this code are in version 3.2, but it is not implemented.  From what I gather, if we have a use case, we need to implement it ourselves.
  • dSpace will be looking at incorporating Akubra into version 2 of dSpace
  • Reagan Moore (UNC) and Bing Zhu (UCSD) gave a very detailed discussion on iRods.  iRods has a very detailed architecture for rule-based storage.  It defines many micro-services to be performed on objects.  These micro services can be chained together.  iRods has a clean rule-based configuration for defining chains of micoservices and the conditions under which these workflow chains should be executed on an object.  iRods allows for a good separation between remote storage layer and “metadata repository.”  Bing discussed how iRods is integrated with Fedora.  From what I understood, Fedora does not directly manage iRods, rather datastreams are created in Fedora as external references to iRods, and iRods must be managed separately.

JPEG2000

djatoka continues to impress me.  It takes the math out of jpeg2000.  Ryan Chute discussed how this can be integrated into Fedora, and the service definitions involved in doing so.  He also showed some of the image viewers that have been built using djatoka.  With djatoka, the primary use of jpeg2000 is as a presentation format.  The integration with Fedora relies on a separate jpeg2000 “caching” server for serving up jpeg2000 services, which would live outside of Fedora.  In this model, it may be that Fedora never even needs to hold a jpeg2000 file.  I need a little more understanding on how the caching server gets populated, but will be investigating this in the coming months.

Islandora

UPEI has packaged an integration of Drupal and Fedora.  There is a mixed bag between what Drupal content is stored in Fedora and what content gets stored in Drupal.  As new types of content are stored in Drupal, new content models need to be created in Fedora to support them.  Presenter indicated that work still needs to be done on updates on Fedora being reflected in Drupal and vice-versa.  Without more than a presentation to base my opinions on, this seems like an extensible model, but one that also requires continued hand-tuning and management.

Complex object packaging

METS and OAI-ORE, or should it be METS vs OAI-ORE.  There is a lot more discussion and work in the last year around OAI-ORE.  It is a lot more flexible packaging model for complex objects than is METS.  And it has been the medium by which SWORD and other similar models are based on.  With flexibility though comes programmatic complexity.  Our repository model is based on a METS-centric view of digital repositories.  We did generalize item structure in such a way though that we could conceivably change the underlying structure from METS to something like ORE.  More to come on this

Cool stuff

@mire showed off some authoring tools integrated into Microsoft Office as add-ins.  I’m told these won’t be released for at least six months, but showed some real possibility and value that repositories to add to authors.  The authoring tools decomposed powerpoint presentations and word documents and stored them in the repository, and then allowed for searching of the repository (from within powerpoint and word) to include slides, images, text, etc from the repository into the working document.

Peter Sefton showed off his Fascinator.  It features click to create portals that could then be customized fairly easily.  He also talked about work he is currently doing on a “desktop sucker upper” which extracts data from a laptop to store into a repository.

Programming notes

  • eSciDoc is using the same terminology as us, in terms of items and components.  This is good, although I have not heard our terminology really used in other contexts.  Also dSpace seems to be moving away from this terminology.
  • Enhanced content modeling – this development allows for more precise description of datastreams and more precise description of relationships.  This is not incorporated into Fedora proper, although it should be because it adds a lot of value to the core.
  • There are others taking a RESTful approach to repositories, at least in representing the R in CRUD
  • Others confirmed my belief that web services (RESTful ones) should be programmer friendly as well as computer friendly.  In other words, the responses should display in web pages and give a programmer at least a rudimentary but helpful view of the data

Fedora

FIZ Karlsruhe has done extensive performance testing and tuning of Fedora.  They tested with data sets up to 40 million objects.  In terms of scaling, performance was not effected by size of the repository.  They were also able to increase performance by tuning the database, as well as separating the database from the repository.  They found that I/O was the limiting factor in all cases.

Fedora 3.2 highlights – beginnings of Akubra, SWORD integration, will be switching to new development environment (maven, OSGi/Spring DM)

dSpace

SWORD support, Shibboleth supported out of the box, new content model in dSpace 2.0 (based on entities and relationships)

Building the Broadsides Collection: Conservation

What happens when an entire collection goes through the Conservation Department to be processed so that it can be digitized?  What do these collections look like through the eyes of a conservator?  What level of conservation work should a collection get? How long does it take to process a collection?  These are some of the common questions asked of the Conservation Staff.  In our second installment of Digital Collections “Behind the Scenes” we will explore these questions and more.  Below is an overview of the process which is explained in detail in the embedded video.

Overview:
1.    Sort
2.    Remove Mylar
3.    Assess collection for repair
4.    Repair
5.    Flag problem items for the Digital Production Center
6.    Re-house
7.    Repeat

The next stage of the process is digitization — coming soon!

Building the Broadsides Collection: A Large-Scale Digitization Approach

I’m happy to report that work on the Broadsides and Ephemera Collection has begun! The source content for this project is an artificial collection in Duke’s Special Collections Library, dated 1790-1940. Truly an interdisciplinary collection, it includes materials related to political campaigns, politics, theater, dance, museum exhibitions, advertising, travel, expositions, and military campaigns, and it presents historical perspectives on race relations, gender, and religion. On many items, you can still see holes in the upper corners from the original posting of the signs and flyers.

Aside from past processing decisions that brought this artificial collection together in the first place, we will do no selection before digitization. Our goal is to digitize ALL of the content (roughly 5,000 items) and to use it as an example of an “open-ended” digital collection. If we aquire additional broadsides and posters, they can be digitized and added to this collection on an ongoing basis.

We also consider this project as digitization of a hidden collection: the early broadsides and posters are a significant, but underutilized resource. Continue reading Building the Broadsides Collection: A Large-Scale Digitization Approach

On the Trident Project: Part 1 – Architecture

The library’s search for software to support metadata creation served as the topic of two posts of mine from late last year, A Metadata Tool that Scales, and Grand Metadata Tool Ideas. Those posts discussed our internal process and analysis, and engaged in some “guilt-free big thinking.”  This post will report on our progress since we broke for the winter holidays.  While much conjecture remains in what follows, we gave real progress to report, which I plan to do in three parts over the next week or so.

Last fall, we completed a successful job search for two programmers to support the project.  We were thrilled to bring on two talented and experienced individuals.  Lead Programmer Dave Kennedy comes to us from the University of Maryland, where managed the Office of Digital Collections and Research.  User Experience Developer TJ Ward made the move from the on-demand self-publishing outfit Lulu.  Dave, TJ and I serve as principal developers for the project, with an extended team that includes other members of IT staff in the library.

As the team formed, we took the critical step of fixing a name to the project — Trident, which we chose for a number of reasons that sounded good at the time.  First, we call our home-cooked front-end platform for digital collections Tripod, for its three-legged architecture.  Use of the “Tri-” formulation evokes Duke’s history, and the trident imagery its school mascot. Additionally, I am known to use a water metaphor to talk about metadata, which goes as follows:  “Metadata flows from librarians to patrons like water to the sea.  It is inevitable and inexorable.  You don’t want to stop it, and you couldn’t if you did.  What you do is engineer the landscape so that it meanders instead of floods, and serves as a nourishing resource, not a destructive force.”  The trident, of course, is the tool with which Poseidon controls the seas.  Finally, Wikipedia informs us that Bill Brasky used a trident to kill Wolfman Jack.  Thanks, Wikipedia!

Continue reading On the Trident Project: Part 1 – Architecture

Building the Broadsides collection-Part 1

Life-Preserving Coffin: In doubtful cases of death

Over the the next few months, we’ll be writing a series of posts that offer a behind-the-scenes look at all of the work and decision-making that goes into building one digital collection, from selection, conservation, and physical processing to scanning, metadata, and publication.  We’ve chosen to blog about our work on the Broadsides collection in particular for several reasons:

  • It’s a relatively large-scale project that will test our ability to ramp up our digitization efforts (5,500 items from the U.S. and abroad, dated 1790-1940)
  • It will serve as a test-case for the development and use of our new metadata tool–codename “Trident.”
  • It will be a pilot project to get more library staff involved in generating metadata for digital collections.

So check in periodically to see how the project is moving along!