We’re experimenting with changing our approach to projects in Software Development and Integration Services (SDIS). There’s been much talk of Agile (see the Agile Manifesto) over the past few years within our department, but we’ve faced challenges implementing this as an approach to our work given our broad portfolio, relatively small team, and large number of internal stakeholders.
After some productive conversations among staff and managers in SDIS where we reflected on our work over the past few years we decided to commit to applying the Scrum framework to one or more projects.
There are many resources available for learning about Agile and Scrum. The resources I’ve found most useful so far in learning about the framework include:
Scrum seems best suited to developing new products or software and defines the roles, workflow, and artifacts that help a team make the most of its capacity to build the highest value features first and deliver usable software on a regular and frequent schedule.
To start, we’ll be applying this process to a new project to build a prototype of a research data repository based on Hyrax. We’ve formed a small team, including a product owner, scrum master, and development team to build the repository. So far, we’ve developed an initial backlog of requirements in the form of user stories in Jira, the software we use to manage projects. We’ve done some backlog refinement to prioritize the most important and highest value features, and defined acceptance criteria for the ones that we’ll consider first. The development team has estimated the story points (relative estimate of effort and complexity) for some of the user stories to help us with sprint planning and release projection. Our first two-week sprint will begin the week after Thanksgiving. By the end of January we expect to have completed four, two-week sprints and have a pilot ready with a basic set of features implemented for evaluation by internal stakeholders.
One of the important aspects of Scrum is that group reflection on the process itself is built into the workflow through retrospective meetings after each sprint. Done right, routine retrospectives serve to reinforce what is working well and allows for adjustments to address things that aren’t. In the future we hope to adapt what we learn from applying the Scrum framework to the research data repository pilot to improve our approach to other aspects of our work in SDIS.
This past year brought renewed focus on AV development, as we worked to bring the NEH grant-funded Radio Haiti Archive online (launched in June). At the same time, our digital collections legacy platform migration efforts shifted toward moving our existing high-profile digital AV material into the repository.
At Duke University Libraries, we take accessibility seriously. We aim to include captions or transcripts for the audiovisual objects made available via the Duke Digital Repository, especially to ensure that the materials can be perceived and navigated by people with disabilities. For instance, work is well underway to create closed captions for all 1,400 items in the Duke Chapel Recordings project.
The DDR now accommodates modeling and ingest for caption files, and our AV player interface (powered by JW Player) presents a CC button whenever a caption file is available. Caption files are encoded using WebVTT, the modern W3C standard for associating timed text with HTML audio and video. WebVTT is structured so as to be machine-processable, while remaining lightweight enough to be reasonably read, created, or edited by a person. It’s a format that transcription vendors can provide. And given its endorsement by W3C, it should be a viable captioning format for a wide range of applications and devices for the foreseeable future.
Displaying captions within the player UI is helpful, but it only gets us so far. For one, that doesn’t give a user a way to just read the caption text without requiring them to play the media. We also need to support captions for audio files, but unlike with video, the audio player doesn’t include enough real estate within itself to render the captions. There’s no room for them to appear.
We also do some extra formatting when the WebVTT cues include voice tags (<v> tags), which can optionally indicate the name of the speaker (e.g., <v Jane Smith>). The in-page transcript is indexed by Google for search retrieval.
In many cases, especially for audio items, we may have only a PDF or other type of document with a transcript of a recording that isn’t structured or time-coded. Like captions, these documents are important for accessibility. We have developed support for displaying links to these documents near the media player. Look for some new collections using this feature to become available in early 2018.
The DDR web interface provides an optimal viewing or listening experience for AV, but we also want to make it easy to present objects from the DDR on other websites, too. When used on other sites, we’d like the objects to include some metadata, a link to the DDR page, and proper attribution. To that end, we now have copyable <iframe> embed code available from the Share menu for AV items.
This embed code is also what we now use within the Rubenstein Library collection guides (finding aids) interface: it lets us present digital objects from the DDR directly from within a corresponding collection guide. So as a researcher browses the inventory of a physical archival collection, they can play the media inline without having to leave.
If your website or blog is one of the thousands of WordPress sites hosted and supported by Sites@Duke — a service of Duke’s Office of Information Technology (OIT) — we have good news for you. You can now embed objects from the DDR using WordPress shortcode. Sites@Duke, like many content management systems, doesn’t allow authors to enter <iframe> tags, so shortcode is the only way to get embeddable media to render.
Here are the other AV-related features we have been able to develop in 2017:
Access control: master files & derivatives alike can be protected so access is limited to only authorized users/groups
Video thumbnail images: model, manage, and display
Video poster frames: model, manage, and display
Intermediate/mezzanine files: model and manage
Rights display: display icons and info from RightsStatements.org and Creative Commons, so it’s clear what users are permitted to do with media.
We look forward to sharing our recent AV development with our peers at the upcoming Samvera Connect conference (Nov 6-9, 2017 in Evanston, IL). Here’s our poster summarizing the work to date:
Looking ahead to the next couple months, we aim to round out the year by completing a few more AV-related features, most notably:
Export WebVTT captions as PDF or .txt
Advance the player via linked timecodes in the description field in an item’s metadata
Improve workflows for uploading caption files and transcript documents
Now that these features are in place, we’ll be sharing a bunch of great new AV collections soon!
A while back, I wrote a blog post about my enjoyment in digitizing the William Gedney Photograph collection and how it was inspiring me to build a darkroom in my garage. I wish I could say that the darkroom is up and running but so far all I’ve installed is the sink. However, as Molly announced in her last Bitstreams post, we have launched the Gedney collection which includes series two series that are complete (Finished Prints and Contact Sheets) and more to come.
The newly launched site brings together this amazing body of work in a seamless way. The site allows you to browse the collection, use the search box to find something specific or use the facets to filter by series, location, subject, year and format. If that isn’t enough, we have not only related prints from the same contact sheet but also related prints of the same image. For example, you can browse the collection and click on an image of Virgil Thomson, an American composer, smoothly zoom in and out of the image, then scroll to the bottom of the page to find a thumbnail of the contact sheet from which the negative comes. When you click through the thumbnial you can zoom into the contact sheet and see additional shots that Gedney took. You even can see which frames he highlighted for closer inspection. If you scroll to the bottom of this contact sheet page you will find that 2 of those highlighted frames have corresponding finished prints. Wow! I am telling you, checkout the site, it is super cool!
What you do not see [yet], because I am in the middle of digitizing this series, is all of the proof prints Gedney produced of Virgil Thomson, 36 in all. Here are a few below.
Once the proof prints are digitized and ingested into the Repository you will be able to experience Gedney’s photographs from many different angles, vantage points and perspectives.
International Broadsides (added to migrated Broadsides and Ephemera collection): https://repository.duke.edu/dc/broadsides
Orange County Tax List Ledger, 1875: https://repository.duke.edu/dc/orangecountytaxlist
Radio Haiti Archive, second batch of recordings: https://repository.duke.edu/dc/radiohaiti
William Gedney Finished Prints and Contact Sheets (newly re-digitized with new and improved metadata): https://repository.duke.edu/dc/gedney
In addition to the brand new items, the digital collections team is constantly chipping away at the digital collections migration. Here are the latest collections to move from Tripod 2 to the Duke Digital Repository (these are either available now or will be very soon):
What we hoped would be a speedy transition is still a work in progress 2 years later. This is due to a variety of factors one of which is that the work itself is very complex. Before we can move a collection into the digital repository it has to be reviewed, all digital objects fully accounted for, and all metadata remediated and crosswalked into the DDR metadata profile. Sometimes this process requires little effort. However other times, especially with older collection, we have items with no metadata, or metadata with no items, or the numbers in our various systems simply do not match. Tracking down the answers can require some major detective work on the part of my amazing colleagues.
Despite these challenges, we eagerly press on. As each collection moves we get a little closer to having all of our digital collections under preservation control and providing access to all of them from a single platform. Onward!
It’s September, and Duke students aren’t the only folks on campus in back-to-school mode. On the contrary, we here at the Duke Digital Repository are gearing up to begin promoting our research data curation services in real earnest. Over the last eight months, our four new research data staff have been busy getting to know the campus and the libraries, getting to know the repository itself and the tools we’re working with, and establishing a workflow. Now we’re ready to begin actively recruiting research data depositors!
As our colleagues in Data and Visualization Services noted in a presentation just last week, we’re aiming to scale up our data services in a big way by engaging researchers at all stages of the research lifecycle, not just at the very end of a research project. We hope to make this effort a two-front one. Through a series of ongoing workshops and consultations, the Research Data Management Consultants aspire to help researchers develop better data management habits and take the longterm preservation and re-use of their data into account when designing a project or applying for grants. On the back-end of things, the Content Analysts will be able to carry out many of the manual tasks that facilitate that longterm preservation and re-use, and are beginning to think about ways in which to tweak our existing software to better accommodate the needs of capital-D Data.
This past spring, the Data Management Consultants carried out a series of workshops intending to help researchers navigate the often muddy waters of data management and data sharing; topics ranged from available and useful tools to the occasionally thorny process of obtaining consent for–and the re-use of–data from human subjects.
Looking forward to the fall, the RDM consultants are planning another series of workshops to expand on the sessions given in the spring, covering new tools and strategies for managing research output. One of the tools we’re most excited to share is the Open Science Framework (OSF) for Institutions, which Duke joined just this spring. OSF is a powerful project management tool that helps promote transparency in research and allows scholars to associate their work and projects with Duke.
On the back-end of things, much work has been done to shore up our existing workflows, and a number of policies–both internal and external–have been met with approval by the Repository Program Committee. The Content Analysts continue to become more familiar with the available repository tools, while weighing in on ways in which we can make the software work better. The better part of the summer was devoted to collecting and analyzing requirements from research data stakeholders (among others), and we hope to put those needs in the development spotlight later this fall.
All of this is to say: we’re ready for it, so bring us your data!
Born digital archival material present unique challenges to representation, access, and discovery in the DDR. A hard drive arrives at the archives and we want to preserve and provide access to the files. In addition to the content of the files, it’s often important to preserve to some degree the organization of the material on the hard drive in nested directories.
One challenge to representing complex inter-object relationships in the repository is the repository’s relatively simple object model. A collection contains one or more items. An item contains one or more components. And a component has one or more data streams. There’s no accommodation in this model for complex groups and hierarchies of items. We tend to talk about this as a limitation, but it also makes it possible to provide search and discovery of a wide range of kinds and arrangements of materials in a single repository and forces us to make decisions about how to model collections in sustainable and consistent ways. But we still need to preserve and provide access to the original structure of the material.
One approach is to ingest the disk image or a zip archive of the directories and files and store the content as a single file in the repository. This approach is straightforward, but makes it impossible to search for individual files in the repository or to understand much about the content without first downloading and unarchiving it.
As a first pass at solving this problem of how to preserve and represent files in nested directories in the DDR we’ve taken a two-pronged approach. We will use a simple approach to modeling disk image and directory content in the repository. Every file is modeled in the repository as an item with a single component that contains the data stream of the file. This provides convenient discovery and access to each individual file from the collection in the DDR, but does not represent any folder hierarchies. The files are just a flat list of objects contained by a collection.
To preserve and store information about the structure of the files we add an XML METS structMap as metadata on the collection. In addition we store on each item a metadata field that stores the complete original file path of the file.
Below is a small sample of the kind of structural metadata that encodes the nested folder information on the collection. It encodes the structure and nesting, directory names (in the LABEL attribute), the order of files and directories, as well as the identifiers for each of the files/items in the collection.
Combining the 1:1 (item:component) object model with structural metadata that preserves the original directory structure of the files on the file system enables us to display a user interface that reflects the original structure of the content even though the structure of the items in the repository is flat.
There’s more to it of course. We had to develop a new ingest process that could take as its starting point a file path and then crawl it and its subdirectories to ingest files and construct the necessary structural metadata.
Because some of the collections are very large and loading a directory tree structure of 100,000 or more items would be very slow, we implemented a small web service in the application that loads the jsTree data only when someone clicks to open a directory in the interface.
The file paths are also keyword searchable from within the public interface. So if a file is contained in a directory named “kitchen/fruits/bananas/this-banana.txt” you would be able to find the file this-banana.txt by searching for “kitchen” or “fruit” or “banana.”
This new functionality to ingest, preserve, and represent files in nested folder structures in the Duke Digital Repository will be included in the September release of the Duke Digital Repository.
As 2017 reaches its halfway point, we have concluded another busy quarter of development on the Duke Digital Repository (DDR). We have several new features to share, and one we’re particularly delighted to introduce is Rights display.
Back in March, my colleague Maggie Dickson shared our plans for rights management in the DDR, a strategy built upon using rights status URIs from RightsStatements.org, and in a similar fashion, licenses from Creative Commons. In some cases, we supplement the status with free text in a local Rights Note property. Our implementation goals here were two-fold: 1) use standard statuses that are machine-readable; 2) display them in an easily understood manner to users.
What to Display
Getting and assigning machine-readable URIs for Rights is a significant milestone in its own right. Using that value to power a display that makes sense to users is the next logical step. So, how do we make it clear to a user what they can or can’t do with a resource they have discovered? While we could simply display the URI and link to its webpage (e.g., http://rightsstatements.org/vocab/InC-EDU/1.0/ ) the key info still remains a click away. Alternatively, we could display the rights statement or license title with the link, but some of them aren’t exactly intuitive or easy on the eyes. “Attribution-NonCommercial-NoDerivatives 4.0 International,” anyone?
Looking around to see how other cultural heritage institutions have solved this problem led us to very few examples. RightsStatements.org is still fairly new and it takes time for good design patterns to emerge. However, Europeana — co-champion of the RightsStatements.org initiative along with DPLA — has a stellar collections site, and, as it turns out, a wonderfully effective design for displaying rights statuses to users. Our solution ended up very much inspired by theirs; hats off to the Europeana team.
Both Creative Commons and RightsStatements.org provide downloadable icons at their sites (here and here). We opted to store a local copy of the circular SVG versions for both to render in our UI. They’re easily styled, they don’t take up a lot of space, and used together, they have some nice visual unity.
Labels & Titles
We have a lightweight Rails app with an easy-to-use administrative UI for managing auxiliary content for the DDR, so that made a good home for our rights statuses and associated text. Statements are modeled to have a URI and Title, but can also have three additional optional fields: short title, re-use text, and an array of icon classes.
Displaying the Info
We wanted to be sure to show the rights status in the flow of the rest of an object’s metadata. We also wanted to emphasize this information for anyone looking to download a digital object. So we decided to render the rights status prominently in the download menu, too.
Our focus in this area now shifts toward applying these newly available rights statuses to our existing digital objects in the repository, while ensuring that new ingests/deposits get assessed and assigned appropriate values. We’ll also have opportunities to refine where and how the statuses get displayed. We stand to learn a lot from our peer organizations implementing their own rights management strategies, and from our visitors as they use this new feature on our site. There’s a lot of work ahead, but we’re thrilled to have reached this noteworthy milestone.
Duke Digital Repository is, among other things, a digital preservation platform and the locus of much of our work in that area. As such, we often ponder the big questions:
What is the repository?
What is digital preservation?
How are we doing?
What is the repository?
Fortunately, Ginny gave us a good start on defining the repository in Revisiting: What is the Repository? It’s software, hardware, and collaboration. It’s processes, policies, attention, and intention. While digital preservation is one of the focuses of the repository, digital preservation extends beyond the repository and should far outlive the repository.
What is digital preservation?
There are scores of definitions, but this Medium Definition from ALCTS is representative:
Digital preservation combines policies, strategies and actions to ensure access to reformatted and born digital content regardless of the challenges of media failure and technological change. The goal of digital preservation is the accurate rendering of authenticated content over time.
There are 2 basic methodologies for assessing this work- reactive and proactive. A reactive approach to digital preservation might be characterized by “Hey! We haven’t lost anything yet!”, which is why we like the proactive approach.
Digital preservation can be be a pretty deep rabbit hole and it can be an expensive proposition to attempt to mitigate the long tail of risk. Fortunately, the community of practice has developed tools to assist in the planning and execution of trustworthy repositories. At Duke, we’ve got several years experience working in the framework of the Center for Research Libraries’ Trustworthy Repositories Audit & Certification: Criteria and Checklist (TRAC) as the primary assessment tool by which we measure our efforts. Much of the work to document our preservation environment and the supporting institutional commitment was focused on our DSpace repository, DukeSpace. A great deal has changed in the recent 3 years including significant growth in our team and scope. So, once again we’re working to measure ourselves against the standards of our profession and to use that process to inform our work.
There are 3 areas of focus in TRAC: Organizational Infrastructure, Digital Object Management, and Technologies, Technical Infrastructure, & Security. These cover a very wide and deep field and include things like:
Securing Service Level of Agreements for all service providers
Documenting the organizational commitments of both Duke University and Duke University Libraries and sustainability plans relating to the repository
Creating and implementing routine testing of backup, remote replication, and restoration of data and relevant infrastructure
Creating and approving documentation on a wide variety of subjects for internal and external audiences
Back to the question: How are we doing?
Well, we’re making progress! Naturally we’re starting with ensuring the basic needs are met first- successfully preserving the bits, maximizing transparency and external validation that we’re not losing the bits, and working on a sustainable, scalable architecture. We have a lot of work ahead of us, of course. The boxes in the illustration are all the same size, but the work they represent is not. For example, the Disaster Recovery Plan at Hathi Trust is 61 pages of highly detailed thoughtfulness. However, these works build on each other so we’re confident that the work we’re doing on the supporting bodies of policy, procedure, and documentation will make ease the work to a complete Disaster Recovery Plan.
Why research data? Data generated by scholars in the course of investigation are increasingly being recognized as outputs nearly equal in importance to the scholarly publications they support. Among other benefits, the open sharing of research data reinforces unfettered intellectual inquiry, fosters reproducibility and broader analysis, and permits the creation of new data sets when data from multiple sources are combined. Data sharing, though, starts with data curation.
In January of this year, Duke University Libraries brought on four new staff members–two Research Data Management Consultants and two Digital Content Analysts–to engage in this curatorial effort, and we have spent the last few months mapping out and refining a research data curation workflow to ensure best practices are applied to managing data before, during, and after ingest into the Duke Digital Repository.
What does this workflow entail? A high level overview of the process looks something like the following:
After collecting their data, the researcher will take what steps they are able to prepare it for deposit. This generally means tasks like cleaning and de-identifying the data, arranging files in a structure expected by the system, and compiling documentation to ensure that the data is comprehensible to future researchers. The Research Data Management Consultants will be on hand to help guide these efforts and provide researchers with feedback about data management best practices as they prepare their materials.
Depositors will then be asked to complete a metadata form and electronically sign a deposit agreement defining the terms of deposit. After we receive this information, someone from our team will invite the depositor to transfer their files to us, usually through Box.
As this stage, the Research Data Management Consultants will begin a preliminary review of the researcher’s data by performing a cursory examination for personally identifying or protected health information, inspecting the researcher’s documentation for comprehension and completeness, analyzing the submitted metadata for compliance with the research data application profile, and evaluating file formats for preservation suitability. If they have any concerns, they will contact the researcher to make some suggestions about ways to better align the deposit with best practices.
When the deposit is in good shape, the Research Data Management Consultants will notify the Digital Content Analysts, who will finalize the file arrangement and migrate some file formats, generate and normalize any necessary or missing metadata, ingest the files into the repository, and assign the deposit a DOI. After the ingest is complete, the Digital Content Analysts will carry out some quality assurance on the data to verify that the deposit was appropriately and coherently structured and that metadata has been correctly assigned. When this is confirmed, they will publish the data in the repository and notify the depositor.
Of course, this workflow isn’t a finished piece–we hope to continue to clarify and optimize the process as we develop relationships with researchers at Duke and receive more data. The Research Data Management Consultants in particular are enthusiastic about the opportunity to engage with scholars earlier in the research life cycle in order to help them better incorporate data curation standards in the beginning phases of their projects. All of us are looking forward to growing into our new roles, while helping to preserve Duke’s research output for some time to come.