Snow is a major event here in North Carolina, and the University and Library were operating accordingly under a “severe weather policy” last week due to 6-12 inches of frozen precipitation. While essential services continued undeterred, most of the Library’s staff and patrons were asked to stay home until conditions had improved enough to safely commute to and navigate the campus. In celebration of last week’s storm, here are some handy tips for surviving and enjoying the winter weather–illustrated entirely with images from Duke Digital Collections!
Stock up on your favorite vices and indulgences before the storm hits.
2. Be sure to bundle and layer up your clothing to stay warm in the frigid outdoor temperatures.
3. Plan some fun outdoor activities to keep malaise and torpor from settling in.
4. Never underestimate the importance of a good winter hat.
5. While snowed in, don’t let your personal hygiene slip too far.
6. Despite the inconveniences brought on by the weather, don’t forget to see the beauty and uniquity around you.
While the landscape of open access has changed much over the intervening period, we can’t really say the same about the underlying platform of DukeSpace.
At Duke, faculty approved an open access policy in March of 2010; it was a few weeks previous that DSpace 1.6 was released. By the end of the year it had moved ahead a dot release to 1.7. Along the way, we did some customization to integrate with Symplectic Elements – the Research Information Management System (RIMS) that powers the Scholars@Duke site. That work essentially locked us into that version of DSpace, which remains in operation despite its final release in July 2013, and having reached its end of life four years ago.
Beginning last November, we committed to a full upgrade of the DukeSpace platform to the current version (6.2 as of this writing). We had considered alternatives, including replacing the platform with Hyrax, but concluded that that approach would be too complex.
So we are currently coordinating work across a technology team and the Libraries’ open access group. Some of the concerns that we have encountered include:
Integrating with updated versions of Symplectic Elements. That same integration that locked us into a version years ago lies at the center of this upgrade. We have basically been handling this process as a separate thread of the larger project. It will be critical for us to maintain the currency of this dependency with subsequent upgrades to both products.
Rethinking metadata architecture. The conceptual basis of the institutional repository is greatly informed by the definition and use of metadata. Our Metadata Architect, Maggie Dickson, mentioned this area in her “Metadata Year-in-Review” post back in December. She highlighted the need to make “real headway tackling the problem of identity management – leveraging unique identifiers for people (ORCIDs, for example), rather than relying on name strings, which is inherently error prone.” Many other questions have arisen this area, requiring extensive and ongoing discussion and coordination between the tech team and the stakeholders.
Migration of legacy stats data. How do we migrate usage stats between two versions of a platform so remote from each other in time? It has taken some trial-and-error to solve this one.
Replicating or enhancing existing workflows. Again, when two versions of a system are so different that an upgrade seems more like a platform migration, and our infrastructure and staffing have changed over the years, how do we reproduce existing workflows without disrupting them? What opportunities can we take to improve on them without destabilizing the project? Aside from the integration with Elements, we also have the important workflow related to the ingest of electronic theses & dissertations, which employs both self-deposit and file transfer from ProQuest. Re-envisioning and re-implementing workflows such as these takes careful analysis and planning.
While we have run into a few complicating issues during the process so far, we feel confident that we remain on track to roll out the upgraded version during the first quarter of 2018. Pluto remains a dwarf planet, Zidane manages Real Madrid (for now), and to Mark Cuban’s apparent distress, Google still owns Youtube. Soon our own story from 2006 should reach a kind of resolution.
It’s a new year! And a new year means new priorities. One of the many projects DUL staff have on deck for the Duke Digital Repository in the coming calendar year is an upgrade to DSpace, the software application we use to manage and maintain our collections of scholarly publications and electronic theses and dissertations. As part of that upgrade, the existing DSpace content will need to be migrated to the new software. Until very recently, that existing content has included a few research datasets deposited by Duke community members. But with the advent of our new research data curation program, research datasets have been published in the Fedora 3 part of the repository. Naturally, we wanted all of our research data content to be found in one place, so that meant migrating the few existing outliers. And given the ongoing upgrade project, we wanted to be sure to have it done and out of the way before the rest of the DSpace content needed to be moved.
Most of the datasets that required moving were relatively small–a handful of files, all of manageable size (under a gigabyte) that could be exported using DSpace’s web interface. However, a limited series of data associated with a project called The Integrated Precipitation and Hydrology Experiment (IPHEx) posed a notable exception. There’s a lot of data associated with the IPHEx project (recorded daily for 7 years, along with some supplementary data files, and iterated over 3 different areas of coverage, the total footprint came to just under a terabyte, spread over more than 7,000 files), so this project needed some advance planning.
First, the size of the project meant that the data were too large to export through the DSpace web client, so we needed the developers to wrangle a behind the scenes dump of what was in DSpace to a local file system. Once we had everything we needed to work with (which included some previously unpublished updates to the data we received last year from the researchers), we had to make some decisions on how to model it. The data model used in DSpace was a bit limiting, which resulted in the data being made available as a long list of files for each part of the project. In moving the data to our Fedora repository, we gained a little more flexibility with how we could arrange the files. We determined that we wanted to deviate slightly from the arrangement in DSpace, grouping the files by month and year.
This meant we would have group all the files into subdirectories containing the data for each month–for over 7,000 files, that would have been extremely tedious to do by hand, so we wrote a script to do the sorting for us. That completed, we were able to carry out the ingest process as normal. The final wrinkle associated with the IPHEx project was making sure that the persistent identifiers each part of the project data had been assigned in DSpace still resolved to the correct content. One of our developers was able to set up a server redirect to ensure that each URL would still take a user to the right place. As of the new year, the IPHEx project data (along with our other migrated DSpace datasets) are available in their new home!
We’re experimenting with changing our approach to projects in Software Development and Integration Services (SDIS). There’s been much talk of Agile (see the Agile Manifesto) over the past few years within our department, but we’ve faced challenges implementing this as an approach to our work given our broad portfolio, relatively small team, and large number of internal stakeholders.
After some productive conversations among staff and managers in SDIS where we reflected on our work over the past few years we decided to commit to applying the Scrum framework to one or more projects.
There are many resources available for learning about Agile and Scrum. The resources I’ve found most useful so far in learning about the framework include:
Scrum seems best suited to developing new products or software and defines the roles, workflow, and artifacts that help a team make the most of its capacity to build the highest value features first and deliver usable software on a regular and frequent schedule.
To start, we’ll be applying this process to a new project to build a prototype of a research data repository based on Hyrax. We’ve formed a small team, including a product owner, scrum master, and development team to build the repository. So far, we’ve developed an initial backlog of requirements in the form of user stories in Jira, the software we use to manage projects. We’ve done some backlog refinement to prioritize the most important and highest value features, and defined acceptance criteria for the ones that we’ll consider first. The development team has estimated the story points (relative estimate of effort and complexity) for some of the user stories to help us with sprint planning and release projection. Our first two-week sprint will begin the week after Thanksgiving. By the end of January we expect to have completed four, two-week sprints and have a pilot ready with a basic set of features implemented for evaluation by internal stakeholders.
One of the important aspects of Scrum is that group reflection on the process itself is built into the workflow through retrospective meetings after each sprint. Done right, routine retrospectives serve to reinforce what is working well and allows for adjustments to address things that aren’t. In the future we hope to adapt what we learn from applying the Scrum framework to the research data repository pilot to improve our approach to other aspects of our work in SDIS.
This past year brought renewed focus on AV development, as we worked to bring the NEH grant-funded Radio Haiti Archive online (launched in June). At the same time, our digital collections legacy platform migration efforts shifted toward moving our existing high-profile digital AV material into the repository.
At Duke University Libraries, we take accessibility seriously. We aim to include captions or transcripts for the audiovisual objects made available via the Duke Digital Repository, especially to ensure that the materials can be perceived and navigated by people with disabilities. For instance, work is well underway to create closed captions for all 1,400 items in the Duke Chapel Recordings project.
The DDR now accommodates modeling and ingest for caption files, and our AV player interface (powered by JW Player) presents a CC button whenever a caption file is available. Caption files are encoded using WebVTT, the modern W3C standard for associating timed text with HTML audio and video. WebVTT is structured so as to be machine-processable, while remaining lightweight enough to be reasonably read, created, or edited by a person. It’s a format that transcription vendors can provide. And given its endorsement by W3C, it should be a viable captioning format for a wide range of applications and devices for the foreseeable future.
Displaying captions within the player UI is helpful, but it only gets us so far. For one, that doesn’t give a user a way to just read the caption text without requiring them to play the media. We also need to support captions for audio files, but unlike with video, the audio player doesn’t include enough real estate within itself to render the captions. There’s no room for them to appear.
We also do some extra formatting when the WebVTT cues include voice tags (<v> tags), which can optionally indicate the name of the speaker (e.g., <v Jane Smith>). The in-page transcript is indexed by Google for search retrieval.
In many cases, especially for audio items, we may have only a PDF or other type of document with a transcript of a recording that isn’t structured or time-coded. Like captions, these documents are important for accessibility. We have developed support for displaying links to these documents near the media player. Look for some new collections using this feature to become available in early 2018.
The DDR web interface provides an optimal viewing or listening experience for AV, but we also want to make it easy to present objects from the DDR on other websites, too. When used on other sites, we’d like the objects to include some metadata, a link to the DDR page, and proper attribution. To that end, we now have copyable <iframe> embed code available from the Share menu for AV items.
This embed code is also what we now use within the Rubenstein Library collection guides (finding aids) interface: it lets us present digital objects from the DDR directly from within a corresponding collection guide. So as a researcher browses the inventory of a physical archival collection, they can play the media inline without having to leave.
If your website or blog is one of the thousands of WordPress sites hosted and supported by Sites@Duke — a service of Duke’s Office of Information Technology (OIT) — we have good news for you. You can now embed objects from the DDR using WordPress shortcode. Sites@Duke, like many content management systems, doesn’t allow authors to enter <iframe> tags, so shortcode is the only way to get embeddable media to render.
Here are the other AV-related features we have been able to develop in 2017:
Access control: master files & derivatives alike can be protected so access is limited to only authorized users/groups
Video thumbnail images: model, manage, and display
Video poster frames: model, manage, and display
Intermediate/mezzanine files: model and manage
Rights display: display icons and info from RightsStatements.org and Creative Commons, so it’s clear what users are permitted to do with media.
We look forward to sharing our recent AV development with our peers at the upcoming Samvera Connect conference (Nov 6-9, 2017 in Evanston, IL). Here’s our poster summarizing the work to date:
Looking ahead to the next couple months, we aim to round out the year by completing a few more AV-related features, most notably:
Export WebVTT captions as PDF or .txt
Advance the player via linked timecodes in the description field in an item’s metadata
Improve workflows for uploading caption files and transcript documents
Now that these features are in place, we’ll be sharing a bunch of great new AV collections soon!
A while back, I wrote a blog post about my enjoyment in digitizing the William Gedney Photograph collection and how it was inspiring me to build a darkroom in my garage. I wish I could say that the darkroom is up and running but so far all I’ve installed is the sink. However, as Molly announced in her last Bitstreams post, we have launched the Gedney collection which includes series two series that are complete (Finished Prints and Contact Sheets) and more to come.
The newly launched site brings together this amazing body of work in a seamless way. The site allows you to browse the collection, use the search box to find something specific or use the facets to filter by series, location, subject, year and format. If that isn’t enough, we have not only related prints from the same contact sheet but also related prints of the same image. For example, you can browse the collection and click on an image of Virgil Thomson, an American composer, smoothly zoom in and out of the image, then scroll to the bottom of the page to find a thumbnail of the contact sheet from which the negative comes. When you click through the thumbnial you can zoom into the contact sheet and see additional shots that Gedney took. You even can see which frames he highlighted for closer inspection. If you scroll to the bottom of this contact sheet page you will find that 2 of those highlighted frames have corresponding finished prints. Wow! I am telling you, checkout the site, it is super cool!
What you do not see [yet], because I am in the middle of digitizing this series, is all of the proof prints Gedney produced of Virgil Thomson, 36 in all. Here are a few below.
Once the proof prints are digitized and ingested into the Repository you will be able to experience Gedney’s photographs from many different angles, vantage points and perspectives.
International Broadsides (added to migrated Broadsides and Ephemera collection): https://repository.duke.edu/dc/broadsides
Orange County Tax List Ledger, 1875: https://repository.duke.edu/dc/orangecountytaxlist
Radio Haiti Archive, second batch of recordings: https://repository.duke.edu/dc/radiohaiti
William Gedney Finished Prints and Contact Sheets (newly re-digitized with new and improved metadata): https://repository.duke.edu/dc/gedney
In addition to the brand new items, the digital collections team is constantly chipping away at the digital collections migration. Here are the latest collections to move from Tripod 2 to the Duke Digital Repository (these are either available now or will be very soon):
What we hoped would be a speedy transition is still a work in progress 2 years later. This is due to a variety of factors one of which is that the work itself is very complex. Before we can move a collection into the digital repository it has to be reviewed, all digital objects fully accounted for, and all metadata remediated and crosswalked into the DDR metadata profile. Sometimes this process requires little effort. However other times, especially with older collection, we have items with no metadata, or metadata with no items, or the numbers in our various systems simply do not match. Tracking down the answers can require some major detective work on the part of my amazing colleagues.
Despite these challenges, we eagerly press on. As each collection moves we get a little closer to having all of our digital collections under preservation control and providing access to all of them from a single platform. Onward!
It’s September, and Duke students aren’t the only folks on campus in back-to-school mode. On the contrary, we here at the Duke Digital Repository are gearing up to begin promoting our research data curation services in real earnest. Over the last eight months, our four new research data staff have been busy getting to know the campus and the libraries, getting to know the repository itself and the tools we’re working with, and establishing a workflow. Now we’re ready to begin actively recruiting research data depositors!
As our colleagues in Data and Visualization Services noted in a presentation just last week, we’re aiming to scale up our data services in a big way by engaging researchers at all stages of the research lifecycle, not just at the very end of a research project. We hope to make this effort a two-front one. Through a series of ongoing workshops and consultations, the Research Data Management Consultants aspire to help researchers develop better data management habits and take the longterm preservation and re-use of their data into account when designing a project or applying for grants. On the back-end of things, the Content Analysts will be able to carry out many of the manual tasks that facilitate that longterm preservation and re-use, and are beginning to think about ways in which to tweak our existing software to better accommodate the needs of capital-D Data.
This past spring, the Data Management Consultants carried out a series of workshops intending to help researchers navigate the often muddy waters of data management and data sharing; topics ranged from available and useful tools to the occasionally thorny process of obtaining consent for–and the re-use of–data from human subjects.
Looking forward to the fall, the RDM consultants are planning another series of workshops to expand on the sessions given in the spring, covering new tools and strategies for managing research output. One of the tools we’re most excited to share is the Open Science Framework (OSF) for Institutions, which Duke joined just this spring. OSF is a powerful project management tool that helps promote transparency in research and allows scholars to associate their work and projects with Duke.
On the back-end of things, much work has been done to shore up our existing workflows, and a number of policies–both internal and external–have been met with approval by the Repository Program Committee. The Content Analysts continue to become more familiar with the available repository tools, while weighing in on ways in which we can make the software work better. The better part of the summer was devoted to collecting and analyzing requirements from research data stakeholders (among others), and we hope to put those needs in the development spotlight later this fall.
All of this is to say: we’re ready for it, so bring us your data!
Born digital archival material present unique challenges to representation, access, and discovery in the DDR. A hard drive arrives at the archives and we want to preserve and provide access to the files. In addition to the content of the files, it’s often important to preserve to some degree the organization of the material on the hard drive in nested directories.
One challenge to representing complex inter-object relationships in the repository is the repository’s relatively simple object model. A collection contains one or more items. An item contains one or more components. And a component has one or more data streams. There’s no accommodation in this model for complex groups and hierarchies of items. We tend to talk about this as a limitation, but it also makes it possible to provide search and discovery of a wide range of kinds and arrangements of materials in a single repository and forces us to make decisions about how to model collections in sustainable and consistent ways. But we still need to preserve and provide access to the original structure of the material.
One approach is to ingest the disk image or a zip archive of the directories and files and store the content as a single file in the repository. This approach is straightforward, but makes it impossible to search for individual files in the repository or to understand much about the content without first downloading and unarchiving it.
As a first pass at solving this problem of how to preserve and represent files in nested directories in the DDR we’ve taken a two-pronged approach. We will use a simple approach to modeling disk image and directory content in the repository. Every file is modeled in the repository as an item with a single component that contains the data stream of the file. This provides convenient discovery and access to each individual file from the collection in the DDR, but does not represent any folder hierarchies. The files are just a flat list of objects contained by a collection.
To preserve and store information about the structure of the files we add an XML METS structMap as metadata on the collection. In addition we store on each item a metadata field that stores the complete original file path of the file.
Below is a small sample of the kind of structural metadata that encodes the nested folder information on the collection. It encodes the structure and nesting, directory names (in the LABEL attribute), the order of files and directories, as well as the identifiers for each of the files/items in the collection.
Combining the 1:1 (item:component) object model with structural metadata that preserves the original directory structure of the files on the file system enables us to display a user interface that reflects the original structure of the content even though the structure of the items in the repository is flat.
There’s more to it of course. We had to develop a new ingest process that could take as its starting point a file path and then crawl it and its subdirectories to ingest files and construct the necessary structural metadata.
Because some of the collections are very large and loading a directory tree structure of 100,000 or more items would be very slow, we implemented a small web service in the application that loads the jsTree data only when someone clicks to open a directory in the interface.
The file paths are also keyword searchable from within the public interface. So if a file is contained in a directory named “kitchen/fruits/bananas/this-banana.txt” you would be able to find the file this-banana.txt by searching for “kitchen” or “fruit” or “banana.”
This new functionality to ingest, preserve, and represent files in nested folder structures in the Duke Digital Repository will be included in the September release of the Duke Digital Repository.
As 2017 reaches its halfway point, we have concluded another busy quarter of development on the Duke Digital Repository (DDR). We have several new features to share, and one we’re particularly delighted to introduce is Rights display.
Back in March, my colleague Maggie Dickson shared our plans for rights management in the DDR, a strategy built upon using rights status URIs from RightsStatements.org, and in a similar fashion, licenses from Creative Commons. In some cases, we supplement the status with free text in a local Rights Note property. Our implementation goals here were two-fold: 1) use standard statuses that are machine-readable; 2) display them in an easily understood manner to users.
What to Display
Getting and assigning machine-readable URIs for Rights is a significant milestone in its own right. Using that value to power a display that makes sense to users is the next logical step. So, how do we make it clear to a user what they can or can’t do with a resource they have discovered? While we could simply display the URI and link to its webpage (e.g., http://rightsstatements.org/vocab/InC-EDU/1.0/ ) the key info still remains a click away. Alternatively, we could display the rights statement or license title with the link, but some of them aren’t exactly intuitive or easy on the eyes. “Attribution-NonCommercial-NoDerivatives 4.0 International,” anyone?
Looking around to see how other cultural heritage institutions have solved this problem led us to very few examples. RightsStatements.org is still fairly new and it takes time for good design patterns to emerge. However, Europeana — co-champion of the RightsStatements.org initiative along with DPLA — has a stellar collections site, and, as it turns out, a wonderfully effective design for displaying rights statuses to users. Our solution ended up very much inspired by theirs; hats off to the Europeana team.
Both Creative Commons and RightsStatements.org provide downloadable icons at their sites (here and here). We opted to store a local copy of the circular SVG versions for both to render in our UI. They’re easily styled, they don’t take up a lot of space, and used together, they have some nice visual unity.
Labels & Titles
We have a lightweight Rails app with an easy-to-use administrative UI for managing auxiliary content for the DDR, so that made a good home for our rights statuses and associated text. Statements are modeled to have a URI and Title, but can also have three additional optional fields: short title, re-use text, and an array of icon classes.
Displaying the Info
We wanted to be sure to show the rights status in the flow of the rest of an object’s metadata. We also wanted to emphasize this information for anyone looking to download a digital object. So we decided to render the rights status prominently in the download menu, too.
Our focus in this area now shifts toward applying these newly available rights statuses to our existing digital objects in the repository, while ensuring that new ingests/deposits get assessed and assigned appropriate values. We’ll also have opportunities to refine where and how the statuses get displayed. We stand to learn a lot from our peer organizations implementing their own rights management strategies, and from our visitors as they use this new feature on our site. There’s a lot of work ahead, but we’re thrilled to have reached this noteworthy milestone.
Notes from the Duke University Libraries Digital Projects Team