In early 2017, Duke University Libraries launched a research data curation program designed to help researchers on campus ensure that their data are adequately prepared for both sharing and publication, and long term preservation and re-use. Why the focus on research data? Data generated by scholars in the course of their investigation are increasingly being recognized as outputs similar in importance to the scholarly publications they support. Open data sharing reinforces unfettered intellectual inquiry, fosters transparency, reproducibility and broader analysis, and permits the creation of new data sets when data from multiple sources are combined. For these reasons, a growing number of publishers and funding agencies like PLoS ONE and the National Science Foundation are requiring researchers to make openly available the data underlying the results of their research.
But data sharing can only be successful if the data have been properly documented and described. And they are only useful in the long term if steps have been taken to mitigate the risks of file format obsolescence and bit rot. To address these concerns, Duke’s data curation workflow will review a researcher’s data for appropriate documentation (such as README files or codebooks), solicit and refine Dublin Core metadata about the dataset, and make sure files are named and arranged in a way that facilitates secondary use. Additionally, the curation team can make suggestions about preferred file formats for long-term re-use and conduct a brief review for personally identifiable information. Once the data package has been reviewed, the curation team can then help researchers make their data available in Duke’s own Research Data Repository, where the data can be licensed and assigned a Digital Object Identifier, ensuring persistent access.
“The Data Curation Network (DCN) serves as the “human layer” in the data repository stack and seamlessly connects local data sets to expert data curators via a cross-institutional shared staffing model.”
New to Duke’s curation workflow is the ability to rely on the domain expertise of our colleagues at a few other research institutions. While our data curators here at Duke possess a wealth of knowledge about general research data-related best practices, and are especially well-versed in the vagaries of social sciences data, they may not always have the all the information they need to sufficiently assess the state of a dataset from a researcher. As an answer to this problem, the Data Curation Network, an Alfred P. Sloan Foundation-funded endeavor, has established a cross-institutional staffing model that distributes the domain expertise of each of its partner institutions. Should a curator at one institution encounter data of a kind with which they are unfamiliar, submission to the DCN opens up the possibility for enhanced curation from a network partner with the requisite knowledge.
Duke joins Cornell University, Dryad Digital Repository, Johns Hopkins University, University of Illinois, University of Michigan, University of Minnesota, and Pennsylvania State University in partnering to provide curatorial expertise to the DCN. As of January of this year, the project has moved out of pilot phase into production, and is actively moving data through the network. If a Duke researcher were to submit a dataset our curation team thought would benefit from further examination by a curator with domain knowledge, we will now reach out to the potential depositor to receive clearance to submit the data to the network. We’re very excited about this opportunity to provide this enhancement to our service!
It’s that item of year where we like to reflect on all we have done in 2018, and your favorite digital collections and curation services team is no exception. This year, Digital Collections and Curation Services have been really focusing on getting collections and data into the Digital Repository and making it accessible to the world!
As you will see from the list below we launched 320 new digital collections, managed substantial additions to 2, and migrated 8. However, these publicly accessible digital collections are just the tip of the iceberg in terms of our work in Digital Collections and Curation Services.
So much more digitization happens behind the scenes than is reflected in the list of new digital collections. Many of our larger projects are years in the making. For example, we continue to digitize Gedney photographs and we hope to make them publicly accessible next year. There is also preservation digitization happening that we cannot make publicly accessible online. This work is essential to our preservation mission, though we cannot share the collections widely in the short term.
We strongly believe in keeping our metadata healthy, so in addition to managing new metadata, we often revisit existing metadata across our repositories in order to ensure its overall quality and functionality.
In 2016, after we launched the first iteration of the Duke Chapel Recordings Digital Collection in the Duke Digital Repository (DDR), we began a collaborative project between Digital Collections and Curation Services, University Archives, and the Duke Divinity School to enhance the metadata. The original metadata was fairly basic and allowed users to identify individual written, audio, and video sermons based on speaker, date, title, and format. All good stuff, but it didn’t allow for discovery based on the intellectual content of the sermons themselves. So, it was decided that, at the same time Divinity School staff listened to and corrected machine-generated transcripts for each sermon, they would also capture information that is useful from a homileticperspective.
At the very beginning of the project, the Divinity School convened two focus groups of preachers from a variety of denominations and backgrounds to ask them how they would like to be able to discover and use a digital collection of sermons. These groups developed a set of terms/categories based on which they would like to be able to identify sermons. From there I worked with the project team to begin thinking about what kinds of fields they would want to capture, and determine whether or how those fields could map to the existing metadata application profile that we use in the DDR.
It quickly became clear that this project was going to require the creation of new metadata fields in the DDR application. I try to be really judicious about creating new fields (because otherwise, you end up doing this), but in this case, I felt that the need was justified: homiletic metadata is fairly specialized, and given Duke’s commitment to this collecting area, making adjustments to accommodate it seemed more than reasonable. Since I always like to work with best practices, I attempted to identify any extant metadata schemas that might already exist for working with biblical metadata. I felt pretty confident that I would find one, considering that the Bible is actually one of the oldest books out there. While I did find some resources, they were pretty old (think last-updated-in-2006), and all of them were oriented towards marking up actual Biblical texts, rather than the encoding of metadata about those texts.
With no established standards to work with, we set about determining what the fields should be, using the practice of homiletics itself as a guide. We also developed a workflow for the capturing of this metadata, using a google spreadsheet with conditional formatting and pre-developed drop-down lists to control and facilitate data entry. And starting from the set of terms/categories developed during the focus groups, we came up with a normalized set of Library of Congress Subject Headings (LCSH) for staff to choose from or add to, as needs arose.
Working with LCSH was in itself a challenge, as it required us to navigate the tension between the need to use a standardized set of headings while also include concepts that weren’t themselves well represented in the vocabulary. In some cases we diverged from LCSH in the interest of using terms that would be familiar, expected, and recognizable to practitioners of homiletics. One example of this is the term ‘Community’, which has a particular meaning in a Biblical context, but which, were we to have used the LCSH term ‘Communities’, loses its intent.
We rolled out the new metadata properties and values in early August so they could be available for use by attendees at the international homiletics conference, Societas Homiletica, which was held at Duke University August 3-8, 2018. Now, users of the digital collection can facet and browse by: Liturgical Calendar, Biblical Book, Chapter and Verse, and Subject. We’ve also added curated abstracts, and key quotations from the sermons, which are free-text searchable.
The enhanced metadata makes for a much more meaningful experience using the Duke Chapel Recordings, and future plans involve the inclusion of sermon transcripts, as well as the development of a complimentary website, maintained by the Duke Divinity School, to provide even more information about the speakers and their sermons. With these enrichments, we are well on our way to having an unparalleled free and open resource for the study of homiletics, and hopefully, in so doing, we will facilitate the discovery and study of preachers whose voices have traditionally been underheard.
While we sometimes talk about “the repository” as if it were a monolith at Duke University Libraries, we have in fact developed and maintained two core platforms that function as repository applications. I’ll describe them briefly, then preview a third that is in development, as well as the rationale behind expanding in this way.
That same team is now back for a sequel, collaborating to tackle additional issues around system integrations, statistics/reporting, citations, and platform maintenance. Phase II of the project will wrap up this summer.
I’d like to share a bit more about the DSpace upgrade project, beginning with some background on why it’s important and where the platform fits into the larger picture at Duke. Then I’ll share more about the areas to which we have devoted the most developer time and attention over the past several months. Some of the development efforts were required to make DSpace 6 viable at all for Duke’s ongoing needs. Other efforts have been to strengthen connections between DukeSpace and other platforms. We have also been enhancing several parts of the user interface to optimize its usability and visual appeal.
DSpace at Duke: What’s in It?
Duke began using DSpace around 2006 as a solution for Duke University Archives to collect and preserve electronic theses and dissertations (ETDs). In 2010, the university adopted an Open Access policy for articles authored by Duke faculty, and DukeSpace became the host platform to make these articles accessible under the policy. These two groups of materials represent the vast majority of the 15,000+ items currently in the platform. Ensuring long-term preservation, discovery, and access to these items is central to the library’s mission.
Integrations With Other Systems
DukeSpace is one of three key technology platforms working in concert to support scholarly communications at Duke. The other two are the proprietary Research Information Management System Symplectic Elements, and the open-source research networking tool VIVO (branded as Scholars@Duke). Here’s a diagram illustrating how the platforms work together, created by my colleague Paolo Mangiafico:
In a nutshell, DSpace plays a critical role in Duke University scholars’ ability to have their research easily discovered, accessed, and used.
Faculty use Elements to manage information about their scholarly publications. That information is pulled neatly into Scholars@Duke which presents for each scholar an authoritative profile that also includes contact info, courses taught, news stories in which they’re mentioned, and more.
The Scholars@Duke profile has an SEO-friendly URL, and the data from it is portable: it can be dynamically displayed anywhere else on the web (e.g., departmental websites).
Elements is also the place where faculty submit the open access copies of their articles; Elements in turn deposits those files and their metadata to DSpace. Faculty don’t encounter DSpace at all in the process of submitting their work.
Publications listed in a Scholars@Duke profile automatically include a link to the published version (which is often behind a paywall), and a link to the open access copy in DSpace (which is globally accessible).
Upgrading DSpace: Ripple Effects
The following diagram expands upon the previous one. It adds boxes to the right to account for ETDs and other materials deposited to DSpace either by batch import mechanisms or directly via the application’s web input forms. In a vacuum, a DSpace upgrade–complex as that is in its own right–would be just the green box. But as part of an array of systems working together, the upgrade meant ripping out and replacing so much more. Each white star on the diagram represents a component that had to be thoroughly investigated and completely re-done for this upgrade to succeed.
One of the most complicated factors in the upgrade effort was the bidirectional arrow marked “RT2”: Symplectic’s new Repository Tools 2 connector. Like its predecessor RT1, it facilitates the deposit of files and metadata from Elements into DSpace (but now via different mechanisms). Unlike RT1, RT2 also permits harvesting files and metadata from DSpace back into Elements, even for items that weren’t originally deposited via Elements. The biggest challenges there:
Divergent metadata architecture. DukeSpace and Elements employ over 60 metadata fields apiece (and they are not the same).
Crosswalks. The syntax for munging/mapping data elements from Elements to DSpace (and vice versa) is esoteric, new, and a moving target.
Legacy/inconsistent data. DukeSpace metadata had not previously been analyzed or curated in the 12 years it had been collected.
Newness. Duke is likely the first institution to integrate DSpace 6.x & Elements via RT2, so a lot had to be figured out through trial & error.
Kudos to superhero metadata architect Maggie Dickson for tackling all of these challenges head-on.
User Interface Enhancements in Action
There are over 2,000 DSpace instances in the world. Most implementors haven’t done much to customize the out-of-the-box templates, which look something like this for an item page:
The UI framework itself is outdated (driven via XSLT 1.0 through Cocoon XML pipelines), which makes it hard for anyone to revise substantially. It’s a bit like trying to whittle a block of wood into something ornate using a really blunt instrument. The DSpace community is indeed working on addressing that for DSpace 7.0, but we didn’t have the luxury to wait. So we started with the vanilla template and chipped away at it, one piece at a time. These screenshots highlight the main areas we have been able to address so far.
We configured DSpace to generate and display thumbnail images for all items. Then we added icons corresponding to MIME types to help distinguish different kinds of files. We added really prominent indicators for when an item was embargoed (and when it would become available), and also revised the filesize display to be more clear and concise.
Usage & Attention Stats
Out of the box, DSpace item statistics are only available by clicking a link on the item page to go to a separate stats page. We figured out how to tap into the Solr statistics core and transform that data to display item views and file downloads directly in the item sidebar for easier access. We were also successful showing an Altmetric donut badge for any article with a DOI. These features together help provide a clear indication on the item page how much of an impact a work has made.
We added a lookup from the item page to retrieve the parent collection’s rights statement, which may contain a statement about Open Access, a Creative Commons license, or other explanatory text. This will hopefully assert rights information in a more natural spot for a user to see it, while at the same time draw more attention to Duke’s Open Access policy.
Scholars@Duke Profiles & ORCID Links
For any DukeSpace item author with a Scholars@Duke profile, we now display a clickable icon next to their name. This leads to their Scholars@Duke profile, where a visitor can learn much more about the scholar’s background, affiliations, and other research. Making this connection relies on some complicated parts: 1) enable getting Duke IDs automatically from Elements or manually via direct entry; 2) storing the ID in a DSpace field; 3) using the ID to query a VIVO API to retrieve the Scholars@Duke profile URL. We are able to treat a scholar’s ORCID in a similar fashion.
Other Development Areas
Beyond the public-facing UI, these areas in DSpace 6.2 also needed significant development for the upgrade project to succeed:
Fixed several bugs related to batch metadata import/export
Developed a mechanism to create user accounts via batch operations
Modified features related to authority control for metadata values
By summer 2018, we aim to have the following in place:
Add collapsable / expandable facet and browse options to reduce the number of menu links visible at any given time.
Present a copyable citation on the item page.
Upgrade the XSLT processor from Xalan to Saxon, using XLST 3.0; this will enable us to accomplish more with less code going forward
Revise the Scholars@Duke profile lookup by using a different VIVO API
Create additional browse/facet options
Display aggregated stats in more places
We’re excited to get all of these changes in place soon. And we look forward to learning more from our users, our collaborators, and our peers in the DSpace community about what we can do next to improve upon the solid foundation we established during the project’s initial phases.
Last week, an indefatigable team at Duke University Libraries released an upgraded version of the DukeSpace platform, completing the first phase of the critical project that I wrote about in this space in January. One member of the team remarked that we now surely have “one of the best DSpaces in the world,” and I dare anyone to prove otherwise.
DukeSpace serves as the Libraries’ open-access institutional repository, which makes it a key aspect of our mission to “partner in research,” as outlined in our strategic plan. As I wrote in January, the version of the DSpace platform that underlies the service had been stuck at 1.7, which was released during 2010 – the year the iPad came out, and Lady Gaga wore a meat dress. We upgraded to version 6.2, though the differences between the two versions are so great that it would be more accurate to call the project a migration.
That migration turned out to be one of the more complex technology projects we’ve undertaken over the years. The main complicating factor was the integration with Symplectic Elements, the Research Information Management System (RIMS) that powers the Scholars at Duke site. As far as we know, we are the first institution to integrate Elements with DSpace 6.2. It was a beast to do, and we are happy to share our knowledge gained if it will help any of our peers out there trying to do the same thing.
Meanwhile, feel free to click on over to and enjoy one of the best DSpaces in the world. And congratulations to one of the mightiest teams assembled since Spain won the World Cup!
Apache Solr is behind many of our systems that provide a way to search and browse via a web application (such as the Duke Digital Repository, parts of our Bento search application, and the not yet public next generation TRLN Discovery catalog). It’s a tool for indexing data and provides a powerful query API. In this post I will document a few Solr querying techniques that might be interesting or useful. In some cases I won’t be able to provide live links to queries because we restrict direct access to Solr. However, many of these Solr querying techniques can be used directly in an application’s search box. In those cases, I will include live links to example queries in the Duke Digital Repository.
Find a list of items from their identifiers.
With this query you can specify exactly what items you want to appear in a search result from a list of identifiers.
Find items using a begins-with (left-anchored) query.
I want to see all items that have a subject term that begins with “Soviet Union.” The example is a left-anchored query and will exactly match fields that begin with “Soviet Union.” (Note, the field must not be tokenized for this to work as expected.)
The following examples require direct access to Solr, which is restricted to authorized users and applications. Instead of providing live links, I’ll show the basic syntax, a complete example query using http://localhost:8983/solr/core/* as the sample URL for a Solr index, and a sample response from Solr.
Count instances of values in a field.
I want to know how many items in the repository have a workflow state of published and how many are unpublished. To do that I can write a facet query that will count instances of each value in the specified field. (This is another query that will only work as expected with an untokenized field.)
Collapse multiple records into one result based on a shared field value.
This one is somewhat advanced and likely only useful in particular circumstance. But if you had multiple records that were slight variants of each other, and wanted to collapse each variant down to a single result you can do that with a collapse query — as long as the records you want to collapse share a value.
!collapse instructs Solr to use the Collapsing Query Parser.
field=oclc_number instructs Solr to collapse records that share the same value in the oclc_number field.
nullPolicy=expand instructs Solr to return any document without a matching OCLC as part of the result set. If this is excluded then records that don’t share an oclc_number with another record will be excluded from the results.
max=termfreq(institution,duke) instructs Solr to select as the representative record when collapsing multiple records the one that has the value “duke” in institution field.
CSV response writer (or JSON, Ruby, etc.)
Solr has a number of tricks up its sleeve when it comes to returning results. By default it will return results as XML. You can also specify JSON, or Ruby. You specify a response writer by adding the wt parameter to the URL (wt=json or wt=ruby, etc.).
Solr will also return results as a CSV file, which can then be opened in an Excel spreadsheet — a useful feature for working with metadata.
duke:194006,Sun Bowl...Sun City...
duke:194002,Sun Bowl...Sun City...
duke:194009,Sun Bowl...Sun City.
duke:194019,Sun Bowl...Sun City.
duke:194037,"Sun City\, Sun Bowl"
duke:194036,"Sun City\, Sun Bowl"
duke:355105,Proved! Fast starts at 30° below zero!
This is just a small sample of useful ways you can query Solr.
Snow is a major event here in North Carolina, and the University and Library were operating accordingly under a “severe weather policy” last week due to 6-12 inches of frozen precipitation. While essential services continued undeterred, most of the Library’s staff and patrons were asked to stay home until conditions had improved enough to safely commute to and navigate the campus. In celebration of last week’s storm, here are some handy tips for surviving and enjoying the winter weather–illustrated entirely with images from Duke Digital Collections!
Stock up on your favorite vices and indulgences before the storm hits.
2. Be sure to bundle and layer up your clothing to stay warm in the frigid outdoor temperatures.
3. Plan some fun outdoor activities to keep malaise and torpor from settling in.
4. Never underestimate the importance of a good winter hat.
5. While snowed in, don’t let your personal hygiene slip too far.
6. Despite the inconveniences brought on by the weather, don’t forget to see the beauty and uniquity around you.
While the landscape of open access has changed much over the intervening period, we can’t really say the same about the underlying platform of DukeSpace.
At Duke, faculty approved an open access policy in March of 2010; it was a few weeks previous that DSpace 1.6 was released. By the end of the year it had moved ahead a dot release to 1.7. Along the way, we did some customization to integrate with Symplectic Elements – the Research Information Management System (RIMS) that powers the Scholars@Duke site. That work essentially locked us into that version of DSpace, which remains in operation despite its final release in July 2013, and having reached its end of life four years ago.
Beginning last November, we committed to a full upgrade of the DukeSpace platform to the current version (6.2 as of this writing). We had considered alternatives, including replacing the platform with Hyrax, but concluded that that approach would be too complex.
So we are currently coordinating work across a technology team and the Libraries’ open access group. Some of the concerns that we have encountered include:
Integrating with updated versions of Symplectic Elements. That same integration that locked us into a version years ago lies at the center of this upgrade. We have basically been handling this process as a separate thread of the larger project. It will be critical for us to maintain the currency of this dependency with subsequent upgrades to both products.
Rethinking metadata architecture. The conceptual basis of the institutional repository is greatly informed by the definition and use of metadata. Our Metadata Architect, Maggie Dickson, mentioned this area in her “Metadata Year-in-Review” post back in December. She highlighted the need to make “real headway tackling the problem of identity management – leveraging unique identifiers for people (ORCIDs, for example), rather than relying on name strings, which is inherently error prone.” Many other questions have arisen this area, requiring extensive and ongoing discussion and coordination between the tech team and the stakeholders.
Migration of legacy stats data. How do we migrate usage stats between two versions of a platform so remote from each other in time? It has taken some trial-and-error to solve this one.
Replicating or enhancing existing workflows. Again, when two versions of a system are so different that an upgrade seems more like a platform migration, and our infrastructure and staffing have changed over the years, how do we reproduce existing workflows without disrupting them? What opportunities can we take to improve on them without destabilizing the project? Aside from the integration with Elements, we also have the important workflow related to the ingest of electronic theses & dissertations, which employs both self-deposit and file transfer from ProQuest. Re-envisioning and re-implementing workflows such as these takes careful analysis and planning.
While we have run into a few complicating issues during the process so far, we feel confident that we remain on track to roll out the upgraded version during the first quarter of 2018. Pluto remains a dwarf planet, Zidane manages Real Madrid (for now), and to Mark Cuban’s apparent distress, Google still owns Youtube. Soon our own story from 2006 should reach a kind of resolution.
It’s a new year! And a new year means new priorities. One of the many projects DUL staff have on deck for the Duke Digital Repository in the coming calendar year is an upgrade to DSpace, the software application we use to manage and maintain our collections of scholarly publications and electronic theses and dissertations. As part of that upgrade, the existing DSpace content will need to be migrated to the new software. Until very recently, that existing content has included a few research datasets deposited by Duke community members. But with the advent of our new research data curation program, research datasets have been published in the Fedora 3 part of the repository. Naturally, we wanted all of our research data content to be found in one place, so that meant migrating the few existing outliers. And given the ongoing upgrade project, we wanted to be sure to have it done and out of the way before the rest of the DSpace content needed to be moved.
Most of the datasets that required moving were relatively small–a handful of files, all of manageable size (under a gigabyte) that could be exported using DSpace’s web interface. However, a limited series of data associated with a project called The Integrated Precipitation and Hydrology Experiment (IPHEx) posed a notable exception. There’s a lot of data associated with the IPHEx project (recorded daily for 7 years, along with some supplementary data files, and iterated over 3 different areas of coverage, the total footprint came to just under a terabyte, spread over more than 7,000 files), so this project needed some advance planning.
First, the size of the project meant that the data were too large to export through the DSpace web client, so we needed the developers to wrangle a behind the scenes dump of what was in DSpace to a local file system. Once we had everything we needed to work with (which included some previously unpublished updates to the data we received last year from the researchers), we had to make some decisions on how to model it. The data model used in DSpace was a bit limiting, which resulted in the data being made available as a long list of files for each part of the project. In moving the data to our Fedora repository, we gained a little more flexibility with how we could arrange the files. We determined that we wanted to deviate slightly from the arrangement in DSpace, grouping the files by month and year.
This meant we would have group all the files into subdirectories containing the data for each month–for over 7,000 files, that would have been extremely tedious to do by hand, so we wrote a script to do the sorting for us. That completed, we were able to carry out the ingest process as normal. The final wrinkle associated with the IPHEx project was making sure that the persistent identifiers each part of the project data had been assigned in DSpace still resolved to the correct content. One of our developers was able to set up a server redirect to ensure that each URL would still take a user to the right place. As of the new year, the IPHEx project data (along with our other migrated DSpace datasets) are available in their new home!
At least (of course) until the next migration.
Notes from the Duke University Libraries Digital Projects Team