Tag Archives: Duke Digital Repository

September scale-up: promoting the DDR and associated services to faculty and students

It’s September, and Duke students aren’t the only folks on campus in back-to-school mode. On the contrary, we here at the Duke Digital Repository are gearing up to begin promoting our research data curation services in real earnest. Over the last eight months, our four new research data staff have been busy getting to know the campus and the libraries, getting to know the repository itself and the tools we’re working with, and establishing a workflow. Now we’re ready to begin actively recruiting research data depositors!

As our colleagues in Data and Visualization Services noted in a presentation just last week, we’re aiming to scale up our data services in a big way by engaging researchers at all stages of the research lifecycle, not just at the very end of a research project. We hope to make this effort a two-front one. Through a series of ongoing workshops and consultations, the Research Data Management Consultants aspire to help researchers develop better data management habits and take the longterm preservation and re-use of their data into account when designing a project or applying for grants. On the back-end of things, the Content Analysts will be able to carry out many of the manual tasks that facilitate that longterm preservation and re-use, and are beginning to think about ways in which to tweak our existing software to better accommodate the needs of capital-D Data.

This past spring, the Data Management Consultants carried out a series of workshops intending to help researchers navigate the often muddy waters of data management and data sharing; topics ranged from available and useful tools to the occasionally thorny process of obtaining consent for–and the re-use of–data from human subjects.

Looking forward to the fall, the RDM consultants are planning another series of workshops to expand on the sessions given in the spring, covering new tools and strategies for managing research output. One of the tools we’re most excited to share is the Open Science Framework (OSF) for Institutions, which Duke joined just this spring. OSF is a powerful project management tool that helps promote transparency in research and allows scholars to associate their work and projects with Duke.

On the back-end of things, much work has been done to shore up our existing workflows, and a number of policies–both internal and external–have been met with approval by the Repository Program Committee. The Content Analysts continue to become more familiar with the available repository tools, while weighing in on ways in which we can make the software work better. The better part of the summer was devoted to collecting and analyzing requirements from research data stakeholders (among others), and we hope to put those needs in the development spotlight later this fall.

All of this is to say: we’re ready for it, so bring us your data!

The ABCs of Digitizing Section A

I’m not sure anyone who currently works in the library has any idea when the phrase “Section A” was first coined as a call number for small manuscript collections. Before the library’s renovation, before we barcoded all our books and boxes — back when the Rubenstein was still RBMSCL, and our reading room carpet was a very bright blue — there was a range of boxes holding single-folder manuscript collections, arranged alphabetically by collection creator. And this range was called Section A.

Box 175 of Section A
Box 175 of Section A

Presumably there used to be a Section B, Section C, and so on — and it could be that the old shelf ranges were tracked this way, I’m not sure — but the only one that has persisted through all our subsequent stacks moves and barcoding projects has been Section A. Today there are about 3900 small collections held in 175 boxes that make up the Section A call number. We continue to add new single-folder collections to this call number, although thanks to the miracle of barcodes in the catalog, we no longer have to shift files to keep things in perfect alphabetical order. The collections themselves have no relationship to one another except that they are all small. Each collection has a distinct provenance, and the range of topics and time periods is enormous — we have everything from the 17th to the 21st century filed in Section A boxes. Small manuscript collections can also contain a variety of formats: correspondence, writings, receipts, diaries or other volumes, accounts, some photographs, drawings, printed ephemera, and so on. The bang-for-your-buck ratio is pretty high in Section A: though small, the collections tend to be well-described, meaning that there are regular reproduction and reference requests. Section A is used so often that in 2016, Rubenstein Research Services staff approached Digital Collections to propose a mass digitization project, re-purposing the existing catalog description into digital collections within our repository. This will allow remote researchers to browse all the collections easily, and also reduce repetitive reproduction requests.

This project has been met with enthusiasm and trepidation from staff since last summer, when we began to develop a cross-departmental plan to appraise, enhance description, and digitize the 3900 small manuscript collections that are housed in Section A. It took us a bit of time, partially due to the migration and other pressing IT priorities, but this month we are celebrating a major milestone: we have finally launched our first 2 Section A collections, meant to serve as a proof of concept, as well as a chance for us to firmly define the project’s goals and scope. Check them out: Abolitionist Speech, approximately 1850, and the A. Brouseau and Co. Records, 1864-1866. (Appropriately, we started by digitizing the collections that began with the letter A.)

A. Brouseau & Co. Records carpet receipts, 1865

Why has it been so complicated? First, the sheer number of collections is daunting; while there are plenty of digital collections with huge item counts already in the repository, they tend to come from a single or a few archival collections. Each newly-digitized Section A collection will be a new collection in the repository, which has significant workflow repercussions for the Digital Collections team. There is no unifying thread for Section A collections, so we are not able to apply metadata in batch like we would normally do for outdoor advertising or women’s diaries. Rubenstein Research Services and Library Conservation Department staff have been going box by box through the collections (there are about 25 collections per box) to identify out-of-scope collections (typically reference material, not primary sources), preservation concerns, and copyright concerns. These are excluded from the digitization process. Technical Services staff are also reviewing and editing the Section A collections’ description. This project has led to our enhancing some of our oldest catalog records — updating titles, adding subject or name access, and upgrading the records to RDA, a relatively new standard. Using scripts and batch processes (details on GitHub), the refreshed MARC records are converted to EAD files for each collection, and the digitized folder is linked through ArchivesSpace, our collection management system. We crosswalk the catalog’s name and subject access data to both the finding aid and the repository’s metadata fields, allowing the collection to be discoverable through the Rubenstein finding aid portal, the Duke Libraries catalog, and the Duke Digital Repository.

It has been really exciting to see the first two collections go live, and there are many more already digitized and just waiting in the wings for us to automate some of our linking and publishing processes. Another future development that we expect will speed up the project is a batch ingest feature for collections entering the repository. With over 3000 collections to ingest, we are eager to streamline our processes and make things as efficient as possible. Stay tuned here for more updates on the Section A project, and keep an eye on Digital Collections if you’d like to explore some of these newly-digitized collections.

The Research Data Team: Hitting the Ground Running

There has been a lot of blogging over the last year about the Duke Digital Repository’s development and implementation, about its growth as a platform and a program, and about the creation of new positions to support research data management and curation. My fellow digital content analyst also recently posted about how we four new hires have been creating and refining our research data curation workflow since beginning our positions at Duke this past January. It’s obviously been (and continues to be) a very busy time here for the repository team at Duke Libraries, including both seasoned and new staff alike.

Besides the research data workflows between our two departments, what other things have the data management consultants and the digital content analysts been doing? In short, we’ve been busy!

 

In addition to envisioning stakeholder needs (which is an exercise we continuously do), we’ve received and ingested several data collections this year, which has given us an opportunity to also learn from experience. We have been tracking and documenting the types of data we’re receiving, the various needs that these types of data and depositors have, how we approach these needs (including investigating and implementing any additional tools that may help us better address these), how our repository displays the data and associated metadata, and the time spent on our management and curation tasks. Some of these are in the form of spreadsheets, others as draft policies that will first be reviewed by the library’s research data working group and then by a program committee, and others simply as brain dumps for things that require a further, more structured investigation by developers, the metadata architect, subject librarians, and other stakeholders. These documents live in either our shared online folder or our shared Box account, and, if a wider Duke library and/or public audience are required, are moved to our departments’ content collaboration software platforms (currently Confluence/Jira and Basecamp). The collaborative environments of these platforms support the dynamic nature of our work, particularly as our program takes form.

We also value the importance of face-to-face discussions, so we hold weekly meetings to talk through all of this work (we prefer outside when the weather is nice, and because squirrels are awesome).

One of the most exciting, and at times challenging, aspects of where we are is that we are essentially starting from the ground up and therefore able to develop procedures and features (and re-develop, and on and on again) until we find fits that best accommodate our users and their data. We rely heavily on each other’s knowledge about the research data field, and we also engage in periodic environmental scans of other institutions that offer data management and curation services.

When we began in January, we all considered the first 6-9 months as a “pilot phase”, though this description may not be accurate. In the minds of the data management consultants and the digital content analysts, we’re here and ready. Will we run into situations that require an adjustment to our procedures? Absolutely. It’s the nature of our work. Do we want feedback from the Duke community about how our services are (or are not) meeting their needs? Without a doubt. And will the DDR team continue to identify and implement features to better meet end-user needs? Certainly. We fully expect to adjust and readjust our tools and services, with the overall goal of fulfilling future needs before they’re even evident to our users. So, as always, keep watching to see how we grow!

Rethinking Repositories at CNI Spring ’17

One of the main areas of emphasis for the CNI Spring 2017 meeting was “new strategies and approaches for institutional repositories (IR).” A few of us at UNC and Duke decided to plug into the zeitgeist by proposing a panel to reflect on some of the ways that we have been rethinking – or even just thinking about – our repositories.

Continue reading Rethinking Repositories at CNI Spring ’17

Revisiting: What is the Repository?

Here at the Duke University Libraries we recently hosted a series of workshops that were part of a larger Research Symposium on campus.  It was an opportunity for various campus agencies to talk about all of the evolving and innovative ways that they are planning for and accommodating research data.  A few of my colleagues and I were asked to present on the new Research Data program that we’re rolling out in collaboration with the Duke Digital Repository, and we were happy to oblige!

I was asked to speak directly about the various software development initiatives that we have underway with the Duke Digital Repository.  Since we’re in the midst of rolling out a brand new program area, we’ve got a lot of things cooking!

When I started planning for the conversation I initially thought I would talk a lot about our Fedora/Hydra stack, and the various inter-related systems that we’re planning to integrate into our repository eco-system.  But what resulted from that was a lot of technical terms, and open-source software project names that didn’t mean a whole lot to anyone; especially those not embedded in the work.  As a result, I took a step back and decided to focus at a higher level.  I wanted to present to our faculty that we were implementing a series of software solutions that would meet their needs for accommodation of their data.  This had me revisiting the age-old question: What is our Repository?  And for the purposes of this conversation, it boiled down to this:

And this:

It is a highly complex, often mind-boggling set of software components, that are wrangled and tamed by a highly talented team with a diversity of skills and experience, all for the purposes of supporting Preservation, Curation, and Access of digital materials.

Those are our tenets or objectives.  They are the principles that guide out work.  Let’s dig in a bit on each.

Our first objection is Preservation.  We want our researchers to feel 100% confident that when they give us their data, that we are preserving the integrity, longevity, and persistence of their data.

Our second objective is to support Curation.  We aim to do that by providing software solutions that facilitate management and description of file sets, and logical arrangement of complex data sets.  This piece is critically important because the data cannot be optimized without solid description and modeling that informs on its purpose, intended use, and to facilitate discovery of the materials for use.

Finally our work, our software, aims to facilitate discovery & access.  We do this by architecture thoughtful solutions that optimize metadata and modeling, we build out features that enhance the consumption and usability of different format types, we tweak, refine and optimize our code to enhance performance and user experience.

The repository is a complex beast.  It’s a software stack, and an eco-system of components.  It’s Fedora.  It’s Hydra.  It’s a whole lot of other project names that are equally attractive and mystifying.  At it’s core though, it’s a software initiative- one that seeks to serve up an eco-system of components with optimal functionality that meet the needs and desires of our programmatic stakeholders- our University.

Preservation, Curation, & Access are the heart of it.

A New Home Page for the Duke Digital Repository

Today is an eventful day for the Duke Digital Repository (DDR). Later today, I and several of my colleagues will present on the DDR at Day 1 of the Duke Research Computing Symposium. We’ll be introducing new staff who’ll focus on managing, curating, and preserving research data, as well as the role that the DDR will play as both a service and a platform. This event serves as a soft launch of our plans – which I wrote about last September – to support the work of researchers at Duke.

Out-of-the-box DDR home page of the past

At the same time, the DDR gets a new look, at least on its home page. For years, we’ve used a rather drab and uninformative page that was essentially the out-of-the-box rendering by Blacklight, our discovery and access layer in the repository stack. Last fall, our DDR Program Committee took up the task of revamping that page to reflect how we conceptualize the repository and its major program areas.

New DDR home page with aerial hero image and three program areas.

The page design will evolve with the DDR itself, but it went live earlier today. More information about the DDR initiative and our plans will follow in the coming months.

 

Good Stuff on the Horizon: a Duke Digital Repository Teaser…

Folks,

We have been hard at work architecting a robust Repository program for our Duke University community.  And while doing this, we’re in the midst of shoring things up architecturally on the back end.  You may be asking yourself:  Why all the fuss?  What’s the big deal?

architecture-college-program-marquee

Well, part of the fuss is that it’s high time to move beyond the idea that our repository is a platform.  We’d much prefer that our repository be know as a program.  A suite of valuable services that serve the needs of our campus community.  The repository will always be a platform.  In fact, it will be a rock-solid preservation platform- a space to park your valuable digital assets and feel 100% confident that the Libraries will steward those materials for the long haul.  But the repository is much more than a platform; it’s a suite of service goodness that we hope to market and promote!

Secondly, it’s because we’ve got some new and exciting developments happening in Repository-land, specifically in the realm of data management.  To start with, the Provost graciously appointed four new positions to serve the data needs of the University, and those new positions will sit in the Libraries.  We have two Senior Research Specialists and two Content Analysts joining our ranks in early January.  These positions will be solely dedicated to the refinement of data curation processes, liaising with faculty on data management best practice, assisting researchers with the curation and deposit of research data, and acquiring persistent access to said data.  Pretty cool stuff!

ero13111articleart

So in preparation for this, we’ve had a few things cooking.  To begin with, we are re-designing our Duke Digital Repository homepage.  We will highlight three service areas:

  • Duke Scholarship: This area will feature the research, scholarship and activities of Duke faculty members and academic staff.  It will also highlight services in support of open access, copyright support, digital publishing, and more.
  • Research Data:  This area will be dedicated to the fruits of Duke Scholarship, and will be an area that features research data and data sets.  It will highlight services in support of data curation, data management, data deposit, data citation, and more.
  • Library Collections: This area will focus on digital collections that are owned or stewarded specifically by the Duke University Libraries.  This includes digitized special collections, University Archives material, born digital materials, and more.

For each of these areas we’ve focused on defining a base collections policy for each, and are in the process of refining our service models, and shoring up policy that will drive preservation and digital asset management of these materials.

So now that I’ve got you all worked up about these new developments, you may be asking, ‘When can I know more?!’  You can expect to see and hear more about these developments (and our newly redesigned website) just after the New Year.  In fact, you can likely expect another Bitstreams Repository post around that time with more updates on our progress, a preview of our site, and perhaps a profile or two of the new staff joining our efforts!

stay-tuned-300x226

Until then, stay tuned, press ‘Save’, and call us if you’re looking for a better, more persistent, more authoritative approach to saving the fruits of your digital labor!  (Or contact us)

Blacklight Summit 2016

Last week I traveled to lovely Princeton, NJ to attend Blacklight Summit. For the second year in a row a smallish group of developers who use or work on Project Blacklight met to talk about our work and learn from each other.

Blacklight is an open source project written in Ruby on Rails that serves as a discovery interface over a Lucene Solr search index. It’s commonly used to build library catalogs, but is generally agnostic about the source and type of the data you want to search. It was even used to help reporters explore the leaked Panama Papers.
blacklight-logo-h200-transparent-black-text
At Duke we’re using Blacklight as the public interface to our digital repository. Metadata about repository objects are indexed in Solr and we use Blacklight (with a lot of customizations) to provide access to digital collections, including images, audio, and video. Some of the collections include: Gary Monroe Photographs, J. Walter Thompson Ford Advertisements, and Duke Chapel Recordings, among many others.

Blacklight has also been selected to replace the aging Endeca based catalog that provides search across the TRLN libraries. Expect to hear more information about this project in the future.
trln_logo_abbrev_rgb
Blacklight Summit is more of an unconference meeting than a conference, with a relatively small number of participants. It’s a great chance to learn and talk about common problems and interests with library developers from other institutions.

I’m going to give a brief overview of some of what we talked about and did during the two and a half day meeting and provides links for you explore more on your own.

First, a representative from each institution gave about a five minute overview of how they’re using Blacklight:

The group participated in a workshop on customizing Blacklight. The organizers paired people based on experience, so the most experienced and least experienced (self-identified) were paired up, and so on. Links to the github project for the workshop: https://github.com/projectblacklight/blacklight_summit_demo

We got an update on the state of Blacklight 7. Some of the highlights of what’s coming:

  • Move to Bootstrap 4 from Bootstrap 3
  • Use of HTML 5 structural elements
  • Better internationalization support
  • Move from helpers to presenters. (What are presenters: http://nithinbekal.com/posts/rails-presenters/)
  • Improved code quality
  • Partial structure that makes overrides easier

A release of Blacklight 7 won’t be ready until Bootstrap 4 is released.

There were also several conversations and breakout session about Solr, the indexing tool used to power Blacklight. I won’t go into great detail here, but some topics discussed included:

  • Developing a common Solr schema for library catalogs.
  • Tuning the performance of Solr when the index is updated frequently. (Items that are checkout out or returned need to be indexed relatively frequently to keep availability information up to date.)
  • Support for multi-lingual indexing and searching in Solr, especially Chinese, Japanese, and Korean languages. Stanford has done a lot of work on this.

I’m sure you’ll be hearing more from me about Blacklight on this blog, especially as we work to build a new TRLN shared catalog with it.

Open Source Software and Repository land

The Duke University Libraries software development team just recently returned from a week in Boston, MA at a conference called Hydra Connect.  We ate good seafood, admired beautiful cobblestones, strolled along the Charles River, and learned a ton about what’s going on in the Hydra-sphere.

At this point you may be scratching your head, exclaiming- huh?!  Hydra?  Hydrasphere?  Have no fear, I shall explain!

hydra_logo_ahead_captioned_realigned

Our repository, the Duke Digital Repository, is a Hydra/Fedora Repository.  Hydra and Fedora are names for two prominent open-source communities in repository land.  Fedora concerns itself with architecting the back-end of a repository- the storage layer.  Hydra, on the other hand, refers to a multitude of end-user applications that one can architect on top of a Fedora repository to perform digital asset management.  Pretty cool and pretty handy.  Especially for someone that has no interest in architecting a repository from scratch.

And for a little context re: open source… the idea is that a community of like-minded individuals that care about a particular thing, will band together to develop a massively cool software product that meets a defined need, is supported and extended by the community, and is offered for free for someone to inspect, modify and/or enhance the source code.

3ea640b

I italicized ‘free’ to emphasize that while the software itself is free, and while the source code is available for download and modification it does take a certain suite of skills to architect a Hydra/Fedora Repository.  It’s not currently an out-of-the-box solutions, but is moving in that direction with Hydra-in-a-Box.  But I digress…

So.  Why might someone be interested in joining an open-source community such as these?  Well, for many reasons, some of which might ring true for you:

  • Resources are thin.  Talented developers are hard to find and harder to recruit.  Working with an open source community means that 1) you have the source code to get started, 2) you have a community of people that are available (and generally enthusiastic) about being a resource, and 3) working collaboratively makes everything better.  No one wants to go it alone.
  • Governance.  If one gets truly involved at the community level there are often opportunities for contributing thoughts and opinion that can help to shape and guide the software product.  That’s super important when you want to get invested in a project and ensure that it fully meets you need.  Going it alone is never a good option, and the whole idea of open-source is that it’s participatory, collaborative, and engaged.
  • Give back.  Perhaps you have a great idea.  A fantastic use case.  Perhaps one that could benefit a whole lot of other people and/or institutions.  Well then share the love by participating in open-source.  Instead of developing a behemoth locally that is not maintainable, contribute ideas or features or a new product back to the community.  It benefits others, and it benefits you, by investing the community in the effort of folding features and enhancements back into the core.

Hydra Connect was a fantastic opportunity to mingle with like-minded professionals doing very similar work, and all really enthusiastic to share their efforts.  They want you to get excited about their work.  To see how they are participating in the community.  How they are using this variety of open-source software solutions in new and innovative ways.

It’s easy to get bogged down at a local level with the micro details, and to lose the big picture.  It was refreshing to step out of the office and get back into the frame of mind that recognizes and empowers the notion that there is a lot of power in participating in healthy communities of practice.  There is also a lot of economy in it.

The team came back to Durham full of great ideas and a lot of enthusiasm.  It has fueled a lot of fantastic discussion about the future of our repository software eco-system and how that complements our desire to focus on integration, community developed goodness, and sustainable practices for software development.

More to come as we turn that thought process into practice!

img_0748
Team Hydra Connect 2016

Project Hydra

Hydra Connect 2016

Research at Duke and the future of the DDR

The Duke Digital Repository (DDR) is a growing service, and the Libraries are growing to support it. As I post this entry, our jobs page shows three new positions comprising five separate openings that will support the DDR. One is a DevOps position which we have re-envisioned from a salary line that opened with a staff member’s departure. The other four consist of two new positions, with two openings for each, created to meet specific, emerging needs for supporting research data at Duke.

Last fall at Duke, the Vice Provosts for Research and the Vice President for Information Technology convened a Digital Research Faculty Working Group. It included a number of faculty members from around campus, as well as several IT administrators, the latter of whom served in an ex-officio capacity. The Libraries were represented by our Associate University Librarian for Information Technology, Tim McGeary (who happens to be my supervisor).

Membership of the Digital Research Faculty Group. This image and others in the post are slides taken from a presentation I gave to the Libraries’ all-staff meeting in August.

Continue reading Research at Duke and the future of the DDR