Going with the Flow: building a research data curation workflow

Why research data? Data generated by scholars in the course of investigation are increasingly being recognized as outputs nearly equal in importance to the scholarly publications they support. Among other benefits, the open sharing of research data reinforces unfettered intellectual inquiry, fosters reproducibility and broader analysis, and permits the creation of new data sets when data from multiple sources are combined. Data sharing, though, starts with data curation.

In January of this year, Duke University Libraries brought on four new staff members–two Research Data Management Consultants and two Digital Content Analysts–to engage in this curatorial effort, and we have spent the last few months mapping out and refining a research data curation workflow to ensure best practices are applied to managing data before, during, and after ingest into the Duke Digital Repository.

What does this workflow entail? A high level overview of the process looks something like the following:

After collecting their data, the researcher will take what steps they are able to prepare it for deposit. This generally means tasks like cleaning and de-identifying the data, arranging files in a structure expected by the system, and compiling documentation to ensure that the data is comprehensible to future researchers. The Research Data Management Consultants will be on hand to help guide these efforts and provide researchers with feedback about data management best practices as they prepare their materials.

Our form for metadata capture

Depositors will then be asked to complete a metadata form and electronically sign a deposit agreement defining the terms of deposit. After we receive this information, someone from our team will invite the depositor to transfer their files to us, usually through Box.

Consultant tasks

As this stage, the Research Data Management Consultants will begin a preliminary review of the researcher’s data by performing a cursory examination for personally identifying or protected health information, inspecting the researcher’s documentation for comprehension and completeness, analyzing the submitted metadata for compliance with the research data application profile, and evaluating file formats for preservation suitability. If they have any concerns, they will contact the researcher to make some suggestions about ways to better align the deposit with best practices.

Analyst tasks

When the deposit is in good shape, the Research Data Management Consultants will notify the Digital Content Analysts, who will finalize the file arrangement and migrate some file formats, generate and normalize any necessary or missing metadata, ingest the files into the repository, and assign the deposit a DOI. After the ingest is complete, the Digital Content Analysts will carry out some quality assurance on the data to verify that the deposit was appropriately and coherently structured and that metadata has been correctly assigned. When this is confirmed, they will publish the data in the repository and notify the depositor.

Of course, this workflow isn’t a finished piece–we hope to continue to clarify and optimize the process as we develop relationships with researchers at Duke and receive more data. The Research Data Management Consultants in particular are enthusiastic about the opportunity to engage with scholars earlier in the research life cycle in order to help them better incorporate data curation standards in the beginning phases of their projects. All of us are looking forward to growing into our new roles, while helping to preserve Duke’s research output for some time to come.

Rethinking Repositories at CNI Spring ’17

One of the main areas of emphasis for the CNI Spring 2017 meeting was “new strategies and approaches for institutional repositories (IR).” A few of us at UNC and Duke decided to plug into the zeitgeist by proposing a panel to reflect on some of the ways that we have been rethinking – or even just thinking about – our repositories.

Continue reading Rethinking Repositories at CNI Spring ’17

Photographing the Movement

Ella Baker, 1964, Danny Lyon, Memories of the Southern Civil Rights Movement 12, dektol.wordpress.com

You never know what to expect with the SNCC Digital Gateway Project project. This three-and-a-half year collaboration with veterans of the Civil Rights Movement, scholars, and archivists has brought constant surprises, one of which came this past January.

The SNCC Digital Gateway website went live on December 13, 2016. One of the 20th century’s most influential activists, Ella Baker, brought the Student Nonviolent Coordinating Committee (SNCC) into being out of the student sit-in movement in 1960, and it would have been her 113th birthday. The SNCC Digital Gateway tells the story of how young activists in SNCC united with local people in the Deep South to build a grassroots movement for change that empowered the Black community and transformed the nation.

Bob Dylan, Courtland Cox, Pete Seeger, and James Forman sitting outside the SNCC office in Greenwood, Mississippi, July 1963, Danny Lyon, dektol.wordpress.com

Over 175 SNCC staff, local activists, mentors, and allies are profiled on the site. The entire project is built on open communication and collaboration between movement veterans and scholars, so after the website went live, SNCC Legacy Project president Courtland Cox sent all living SNCC veterans a link to their profile and requested their feedback. And this is where the unexpected happened. Danny Lyon, SNCC’s first staff photographer, wrote back, offering the use of his photographs in the SNCC Digital Gateway.

Photograph of Danny Lyon with his Nikon F Reflex, Chicago 1960, Danny Lyon Photography, Magnum Photos
Photograph of Danny Lyon with his Nikon F Reflex in Chicago, 1960, Danny Lyon, dektol.wordpress.com

In many ways, Danny Lyon’s photos are the iconic photos of SNCC. In the summer of 1962, Lyon, then a student at the University of Chicago, hitchhiked to Cairo, Illinois with his camera to photograph the desegregation movement that SNCC was helping to organize. After SNCC’s executive secretary James Forman brought Lyon onto staff, he spent next two years documenting SNCC organizing work and demonstrations across the South. Many SNCC posters and pamphlets featured Lyon’s photographs, helping SNCC develop its public image and garner sympathy for the Movement. Julian Bond described Lyon’s photos  as helping “to make the movement move.”

For Danny Lyon to offer these images to the SNCC Digital Gateway was something like winning the lottery. The ties between SNCC veterans run deep, and Lyon explained that he wanted to help. Due to the website’s sustainability standards, our sources for photographs have been limited. While the number of movement-related digital collections are growing, the vast majority are made up of documents, not images. Lyon agreed to give us digital copies of any his movement photos for use in the site. These included rarely photographed people, like Prathia Hall, Worth Long, Euvester Simpson, and many others.

We spent two months working with Danny Lyon and the people at Magnum Photos to identify images, hammer out terms of use, and embed the photos in the site. Today, the SNCC Digital Gateway website proudly features over 70 of Danny Lyon’s photographs. Spend some time, and check them out. And thank you, Mr. Danny Lyon!

James Forman leads singing with staffers in the SNCC office on Raymond Street, 1963, Danny Lyon, Memories of the Southern Civil Rights Movement 123, dektol.wordpress.com

Nuts, Bolts, and Bits: Further Down the Preservation Path

It’s been awhile since we last wrote about the preservation architecture underlying the repository in Preservation Architecture: Phase 2 – Moving Forward with Duke Digital Repository.   Iceberg.  Fickr user: pere.We’ve made some terrific progress in the interim, but most of that is invisible to our users not unlike our chilly friends, icebergs.

Let’s take a brief tour to surface some these changes!

 

Policy and Procedure Development

The recently formed Digital Preservation Advisory Group has been working on policy and procedure to bring DDR into compliance with the ISO 16363 Audit and Certification of Trustworthy Digital Repositories Minimum Criteria. We’ve been working on diverse policy areas like defining how embargoes may be set; how often fixity must be checked and reported to stakeholders; in what situations may content be removed and who must be involved in that decision; and what conditions necessitate a ‘tombstone’ to explain the removal of an object.   Some of these policies are internal and some have already been made publicly available.  For example, see our Deaccession Policy and our Preservation Policy.   We’ve made great progress due to the fantastic example set by our friends at Purdue University Research Repository and others.

Preservation Infrastructure

Duke, DuraCloud, and GlacierDurham, North Carolina, is a lovely city– close to mountains, the beach, and full of fantastic restaurants!  Sometimes, though, your digital assets just need to get away from it all.  Digital preservation demands some geographic diversity.  No repository wants all of its data to be subject to a hurricane, of course!  That’s why we’ve partnered with DuraCloud, a preservation-focused cloud provider, to store copies of our digital assets in geographically diverse locations.  Our data now enjoys homes at Duke, at DuraCloud, and in Amazon Glacier!

To bring transparency to the process of remotely replicating our assets and validating the local and remote assets, we’ve recently implemented a process that externalizes these tasks from Fedora and delivers scheduled reports to stakeholders enumerating and detailing the health of their assets.

 

Research and Development

The DDR has grown tremendously in the last year and with it has grown the need to standardize and scale to demand.  Writing Python to arrange files to conform to our Standard Ingest Format was a perfectly reasonable solution in early 2016.  Likewise, programmatic reformatting of endangered file formats wasn’t feasible with the resources available at the time.  We also did need to worry about traffic scaling back then.  Times have changed!

DDR staff are exploring tools to allow non-developers to easily ingest large amounts of material, methods to identify and migrate files to better supported formats, and are planning for more sustainable and durable architecture like increased inter-application messaging to allow us to externalize processes that have been handled within the repository to external servers.

Let’s Get Small: a tribute to the mighty microcassette

In past posts, I’ve paid homage to the audio ancestors with riffs on such endangered–some might say extinct–formats as DAT and Minidisc.  This week we turn our attention to the smallest (and perhaps the cutest) tape format of them all:  the Microcassette.

Introduced by the Olympus Corporation in 1969, the Microcassette used the same width tape (3.81 mm) as the more common Philips Compact Cassette but housed it in a much smaller and less robust plastic shell.  The Microcassette also spooled from right to left (opposite from the compact cassette) as well as using slower recording speeds of 2.4 and 1.2 cm/s.  The speed adjustment, allowing for longer uninterrupted recording times, could be toggled on the recorder itself.  For instance, the original MC60 Microcassette allowed for 30 minutes of recorded content per “side” at standard speed and 60 minutes per side at low speed.

The microcassette was mostly used for recording voice–e.g. lectures, interviews, and memos.  The thin tape (prone to stretching) and slow recording speeds made for a low-fidelity result that was perfectly adequate for the aforementioned applications, but not up to the task of capturing the wide dynamic and frequency range of music.  As a result, the microcassette was the go-to format for cheap, portable, hand-held recording in the days before the smartphone and digital recording.  It was standard to see a cluster of these around the lectern in a college classroom as late as the mid-1990s.  Many of the recorders featured voice-activated recording (to prevent capturing “dead air”) and continuously variable playback speed to make transcription easier.

The tiny tapes were also commonly used in telephone answering machines and dictation machines.

As you may have guessed, the rise of digital recording, handheld devices, and cheap data storage quickly relegated the microcassette to a museum piece by the early 21st century.  While the compact cassette has enjoyed a resurgence as a hip medium for underground music, the poor audio quality and durability of the microcassette have largely doomed it to oblivion except among the most willful obscurantists.  Still, many Rubenstein Library collections contain these little guys as carriers of valuable primary source material.  That means we’re holding onto our Microcassette player for the long haul in all of its atavistic glory.

image by the author. other images in this post taken from Wikimedia Commons (https://commons.wikimedia.org/wiki/Category:Microcassette)

 

An Update on TRLN Discovery: A New Catalog for the Triangle Research Libraries Network.

We are still many months (well, OK — a year or more) away from unplugging the servers that keep the Endeca powered Search TRLN catalog and our local catalogs alive. But in the meantime work is well underway on its replacement. For those who don’t know, the libraries of Duke, UNC Chapel Hill, NC State, and NCCU have for many years maintained a shared catalog to make resources for all Triangle research libraries easily available to our communities. We all push our records to a centralized data pipeline that indexes our data in Endeca and even share much of the code that runs our local catalog user interfaces. While the browsing and searching capabilities of Endeca were innovative for library catalogs at the time, the rest of the library world has caught up and there are numerous open source solutions to providing convenient and powerful search and browse access to our holdings.

Some goals for the replacement for the Endeca based shared catalog system include:

  • Maintain functionality of the current catalog, including search & browse features.
  • Architect the new system so that it’s easier for staff to troubleshoot problems and test changes to the data pipeline.
  • Use open source tools already in common use by peer libraries to take advantage of community effort and knowledge.
  • Develop a platform that makes it easy for each institution to host their own copy of the catalog UI, that takes advantage of shared features and code, while allowing for local customization where needed.
  • Provide a centralized data service and index that can process and host all 12 million or so records from each institution that will provide the back end index for all local catalogs. (NCCU will use a separate vended solution for their catalog.)

The large team of librarians and staff from across the Research Triangle libraries have been hard at work planning for the new system and many characteristics of the new system are becoming clear:

  • We will use Solr as the replacement for Endeca. Solr will store and index our collective holdings and provide search and browse access to our holdings for our catalog applications.
  • Blacklight, a UI built to search and browse a Solr Index will provide the basis for our local catalogs. Blacklight is used by Stanford, Cornell, and many other libraries to provide the user interface to their catalogs.
  • We will build a Rails Engine that will add to Blacklight any customizations needed to support the consortial catalog and any additional features that the institutions want to share. This engine will make it easy to build a new Blacklight based catalog that will work with the TRLN Solr Index.
  • The data pipeline, Solr index and Catalog UI will be packaged in such a way that staff can run a near copy of the production system locally to develop and test changes.

What we have so far:

  • Scripts built with Traject that will take our MARC XML or MARC Binary records and transform them into an intermediate JSON format.
  • Scripts that take the JSON data and index the data in Solr.
  • The beginnings of a Rails engine that will form the basis of our Blacklight catalog user interfaces.

The technical implementation team is using the TRLN GitHub account to collaborate on development, and much work-in-progress code is posted there. There is still plenty of work left to do, a long list of requirements to accommodate and many unanswered questions, but the project is well underway to build the next shared catalog for the Triangle Research Libraries.

The Outer Limits of Aspect Ratios

“There is nothing wrong with your television set. Do not attempt to adjust the picture. We are controlling transmission. We will control the horizontal. We will control the vertical. We repeat: there is nothing wrong with your television set.”

That was part of the cold open of one of the best science fiction shows of the 1960’s, “The Outer Limits.” The implication being that by controlling everything you see and hear in the next hour, the show’s producers were about to blow your mind and take you to the outer limits of human thought and fantasy, which the show often did.

In regards to controlling the horizontal and the vertical, one of the more mysterious parts of my job is dealing with aspect ratios when it comes to digitizing videotape. The aspect ratio of any shape is the proportion of it’s dimensions. For example, the aspect ratio of a square is always 1 : 1 (width : height). That means, in any square, the width is always equal to the height, regardless of whether a square is 1-inch wide or 10-feet wide. Traditionally, television sets displayed images in a 4 : 3 ratio. So, if you owned a 20” CRT (cathode ray tube) TV back in the olden days, like say 1980, the broadcast image on the screen was 16” wide by 12” high. So, the height was 3/4 the size of the width, or 4 : 3. The 20” dimension was determined by measuring the rectangle diagonally, and was mainly used to categorize and advertise the TV.

 

 

Almost all standard-definition analog videotapes, like U-matic, Beta and VHS, have a 4 : 3 aspect ratio. But when digitizing the content, things get more complicated. Analog video monitors display pixels that are tall and thin in shape. The height of these pixels is greater than their width, whereas modern computer displays use pixels that are square in shape. On an analog video monitor, NTSC video displays at roughly 720 (tall and skinny) pixels per horizontal line, and there are 486 visible horizontal lines. If you do the math on that, 720 x 486 is not 4 : 3. But because the analog pixels display tall and thin, you need more of them aligned vertically to fill up a 4 : 3 video monitor frame.


When Duke Libraries digitizes analog video, we create a master file that is 720 x 486 pixels, so that if someone from the broadcast television world later wants to use the file, it will be native to that traditional standard-definition broadcast specification. However, in order to display the digitized video on Duke’s website, we make a new file, called a derivative, with the dimensions changed to 640 x 480 pixels, because it will ultimately be viewed on computer monitors, laptops and smart phones, which use square pixels. Because the pixels are square, 640 x 480 is mathematically a 4 : 3 aspect ratio, and the video will display properly. The derivative video file is also compressed, so that it will stream smoothly regardless of internet bandwidth limits.

“We now return control of your television set to you. Until next week at the same time, when the control voice will take you to – The Outer Limits.”

508 Update, Update

A little more than a year ago, I wrote about the proposed update to the 508 accessibility standards. And about three weeks ago, the US Access Board published the final rule that contains updates to the 508 accessibility requirements for Information and Communication Technology (ICT). The rules had not previously been updated since 2001 and as such had greatly lagged behind modern web conventions.

It’s important to note that the 508 guidelines are intended to serve as a vehicle for guiding procurement, while at the same time applying to content created by a given group/agency. As such, the language isn’t always straightforward.

What’s new?

As I outlined in my previous post, a major purpose of the new rule is to move away from regulating types of devices and instead focus on functionality:


… one of the primary purposes of the final rule is to replace the current product-based approach with requirements based on functionality, and, thereby, ensure that accessibility for people with disabilities keeps pace with advances in ICT.


To that effect, one of the biggest change over the old standard is the adoption of WCAG 2.0 as the compliance level. The fundamental premise of WCAG compliance is that content is ‘perceivable, operable, and understandable’ — bottom line is that as developers, we should strive to make sure all of our content is usable for everyone across all devices. The adoption of WCAG allows the board to offload responsibility of making incremental changes as technology advances (so we don’t have to wait another 15 years for updates) and also aligns our standards in the United States with those used around the world.


Harmonization with international standards and guidelines creates a larger marketplace for accessibility solutions, thereby attracting more offerings and increasing the likelihood of commercial availability of accessible ICT options.


Another change has to do with making a wider variety of electronic content accessible, including internal documents. It will be interesting to see to what degree this part of the rule is followed by non-federal agencies.


The Revised 508 Standards specify that all types of public-facing content, as well as nine categories of non-public-facing content that communicate agency official business, have to be accessible, with “content” encompassing all forms of electronic information and data. The existing standards require Federal agencies to make electronic information and data accessible, but do not delineate clearly the scope of covered information and data. As a result, document accessibility has been inconsistent across Federal agencies. By focusing on public-facing content and certain types of agency official communications that are not public facing, the revised requirements bring needed clarity to the scope of electronic content covered by the 508 Standards and, thereby, help Federal agencies make electronic content accessible more consistently.


The new rules do not go into effect until January 2018. There’s also a ‘safe harbor’ clause that protects content that was created before this enforcement date, assuming it was in compliance with the old rules. However, if you update that content after January, you’ll need to make sure it complies with the new final rule.


Existing ICT, including content, that meets the original 508 Standards does not have to be upgraded to meet the refreshed standards unless it is altered. This “safe harbor” clause (E202.2) applies to any component or portion of ICT that complies with the existing 508 Standards and is not altered. Any component or portion of existing, compliant ICT that is altered after the compliance date (January 18, 2018) must conform to the updated 508 Standards.


So long story short, a year from now you should make sure all the content you’re creating meets the new compliance level.

A Refreshing New Look for Our Library Website

If you’ve visited the Duke University Libraries website in the past month, you may have noticed that it looks a bit more polished than it used to. Over the course of the fall 2016 semester, my talented colleague Michael Daul and I co-led a project to develop and implement a new theme for the site. We flipped the switch to launch the theme on January 6, 2017, the week before spring classes began. In this post, I’ll share some background on the project and its process, and highlight some noteworthy features of the new theme we put in place.

Newly refreshed Duke University Libraries website homepage.

Goals

We kicked off the project in Aug 2016 using the title “Website Refresh” (hat-tip to our friends at NC State Libraries for coining that term). The best way to frame it was not as a “redesign,” but more like a 50,000-mile maintenance tuneup for the site.  We had four main goals:

  • Extend the Life of our current site (in Drupal 7) without a major redesign or redevelopment effort
  • Refresh the Look of the site to be modern but not drastically different
  • Better Code by streamlining HTML markup & CSS style code for easier management & flexibility
  • Enhance Accessibility via improved compliance with WCAG accessibility guidelines

Our site is fairly large and complex (1,200+ pages, for starters). So to keep the scope lean, we included no changes in content, information architecture, or platform (i.e., stayed on Drupal 7). We also worked with a lean stakeholder team to make decisions related to aesthetics.

Extending the Life of the Site

Our old website theme was aging; the project leading to its development began five years ago in Sep 2012, was announced in Jan 2013, and then eventually launched about three years ago in Jan 2014. Five years–and even three–is a long time in web years. Sites accumulate a lot of code cruft over time, the tools for managing and writing code become deprecated quickly. We wanted to invest a little time now to replace some pieces of the site’s front-end architecture with newer and better replacements, in order to buy us more time before we’d have to do an expensive full-scale overhaul from the ground up.

Refreshing the Look

Our 2014 site derived a lot its aesthetic from the main Duke.edu website at the time. Duke’s site has changed significantly since then, and meanwhile, web design trends have changed dramatically: flat design is in, skeuomorphism out.  Google Web Fonts are in, Times, Arial, Verdana and company are out.  Even a three year old site on the web can look quite dated.

Old site theme, dated aesthetics.
New “refreshed” theme, with flatter, more modern aesthetic

Closeup on skeuomorphic embellishments vs. flat elements.

Better Code

Beyond evolving aesthetics, the various behind-the-scenes web frameworks and code workflows are in constant, rapid flux; it can really keep a developer’s head on a swivel. Better code means easier maintenance, and to that end our code got a lot better after implementing these solutions:

  • Bootstrap Upgrade. For our site’s HTML/CSS/JS framework, we moved from Bootstrap version 2 (2.3.1) to version 3 (3.3.7). This took weeks of work: it meant thousands of pages of markup revisions, only some of which could be done with a global Search & Replace.
  •  Sass for CSS. We trashed all of our old theme’s CSS files and started over using Sass, a far more efficient way to express and maintain style rules than vanilla CSS.
  • Gulp for Automation. Our new theme uses Gulp to automate code tasks like processing Sass into CSS, auto-prefixing style declarations to work on older browsers, and crunching 30+ css files down into one.
  • Font Awesome. We ditched most of our older image-based icons in favor of Font Awesome ones, which are far easier to reference and style, and faster to load.
  • Radix.  This was an incredibly useful base theme for Drupal that encapsulates/integrates Sass, Gulp, Bootstrap, and FontAwesome. It also helped us get a Bootswatch starter theme in the mix to minimize the local styling we had to do on top of Bootstrap.

We named our new theme Dulcet and put it up on GitHub.

Sass for style management, e.g., expressing colors as reusable variables.
Gulp for task automation, e.g., auto-prefixing styles to account for older browser workarounds.

 Accessibility

Some of the code and typography revisions we’ve made in the “refresh” improve our site’s compliance with WCAG2.0 accessibility guidelines. We’re actively working on further assessment and development in this area. Our new theme is better suited to integrate with existing tools, e.g., to automatically add ARIA attributes to interactive page elements.

Feedback or Questions?

We would love to hear from you if you have any feedback on our new site, if you spot any oddities, or if you’re considering doing a similar project and have any questions. We encourage you to explore the site, and hope you find it a refreshing experience.

Notes from the Duke University Libraries Digital Projects Team