Category Archives: Behind the Scenes

EDTF-Humanize 2.0 with Improved Internationalization Support

July 29, 2020 Cory Lown

About four years ago we released a small Ruby gem (EDTF-Humanize) to generate human readable dates out of Extended Date Time Format dates. For some background on our use of the EDTF standard, please see our previous blog posts on the topic: EDTF-Humanize, Enjoy your Metadata: Fun with Date Encoding, and It’s Date Night Here at Digital Projects and Production Services.

Some recent community contributions to the gem as well as some extra time as we transition from one work cycle to another provided an opportunity for maintenance and refinement of EDTF-Humanize. The primary improvement is better support for languages other than English via Ruby I18n locale configuration files and a language specific module override pattern. Support for French is now included and support for other languages may be added following the same approach as French.

The primary means of adding additional languages to EDTF-Humanize is to add a translation file to config/locals/. This is the translation file included to support French:

fr:
  date:
    day_names: [Dimanche, Lundi, Mardi, Mercredi, Jeudi, Vendredi, Samedi]
    abbr_day_names: [Dim, Lun, Mar, Mer, Jeu, Ven, Sam]
    # Don't forget the nil at the beginning; there's no such thing as a 0th month
    month_names: [~, Janvier, Février, Mars, Avril, Mai, Juin, Juillet, Août, Septembre, Octobre, Novembre, Decembre]
    abbr_month_names: [~, Jan, Fev, Mar, Avr, Mai, Jun, Jul, Aou, Sep, Oct, Nov, Dec]
    seasons:
      spring: "printemps"
      summer: "été"
      autumn: "automne"
      winter: "hiver"
  edtf:
    terms:
      approximate_date_prefix_day: ""
      approximate_date_prefix_month: ""
      approximate_date_prefix_year: ""
      approximate_date_suffix_day: " environ"
      approximate_date_suffix_month: " environ"
      approximate_date_suffix_year: " environ"
      decade_prefix: "Les années "
      decade_suffix: ""
      century_suffix: ""
      interval_prefix_day: "Du "
      interval_prefix_month: "De "
      interval_prefix_year: "De "
      interval_connector_approximate: " à "
      interval_connector_open: " à "
      interval_connector_day: " au "
      interval_connector_month: " à "
      interval_connector_year: " à "
      interval_unspecified_suffix: "s"
      open_start_interval_with_day: "Jusqu'au %{date}"
      open_start_interval_with_month: "Jusqu'en %{date}"
      open_start_interval_with_year: "Jusqu'en %{date}"
      open_end_interval_with_day: "Depuis le %{date}"
      open_end_interval_with_month: "Depuis %{date}"
      open_end_interval_with_year: "Depuis %{date}"
      set_dates_connector_exclusive: ", "
      set_dates_connector_inclusive: ", "
      set_earlier_prefix_exclusive: 'Le ou avant '
      set_earlier_prefix_inclusive: 'Le et avant '
      set_last_date_connector_exclusive: " ou "
      set_last_date_connector_inclusive: " et "
      set_later_prefix_exclusive: 'Le ou après '
      set_later_prefix_inclusive: 'Le et après '
      set_two_dates_connector_exclusive: " ou "
      set_two_dates_connector_inclusive: " et "
      uncertain_date_suffix: "?"
      unknown: 'Inconnue'
      unspecified_digit_substitute: "x"
    formats:
      day_precision_strftime_format: "%-d %B %Y"
      month_precision_strftime_format: "%B %Y"
      year_precision_strftime_format: "%Y"

In addition to the translation file, the methods used to construct the human readable string for each EDTF date object type may be completely overridden for a language if needed. For instance, when the date object is an instance of EDTF::Century the French language uses a different method from the default to construct the humanized form. This override is accomplished by adding a language module for the French language that includes the Default module and also includes a Century module that overrides the default behavior. The override is here (minus the internals of the humanizer method) as an example:

# lib/edtf/humanize/language/french.rb
module Edtf
  module Humanize
    module Language
      module French
        include Default
        module Century
          extend self

          def humanizer(date)
            # Special French handling for EDTF::Century
          end
        end
      end
    end
  end
end

EDTF-Humanize version 2.0.0 is available on rubygems.org and on GitHub. Documentation is available on GitHub. Pull requests are welcome; I’m especially interested in contributions to add support for languages in addition to English and French.

Behind the Scenes, FOLIO, Technology

In a (Temporary) Time of Remote Work, Duke’s FOLIO Implementation Continues

April 9, 2020 en25@duke.edu

Duke University is an early adopter for FOLIO, an open source library services platform that will give us tools to better support the information needs of our students, faculty, and staff. A core team in Library Systems and Integration Support began forming in January 2019 to help Duke move to FOLIO. I joined that team in January 2019 and began work as an IT Business Analyst.

In preparation for going-live with FOLIO, we formally kicked off our local implementation effort in January 2020. More than 40 local subject experts have joined small group teams to work on different parts of the FOLIO project. These experts are invaluable to Library IT staff: they know how the library’s work is done, which features need to be prioritized over others, and are committed to figuring out how to transition their work into the FOLIO environment.

If you’re reading this in April 2020 and thinking “wasn’t January ten years ago?” you’re not alone. Because the FOLIO Project is international, with partners all over the world, many of us are used to working via remote tools like Slack, Microsoft Teams, and Zoom. But that is a far cry from doing ALL of our work that way, while also taking care of our families and ourselves. It’s a huge credit to all library staff that while the University was swiftly pivoting to remote work, we were able to keep our implementation work going.

One of the first big, messy areas that we knew we needed to work on was using locations.

Locations are essential to how patrons know where an item is at the Duke Libraries. When you look up a book in our catalog and the system tells you Where to Find It, it’s using location information from our systems. Library staff also use locations to understand how often items are borrowed, decide when to move items to our off-campus storage, and decide when we to buy new items to keep our collections up to date.

A group of FOLIO team members came together from different working areas, including public services, cataloging, acquisitions, digital resources and assessment. I convened those discussions as a lead for our Configurations team. Over the course of late February and March 2020, we met three times as a group using Zoom and delved deep into learning about locations in our current system and how they will work in FOLIO. Staff members shared their knowledge with each other about their functional areas, allowing us to identify potential gaps in FOLIO functionality, as well as things we could improve now, without waiting for FOLIO to deploy.

This team identified two potential paths forward – one that was straightforward, and one that was more creative and would adapt the FOLIO four-level locations in a new way. In our final meeting – where we had hoped to decide between the two options, our subject experts grappled with the challenges, risks and rewards of the two choices and were able to recommend a path forward together. Ultimately, the team agreed that the creative option was the best choice, but both options would work – and that guidance helped us decide how to make a first pass on configuring locations and move the project forward.

The most important part of these meetings was valuing the expertise of our library staff and working to support them as they decided what would work the best for the library’s needs. I am deeply appreciative of the staff who committed the time to these discussions while also figuring out how to move their regular jobs to remote work. Our FOLIO implementation is all the better because of their collaborative spirit.

Behind the Scenes, Projects, Technology

The New Books & Media Catalog Turns One

February 28, 2020 Cory Lown 2 Comments

It’s been just over a year since we launched our new catalog in January of 2019. Since then we’ve made improvements to features, performance, and reliability, have developed a long term governance and development strategy, and have plans for future features and enhancements.

During the Spring 2019 semester we experienced a number of outages of the Solr index that powers the new catalog. It proved to be both frustrating and difficult to track down the root cause of the outages. We took a number of measures to reduce the risk of bot traffic slowing down or crashing the index. A few of these measures include limiting facet paging to 50 pages and results paging to 250 pages, as well as setting limits on OpenSearch queries. We also added service monitoring so we are automatically alerted when things go awry and automatic restarts under some known bad system conditions. We also identified that a bug in the version of Solr we were running was vulnerable to causing crashes for queries with particular characteristics. We have since applied a patch to Solr to address this bug. Happily, the index has not crashed since we implemented these protective measures and bug fixes.

Over the past year we’ve made a number of other improvements to the catalog including:

Caching of the home page and advanced search page have reduced page load times by 75%.
Subject searches are now more precise and do not include stemmed terms.
CDs and DVDs can be searched by accession number.
When digitized copies of Duke material are available at the Internet Archive, links to the digital copy are automatically generated.
Records can be saved to a bookmarks list and shared with others via a stable URL.
Eligible records now have a “Request digitization” link.
Many other small improvements and bug fixes.

We sometimes get requests for features that the catalog already supports:

Queries with Boolean operators
Queries with wildcards
Searching across multiple libraries at the same time (from advanced search)
Anchor links for quick access to record metadata, and a jump to top link.

While development has slowed, the core TRLN team meets monthly to discuss and prioritize new features and fixes, and dedicates time each month to maintenance and new development. We have a number of small improvements and bug fixes in the works. One new feature we’re working on is adding a citation generator that will provide copyable citations in multiple formats (APA, MLA, Chicago, Harvard, and Turabian) for records with OCLC numbers.

We welcome, read, and respond to all feedback and feature requests that come to us through the catalog’s feedback form. Let us know what you think.

Check out “Search Tips” and “Expert Search Tips” for detailed information about how to get the most out of the new catalog.

Behind the Scenes, Duke Digital Repository, Technology, User Experience

Duke Digital Repository Evolution and a new home page

February 7, 2020 Michael Daul 3 Comments

After nearly a year of work, the libraries recently launched an updated version of the software stack that powers parts the Duke Digital Repository. This work primarily centered around migrating the underlying software in our Samvera implementation — which we use to power the DDR — from ActiveFedora to Valkyrie. Moving to Valkyrie gives us the benefits of improved stability along with the flexibility to use different storage solutions, which in turn provides us with options and some degree of future-proofing. Considerable effort was also spent on updating the public and administrative interfaces to use more recent versions of blacklight and supporting software.

ddr admin interface — Administrative interface for the DDR

We also used this opportunity to revise the repository landing page at repository.duke.edu and I was involved in building a new version of the home page. Our main goals were to make use of a header implementation that mirrored our design work in other recent library projects and that integrated our ‘unified’ navigation, while also maintaining the functionality required by the Samvera software.

Old DDR Homepage — DDR home page before the redesign

We also spent a lot of time thinking about how best to illustrate the components of the Duke Digital Repository while trying to keep the content simple and streamlined. In the end we went with a design that emphasizes the two branches of the repository; Library Collections and Duke Scholarship. Each branch in turn links to two destinations — Digitized Collections / Acquired Materials and the Research Data Repository / DukeSpace. The overall design is more compact than before and hopefully an improvement aesthetically as well.

new DDR homepage — Redesigned DDR home page

We also incorporated a feedback form that is persistent across the interface so that users can more readily report any difficulties they encounter while using the platform. And finally, we updated the content in the footer to help direct users to the content they are more than likely looking for.

Future plans include incorporating our header and footer content more consistently across the repository platforms along with bringing a more unified look and feel to interface components.

Check out the new design and let us know what you think!

Behind the Scenes, MorphoSource, Technology

Describing 3D Data in MorphoSource 2.0

January 17, 2020 Jocelyn Triplett

Header Image: Collection of extinct and extant turtle skull microCT scans in MorphoSource: bit.ly/3DFossilTurtles

MorphoSource (www.morphosource.org) is a publicly accessible repository for 3D research data, especially data that represents biological specimens. Developers in Evolutionary Anthropology and the Library’s Software Services department have been working to rebuild the application, improving upon the current site’s technology and features. An important part of this rebuild is implementing a more robust data model that will let our users efficiently discover, curate, disseminate, and preserve their data.

A typical deposit in MorphoSource is a file or files that represent a scan of all or part of an organism – such as a bone, tooth, or entire animal. The files may be a mesh or series of images produced through a CT scan. In order to collect all the information necessary to understand the files, the specimen that the files represent, and the processes that created the data, the improved site will guide the researcher in providing additional context for their deposit at the same time that they upload their files. The following describes what kind of metadata the depositor can expect to provide as part of the submission process.

The first step is to determine whether the researcher’s current deposit is derived in some way from data that is already in MorphoSource, or if the depositor would like to also submit those files and metadata. For example, they may be depositing a mesh file that was created from original photographs that are already available through the site. By including links to the raw data in the repository, users can reprocess the files if needed, or run different processes in the future.

MorphoSource collects metadata to provide context for 3D data in the repository

Next, the researcher is asked to identify or describe the biological specimen that was imaged to create their data, either by entering the information themselves or importing it from another site like iDigBio. Metadata entered at this stage includes the information about the institution that owns the specimen, a taxonomy for the specimen, and additional identifying information such as the institution’s collection or catalog number. When the depositor fills in these fields, other users will be able to search for and compare data sets for the same specimen or species.

Moving on from the description of the organism, the depositor then provides information about the device that was used to image the specimen, either by selecting a device that is already in the repository’s database, or by creating a new record, including the manufacturer, model, and modality (MRI, photography, laser scan, etc.) of the device.

Once they have described the specimen and device used for imaging, the depositor then enters metadata about the imaging event itself, such as the technician who did the imaging, the date, and the software used.

With the imaging of the specimen described, the depositor then enters data about any processing that was done to create the files being deposited, including who was responsible, what software was used, and what the process was – for example, creating a mesh or point cloud from photographs. This metadata is important in case there is a need to reprocess the data in the future.

Finally, the researcher completes their deposit by uploading the files themselves. While some technical metadata is extracted automatically, MorphoSource will rely on data depositors to provide other information that is helpful for display, such as the orientation of the scan, or to identify the files, like an external id number. This technical metadata is important for long term preservation of the data sets.

morphosource media page — Screen capture of example media page in MorphoSource

While the submission process asks the researcher to enter quite a bit of metadata, when users view the data on MorphoSource they have an understanding of what the data represents, how it was created, and how it relates to other data in the repository. It becomes easy to discover other media files representing the same specimen, or the same species, or to explore other items from the institution or researcher’s collections.

Behind the Scenes, Digital Collections, Equipment

All About that Time Base

December 20, 2019 Alex Marsh

The video digitization system in Duke Libraries’ Digital Production Center utilizes many different pieces of equipment: power distributors, waveform and vectorscope monitors, analog & digital routers, audio splitters & decibel meters, proc-amps, analog (BNC, XLR and RCA) to digital (SDI) converters, CRT & LCD video monitors, and of course an array of analog video playback decks of varying flavors (U-matic-NTSC, U-matic-PAL, Betacam SP, DigiBeta, VHS-NTSC and VHS-PAL/SECAM). We also transfer content directly from born-digital DV and MiniDV tapes.

One additional component that is crucial to videotape digitization is the Time Base Corrector (TBC). Each of our analog video playback decks must have either an internal or external TBC, in order to generate an image of acceptable quality. At the recent Association of Moving Image Archivist’s Conference in Baltimore, George Blood (of George Blood Audio/Video/Film/Data) gave a great presentation on exactly what a Time Base Corrector is, appropriately entitled “WTF is a TBC?” Thanks to George for letting me relay some of his presentation points here.

A time base is a consistent reference point that one can utilize to stay in sync. For example, The Earth rotating around the Sun is a time base that the entire human race relies on, to stay on schedule. A grandfather clock is also a time base. And so is a metronome, which a musical ensemble might use to all stay “in time.”

Frequency is defined as the number of occurrences of a repeating event per unit of time. So, the frequency of the Earth rotating around the Sun is once per 24 hrs. The frequency of a grandfather clock is one pendulum swing per second. The clock example can also be defined as one “cycle per second” or one hertz (Hz), named after Heinrich Hertz, who first conclusively proved the existence of electromagnetic waves in the late 1800’s.

One of the DPC’s external Time Base Correctors

But anything mechanical, like grandfather clocks and videotape decks, can be inconsistent. The age and condition of gears and rods and springs, as well as temperature and humidity, can significantly affect a grandfather clock’s ability to display the time correctly.

Videotape decks are similar, full of numerous mechanical and electrical parts that produce infinite variables in performance, affecting the deck’s ability to play the videotape’s frames-per-second (frequency) in correct time.

NTSC video is supposed to play at 29.97 frames-per-second, but due to mechanical and electro-magnetic variables, some frames may be delayed, or some may come too fast. One second of video might not have enough frames, another second may have too many. Even the videotape itself can stretch, expand and contract during playback, throwing off the timing, and making the image wobbly, jittery, too bright or dark, too blue, red or green.

A Time Base Corrector does something awesome. As the videotape plays, the TBC stores the unstable video content briefly, fixes the timing errors, and then outputs the corrected analog video signal to the DPC’s analog-to-digital converters. Some of our videotape decks have internal TBCs, which look like a computer circuit board (shown below). Others need an external TBC, which is a smaller box that attaches to the output cables coming from the videotape deck (shown above, right). Either way, the TBC can delay or advance the video frames to lock them into correct time, which fixes all the errors.

An internal Time Base Corrector card from a Sony U-matic BVU-950 deck

An internal TBC is actually able to “talk” to the videotape deck, and give it instructions, like this…

“Could you slow down a little? You’re starting to catch up with me.”

“Hey, the frames are arriving at a strange time. Please adjust the timing between the capstan and the head drum.”

“There’s a wobble in the rate the frames are arriving. Can you counter-wobble the capstan speed to smooth that out?”

“Looks like this tape was recorded with bad heads. Please increase gain on the horizontal sync pulse so I can get a clearer lock.”

Without the mighty TBC, video digitization would not be possible, because all those errors would be permanently embedded in the digitized file. Thanks to the TBC, we can capture a nice, clean, stable image to share with generations to come, long after the magnetic videotape, and playback decks, have reached the end of their shelf life.

Behind the Scenes, Projects

FOLIO Update November 2019

November 22, 2019 Karen Newbery

Here at Duke, the buzz continues around FOLIO. We have continued to contribute to the international project as active participants on the FOLIO Product Council, special interest groups, and contribute development resources. You can find links to the various groups on the FOLIO wiki.

We’ve also committed to implementing the electronic resource management (ERM)-focused apps in summer of 2020. Starting with the ERM-focused apps will give us the opportunity to use FOLIO in a production environment, and will be a benefit to our Continuing Resource Acquisitions Department since they are not currently using software dedicated to electronic resources to keep track of licences and terms.

Our local project planning has come more into focus as well. We have gathered names for team participants and will be kicking off our project teams in January. As we’ve talked about the implementation here, we’ve realized that we have a number of tasks that will need to be addressed, regardless of subject matter. For example, we’re going to need to map data – not just bibliographic, holdings and item data, but users, orders, invoices, etc. We’ll also need to set up configurations and user permissions for each of the apps, and document, train, and develop new workflows. Since our work is not siloed in functional areas, we need to facilitate discussions among the functional areas. To do that, we’re going to create a set of functional area implementation teams, and work groups around the task areas that need to be addressed.

To learn more about the FOLIO project at Duke, fly on over to our WordPress site and read through our past newsletters, look through slides from past presentations, and check out some fun links to bee facts.

Behind the Scenes, Duke Digital Repository

A Statement of Commitment

November 11, 2019 Will Sexton 2 Comments

The featured image is from a mockup of a new repositories home page that we’re working on in the Libraries, planned for rollout in January of 2020.

Working at the Libraries, it can be dizzying to think about all of our commitments.

There’s what we owe our patrons, a body of so many distinct and overlapping communities, all seeking to learn and discover, that we could split the library along an infinite number of lines to meet them where they work and think.

There’s what we owe the future, in our efforts to preserve and share the artifacts of knowledge that we acquire on the market, that scholars create on our own campus, or that seem to form from history and find us somehow.

There’s what we owe the field, and the network of peer libraries that serve their own communities, each of them linked in a web of scholarship with our own. Within our professional network, we seek to support and complement one another, to compete sometimes in ways that move our field forward, and to share what we learn from our experiences.

The needs of information technology underlie nearly all of these activities, and to meet those needs, we have an IT staff that’s modest in size, but prodigious in its skill and its dedication to the mission of the Libraries. Within that group, the responsibility for creating new software, and maintaining what we have, falls to a small team of developers and devops engineers. We depend on them to enhance and support a wide range of platforms, including our web services, our discovery platforms, and our digital repositories.

This fall, we did some reflection on how we want to approach support for our repository platforms. The result of that reflection was a Statement of Commitment to Repositories Support and Development, a document of roughly a page that expresses what we consider to be our values in this area, and the context of priorities in which we do that work.

The committee that created the statement was our Digital Preservation and Publishing Program, or DP3 as call it in house. We summarized our values as “openness, community and peer engagement, and independence from vended platforms,” which have “guided us to build our repositories on open source software platforms.” We place that work within the context of very large, looming priorities like our transition to FOLIO as our Library Services Platform, and the project to renovate Lilly Library. There are others, not mentioned in the statement, that fill the pages of this blog.

The statement is explicit that we will not seek to find alternative platforms for our repository services in the next several years, and in particular while the FOLIO transition is underway. This decision is informed by our recognition that migration of content and services across platforms is complex and expensive. It’s also a recognition that we have invested a lot into these existing platforms, and we want to carve out as much space as we can for our talented staff to focus on maintaining and improving them, rather than locking ourselves into all-consuming cycles of content migration.

From a practical perspective, and speaking as the manager who oversees software development in the Libraries, I see this statement as part of an overall strategy to bring focus to our work. It’s a small but important symbolic measure that recognizes the drag that we create for our software team when give in to our urge to prioritize everything.

The phrase “context switching” is one that we have borrowed from the parlance of operating systems to describe the effects on a developer of working on multiple projects at once. There are real costs to moving between development environments, code bases, and architectures on the same day, in the same week, during the same sprint, or within even an extended work cycle. We also call this problem “multi-tasking,” and the penalty it imposes of performance is well documented.

Even more than performance, I think of it as a quality of life concern. People are generally happier and more invested when they’re able to do quality work. As a manager, I can work with scheduling and planning to try to mitigate those effects of multitasking on our team. But the responsibility really lies with the organization. We have our commitments, and they are vast in size and scope. We owe it to ourselves to do some introspection now and again, and ask what we can realistically do with what we have, or more accurately, who we are.

Behind the Scenes, Digital Collections, Digitization Expertise, Equipment, Technology

Lighting and the PhaseOne: It’s More Than Point and Shoot

September 24, 2019 Spencer Bevis

Last week, I went to go see the movie IT: Chapter 2. One thing I really appreciated about the movie was how it used a scene’s lighting to full effect. Some scenes are brightly lit to signify the friendship among the main characters. Conversely, there are dark scenes that signify the evil Pennywise the Clown. For the movie crew, no doubt it took a lot of time and manpower to light an individual scene – especially when the movie is nearly 3 hours long.

We do the same type of light setup and management inside the Digital Production Center (DPC) when we take photos of objects like books, letters, or manuscripts. Today, I will talk specifically about how we light the bound material that comes our way, like books or booklets. Generally, this type of material is always going to be shot on our PhaseOne camera, so I will particularly highlight that lighting setup today.

Before We Begin

It’s not enough to just turn the lights on in our camera room to do the trick. In order to properly light all the things that need to be shot on the PhaseOne, we have specific tools and products we use that you can see in the photo below.

We have 4 high-powered lights (two sets of two Buhl SoftCube SC-150 models) pointed directly in the camera’s field of view. There are two on the right and two on the left. These are stationed approximately 3.5 feet off the ground and approximately 2.5 feet away from the objects themselves. These lights are supported by Avenger A630B light stands. They allow for a wide range of movement, extension, and support if we need them.

But if bright, hot lights were pointed directly at sensitive documents for hours, it would damage them. So light diffusers are necessary. For both sets of lights, we have 3 layers of material to diffuse the light and prevent material from warping or text from fading. The first layer, directly attached to the light box itself, is an inexpensive sheet of diffusion fabric. This type of material is often made from nylon or silk, and are usually inexpensive.

The second diffusion layer is an FJ Westcott Scrim Jim, a similar thin fabric that is attached to a lightweight stand-up frame, the Manfrotto 156BLB. This frame can also be moved or extended if need be. The last layer is another sheet of diffusion fabric, attached to a makeshift “cube” held up by lightweight wooden rods. This cube can be picked up or carried, making it very convenient if we need to eventually move our lights.

So in total, we have 4 lights, 4 layers of diffusion fabric attached to the light boxes, two Scrim Jims, and the cube featuring 2 sides of additional diffusion fabric. After having all these items stationed, surely we can start taking pictures, right? Not yet.

Around the Room

There are still more things to be aware of – this time in the camera room itself. We gently place the materials themselves on a cradle lined with a black felt, similar to velvet. This cradle is visible in the bottom right part of the photo above. It is placed on top of a table, also coated in black felt. This is done so no background colors bounce back or reflect onto the object and change what it looks like in the final image itself. The walls of the camera room are also painted a neutral grey color for the same reason, as you can see in the background of the above photo. Finally, any tiny reflective segments between the ceiling tiles have been blacked out with gaffer tape. Having the room this muted and intentionally dark also helps us when we have to shoot multi-spectral images. No expense has been spared to make sure our colors and photos are correct.

Camera Settings

With all these precautions in place, can we finally take photos of our materials? Almost. Before we can start photographing, we have to run some tests to make sure everything looks correct to our computers. After making sure our objects are sharp and in focus, we use a program called DTDCH (see the photo to the right) to adjust the aperture and exposure of the PhaseOne so that nothing appears either way too dim or too bright. In our camera room, we use a PhaseOne IQ180 with a Schneider Kreuznach Apo-Digitar lens (visible in the top-right corner of the photo above). We also use the program CaptureOne to capture, save, and export our photos.

Once the shot is in focus and appropriately bright, we will check our colors against an X-Rite ColorChecker Classic card (see the photo on the left) to verify that our camera has a correct white balance. When we take a photo of the ColorChecker, CaptureOne displays a series of numbers, known as RGB values, found in the photo’s colors. We will check these numbers against what they should be, so we know that our photo looks accurate. If these numbers match up, we can continue. You could check our work by saving the photo on the left and opening it in a program like Adobe Photoshop.

Finally, we have specific color profiles that the DPC uses to ensure that all our colors appear accurate as well. For more information on how we consistently calibrate the color in our images, please check out this previous blog post.

After all this setup, now we can finally shoot photos! Lighting our materials for the PhaseOne is a lot of hard work and preparation. But it is well worth it to fulfill our mission of digitizing images for preservation.

Behind the Scenes, Duke Digital Repository, Technology

What we talk about when we talk about digital preservation

September 22, 2019 Moira Downey

(Header image: Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark)

Here at Duke University Libraries, we often talk about digital preservation as though everyone is familiar with the various corners and implications of the phrase, but “digital preservation” is, in fact, a large and occasionally mystifying topic. What does it mean to “preserve” a digital resource for the long term? What does “the long term” even mean with regard to digital objects? How are libraries engaging in preserving our digital resources? And what are some of the best ways to ensure that your personal documents will be reusable in the future? While the answers to some of these questions are still emerging, the library can help you begin to think about good strategies for keeping your content available to other users over time by highlighting agreed-upon best practices, as well as some of the services we are able to provide to the Duke community.

File formats

Not all file formats have proven to be equally robust over time! Have you ever tried to open a document created using a Microsoft Office product from several years ago, only to be greeted with a page full of strangely encoded gibberish? Proprietary software like the products in the Office suite can be convenient and produce polished contemporary documents. But software changes, and there is often no guarantee that the beautifully formatted paper you’ve written using Word will be legible without the appropriate software 5 years down the line. One solution to this problem is to always have a version of that software available to you to use. Libraries are beginning to investigate this strategy (often using a technique called emulation) as an important piece of the digital preservation puzzle. The Emulation as a Service (EaaS) architecture is an emerging tool designed to simplify access to preserved digital assets by allowing end users to interact with the original environments running on different emulators.

An alternative to emulation as a solution is to save your files in a format that can be consumed by different, changing versions of software. Experts at cultural heritage institutions like the Library of Congress and the US National Archives and Records Administration have identified an array of file formats about which they feel some degree of confidence that the software of the future will be able to consume. Formats like plain text or PDFs for textual data, value separated files (like comma-separated values, or CSVs), MP3s and MP4s for audio and video data respectively, and JPEGs for still images have all proven to have some measure of durability as formats. What’s more, they will help to make your content or your data more easily accessible to folks who do not have access to particular kinds of software. It can be helpful to keep these format recommendations in mind when working with your own materials.

File format migration

The formats recommended by the LIbrary of Congress and others have been selected not only because they are interoperable with a wide variety of software applications, but also because they have proven to be relatively stable over time, resisting format obsolescence. The process of moving data from an obsolete format to one that is usable in the present day is known as file format migration or format conversion. Libraries generally have yet to establish scalable strategies for extensive migration of obsolete file formats, though it is generally a subject of some concern.

Here at DUL, we encourage the use of one of these recommended formats for content that is submitted to us for preservation, and will even go so far as to convert your files prior to preservation in one of our repository platforms where possible and when appropriate to do so. This helps us ensure that your data will be usable in the future. What we can’t necessarily promise is that, should you give us content in a file format that isn’t one we recommend, a user who is interested in your materials will be able to read or otherwise use your files ten years from now. For some widely used formats, like MP3 and MP4, staff at the Libraries anticipate developing a strategy for migrating our data from this format, in the event that the format becomes superseded. However, the Libraries do not currently have the staff to monitor and convert rarer, and especially proprietary formats to one that is immediately consumable by contemporary software. The best we can promise is that we are able to deliver to the end users of the future the same digital bits you initially gave to us.

Bit-level preservation

Which brings me to a final component of digital preservation: bit-level preservation. At DUL, we calculate a checksum for each of the files we ingest into any of our preservation repositories. Briefly, a checksum is an algorithmically derived alphanumeric hash that is intended to surface errors that may have been introduced to the file during its transmission or storage. A checksum acts somewhat like a digital fingerprint, and is periodically recalculated for each file in the repository environment by the repository software to ensure that nothing has disrupted the bits that compose each individual file. In the event that the re-calculated checksum does not match the one supplied when the file has been ingested into the repository, we can conclude with some level of certainty that something has gone wrong with the file, and it may be necessary to revert to an earlier version of the data. THe process of generating, regenerating, and cross-checking these checksums is a way to ensure the file fixity, or file integrity, of the digital assets that DUL stewards.