All posts by Cory Lown

Indexing variant names from the Library of Congress Name Authority File (LCNAF) in TRLN Discovery

You might or might not have noticed a TRLN Discovery feature announcement in the February TRLN News Roundup. It mentioned that we are now indexing variant names from the Library of Congress Name Authority File in TRLN Discovery. I thought in this post I would expand on what this change means for Duke’s Books & Media catalog, add some details about the technical implementation, and discuss some related features we might add in the future based on this work.

What is it?

First, the practical matter:  what does this feature mean for people who search the catalog? Our catalog records contain authoritative forms of creator names. This is the specific form of the person’s name chosen as the authoritative form by the Library of Congress. For example, the authoritative form of Emily Dickinson’s name is “Dickinson, Emily, 1830-1886.” If you search the Books & Media catalog using this form of the poet’s name you will find all records associated with her name (example search with the authoritative name). Previously, if you had searched the catalog and added the poet’s middle name, “Elizabeth,” it’s likely you would have missed many relevant results because “Elizabeth” is not included in the authoritative form of the name. It is, however, included in one of the variant names in the LC Name Authority File. The full list of variant names for Emily Dickinson is:

  • Dickinson, Emilia, 1830-1886
  • Dickinson, Emily Elizabeth, 1830-1886
  • Dickinson, Emily (Emily Elizabeth), 1830-1886
  • Dikinson, Ėmili, 1830-1886
  • D̲ikinson, Emily, 1830-1886
  • Ti-chin-sen, Ai-mi-li, 1830-1886
  • דיקינסון, אמילי, 1830־1886
  • דיקינסון, אמילי, 1886־1830
  • Dykinsan, Ėmili, 1830-1886

Emily Dickinson

Since we are now indexing these forms in TRLN Discovery you now get much better results if you happen to add Emily Dickinson’s middle name to your search (example search including a variant form of the name). Additionally, various romanizations and vernacular forms are indexed (example search for “דיקינסון, אמילי”).

If you clicked through to the example searches you may have noticed that the result counts and result order are slightly different when searching the authoritative form vs. the variant forms.

The variant forms are only indexed on records that include a URI that references the LC Name Authority File. If this URI reference is missing the variant names are not indexed for that record. Additionally, some records may not have been updated since we implemented this feature. In time all records that include URIs for names will have variant names indexed.

The difference in result order is due to how the variant names are indexed. For the authoritative form of the name we distinguish between creators, editors, contributors, etc. and give matches in these categories different boosts in the relevance ranking. At the moment, the variant names from the LCNAF file are indexed in a single field and so we lose the nuance needed for more granular relevance ranking. This is something that could be revised in the future if needed.

How does it work?

This feature relies on the fact that our MARC records include URI references to the LC Name Authority File. As an example, here’s a MARC XML 100 Main Entry-Personal Name field for Emily Dickinson with a URI reference to the authority file.

<datafield tag="100" ind1="1" ind2=" ">
<subfield code="a">Dickinson, Emily,</subfield>
<subfield code="d">1830-1886.</subfield>
<subfield code="0">http://id.loc.gov/authorities/names/n79054166</subfield>
</datafield>

We store this URI reference in the TRLN Discovery name field and then use this URI reference at ingest time to lookup and index the variant names from a local cache of the variant names. Here’s the stored name for Emily Dickinson in TRLN Discovery index.

names_a: ["{\"name\":\"Dickinson, Emily, 1830-1886\",\"rel\":\"author\",\"type\":\"creator\",\"id\":\"http://id.loc.gov/authorities/names/n79054166\"}"]

The TRLN Discovery ingest service keeps its own cache of the name identifiers and variant names for efficient lookup at ingest time. We use Redis, an open-source, in-memory (very fast) data store to make the variant names available when records are ingested. This local cache is built from the LC Name Authority File. Since the name authority file changes over time we will refresh our local cache of the data every 3 months to keep it up to date. We’ve written a script (Rails Rake task) that automates this update process.

What’s next?

The addition of stored name authority URIs in the TRLN Discovery index opens up opportunities to add more features in the future. I’m especially interested in displaying more contextual information about creators in our catalog. We could also expose “See also” references from the authority files to make it easier to find works by the same person published under different names (“Twain, Mark, 1835-1910” being a good example):

  • Clemens, Samuel Langhorne, 1835-1910
  • Conte, Louis de, 1835-1910
  • Snodgrass, Quintus Curtius, 1835-1910

Mark Twain

As always, we continue to add features and make incremental improvements to TRLN Discovery, and your feedback is critical. Please let us know how things are working for you using the feedback form available on every page of the Books & Media Catalog.

EDTF-Humanize 2.0 with Improved Internationalization Support

About four years ago we released a small Ruby gem (EDTF-Humanize) to generate human readable dates out of Extended Date Time Format dates. For some background on our use of the EDTF standard, please see our previous blog posts on the topic: EDTF-Humanize, Enjoy your Metadata: Fun with Date Encoding, and It’s Date Night Here at Digital Projects and Production Services.

Some recent community contributions to the gem as well as some extra time as we transition from one work cycle to another provided an opportunity for maintenance and refinement of EDTF-Humanize. The primary improvement is better support for languages other than English via Ruby I18n locale configuration files and a language specific module override pattern. Support for French is now included and support for other languages may be added following the same approach as French.

The primary means of adding additional languages to EDTF-Humanize is to add a translation file to config/locals/. This is the translation file included to support French:

fr:
  date:
    day_names: [Dimanche, Lundi, Mardi, Mercredi, Jeudi, Vendredi, Samedi]
    abbr_day_names: [Dim, Lun, Mar, Mer, Jeu, Ven, Sam]
    # Don't forget the nil at the beginning; there's no such thing as a 0th month
    month_names: [~, Janvier, Février, Mars, Avril, Mai, Juin, Juillet, Août, Septembre, Octobre, Novembre, Decembre]
    abbr_month_names: [~, Jan, Fev, Mar, Avr, Mai, Jun, Jul, Aou, Sep, Oct, Nov, Dec]
    seasons:
      spring: "printemps"
      summer: "été"
      autumn: "automne"
      winter: "hiver"
  edtf:
    terms:
      approximate_date_prefix_day: ""
      approximate_date_prefix_month: ""
      approximate_date_prefix_year: ""
      approximate_date_suffix_day: " environ"
      approximate_date_suffix_month: " environ"
      approximate_date_suffix_year: " environ"
      decade_prefix: "Les années "
      decade_suffix: ""
      century_suffix: ""
      interval_prefix_day: "Du "
      interval_prefix_month: "De "
      interval_prefix_year: "De "
      interval_connector_approximate: " à "
      interval_connector_open: " à "
      interval_connector_day: " au "
      interval_connector_month: " à "
      interval_connector_year: " à "
      interval_unspecified_suffix: "s"
      open_start_interval_with_day: "Jusqu'au %{date}"
      open_start_interval_with_month: "Jusqu'en %{date}"
      open_start_interval_with_year: "Jusqu'en %{date}"
      open_end_interval_with_day: "Depuis le %{date}"
      open_end_interval_with_month: "Depuis %{date}"
      open_end_interval_with_year: "Depuis %{date}"
      set_dates_connector_exclusive: ", "
      set_dates_connector_inclusive: ", "
      set_earlier_prefix_exclusive: 'Le ou avant '
      set_earlier_prefix_inclusive: 'Le et avant '
      set_last_date_connector_exclusive: " ou "
      set_last_date_connector_inclusive: " et "
      set_later_prefix_exclusive: 'Le ou après '
      set_later_prefix_inclusive: 'Le et après '
      set_two_dates_connector_exclusive: " ou "
      set_two_dates_connector_inclusive: " et "
      uncertain_date_suffix: "?"
      unknown: 'Inconnue'
      unspecified_digit_substitute: "x"
    formats:
      day_precision_strftime_format: "%-d %B %Y"
      month_precision_strftime_format: "%B %Y"
      year_precision_strftime_format: "%Y"

In addition to the translation file, the methods used to construct the human readable string for each EDTF date object type may be completely overridden for a language if needed. For instance, when the date object is an instance of EDTF::Century the French language uses a different method from the default to construct the humanized form. This override is accomplished by adding a language module for the French language that includes the Default module and also includes a Century module that overrides the default behavior. The override is here (minus the internals of the humanizer method) as an example:

# lib/edtf/humanize/language/french.rb
module Edtf
  module Humanize
    module Language
      module French
        include Default
        module Century
          extend self

          def humanizer(date)
            # Special French handling for EDTF::Century
          end
        end
      end
    end
  end
end

EDTF-Humanize version 2.0.0 is available on rubygems.org and on GitHub. Documentation is available on GitHub. Pull requests are welcome; I’m especially interested in contributions to add support for languages in addition to English and French.

The New Books & Media Catalog Turns One

It’s been just over a year since we launched our new catalog in January of 2019. Since then we’ve made improvements to features, performance, and reliability, have developed a long term governance and development strategy, and have plans for future features and enhancements.

During the Spring 2019 semester we experienced a number of outages of the Solr index that powers the new catalog. It proved to be both frustrating and difficult to track down the root cause of the outages. We took a number of measures to reduce the risk of bot traffic slowing down or crashing the index. A few of these measures include limiting facet paging to 50 pages and results paging to 250 pages, as well as setting limits on OpenSearch queries. We also added service monitoring so we are automatically alerted when things go awry and automatic restarts under some known bad system conditions. We also identified that a bug in the version of Solr we were running was vulnerable to causing crashes for queries with particular characteristics. We have since applied a patch to Solr to address this bug. Happily, the index has not crashed since we implemented these protective measures and bug fixes.

Over the past year we’ve made a number of other improvements to the catalog including:

  • Caching of the home page and advanced search page have reduced page load times by 75%.
  • Subject searches are now more precise and do not include stemmed terms.
  • CDs and DVDs can be searched by accession number.
  • When digitized copies of Duke material are available at the Internet Archive, links to the digital copy are automatically generated.
  • Records can be saved to a bookmarks list and shared with others via a stable URL.
  • Eligible records now have a “Request digitization” link.
  • Many other small improvements and bug fixes.

We sometimes get requests for features that the catalog already supports:

While development has slowed, the core TRLN team meets monthly to discuss and prioritize new features and fixes, and dedicates time each month to maintenance and new development. We have a number of small improvements and bug fixes in the works. One new feature we’re working on is adding a citation generator that will provide copyable citations in multiple formats (APA, MLA, Chicago, Harvard, and Turabian) for records with OCLC numbers.

We welcome, read, and respond to all feedback and feature requests that come to us through the catalog’s feedback form. Let us know what you think.

Check out “Search Tips” and “Expert Search Tips” for detailed information about how to get the most out of the new catalog.

What happens when you click “Search?”

How many times each day to you type something into a search box on the web and click “Search?” Have you ever wondered what happens behind the scenes to make this possible? In this post I’ll show how search works on the Duke University Libraries Catalog. I’ll trace the journey of how search works from metadata in a MARC record (where our bibliographic data is stored), to transforming that data into something we can index for searching, to how the words you type into the search box are transformed, and then finally how the indexed records and your search interact to produce a relevance ranked list of search results. Let’s get into the weeds!

A MARC record stores bibliographic data that we purchase from vendors or are created by metadata specialists who work at Duke Libraries. These records look something like this:

In an attempt to keep this simple, let’s just focus on the main title of the record. This is information recorded in the MARC record’s 245 field in subfields a, b, f, g, h, k, n, p, and s. I’m not going to explain what each of the subfields is for but the Library of Congress maintains extensive documentation about MARC field specifications (see 245 – Title Statement (NR)). Here is an example of a MARC 245 field with a linked 880 field that contains the equivalent title in an alternate script (just to keep things interesting).

=245 10$6880-02$aUrbilder ;$bBlossoming ; Kalligraphie ; O Mensch, bewein' dein' Sünde gross (Arrangement) : for string quartet /$cToshio Hosokawa.
=880 10$6245-02/{dollar}1$a原像 ;$b開花 ; 書 (カリグラフィー) ほか : 弦楽四重奏のための /$c細川俊夫.

The first thing that has to happen is we need to get the data out of the MARC record into a more computer friendly data format — an array of hashes, which is just a fancy way of saying a list of key value pairs. The software reads the metadata from the MARC 245 field, joins all the subfields together, and cleans up some punctuation. The software also checks to see if the title field contains Arabic, Chinese, Japanese, Korean, or Cyrillic characters, which have to be handled separately from Roman character languages. From the MARC 245 field and its linked 880 field we end up with the following data structure.

"title_main": [
{
"value": "Urbilder ; Blossoming ; Kalligraphie ; O Mensch, bewein' dein' Sünde gross (Arrangement) : for string quartet"
},
{
"value": "原像 ; 開花 ; 書 (カリグラフィー) ほか : 弦楽四重奏のための",
"lang": "cjk"
}
]

We send this data off to an ingest service that prepares the metadata for indexing.

The data is first expanded to multiple fields.

{"title_main_indexed": "Urbilder ; Blossoming ; Kalligraphie ; O Mensch, bewein' dein' Sünde gross (Arrangement) : for string quartet",

"title_main_vernacular_value": "原像 ; 開花 ; 書 (カリグラフィー) ほか : 弦楽四重奏のための",

"title_main_vernacular_lang": "cjk",

"title_main_value": "原像 ; 開花 ; 書 (カリグラフィー) ほか : 弦楽四重奏のための / Urbilder ; Blossoming ; Kalligraphie ; O Mensch, bewein' dein' Sünde gross (Arrangement) : for string quartet"}

title_main_indexed will be indexed for searching.
title_main_vernacular_value holds the non Roman version of the title to be indexed for searching.
title_main_vernacular_lang holds information about the character set stored in title_main_vernacular_value.
title_main_value holds the data that will be stored for display purposes in the catalog user interface.

We take this flattened, expanded set of fields and apply a set of rules to prepare the data for the indexer (Solr). These rules append suffixes to each field and combine the two vernacular fields to produce the following field value pairs. The suffixes provide instructions to the indexer about what should be done with each field.

{"title_main_indexed_tsearchtp": "Urbilder ; Blossoming ; Kalligraphie ; O Mensch, bewein' dein' Sünde gross (Arrangement) : for string quartet",

"title_main_cjk_v": "原像 ; 開花 ; 書 (カリグラフィー) ほか : 弦楽四重奏のための",

"title_main_t_stored_single": "原像 ; 開花 ; 書 (カリグラフィー) ほか : 弦楽四重奏のための / Urbilder ; Blossoming ; Kalligraphie ; O Mensch, bewein' dein' Sünde gross (Arrangement) : for string quartet" }

When sent to the indexer the fields are further transformed.

Suffixed Source Field Solr Field Solr Field Type Solr Stored/Indexed Values
title_main_indexed_tsearchtp title_main_indexed_t text stemmed urbild blossom kalligraphi o mensch bewein dein sund gross arrang for string quartet
title_main_indexed_tsearchtp title_main_indexed_tp text unstemmed urbilder blossoming kalligraphie o mensch bewein dein sunde gross arrangement for string quartet
title_main_cjk_v title_main_cjk_v chinese, japanese, korean text 原 像 开花 书 か り く ら ふ ぃ い ほか 弦乐 亖 重奏 の ため の
title_main_t_stored_single title_main stored string 原像 ; 開花 ; 書 (カリグラフィー) ほか : 弦楽四重奏のための / Urbilder ; Blossoming ; Kalligraphie ; O Mensch, bewein’ dein’ Sünde gross (Arrangement) : for string quartet

These are all index time transformations. They occur when we send records into the index.

The query you enter into the search box also gets transformed in different ways and then compared to the indexed fields above. These are query time transformations. As an example, if I search for the terms “Urbilder Blossom Kalligraphie,” the following transformations and comparisons take place:

The values stored in the records for title_main_indexed_t are evaluated against my search string transformed to urbild blossom kalligraphi.

The values stored in the records for title_main_indexed_tp are evaluated against my search string transformed to urbilder blossom kalligraphie.

The values stored in the records for title_main_cjk_v are evaluated against my search string transformed to urbilder blossom kalligraphie.

Then Solr does some calculations based on relevance rules we configure to determine which documents are matches and how closely they match (signified by the relevance score calculated by Solr). The field value comparisons end up looking like this under the hood in Solr:

+(DisjunctionMaxQuery((
(title_main_cjk_v:urbilder)^50.0 |
(title_main_indexed_tp:urbilder)^500.0 |
(title_main_indexed_t:urbild)^100.0)~1.0)
DisjunctionMaxQuery((
(title_main_cjk_v:blossom)^50.0 |
(title_main_indexed_tp:blossom)^500.0 |
(title_main_indexed_t:blossom)^100.0)~1.0)
DisjunctionMaxQuery((
(title_main_cjk_v:kalligraphie)^50.0 |
(title_main_indexed_tp:kalligraphie)^500.0 |
(title_main_indexed_t:kalligraphi)^100.0)~1.0))~3
DisjunctionMaxQuery((
(title_main_cjk_v:"urbilder blossom kalligraphie")^150.0 |
(title_main_indexed_t:"urbild blossom kalligraphi")^600.0 |
(title_main_indexed_tp:"urbilder blossom kalligraphie")^5000.0)~1.0)
(DisjunctionMaxQuery((
(title_main_cjk_v:"urbilder blossom")^75.0 |
(title_main_indexed_t:"urbild blossom")^200.0 |
(title_main_indexed_tp:"urbilder blossom")^1000.0)~1.0)
DisjunctionMaxQuery((
(title_main_cjk_v:"blossom kalligraphie")^75.0 |
(title_main_indexed_t:"blossom kalligraphi")^200.0 |
(title_main_indexed_tp:"blossom kalligraphie")^1000.0)~1.0))
DisjunctionMaxQuery((
(title_main_cjk_v:"urbilder blossom kalligraphie")^100.0 |
(title_main_indexed_t:"urbild blossom kalligraphi")^350.0 |
(title_main_indexed_tp:"urbilder blossom kalligraphie")^3000.0)~1.0)

The ^nnnn indicates the relevance weight given to any matches it finds, while the ~n.n indicates the number of matches that are required from each clause to consider the document a match. Matches in fields with higher boosts count more than fields with lower boosts. You might notice another thing, that full phrase matches are boosted the most, two consecutive term matches are boosted slightly less, and then individual term matches are given the least boost. Furthermore unstemmed field matches (those that have been modified the least by the indexer, such as in the field title_main_indexed_tp) get more boost than stemmed field matches. This provides the best of both worlds — you still get a match if you search for “blossom” instead of “blossoming,” but if you had searched for “blossoming” the exact term match would boost the score of the document in results. Solr also considers how common the term is among all documents in the index so that very common words like “the” don’t boost the relevance score as much as less common words like “kalligraphie.”

I hope this provides some insight into what happens when you clicks search. Happy searching.

A collaborative approach to developing a new Duke Libraries catalog

Post contributed by: Emily Daly, Thomas Crichlow, and Cory Lown

If you’re a frequent or even casual user of the Duke Libraries catalog, you’ve probably noticed that it’s remained remarkably consistent over the last decade. Consistency can be a good thing, but there is certainly room for improvement in the Duke Libraries catalog, and staff from the libraries at Duke, UNC, and NCSU are excited to replace the current catalog’s aging infrastructure and outdated user interface with an entirely new collaboratively developed open-source discovery layer. While many things are changing, one key feature will remain the same: The catalog will continue to allow users to locate and access materials not only here at Duke but also across the other Triangle Research Libraries member libraries (NCSU, NCCU, UNC).

Users will be able to search for items in the Duke Libraries catalog and then expand to see books and items from NCSU, NCCU, and UNC if they wish.

Commitment to collaboration

In addition to an entirely new central index that supports institutional and consortial searching, the new catalog benefits from a shared, centrally developed codebase as well as locally hosted, customizable catalog interfaces. Perhaps most notably, the new catalog has been built with the needs of library and complex bibliographic data in mind. While the software used for the current library catalog has evolved and grown in complexity to support e-commerce and business needs (not higher ed or library needs), the library software development community has been hard at work building specialized discovery layers using the open-source Blacklight framework. Peer institutions including Stanford, Cornell, and Princeton are already using Blacklight for their library catalogs, and there is an active Blacklight development community that Duke is excited to be a part of. Being part of this community enables us to build on the good work already in place in other library catalogs, including more intuitive facets, adaptive linking for subjects and other fields, a more responsive user interface for access via tablets and phones, and the ability to preserve the order of MARC fields when it’s useful to researchers (MARC is an international standard for representing bibliographic and related data).

We’re upping our collaboration game locally, too: This project has given us the opportunity to develop a new model for collaborative software development. Rather than reinvent the wheel at each Triangle Research Library, we’re combining effort and expertise to develop a feature-rich yet highly customizable discovery layer that will serve the needs of researchers across the triangle. To do this, we have adopted an agile project management process with talented developers and dedicated product owners from NCSU, UNC, and Duke. The agile approach has helped us be more productive and efficient during the development phase and increased collaboration across the four Triangle Research Libraries, positioning us well for maintaining and governing the catalog after we go live.

This image depicts the structure of the development team that was formed in May 2017 to collaboratively build the new library catalog.

What’s next?

The development team has already conducted multiple rounds of user testing and made changes to the user interface based on findings. We’re now ready to hear feedback from library staff. To facilitate this, we’ll be launching the Duke instance of the catalog to all library staff next Wednesday, August 1. We encourage staff to explore catalog features and records and then report feedback, providing screenshots, URLs, and other details as needed. We’ll continue user testing this fall and solicit extensive feedback from faculty, students, staff, and general researchers.

Our plan (fingers crossed!) is to officially launch the new Duke Libraries catalog to all users in early 2019, perhaps as soon as the start of the spring semester. A local implementation team is already at work to be sure we’re ready to replace Duke’s old catalog with the new and improved version early next year. Meanwhile, development and interface enhancement of the catalog will continue this fall. While we are pleased with what we’ve accomplished over the last 18 months, there is still significant work to be done before we’ll be ready to go live. Here are a few items on the lengthy TO DO list:

  • finish loading the 16 million records from all four Triangle Research libraries
  • integrate Duke’s request workflows so users can request items they discover in the new catalog
  • develop a robust Advanced Search interface in response to user demand
  • tune relevance ranking
  • ensure that non-Roman scripts are searchable and display correctly
  • map non-MARC metadata so items such as digital collections records are discoverable
Effective search and display of non-Roman scripts is just one of the many items left on our list before we launch the library catalog to the public.

There is a lot of work ahead to be sure, but what we will launch to staff next week is a functional catalog with nearly 10 million records, and that number is increasing by the day. We invite you to take the new catalog for a spin and tell us what you think so we can make improvements and be ready for all researchers in just a few short months.

Fun with Solr Queries

Apache Solr is behind many of our systems that provide a way to search and browse via a web application (such as the Duke Digital Repository, parts of our Bento search application, and the not yet public next generation TRLN Discovery catalog). It’s a tool for indexing data and provides a powerful query API. In this post I will document a few Solr querying techniques that might be interesting or useful. In some cases I won’t be able to provide live links to queries because we restrict direct access to Solr. However, many of these Solr querying techniques can be used directly in an application’s search box. In those cases, I will include live links to example queries in the Duke Digital Repository.

Find a list of items from their identifiers.

With this query you can specify exactly what items you want to appear in a search result from a list of identifiers.

Query

id:"duke:448098" OR id:"duke:282429" OR id:"duke:142581"
Try it in the Duke Digital Repository

Find all records that have a value (any value) in a specific field.

This query will find all the items in the repository that have a value in the product field. (As with most of these queries, you must know the field name in Solr.)

Query

product_tesim:*
Try it in the Duke Digital Repository

Find all the items in the repository that are missing a field value.

You can find all items in the repository that don’t have any date metadata. Inquiring minds want to know.

Query

-date_tesim:[* TO *]
Try it in the Duke Digital Repository

Find items using a begins-with (left-anchored) query.

I want to see all items that have a subject term that begins with “Soviet Union.” The example is a left-anchored query and will exactly match fields that begin with “Soviet Union.” (Note, the field must not be tokenized for this to work as expected.)

Query

subject_facet_sim:/Soviet Union.*/
Try it in the Duke Digital Repository

Find items with an ends-with (right-anchored) query.

Again, this will only work as expected with an untokenized field.

Query

subject_facet_sim:/.*20th century/
Try it in the Duke Digital Repository

Some of you might have noticed that these queries look a lot like regular expressions. And you’re right! Read more about Solr’s support for regular expression queries.

The following examples require direct access to Solr, which is restricted to authorized users and applications. Instead of providing live links, I’ll show the basic syntax, a complete example query using http://localhost:8983/solr/core/* as the sample URL for a Solr index, and a sample response from Solr.

Count instances of values in a field.

I want to know how many items in the repository have a workflow state of published and how many are unpublished. To do that I can write a facet query that will count instances of each value in the specified field. (This is another query that will only work as expected with an untokenized field.)

Query

http://localhost:8983/solr/core/select?q=*:*&facet=true&facet.field=workflow_state_ssi&facet.mincount=1&fl=id

Solr Response (truncated)


...
<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="workflow_state_ssi">
<int name="published">484075</int>
<int name="unpublished">2228</int>
</lst>
</lst>
</lst>
...

Collapse multiple records into one result based on a shared field value.

This one is somewhat advanced and likely only useful in particular circumstance. But if you had multiple records that were slight variants of each other, and wanted to collapse each variant down to a single result you can do that with a collapse query — as long as the records you want to collapse share a value.

Query

http://localhost:8983/solr/core/select?q=*:*&fq={!collapse%20field=oclc_number%20nullPolicy=expand%20max=termfreq(institution_f,duke)}

  • !collapse instructs Solr to use the Collapsing Query Parser.
  • field=oclc_number instructs Solr to collapse records that share the same value in the oclc_number field.
  • nullPolicy=expand instructs Solr to return any document without a matching OCLC as part of the result set. If this is excluded then records that don’t share an oclc_number with another record will be excluded from the results.
  • max=termfreq(institution,duke) instructs Solr to select as the representative record when collapsing multiple records the one that has the value “duke” in institution field.

CSV response writer (or JSON, Ruby, etc.)

Solr has a number of tricks up its sleeve when it comes to returning results. By default it will return results as XML. You can also specify JSON, or Ruby. You specify a response writer by adding the wt parameter to the URL (wt=json or wt=ruby, etc.).

Solr will also return results as a CSV file, which can then be opened in an Excel spreadsheet — a useful feature for working with metadata.

Query

http://localhost:8983/solr/core/select?q=sun&wt=csv&fl=id,title_tesim

Solr Response

id,title_tesim
duke:194006,Sun Bowl...Sun City...
duke:194002,Sun Bowl...Sun City...
duke:194009,Sun Bowl...Sun City.
duke:194019,Sun Bowl...Sun City.
duke:194037,"Sun City\, Sun Bowl"
duke:194036,"Sun City\, Sun Bowl"
duke:194030,Sun City
duke:194073,Sun City
duke:335601,Sun Control
duke:355105,Proved! Fast starts at 30° below zero!

This is just a small sample of useful ways you can query Solr.

Change is afoot in Software Development and Integration Services

We’re experimenting with changing our approach to projects in Software Development and Integration Services (SDIS). There’s been much talk of Agile (see the Agile Manifesto) over the past few years within our department, but we’ve faced challenges implementing this as an approach to our work given our broad portfolio, relatively small team, and large number of internal stakeholders.

After some productive conversations among staff and managers in SDIS where we reflected on our work over the past few years we decided to commit to applying the Scrum framework to one or more projects.

Scrum Framework
Source: https://commons.wikimedia.org/wiki/File:Scrum_Framework.png

There are many resources available for learning about Agile and Scrum. The resources I’ve found most useful so far in learning about the framework include:

Scrum seems best suited to developing new products or software and defines the roles, workflow, and artifacts that help a team make the most of its capacity to build the highest value features first and deliver usable software on a regular and frequent schedule.

To start, we’ll be applying this process to a new project to build a prototype of a research data repository based on Hyrax. We’ve formed a small team, including a product owner, scrum master, and development team to build the repository. So far, we’ve developed an initial backlog of requirements in the form of user stories in Jira, the software we use to manage projects. We’ve done some backlog refinement to prioritize the most important and highest value features, and defined acceptance criteria for the ones that we’ll consider first. The development team has estimated the story points (relative estimate of effort and complexity) for some of the user stories to help us with sprint planning and release projection. Our first two-week sprint will begin the week after Thanksgiving. By the end of January we expect to have completed four, two-week sprints and have a pilot ready with a basic set of features implemented for evaluation by internal stakeholders.

One of the important aspects of Scrum is that group reflection on the process itself is built into the workflow through retrospective meetings after each sprint. Done right, routine retrospectives serve to reinforce what is working well and allows for adjustments to address things that aren’t. In the future we hope to adapt what we learn from applying the Scrum framework to the research data repository pilot to improve our approach to other aspects of our work in SDIS.

Nested Folders of Files in the Duke Digital Repository

Born digital archival material present unique challenges to representation, access, and discovery in the DDR. A hard drive arrives at the archives and we want to preserve and provide access to the files. In addition to the content of the files, it’s often important to preserve to some degree the organization of the material on the hard drive in nested directories.

One challenge to representing complex inter-object relationships in the repository is the repository’s relatively simple object model. A collection contains one or more items. An item contains one or more components. And a component has one or more data streams. There’s no accommodation in this model for complex groups and hierarchies of items. We tend to talk about this as a limitation, but it also makes it possible to provide search and discovery of a wide range of kinds and arrangements of materials in a single repository and forces us to make decisions about how to model collections in sustainable and consistent ways. But we still need to preserve and provide access to the original structure of the material.

One approach is to ingest the disk image or a zip archive of the directories and files and store the content as a single file in the repository. This approach is straightforward, but makes it impossible to search for individual files in the repository or to understand much about the content without first downloading and unarchiving it.

As a first pass at solving this problem of how to preserve and represent files in nested directories in the DDR we’ve taken a two-pronged approach. We will use a simple approach to modeling disk image and directory content in the repository. Every file is modeled in the repository as an item with a single component that contains the data stream of the file. This provides convenient discovery and access to each individual file from the collection in the DDR, but does not represent any folder hierarchies. The files are just a flat list of objects contained by a collection.

To preserve and store information about the structure of the files we add an XML METS structMap as metadata on the collection. In addition we store on each item a metadata field that stores the complete original file path of the file.

Below is a small sample of the kind of structural metadata that encodes the nested folder information on the collection. It encodes the structure and nesting, directory names (in the LABEL attribute), the order of files and directories, as well as the identifiers for each of the files/items in the collection.

<?xml version="1.0"?>
<mets xmlns="http://www.loc.gov/METS/" xmlns:xlink="http://www.w3.org/1999/xlink">
  <metsHdr>
    <agent ROLE="CREATOR">
      <name>REPOSITORY DEFAULT</name>
    </agent>
  </metsHdr>
  <structMap TYPE="default">
    <div LABEL="2017-0040" ORDER="1" TYPE="Directory">
      <div ORDER="1">
        <mptr LOCTYPE="ARK" xlink:href="ark:/99999/fk42j6qc37"/>
      </div>
      <div LABEL="RL11405-LFF-0001_Programs" ORDER="2" TYPE="Directory">
        <div ORDER="1">
          <mptr LOCTYPE="ARK" xlink:href="ark:/99999/fk4j67r45s"/>
        </div>
        <div ORDER="2">
          <mptr LOCTYPE="ARK" xlink:href="ark:/99999/fk4d50x529"/>
        </div>
        <div ORDER="3">
          <mptr LOCTYPE="ARK" xlink:href="ark:/99999/fk4086jd3r"/>
        </div>
      </div>
      <div LABEL="RL11405-LFF-0002_H1_Early-Records-of-Decentralization-Conference" ORDER="3" TYPE="Directory">
        <div ORDER="1">
          <mptr LOCTYPE="ARK" xlink:href="ark:/99999/fk4697f56f"/>
        </div>
        <div ORDER="2">
          <mptr LOCTYPE="ARK" xlink:href="ark:/99999/fk45h7t22s"/>
        </div>
      </div>
    </div>
  </structMap>
</mets>

Combining the 1:1 (item:component) object model with structural metadata that preserves the original directory structure of the files on the file system enables us to display a user interface that reflects the original structure of the content even though the structure of the items in the repository is flat.

There’s more to it of course. We had to develop a new ingest process that could take as its starting point a file path and then crawl it and its subdirectories to ingest files and construct the necessary structural metadata.

On the UI end of things a nifty Javascript plugin called jsTree powers the interactive directory structure display on the collection page.

Because some of the collections are very large and loading a directory tree structure of 100,000 or more items would be very slow, we implemented a small web service in the application that loads the jsTree data only when someone clicks to open a directory in the interface.

The file paths are also keyword searchable from within the public interface. So if a file is contained in a directory named “kitchen/fruits/bananas/this-banana.txt” you would be able to find the file this-banana.txt by searching for “kitchen” or “fruit” or “banana.”

This new functionality to ingest, preserve, and represent files in nested folder structures in the Duke Digital Repository will be included in the September release of the Duke Digital Repository.

An Update on TRLN Discovery: A New Catalog for the Triangle Research Libraries Network.

We are still many months (well, OK — a year or more) away from unplugging the servers that keep the Endeca powered Search TRLN catalog and our local catalogs alive. But in the meantime work is well underway on its replacement. For those who don’t know, the libraries of Duke, UNC Chapel Hill, NC State, and NCCU have for many years maintained a shared catalog to make resources for all Triangle research libraries easily available to our communities. We all push our records to a centralized data pipeline that indexes our data in Endeca and even share much of the code that runs our local catalog user interfaces. While the browsing and searching capabilities of Endeca were innovative for library catalogs at the time, the rest of the library world has caught up and there are numerous open source solutions to providing convenient and powerful search and browse access to our holdings.

Some goals for the replacement for the Endeca based shared catalog system include:

  • Maintain functionality of the current catalog, including search & browse features.
  • Architect the new system so that it’s easier for staff to troubleshoot problems and test changes to the data pipeline.
  • Use open source tools already in common use by peer libraries to take advantage of community effort and knowledge.
  • Develop a platform that makes it easy for each institution to host their own copy of the catalog UI, that takes advantage of shared features and code, while allowing for local customization where needed.
  • Provide a centralized data service and index that can process and host all 12 million or so records from each institution that will provide the back end index for all local catalogs. (NCCU will use a separate vended solution for their catalog.)

The large team of librarians and staff from across the Research Triangle libraries have been hard at work planning for the new system and many characteristics of the new system are becoming clear:

  • We will use Solr as the replacement for Endeca. Solr will store and index our collective holdings and provide search and browse access to our holdings for our catalog applications.
  • Blacklight, a UI built to search and browse a Solr Index will provide the basis for our local catalogs. Blacklight is used by Stanford, Cornell, and many other libraries to provide the user interface to their catalogs.
  • We will build a Rails Engine that will add to Blacklight any customizations needed to support the consortial catalog and any additional features that the institutions want to share. This engine will make it easy to build a new Blacklight based catalog that will work with the TRLN Solr Index.
  • The data pipeline, Solr index and Catalog UI will be packaged in such a way that staff can run a near copy of the production system locally to develop and test changes.

What we have so far:

  • Scripts built with Traject that will take our MARC XML or MARC Binary records and transform them into an intermediate JSON format.
  • Scripts that take the JSON data and index the data in Solr.
  • The beginnings of a Rails engine that will form the basis of our Blacklight catalog user interfaces.

The technical implementation team is using the TRLN GitHub account to collaborate on development, and much work-in-progress code is posted there. There is still plenty of work left to do, a long list of requirements to accommodate and many unanswered questions, but the project is well underway to build the next shared catalog for the Triangle Research Libraries.

Blacklight Summit 2016

Last week I traveled to lovely Princeton, NJ to attend Blacklight Summit. For the second year in a row a smallish group of developers who use or work on Project Blacklight met to talk about our work and learn from each other.

Blacklight is an open source project written in Ruby on Rails that serves as a discovery interface over a Lucene Solr search index. It’s commonly used to build library catalogs, but is generally agnostic about the source and type of the data you want to search. It was even used to help reporters explore the leaked Panama Papers.
blacklight-logo-h200-transparent-black-text
At Duke we’re using Blacklight as the public interface to our digital repository. Metadata about repository objects are indexed in Solr and we use Blacklight (with a lot of customizations) to provide access to digital collections, including images, audio, and video. Some of the collections include: Gary Monroe Photographs, J. Walter Thompson Ford Advertisements, and Duke Chapel Recordings, among many others.

Blacklight has also been selected to replace the aging Endeca based catalog that provides search across the TRLN libraries. Expect to hear more information about this project in the future.
trln_logo_abbrev_rgb
Blacklight Summit is more of an unconference meeting than a conference, with a relatively small number of participants. It’s a great chance to learn and talk about common problems and interests with library developers from other institutions.

I’m going to give a brief overview of some of what we talked about and did during the two and a half day meeting and provides links for you explore more on your own.

First, a representative from each institution gave about a five minute overview of how they’re using Blacklight:

The group participated in a workshop on customizing Blacklight. The organizers paired people based on experience, so the most experienced and least experienced (self-identified) were paired up, and so on. Links to the github project for the workshop: https://github.com/projectblacklight/blacklight_summit_demo

We got an update on the state of Blacklight 7. Some of the highlights of what’s coming:

  • Move to Bootstrap 4 from Bootstrap 3
  • Use of HTML 5 structural elements
  • Better internationalization support
  • Move from helpers to presenters. (What are presenters: http://nithinbekal.com/posts/rails-presenters/)
  • Improved code quality
  • Partial structure that makes overrides easier

A release of Blacklight 7 won’t be ready until Bootstrap 4 is released.

There were also several conversations and breakout session about Solr, the indexing tool used to power Blacklight. I won’t go into great detail here, but some topics discussed included:

  • Developing a common Solr schema for library catalogs.
  • Tuning the performance of Solr when the index is updated frequently. (Items that are checkout out or returned need to be indexed relatively frequently to keep availability information up to date.)
  • Support for multi-lingual indexing and searching in Solr, especially Chinese, Japanese, and Korean languages. Stanford has done a lot of work on this.

I’m sure you’ll be hearing more from me about Blacklight on this blog, especially as we work to build a new TRLN shared catalog with it.