Category Archives: Behind the Scenes

In a (Temporary) Time of Remote Work, Duke’s FOLIO Implementation Continues

Duke University is an early adopter for FOLIO, an open source library services platform that will give us tools to better support the information needs of our students, faculty, and staff. A core team in Library Systems and Integration Support began forming in January 2019 to help Duke move to FOLIO. I joined that team in January 2019 and began work as an IT Business Analyst.

In preparation for going-live with FOLIO, we formally kicked off our local implementation effort in January 2020. More than 40 local subject experts have joined small group teams to work on different parts of the FOLIO project. These experts are invaluable to Library IT staff: they know how the library’s work is done, which features need to be prioritized over others, and are committed to figuring out how to transition their work into the FOLIO environment.

If you’re reading this in April 2020 and thinking “wasn’t January ten years ago?” you’re not alone. Because the FOLIO Project is international, with partners all over the world, many of us are used to working via remote tools like Slack, Microsoft Teams, and Zoom. But that is a far cry from doing ALL of our work that way, while also taking care of our families and ourselves. It’s a huge credit to all library staff that while the University was swiftly pivoting to remote work, we were able to keep our implementation work going.

One of the first big, messy areas that we knew we needed to work on was using locations.

Locations are essential to how patrons know where an item is at the Duke Libraries. When you look up a book in our catalog and the system tells you Where to Find It, it’s using location information from our systems. Library staff also use locations to understand how often items are borrowed, decide when to move items to our off-campus storage, and decide when we to buy new items to keep our collections up to date.

A group of FOLIO team members came together from different working areas, including public services, cataloging, acquisitions, digital resources and assessment. I convened those discussions as a lead for our Configurations team. Over the course of late February and March 2020, we met three times as a group using Zoom and delved deep into learning about locations in our current system and how they will work in FOLIO. Staff members shared their knowledge with each other about their functional areas, allowing us to identify potential gaps in FOLIO functionality, as well as things we could improve now, without waiting for FOLIO to deploy.

This team identified two potential paths forward – one that was straightforward, and one that was more creative and would adapt the FOLIO four-level locations in a new way.  In our final meeting – where we had hoped to decide between the two options, our subject experts grappled with the challenges, risks and rewards of the two choices and were able to recommend a path forward together. Ultimately, the team agreed that the creative option was the best choice, but both options would work – and that guidance helped us decide how to make a first pass on configuring locations and move the project forward.

The most important part of these meetings was valuing the expertise of our library staff and working to support them as they decided what would work the best for the library’s needs.  I am deeply appreciative of the staff who committed the time to these discussions while also figuring out how to move their regular jobs to remote work. Our FOLIO implementation is all the better because of their collaborative spirit.

The New Books & Media Catalog Turns One

It’s been just over a year since we launched our new catalog in January of 2019. Since then we’ve made improvements to features, performance, and reliability, have developed a long term governance and development strategy, and have plans for future features and enhancements.

During the Spring 2019 semester we experienced a number of outages of the Solr index that powers the new catalog. It proved to be both frustrating and difficult to track down the root cause of the outages. We took a number of measures to reduce the risk of bot traffic slowing down or crashing the index. A few of these measures include limiting facet paging to 50 pages and results paging to 250 pages, as well as setting limits on OpenSearch queries. We also added service monitoring so we are automatically alerted when things go awry and automatic restarts under some known bad system conditions. We also identified that a bug in the version of Solr we were running was vulnerable to causing crashes for queries with particular characteristics. We have since applied a patch to Solr to address this bug. Happily, the index has not crashed since we implemented these protective measures and bug fixes.

Over the past year we’ve made a number of other improvements to the catalog including:

  • Caching of the home page and advanced search page have reduced page load times by 75%.
  • Subject searches are now more precise and do not include stemmed terms.
  • CDs and DVDs can be searched by accession number.
  • When digitized copies of Duke material are available at the Internet Archive, links to the digital copy are automatically generated.
  • Records can be saved to a bookmarks list and shared with others via a stable URL.
  • Eligible records now have a “Request digitization” link.
  • Many other small improvements and bug fixes.

We sometimes get requests for features that the catalog already supports:

While development has slowed, the core TRLN team meets monthly to discuss and prioritize new features and fixes, and dedicates time each month to maintenance and new development. We have a number of small improvements and bug fixes in the works. One new feature we’re working on is adding a citation generator that will provide copyable citations in multiple formats (APA, MLA, Chicago, Harvard, and Turabian) for records with OCLC numbers.

We welcome, read, and respond to all feedback and feature requests that come to us through the catalog’s feedback form. Let us know what you think.

Check out “Search Tips” and “Expert Search Tips” for detailed information about how to get the most out of the new catalog.

Duke Digital Repository Evolution and a new home page

After nearly a year of work, the libraries recently launched an updated version of the software stack that powers parts the Duke Digital Repository. This work primarily centered around migrating the underlying software in our Samvera implementation — which we use to power the DDR — from ActiveFedora to Valkyrie. Moving to Valkyrie gives us the benefits of improved stability along with the flexibility to use different storage solutions, which in turn provides us with options and some degree of future-proofing. Considerable effort was also spent on updating the public and administrative interfaces to use more recent versions of blacklight and supporting software.

ddr admin interface
Administrative interface for the DDR

We also used this opportunity to revise the repository landing page at repository.duke.edu and I was involved in building a new version of the home page. Our main goals were to make use of a header implementation that mirrored our design work in other recent library projects and that integrated our ‘unified’ navigation, while also maintaining the functionality required by the Samvera software.

Old DDR Homepage
DDR home page before the redesign

We also spent a lot of time thinking about how best to illustrate the components of the Duke Digital Repository while trying to keep the content simple and streamlined. In the end we went with a design that emphasizes the two branches of the repository; Library Collections and Duke Scholarship. Each branch in turn links to two destinations — Digitized Collections / Acquired Materials and the Research Data Repository / DukeSpace. The overall design is more compact than before and hopefully an improvement aesthetically as well.

new DDR homepage
Redesigned DDR home page

We also incorporated a feedback form that is persistent across the interface so that users can more readily report any difficulties they encounter while using the platform. And finally, we updated the content in the footer to help direct users to the content they are more than likely looking for.

Future plans include incorporating our header and footer content more consistently across the repository platforms along with bringing a more unified look and feel to interface components.

Check out the new design and let us know what you think!

All About that Time Base

The video digitization system in Duke Libraries’ Digital Production Center utilizes many different pieces of equipment: power distributors, waveform and vectorscope monitors, analog & digital routers, audio splitters & decibel meters, proc-amps, analog (BNC, XLR and RCA) to digital (SDI) converters, CRT & LCD video monitors, and of course an array of analog video playback decks of varying flavors (U-matic-NTSC, U-matic-PAL, Betacam SP, DigiBeta, VHS-NTSC and VHS-PAL/SECAM). We also transfer content directly from born-digital DV and MiniDV tapes.

A grandfather clock is a time base.

One additional component that is crucial to videotape digitization is the Time Base Corrector (TBC). Each of our analog video playback decks must have either an internal or external TBC, in order to generate an image of acceptable quality. At the recent Association of Moving Image Archivist’s Conference in Baltimore, George Blood (of George Blood Audio/Video/Film/Data) gave a great presentation on exactly what a Time Base Corrector is, appropriately entitled “WTF is a TBC?” Thanks to George for letting me relay some of his presentation points here.

A time base is a consistent reference point that one can utilize to stay in sync. For example, The Earth rotating around the Sun is a time base that the entire human race relies on, to stay on schedule. A grandfather clock is also a time base. And so is a metronome, which a musical ensemble might use to all stay “in time.”

Frequency is defined as the number of occurrences of a repeating event per unit of time. So, the frequency of the Earth rotating around the Sun is once per 24 hrs. The frequency of a grandfather clock is one pendulum swing per second. The clock example can also be defined as one “cycle per second” or one hertz (Hz), named after Heinrich Hertz, who first conclusively proved the existence of electromagnetic waves in the late 1800’s.

One of the DPC’s external Time Base Correctors

But anything mechanical, like grandfather clocks and videotape decks, can be inconsistent. The age and condition of gears and rods and springs, as well as temperature and humidity, can significantly affect a grandfather clock’s ability to display the time correctly.

Videotape decks are similar, full of numerous mechanical and electrical parts that produce infinite variables in performance, affecting the deck’s ability to play the videotape’s frames-per-second (frequency) in correct time.

NTSC video is supposed to play at 29.97 frames-per-second, but due to mechanical and electro-magnetic variables, some frames may be delayed, or some may come too fast. One second of video might not have enough frames, another second may have too many. Even the videotape itself can stretch, expand and contract during playback, throwing off the timing, and making the image wobbly, jittery, too bright or dark, too blue, red or green.

A Time Base Corrector does something awesome. As the videotape plays, the TBC stores the unstable video content briefly, fixes the timing errors, and then outputs the corrected analog video signal to the DPC’s analog-to-digital converters. Some of our videotape decks have internal TBCs, which look like a computer circuit board (shown below). Others need an external TBC, which is a smaller box that attaches to the output cables coming from the videotape deck (shown above, right). Either way, the TBC can delay or advance the video frames to lock them into correct time, which fixes all the errors.

An internal Time Base Corrector card from a Sony U-matic BVU-950 deck

An internal TBC is actually able to “talk” to the videotape deck, and give it instructions, like this…

“Could you slow down a little? You’re starting to catch up with me.”

“Hey, the frames are arriving at a strange time. Please adjust the timing between the capstan and the head drum.”

“There’s a wobble in the rate the frames are arriving. Can you counter-wobble the capstan speed to smooth that out?”

“Looks like this tape was recorded with bad heads. Please increase gain on the horizontal sync pulse so I can get a clearer lock.”

Without the mighty TBC, video digitization would not be possible, because all those errors would be permanently embedded in the digitized file. Thanks to the TBC, we can capture a nice, clean, stable image to share with generations to come, long after the magnetic videotape, and playback decks, have reached the end of their shelf life.

FOLIO Update November 2019

Here at Duke, the buzz continues around FOLIO. We have continued to contribute to the international project  as active participants on the FOLIO Product Council,  special interest groups, and contribute development resources. You can find links to the various groups on the FOLIO wiki.

We’ve also committed to implementing the electronic resource management (ERM)-focused apps in summer of 2020. Starting with the ERM-focused apps will give us the opportunity to use FOLIO in a production environment, and will be a benefit to our Continuing Resource Acquisitions Department since they are not currently using software dedicated to electronic resources to keep track of licences and terms.

 

Our local project planning has come more into focus as well. We have gathered names for team participants and will be kicking off our project teams in January. As we’ve talked about the implementation here, we’ve realized that we have a number of tasks that will need to be addressed, regardless of subject matter. For example, we’re going to need to map data – not just bibliographic, holdings and item data, but users, orders, invoices, etc. We’ll also need to set up configurations and user permissions for each of the apps, and document, train, and develop new workflows. Since our work is not siloed in functional areas, we need to facilitate discussions among the functional areas. To do that, we’re going to create a set of functional area implementation teams, and work groups around the task areas that need to be addressed.

To learn more about the FOLIO project at Duke, fly on over to our WordPress site and read through our past newsletters, look through slides from past presentations, and check out some fun links to bee facts.

A Statement of Commitment

The featured image is from a mockup of a new repositories home page that we’re working on in the Libraries, planned for rollout in January of 2020.

Working at the Libraries, it can be dizzying to think about all of our commitments.

There’s what we owe our patrons, a body of so many distinct and overlapping communities, all seeking to learn and discover, that we could split the library along an infinite number of lines to meet them where they work and think.

There’s what we owe the future, in our efforts to preserve and share the artifacts of knowledge that we acquire on the market, that scholars create on our own campus, or that seem to form from history and find us somehow.

There’s what we owe the field, and the network of peer libraries that serve their own communities, each of them linked in a web of scholarship with our own. Within our professional network, we seek to support and complement one another, to compete sometimes in ways that move our field forward, and to share what we learn from our experiences.

The needs of information technology underlie nearly all of these activities, and to meet those needs, we have an IT staff that’s modest in size, but prodigious in its skill and its dedication to the mission of the Libraries. Within that group, the responsibility for creating new software, and maintaining what we have, falls to a small team of developers and devops engineers. We depend on them to enhance and support a wide range of platforms, including our web services, our discovery platforms, and our digital repositories.

This fall, we did some reflection on how we want to approach support for our repository platforms. The result of that reflection was a Statement of Commitment to Repositories Support and Development, a document of roughly a page that expresses what we consider to be our values in this area, and the context of priorities in which we do that work.

The committee that created the statement was our Digital Preservation and Publishing Program, or DP3 as call it in house. We summarized our values as “openness, community and peer engagement, and independence from vended platforms,” which have “guided us to build our repositories on open source software platforms.” We place that work within the context of very large, looming priorities like our transition to FOLIO as our Library Services Platform, and the project to renovate Lilly Library. There are others, not mentioned in the statement, that fill the pages of this blog.

The statement is explicit that we will not seek to find alternative platforms for our repository services in the next several years, and in particular while the FOLIO transition is underway. This decision is informed by our recognition that migration of content and services across platforms is complex and expensive. It’s also a recognition that we have invested a lot into these existing platforms, and we want to carve out as much space as we can for our talented staff to focus on maintaining and improving them, rather than locking ourselves into all-consuming cycles of content migration.

From a practical perspective, and speaking as the manager who oversees software development in the Libraries, I see this statement as part of an overall strategy to bring focus to our work. It’s a small but important symbolic measure that recognizes the drag that we create for our software team when give in to our urge to prioritize everything. 

The phrase “context switching” is one that we have borrowed from the parlance of operating systems to describe the effects on a developer of working on multiple projects at once. There are real costs to moving between development environments, code bases, and architectures on the same day, in the same week, during the same sprint, or within even an extended work cycle. We also call this problem “multi-tasking,” and the penalty it imposes of performance is well documented

Even more than performance, I think of it as a quality of life concern. People are generally happier and more invested when they’re able to do quality work. As a manager, I can work with scheduling and planning to try to mitigate those effects of multitasking on our team. But the responsibility really lies with the organization. We have our commitments, and they are vast in size and scope. We owe it to ourselves to do some introspection now and again, and ask what we can realistically do with what we have, or more accurately, who we are.

Lighting and the PhaseOne: It’s More Than Point and Shoot

Last week, I went to go see the movie IT: Chapter 2. One thing I really appreciated about the movie was how it used a scene’s lighting to full effect. Some scenes are brightly lit to signify the friendship among the main characters. Conversely, there are dark scenes that signify the evil Pennywise the Clown. For the movie crew, no doubt it took a lot of time and manpower to light an individual scene – especially when the movie is nearly 3 hours long.

We do the same type of light setup and management inside the Digital Production Center (DPC) when we take photos of objects like books, letters, or manuscripts. Today, I will talk specifically about how we light the bound material that comes our way, like books or booklets. Generally, this type of material is always going to be shot on our PhaseOne camera, so I will particularly highlight that lighting setup today.

Before We Begin

It’s not enough to just turn the lights on in our camera room to do the trick. In order to properly light all the things that need to be shot on the PhaseOne, we have specific tools and products we use that you can see in the photo below.

We have 4 high-powered lights (two sets of two Buhl SoftCube SC-150 models) pointed directly in the camera’s field of view. There are two on the right and two on the left. These are stationed approximately 3.5 feet off the ground and approximately 2.5 feet away from the objects themselves. These lights are supported by Avenger A630B light stands. They allow for a wide range of movement, extension, and support if we need them.

But if bright, hot lights were pointed directly at sensitive documents for hours, it would damage them. So light diffusers are necessary. For both sets of lights, we have 3 layers of material to diffuse the light and prevent material from warping or text from fading. The first layer, directly attached to the light box itself, is an inexpensive sheet of diffusion fabric. This type of material is often made from nylon or silk, and are usually inexpensive.

The second diffusion layer is an FJ Westcott Scrim Jim, a similar thin fabric that is attached to a lightweight stand-up frame, the Manfrotto 156BLB. This frame can also be moved or extended if need be. The last layer is another sheet of diffusion fabric, attached to a makeshift “cube” held up by lightweight wooden rods. This cube can be picked up or carried, making it very convenient if we need to eventually move our lights.

So in total, we have 4 lights, 4 layers of diffusion fabric attached to the light boxes, two Scrim Jims, and the cube featuring 2 sides of additional diffusion fabric. After having all these items stationed, surely we can start taking pictures, right? Not yet.

Around the Room

There are still more things to be aware of – this time in the camera room itself. We gently place the materials themselves on a cradle lined with a black felt, similar to velvet. This cradle is visible in the bottom right part of the photo above. It is placed on top of a table, also coated in black felt. This is done so no background colors bounce back or reflect onto the object and change what it looks like in the final image itself. The walls of the camera room are also painted a neutral grey color for the same reason, as you can see in the background of the above photo. Finally, any tiny reflective segments between the ceiling tiles have been blacked out with gaffer tape. Having the room this muted and intentionally dark also helps us when we have to shoot multi-spectral images. No expense has been spared to make sure our colors and photos are correct.

Camera Settings

With all these precautions in place, can we finally take photos of our materials? Almost. Before we can start photographing, we have to run some tests to make sure everything looks correct to our computers. After making sure our objects are sharp and in focus, we use a program called DTDCH (see the photo to the right) to adjust the aperture and exposure of the PhaseOne so that nothing appears either way too dim or too bright. In our camera room, we use a PhaseOne IQ180 with a Schneider Kreuznach Apo-Digitar lens (visible in the top-right corner of the photo above). We also use the program CaptureOne to capture, save, and export our photos.

Once the shot is in focus and appropriately bright, we will check our colors against an X-Rite ColorChecker Classic card (see the photo on the left) to verify that our camera has a correct white balance. When we take a photo of the ColorChecker, CaptureOne displays a series of numbers, known as RGB values, found in the photo’s colors. We will check these numbers against what they should be, so we know that our photo looks accurate. If these numbers match up, we can continue. You could check our work by saving the photo on the left and opening it in a program like Adobe Photoshop.

Finally, we have specific color profiles that the DPC uses to ensure that all our colors appear accurate as well. For more information on how we consistently calibrate the color in our images, please check out this previous blog post.

After all this setup, now we can finally shoot photos! Lighting our materials for the PhaseOne is a lot of hard work and preparation. But it is well worth it to fulfill our mission of digitizing images for preservation.

What we talk about when we talk about digital preservation

(Header image: Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark)

Here at Duke University Libraries, we often talk about digital preservation as though everyone is familiar with the various corners and implications of the phrase, but “digital preservation” is, in fact, a large and occasionally mystifying topic. What does it mean to “preserve” a digital resource for the long term? What does “the long term” even mean with regard to digital objects? How are libraries engaging in preserving our digital resources? And what are some of the best ways to ensure that your personal documents will be reusable in the future? While the answers to some of these questions are still emerging, the library can help you begin to think about good strategies for keeping your content available to other users over time by highlighting agreed-upon best practices, as well as some of the services we are able to provide to the Duke community.

File formats

Not all file formats have proven to be equally robust over time! Have you ever tried to open a document created using a Microsoft Office product from several years ago, only to be greeted with a page full of strangely encoded gibberish? Proprietary software like the products in the Office suite can be convenient and produce polished contemporary documents. But software changes, and there is often no guarantee that the beautifully formatted paper you’ve written using Word will be legible without the appropriate software 5 years down the line. One solution to this problem is to always have a version of that software available to you to use. Libraries are beginning to investigate this strategy (often using a technique called emulation) as an important piece of the digital preservation puzzle. The Emulation as a Service (EaaS) architecture is an emerging tool designed to simplify access to preserved digital assets by allowing end users to interact with the original environments running on different emulators.

An alternative to emulation as a solution is to save your files in a format that can be consumed by different, changing versions of software. Experts at cultural heritage institutions like the Library of Congress and the US National Archives and Records Administration have identified an array of file formats about which they feel some degree of confidence that the software of the future will be able to consume. Formats like plain text or PDFs for textual data, value separated files (like comma-separated values, or CSVs), MP3s and MP4s for audio and video data respectively, and JPEGs for still images have all proven to have some measure of durability as formats. What’s more, they will help to make your content or your data more easily accessible to folks who do not have access to particular kinds of software. It can be helpful to keep these format recommendations in mind when working with your own materials.

File format migration

The formats recommended by the LIbrary of Congress and others have been selected not only because they are interoperable with a wide variety of software applications, but also because they have proven to be relatively stable over time, resisting format obsolescence. The process of moving data from an obsolete format to one that is usable in the present day is known as file format migration or format conversion. Libraries generally have yet to establish scalable strategies for extensive migration of obsolete file formats, though it is generally a subject of some concern.

Here at DUL, we encourage the use of one of these recommended formats for content that is submitted to us for preservation, and will even go so far as to convert your files prior to preservation in one of our repository platforms where possible and when appropriate to do so. This helps us ensure that your data will be usable in the future. What we can’t necessarily promise is that, should you give us content in a file format that isn’t one we recommend, a user who is interested in your materials will be able to read or otherwise use your files ten years from now. For some widely used formats, like MP3 and MP4, staff at the Libraries anticipate developing a strategy for migrating our data from this format, in the event that the format becomes superseded. However, the Libraries do not currently have the staff to monitor and convert rarer, and especially proprietary formats to one that is immediately consumable by contemporary software. The best we can promise is that we are able to deliver to the end users of the future the same digital bits you initially gave to us.

Bit-level preservation

Which brings me to a final component of digital preservation: bit-level preservation. At DUL, we calculate a checksum for each of the files we ingest into any of our preservation repositories. Briefly, a checksum is an algorithmically derived alphanumeric hash that is intended to surface errors that may have been introduced to the file during its transmission or storage. A checksum acts somewhat like a digital fingerprint, and is periodically recalculated for each file in the repository environment by the repository software to ensure that nothing has disrupted the bits that compose each individual file. In the event that the re-calculated checksum does not match the one supplied when the file has been ingested into the repository, we can conclude with some level of certainty that something has gone wrong with the file, and it may be necessary to revert to an earlier version of the data. THe process of generating, regenerating, and cross-checking these checksums is a way to ensure the file fixity, or file integrity, of the digital assets that DUL stewards.

What happens when you click “Search?”

How many times each day to you type something into a search box on the web and click “Search?” Have you ever wondered what happens behind the scenes to make this possible? In this post I’ll show how search works on the Duke University Libraries Catalog. I’ll trace the journey of how search works from metadata in a MARC record (where our bibliographic data is stored), to transforming that data into something we can index for searching, to how the words you type into the search box are transformed, and then finally how the indexed records and your search interact to produce a relevance ranked list of search results. Let’s get into the weeds!

A MARC record stores bibliographic data that we purchase from vendors or are created by metadata specialists who work at Duke Libraries. These records look something like this:

In an attempt to keep this simple, let’s just focus on the main title of the record. This is information recorded in the MARC record’s 245 field in subfields a, b, f, g, h, k, n, p, and s. I’m not going to explain what each of the subfields is for but the Library of Congress maintains extensive documentation about MARC field specifications (see 245 – Title Statement (NR)). Here is an example of a MARC 245 field with a linked 880 field that contains the equivalent title in an alternate script (just to keep things interesting).

=245 10$6880-02$aUrbilder ;$bBlossoming ; Kalligraphie ; O Mensch, bewein' dein' Sünde gross (Arrangement) : for string quartet /$cToshio Hosokawa.
=880 10$6245-02/{dollar}1$a原像 ;$b開花 ; 書 (カリグラフィー) ほか : 弦楽四重奏のための /$c細川俊夫.

The first thing that has to happen is we need to get the data out of the MARC record into a more computer friendly data format — an array of hashes, which is just a fancy way of saying a list of key value pairs. The software reads the metadata from the MARC 245 field, joins all the subfields together, and cleans up some punctuation. The software also checks to see if the title field contains Arabic, Chinese, Japanese, Korean, or Cyrillic characters, which have to be handled separately from Roman character languages. From the MARC 245 field and its linked 880 field we end up with the following data structure.

"title_main": [
{
"value": "Urbilder ; Blossoming ; Kalligraphie ; O Mensch, bewein' dein' Sünde gross (Arrangement) : for string quartet"
},
{
"value": "原像 ; 開花 ; 書 (カリグラフィー) ほか : 弦楽四重奏のための",
"lang": "cjk"
}
]

We send this data off to an ingest service that prepares the metadata for indexing.

The data is first expanded to multiple fields.

{"title_main_indexed": "Urbilder ; Blossoming ; Kalligraphie ; O Mensch, bewein' dein' Sünde gross (Arrangement) : for string quartet",

"title_main_vernacular_value": "原像 ; 開花 ; 書 (カリグラフィー) ほか : 弦楽四重奏のための",

"title_main_vernacular_lang": "cjk",

"title_main_value": "原像 ; 開花 ; 書 (カリグラフィー) ほか : 弦楽四重奏のための / Urbilder ; Blossoming ; Kalligraphie ; O Mensch, bewein' dein' Sünde gross (Arrangement) : for string quartet"}

title_main_indexed will be indexed for searching.
title_main_vernacular_value holds the non Roman version of the title to be indexed for searching.
title_main_vernacular_lang holds information about the character set stored in title_main_vernacular_value.
title_main_value holds the data that will be stored for display purposes in the catalog user interface.

We take this flattened, expanded set of fields and apply a set of rules to prepare the data for the indexer (Solr). These rules append suffixes to each field and combine the two vernacular fields to produce the following field value pairs. The suffixes provide instructions to the indexer about what should be done with each field.

{"title_main_indexed_tsearchtp": "Urbilder ; Blossoming ; Kalligraphie ; O Mensch, bewein' dein' Sünde gross (Arrangement) : for string quartet",

"title_main_cjk_v": "原像 ; 開花 ; 書 (カリグラフィー) ほか : 弦楽四重奏のための",

"title_main_t_stored_single": "原像 ; 開花 ; 書 (カリグラフィー) ほか : 弦楽四重奏のための / Urbilder ; Blossoming ; Kalligraphie ; O Mensch, bewein' dein' Sünde gross (Arrangement) : for string quartet" }

When sent to the indexer the fields are further transformed.

Suffixed Source Field Solr Field Solr Field Type Solr Stored/Indexed Values
title_main_indexed_tsearchtp title_main_indexed_t text stemmed urbild blossom kalligraphi o mensch bewein dein sund gross arrang for string quartet
title_main_indexed_tsearchtp title_main_indexed_tp text unstemmed urbilder blossoming kalligraphie o mensch bewein dein sunde gross arrangement for string quartet
title_main_cjk_v title_main_cjk_v chinese, japanese, korean text 原 像 开花 书 か り く ら ふ ぃ い ほか 弦乐 亖 重奏 の ため の
title_main_t_stored_single title_main stored string 原像 ; 開花 ; 書 (カリグラフィー) ほか : 弦楽四重奏のための / Urbilder ; Blossoming ; Kalligraphie ; O Mensch, bewein’ dein’ Sünde gross (Arrangement) : for string quartet

These are all index time transformations. They occur when we send records into the index.

The query you enter into the search box also gets transformed in different ways and then compared to the indexed fields above. These are query time transformations. As an example, if I search for the terms “Urbilder Blossom Kalligraphie,” the following transformations and comparisons take place:

The values stored in the records for title_main_indexed_t are evaluated against my search string transformed to urbild blossom kalligraphi.

The values stored in the records for title_main_indexed_tp are evaluated against my search string transformed to urbilder blossom kalligraphie.

The values stored in the records for title_main_cjk_v are evaluated against my search string transformed to urbilder blossom kalligraphie.

Then Solr does some calculations based on relevance rules we configure to determine which documents are matches and how closely they match (signified by the relevance score calculated by Solr). The field value comparisons end up looking like this under the hood in Solr:

+(DisjunctionMaxQuery((
(title_main_cjk_v:urbilder)^50.0 |
(title_main_indexed_tp:urbilder)^500.0 |
(title_main_indexed_t:urbild)^100.0)~1.0)
DisjunctionMaxQuery((
(title_main_cjk_v:blossom)^50.0 |
(title_main_indexed_tp:blossom)^500.0 |
(title_main_indexed_t:blossom)^100.0)~1.0)
DisjunctionMaxQuery((
(title_main_cjk_v:kalligraphie)^50.0 |
(title_main_indexed_tp:kalligraphie)^500.0 |
(title_main_indexed_t:kalligraphi)^100.0)~1.0))~3
DisjunctionMaxQuery((
(title_main_cjk_v:"urbilder blossom kalligraphie")^150.0 |
(title_main_indexed_t:"urbild blossom kalligraphi")^600.0 |
(title_main_indexed_tp:"urbilder blossom kalligraphie")^5000.0)~1.0)
(DisjunctionMaxQuery((
(title_main_cjk_v:"urbilder blossom")^75.0 |
(title_main_indexed_t:"urbild blossom")^200.0 |
(title_main_indexed_tp:"urbilder blossom")^1000.0)~1.0)
DisjunctionMaxQuery((
(title_main_cjk_v:"blossom kalligraphie")^75.0 |
(title_main_indexed_t:"blossom kalligraphi")^200.0 |
(title_main_indexed_tp:"blossom kalligraphie")^1000.0)~1.0))
DisjunctionMaxQuery((
(title_main_cjk_v:"urbilder blossom kalligraphie")^100.0 |
(title_main_indexed_t:"urbild blossom kalligraphi")^350.0 |
(title_main_indexed_tp:"urbilder blossom kalligraphie")^3000.0)~1.0)

The ^nnnn indicates the relevance weight given to any matches it finds, while the ~n.n indicates the number of matches that are required from each clause to consider the document a match. Matches in fields with higher boosts count more than fields with lower boosts. You might notice another thing, that full phrase matches are boosted the most, two consecutive term matches are boosted slightly less, and then individual term matches are given the least boost. Furthermore unstemmed field matches (those that have been modified the least by the indexer, such as in the field title_main_indexed_tp) get more boost than stemmed field matches. This provides the best of both worlds — you still get a match if you search for “blossom” instead of “blossoming,” but if you had searched for “blossoming” the exact term match would boost the score of the document in results. Solr also considers how common the term is among all documents in the index so that very common words like “the” don’t boost the relevance score as much as less common words like “kalligraphie.”

I hope this provides some insight into what happens when you clicks search. Happy searching.