Now, I know last week’s Bitstreams post about metadata and date encoding left you wanting to know more about Duke Digital Collections date metadata, the Extended Date Time Format, and how we are planning to apply it. Well, don’t you worry – I’m going to talk about it now in scintillating detail.
As Cory talked about last week, we have a lot of inconsistent, “squishy” date metadata in use in our digital collections, squishy both in terms of what those dates mean and how they are represented. This is a problem when you want to do fun stuff with your dates, like create facets and visualizations, or, um, retrieve reliably comprehensive search results when looking for everything from a time period. So we’re beginning the process of normalizing all of that data, but as we’re talking about special collections materials, date squishiness is not an uncommon occurrence, it is inherent to the materials, and we need to be able to represent it programmatically.
There are 12,146 unique date values present in our digital collections metadata, and these values range from very machine-processable – “January 1, 1936” or “1971-1972”, – to a lot less so – “[ca. late 1880s]” or “[1950s] Nov. 22]”, along with a plethora of values all meaning the same thing – “n.d.”, “None”, “undated”, “Unknown”, etc (along with one inscrutable instance of the word “Philadelphia”). In order to begin the process of normalizing the data, we identified the main patterns those dates took, and came up with a list of 38 rough patterns into which all but 178 values fell. Next we took a stab at converting those patterns to EDTF. The following represents the great bulk of our data:
|mm/dd/yy; yyyy Mon. dd; yyyy-mm-dd; dd-Mon-yy||yyyy-mm-dd||January 1, 1910|
|yyyy Mon. dd?||yyyy-mm-dd?||January 1, 1910?|
|yyyy Mon dd-dd||yyyy-mm-dd/yyyy-mm-dd||January 1, 1910 to January 3, 1910|
|yyyy-mm; yyyy Mon.; yyyy/mm||yyyy-mm||January 1910|
|yyyy Mon.?||yyyy-mm?||January 1910?|
|ca. yyyy; circa yyyy||yyyy~||Circa 1910|
|[yyyy/yyyy?]||yyyy?/yyyy?||1910 to 1913?|
|yyyy or yyyy||[yyyy,yyyy]||1910 or 1913|
|yyyx; yyy?; yyy_?; yyy?; [yyy-]||yyyu||191x|
|circa yyyy-yyyy; yyyy-yyyy and n.d.||yyyy~/yyyy~||Circa 1910 to 1913|
As you can see above, the specification accommodated most of the patterns we identified, but when we tried to encode more nuanced dates, we discovered that couldn’t quite take the encoding as far as we wanted.
For example, one pattern that shows up in our metadata frequently looks like this:
A decade encoded in EDTF looks like this:
194x (for 1940s)
and we can encode a circa date like this:
1940~ (for circa 1940)
But we can’t combine the two formats – the following is not a valid EDTF date:
194x~ (for circa 1940s)
Ideally, the specification would allow us to create an encoded date that looked like the above date, as well as this:
194x~/195x~ (for circa 1940s to 1950s)
We can work around this deficiency by stripping ‘circa’ from the date ranges and using the ‘unspecified’ encoding:
194u/195u (which we can translate to display as: 1940s to 1950s)
But this approach isn’t ideal, as it is inconsistent with our other usage of the format and isn’t technically ‘correct’, either. Happily, the EDTF specification is open for modification and proposals for modifications are still being taken. A quick glance at the listserv archives indicates that we’re not the only people trying to encode this kind of squishiness.
In the meantime, we can keep ourselves busy with cleaning up, normalizing, and converting the great bulk of our date metadata, as well as dealing with those 178 outliers individually. We still feel good about using EDTF – it’s a LOT better than our current date situation, and has some good room for improvement, as well. Pretty solid for a first date, I’d say.