Searching the DDbDP (Or, How Fine are a Balrog’s Teeth?)

A DDbDP user recently wrote with the following good question.

“Several years ago, when I searched for the phrase “ενταγιον εμου”, the PN returned P.Oxy. X 1326, PSI I 36 & SB XXII 15268 among the hits. When I perform the same search now, these papyri no longer appear. They all have in common a misspelling of ενταγιον: εντ<α>γιον (P.Oxy. X 1326), εντακιων (PSI I 36) & ενταγιων (SB XXII 15268).”

We not only like but need questions like this. This is how we improve the PN. So, keep them coming. The answer to this one is both simple and complicated. For the simple version, skip to the end. For the complicated one, read on. [Warning: if you are expert in text-searching matters, the following will seem dull and simplistic. But if you are a papyrologist it might help to explain why the PN works the way it does.]

***

First, the user was not hallucinating; these texts exist:

  • http://papyri.info/ddbdp/p.oxy;10;1326
    • Line 1 has: “ἐντγιον(*) ἐμο̣ῦ”  in the text and “l. ἐντ<ά>γιον” in the app, which is rather crummy; “ἐντ<ά>γιον” alone in the text, without app entry, might have been sufficient, even better.
  • http://papyri.info/ddbdp/psi;1;36
    • Line 1 has: “ἐντάκιων(*) ἐμοῦ” in the text and “l. ἐντάγιον” in the app, which is pretty clear
  • http://papyri.info/ddbdp/sb;22;15268
    • Line 1 has: “ἐντάγιων(*) ἐμοῦ” in the text and “l. ἐντάγιον” in the app.

Now,  several years ago, when our user first ran this search, the underlying encoding, at PSI I 36.1, looked like this:

<choice>
<reg>ἐντάγιον</reg>
<orig>εντακιων</orig>
</choice> ἐμοῦ

The DDbDP used to display the ‘reg(ularized)’ form up in the text, and the ‘orig(inal)’ text down in the app: “ἐντάγιον” up above and “εντακιων pap.” down below. Papyrologists, who revel in Greek as written, did not like this. So, a couple years ago, we changed the styling in order to render the original reading up in the text and the regularized reading down in the app.

This took some work. The DDbDP’s practice had been to include diacriticals only on the regularized reading and not on what the scribe wrote, regardless of what the editor printed. This meant two things (1) We could not (and still cannot) know from the DDbDP data alone what the original editors printed in text and/or apparatus. This is a shame (although it is correctable). (2) Some 90,000 origs lacked diacriticals! So, we added them, programmatically (Faith Lawrence and Gabby Bodard, both of KCL DDH, did a fantastic job with this).

In this particular case, both we and the original editors treat ἐντάκιων as a phonetic representation of ἐντάγιον. So, we produced:

<choice>
<reg>ἐντάγιον</reg>
<orig>ἐντάκιων</orig>
</choice> ἐμοῦ

Today, the DDbDP has “ἐντάκιων” and the app indicates “l. ἐντάγιον”.

Understanding this backstory is essential to understanding why our user’s experience a few years ago was different.

***

Remember, before we changed editorial practice as regards reg/orig, the encoding was

<choice>
<reg>ἐντάγιον</reg>
<orig>εντακιων</orig>
</choice> ἐμοῦ

The phrase that appeared in the text was ἐντάγιον ἐμοῦ. The search index:

  • knew that this text contained the word ἐντάγιον and the PN could find it
  • knew that this text contained the word εντακιων and the PN could find it
  • knew that this text contained the phrase ἐντάγιον ἐμοῦ and the PN could find it (as our user correctly recalls)
  • DID NOT know that this text contained the phrase εντακιων ἐμοῦ

Remember, the current encoding is

<choice>
<reg>ἐντάγιον</reg>
<orig>ἐντάκιων</orig>
</choice> ἐμοῦ

The phrase that appears in the text is ἐντάκιων ἐμου. The search index:

  • knows that this text contains the word ἐντάγιον and the PN can find it
  • knows that this text contains the word ἐντάκιων and the PN can find it
  • knows that this text contains the phrase ἐντάκιων ἐμοῦ and the PN can find it
  • DOES NOT know that this text contains the phrase ἐντάγιον ἐμοῦ (the PN cannot find it, as our user was surprised to discover)

Thus, the PN search is now better in one way and worse in another! It knows about all of the same discrete words, but where phrases are concerned it now does a better job with what the scribe wrote, and a poorer job with the modern normalized representation. This may be a good trade, from a papyrological point of view, but it is still a trade.

***

Ok, but why isn’t the PN search as smart as we are? Two answers: (1) because you are just smarter. (2) Actually, maybe you’re not.

In this particular case, the index ‘knows’ that ἐμοῦ immediately follows ἐντάκιων. We humans know that ἐμοῦ immediately follows ἐντάκιων on the papyrus, but that it also follows ἐντάγιον in another, constructed sense. Can we ask the indexer to ‘know’ as much as we do? Yes, sort of. Can we ask it to treat all reg/orig pairs as simultaneously occupying the same position in the line? Yes, but only to a point.

Suppose…

  • a scribe wrote: Abe’s dog has fire teeth.
  • but meant to write: Abe’s dog has fine teeth.
  • the editor prints: Abe’s dog has fire (l. fine) teeth.
  • we encode: Abe’s dog has <choice><reg>fine</reg><orig>fire</orig></choice> teeth.

If we want to be able to support proximity searches against all possible words int his sentence, we must in effect index both possible sentences, 10 words instead of 5.

  1. Abe’s dog has fire teeth.
  2. Abe’s dog has fine teeth.

The more reg/orig pairs a text has, the greater the number of possible sentences, and the larger the number of index versions that we must maintain. The increase is exponential. If “Abe’s” was itself  regularized from “Ave’s” we would have to index this single sentence four times.

  1. Abe’s dog has fire teeth.
  2. Abe’s dog has fine teeth.
  3. Ave’s dog has fire teeth.
  4. Ave’s dog has fine teeth.

Remember also that reg/orig expressions can address strings that can be complicated. Say a scribe writes κεγο for κεγω, which is regularized to καὶ ἐγώ. For such a regularization three possible strings occupy the same position in a line, but one of them is two words and two of them are one!

Now imagine that

  • a scribe wrote: Abe’s dog has fire tooth.
  • but meant to write either: Abe’s dog has a fine tooth.
  • or: Abe’s dog has fine teeth.
  • the editor prints something like: Abe’s dog has fire tooth (l. <a> fine tooth, or fine teeth).
  • we encode: Abe’s dog <choice><reg><app type=”alternative”><lem>has <supplied reason=”omitted”>a</supplied> fine tooth</lem><rdg>has fine teeth</rdg></app></reg><orig>has fire tooth</orig></choice>.
  • The Leiden+ expression of this bit of EpiDoc is much easier to take in: Abe’s dog <:<:has <a> fine tooth|alt|has fine teeth:>|reg|has fire tooth:>.

Now imagine that a subsequent editor, M. Smith, revisits the manuscript and reads: A Balrog has firey (l. fiery) tooth (l. teeth). The encoding for this correction will be:

  • <app type=”editorial”><lem resp=”M. Smith”>A Balrog has <choice><reg>fiery</reg><orig>firey</orig></choice> <choice><reg>teeth</reg><orig>tooth</orig></choice></lem><rdg resp=”Original editor”>Abe’s dog <choice><reg><app type=”alternative”><lem>has <supplied reason=”omitted”>a</supplied> fine tooth</lem><rdg>has fine teeth</rdg></app></reg><orig>has fire tooth</orig></choice></rdg></app>
  • And in Leiden+: <:A Balrog has <:fiery|reg|firey:> <:teeth|reg|tooth:>=M. Smith|ed|Abe’s dog <:<:has <a> fine tooth|alt|has fine teeth:>|reg|has fire tooth:>=Original editor:>

If firey/fiery is a common regularization and tooth/teeth is as well, and if we want our users to be be able to search for all combinations of this phrase, then we must index the correction alone four times (2 regs x 2 origs = 4 possible combinations):

  1. A Balrog has firey tooth. [indexing orig | orig ]
  2. A Balrog has firey teeth. [indexing orig | reg ]
  3. A Balrog has fiery tooth. [indexing reg | orig ]
  4. A Balrog has fiery teeth. [indexing reg | reg ]

If we generate only two versions of the text in the index–one that includes origs but not regs and another that includes regs but not origs–then when someone searches for “fiery tooth” (a plausible phrase among students of Balrogs) s/he will not find this text. And perhaps Joe was right to read “Balrog” but wrong about their “fiery” teeth. Maybe this Balrog has “fine” teeth. If another user wants to know how fine a Balrog’s teeth are, and so wants to search for “fine” in proximity to “Balrog”, the index must include one version for every possible combination of strings not only in Joe’s corrected text but also in that of the original edition.

Do the math. How many versions of the index do we need to create in order to accommodate all possible combinations of words presented by the two competing constructions of this five-word (or is it six-word?) sentence? How many words separate “A” from “fine”? Still wonder why the search engine isn’t as smart as you? There are only contingent answers. This is not easy. The PN search works as well as it does thanks to the industry and genius of Tim Hill and Hugh Cayless. But what is intrinsically complicated will likely stay that way.

In order to be able to deliver searches that specify distance between any two possible words that appear in any editorial construction of a given DDbDP text, we would need to have as many parallel indexes of that text as there are possible combinations of reg/orig pairs (and also alternate readings, and also BL corrections and their deprecated readings, and also with abbreviations expanded and unexpanded, and so on). Even if we were to create parallel indexes only to accommodate reg/orig pairs, the burden might still be more than we could serve in an ordinary production environment: to put it simplistically a text with a dozen simple reg/orig pairs would require 144 parallel indexes.

***

But in the meantime, what’s a papyrologist to do?

First, search the DDbDP for ενταγιον εμου (without quotes); this will find all texts that contain both words, in any position. In other words, this finds A+B not “A B”. It will catch all three of the examples that our user asked about. Then, walk down the list and strike those ‘hits’ that you do not want; it takes a few extra seconds, but probably no more than it would take to craft one single, perfect query (if such were even possible).

Also, if there are common variants, run multiple queries at once. Enter

  • “εντακιων εμου” [in the first search box]
  • OR “ενταγιον εμου” [in the second search box]

…and so on. Or, using wildcards, search for εντα?ι?ν εμου, which will find ενταγιον, εντακιων, ενταχιον, vel sim.

Bottom line: something as simple as searching for a couple contiguous words is not simple. And in the inherently complex, unstable, and variant-rich world of papyrological documents, very little is simple. No search engine can erase that inherent complexity. No matter how much the PN improves, it will almost always be best to attack questions with multiple searches and a variety of strategies.

***

And what comes next? In the short term , we aim to generate a few concurrent indexes to the DDbDP, perhaps:

  1. text including (1) original readings, but not their regularized forms (reg/orig), (2) original erroneous readings, but not their corrected forms (corr/sic), (3) corrections to texts (from BL or PE), (4) expanded abbreviations
  2. text including (1) original readings, but not their regularized forms (reg/orig), (2) original erroneous readings, but not their corrected forms (corr/sic), (3) corrections to texts (from BL or PE), (4) unexpanded abbreviations
  3. text including (1) regularized readings (reg/orig), (2)  corrected readings (corr/sic), (3) deprecated readings, (4) expanded abbreviations

Roughly speaking, you can think of the first two as indexes of what we put in the text, and the third as a slightly less complete index of what we put in the app.

We have also started thinking about an altogether different approach (which serves different goals as well). If we were to generate a comprehensive, curated index of unambiguous phonetic variants attested in the papyri (e.g. ἐντάγιον=ἐντάκιον), then we could automatically run concurrent searches (e.g. “ἐντάγιον ἐμοῦ” OR “ἐντάκιον ἐμοῦ”), whenever users query phrases containing one of the indexed pairs. So, a user searches for “ἐντάγιον ἐμοῦ” and we return texts with “ἐντάκιον ἐμοῦ” as well. One day.

Neither is a comprehensive ‘fix.’ But either one would have let our user find ἐντάγιον ἐμοῦ at PSI I 36.