What’s in a placename?

New York. Paris. London. Saying these names likely evokes a sense of place in many people. Maybe what’s evoked is the knowledge of a place on a map, a sense of the culture there, or the memory of a trip. But what do these toponyms we can so casually reference actually mean? Do I mean “London” or “London”? “New York” or “New York”? If I were in my home state of Kentucky, I might well mean London and Paris. This gets even trickier when you consider the shifting cultural contexts (and even geography) introduced by thinking about historical placenames over time. In what sense is ancient Lutetia modern Paris?

The purpose here is not to retread already well-trod philosophical grounds, but rather to highlight the sorts of very real problems that can confront us when trying to align multiple datasets containing placenames. This is important for us at DC3 because we want to align multiple epigraphic databases containing a variety of forms of placenames; moreover, aligning these placenames to other databases which include actual geospatial information will allow querying and visualization of the data in a way which is not easily possible now. One can imagine looking for inscriptions found within some radius of an ancient or modern city, or creating a map showing the geographical distribution of all inscriptions in the database, or a visualization which illustrates the relationships between the findspot of an inscription and the placenames mentioned in its text, and so on.

One component of this involves aligning names in Pleiades and GeoNames, allowing us to get a “free” mapping to the other resource wherever we have a relationship to only one, and greatly expanding our graph of knowledge. The machine-automated process for this, known as “Pleiades+”, simply uses a combination of string-matching and geospatial filtering to try to find likely matches between both resources. But many of these matches may be erroneous under various criteria – multiple similarly-named places within a certain radius of one another may all be matched to one another, for example, or a city may be matched to both a city and an administrative region.

Similar to the problem Hugh discussed in the previous post, you can come up with certain rules for some of these cases, but others require a human to make the decision. As a result, we’ve adapted the excellent gazComp tool developed at Perseus to work through the list of Pleiades+ candidate matches and allow quick visualization and voting for each match. The process of developing and using the tool on real data has also turned up various kinds of ambiguities like those discussed before: what, exactly, do we mean by a “match”? For example, publications may occasionally use the name of the nearest modern city interchangeably with the actual archaeological site name, and sometimes GeoNames may have records for both the city and the site, and sometimes not. Any solution causes a certain amount of anxiety, as what’s “right” may depend on a variety of contexts – context of the place, context of the placename mention, context of how these “matches” will be used, and so on. There’s not one perfect answer for all cases. What we hope to accomplish is not perfection, but to move pragmatically toward improvement.

In that spirit, we’ve placed a publicly accessible instance of the Pleiades+ gazComp voting tool online. Currently, it requires sign-in with a Google account for vote attribution. Eventually, we will incorporate the results of these votes into the Pleiades+ output, so that anyone can use them. Additionally, if you start adding places to GeoNames where you currently come across an erroneous match in the voting tool (ancient ruins clearly visible on the satellite imagery with no marker in GeoNames, for example), they will eventually get picked up by the automated Pleiades+ process and be fed as candidate matches into the voting pool. The hope is that this process will also allow us to broaden the pool of Pleiades+ match candidates without making the data meaningless; once we have good vote coverage for this initial set, we can start to add in matches such as those from substring rather than exact string matching, which doubles the number of candidates.

We need your votes! If you have any questions or run into any problems, feel free to leave a comment on this post, drop us a line, or use the gazComp issue tracker on GitHub.