Issue 399: Appellations that ARE URIs

Starting Date: 
2018-11-10
Working Group: 
3
Status: 
Open
Background: 

Posted by Martin on 10/11/2018

Dear All,

After complete rewriting of the text about implementing CRM in RDF I have temporarily abandoned google Docs. It is more efficient to split the topic, and then recombine.

Here my reformulation of the "punning" topic, duality of Appellation, as a discussion Item. Please check at the end the open questions I pose!

"In the CRM names are modelled as instances of  E41 Appellation. This class comprises any symbolic object used or created to name something without requiring further meaning. The CIDOC CRM version 6.2 defines  E41 Appellation, subclass of E90 Symbolic Object, as:

“This class comprises signs, either meaningful or not, or arrangements of signs following a specific syntax, that are used or can be used to refer to and identify a specific instance of some class or category within a certain context.

Instances of E41 Appellation do not identify things by their meaning, even if they happen to have one, but instead by convention, tradition, or agreement. Instances of E41 Appellation are cultural constructs; as such, they have a context, a history, and a use in time and space by some group of users. A given instance of E41 Appellation can have alternative forms, i.e., other instances of E41 Appellation that are always regarded as equivalent independent from the thing it denotes. “

The CRM is an ontology in the proper sense. Therefore, instances of physical things and phenomena of the physical worlds are regarded to be the things themselves, and not their machine representation, and any identifier or name used for something from the material world is different from the thing itself. For instance, I, Martin Doerr, am an instance of E21 Person, and not any of the URIs or records that may represent me in an information system. I am unique in this world, as is any particular thing, in contrast to representations of me.

In the CRM, the property  “P1 is identified by” from E1 CRM Entity” to “E41 Appellation” relates the things to their names or identifiers.

In any knowledge representation schema, any item that cannot “reside” in the machine itself due to its nature, must be represented by one selected primary identifier, in the case of RDF by a URI. For an information system to be consistent with the described reality, these selected identifiers should map one-to-one to the ontological instances they stand for. Therefore, any instance of a class represented by a URI in RDF plays a dual role: it stands for the ontological instance and is an identifier for it (see also Meghini et al. 2014).

For practical reasons, we do not represent this duality by a recursive use of “P1 is identified by” from an instance to itself in its second capacity as an identifier. However, all other names and identifiers are related to the select primary identifier via “P1 is identified by”. This implies that the choice about which of multiple identifiers is the primary one may be changed without changing the meaning. In contrast, owl:same_as relates two primary URIs of things as different representation of the same real world thing, aggregating the properties of both representations as valid for the real world thing.

In practice, only the URIs, literals and datatypes “reside” themselves directly in a machine and need no additional identification because they are completely identified by their content.

We may distinguish four different kinds of Appellations: URIs, identifiers from local application contexts, literally defined names used in human written communication and names from oral communication and tradition. Typically, URIs and local identifiers have a unique representation as strings. However, the situation for names is more complex.

For instance, 北京 is a literally defined name for the capital of China. “Bei Jing” is meant to be an representation of the same name in Latin characters (underspecified without accent marks), and not meant to be another name for the same city. “Doerr is a respelling of Dörr, a German surname[1]”. The most elaborate and effective good practice for registering proper names comes from the library community (Doerr, Riva and Zumer 2012). The FRBR Review Group of IFLA decided for practical reasons to identify a name (“Nomen” in their terminology) by the identical sequence of characters in a given script, not by the binary encoding.

For historical research however, in particular capturing oral tradition, this definition is too narrow, and we are confronted in relevant CRM applications with cases of names with spelling variants and even spoken variants. All cases of names that cannot uniquely be identified with a character sequence must be represented with a URI and further properties of description must be added, by preference the newly proposed property “E90 Symbolic Object: has symbolic content”. Also, if someone wants to document facts about a name other than its spelling, a URI must first be assigned, because a character string itself cannot be referred to in RDF. This case must not be confused with documenting facts about the relation between a name and a particular carrier of that name, because that would be a reification of this relation, and not talking about the name.

Summarizing, there are two cases:

a)      A name or identifier is completely defined and identified by a character sequence or any digitally, unambiguously encoded symbol.

b)      A name or identifier is identified but not defined by a URI.

As a matter of fact, RDFS provides the property rdfs:label, which implements exactly the case a) above, without the possibility to add descriptions of the name itself. SKOS specializes rdfs:label into properties such as skos:prefLabel and skos:altLabel, which define indeed the names by which things are called by people. We take therefore the use of rdfs:label as existing good practice. Consequently, we have to regard rdfs:label as a special case of “P1 is identified by”, and all literals used as range instances of rdfs:label implicitly as instances of E41 Appellation (see section “RDF implementation tests” item 1.).

Unfortunately, our KR languages have not foreseen the case that an instance of a datatype is also an instance of a user-defined class. This causes a range conflict, which can be overcome by “punning” the range of “P1 is identified by” to be both rdfs:Literal and E41 Appellation (see section “RDF implementation tests” item 2.).

This recommended implementation allows for using both models for Appellations, via an additional URI or directly as literal, and returning with one query all range instances of “P1 is identified by” following this interpretation. The SPARQL query result separates URIs from literals automatically. So, there is no ambiguity about the nature of the result.

Only if the same name is described both directly via rdfs:label and indirectly via a URI, the matching of both would need another query.

So, the frequently asked question remains, why not avoiding this double definition and describe any instance of E41 Appellation via another URI?  The answer is, that actually the cases that require explicit representation of E41 Appellation are relevant but rare. On the other side, good practice requires all nodes in a semantic graph represented by a URI to carry a human-readable label in addition. This means that the storage volume and query performance would be heavily hampered by such a “pure-logic-driven” decision.

The only ambiguity that remains is the case in which the instance of Appellation is literally the URI itself, and not a URI representing an Appellation of different form. There are two solution to this problem: Either classify this URI by the class of things it identifies and use owl:same_as, or we define a specific subclass of E41 Appellation “URI”.

Another question is, if label for the readability of the semantic graph should be distinguished from names used in the referred to world.

Tests:

asking for the subproperties of rdfs:label as follows: You may see the answer here