Issue 389: Label-free RDF classes

Starting Date: 
2018-07-19
Working Group: 
3
Status: 
Open
Background: 

Posted by Martin on 19/7/2018

Dear All,

The current text "Expressing the CIDOC Conceptual Reference Model in RDF"
(https://docs.google.com/document/d/1zCGZ4iBzekcEYo4Dy0hI8CrZ7dTkMD2rJaxa...)

contains the phrase:
"In addition, for convenience of implementation we have defined number-only classes and properties e.g. “E63” or “P2”, and declared each of them to be equivalent to the corresponding full form"

In the past, this option was provided and widely rejected by users. I do not know of any installation using it.

It was proposed again because CRM-SIG reserves the right to change labels without changing the code ("E63", "P2" etc.), in cases when the meaning is preserved but the existing label causes confusion and can be replaced by a more fitting or at least less confusing one. These changes are very rare, and explicit in the amendment of the respective version.

Those of you who support:
"In addition, for convenience of implementation we have defined number-only classes and properties e.g. “E63” or “P2”, and declared each of them to be equivalent to the corresponding full form"
please vote YES.

Those of you who support:
"The English label is part of the definition of the RDF classes and properties. Number-only classes and properties e.g. “E63” or “P2”, are not provided". Other means of supporting migration between label versions to be discussed.

please vote NO.

(Those who believe the issue is not sufficiently formulated please vote "VETO". One or more "VETO" will stop the e-mail vote as a whole and postpone it to the next physical meeting)

Current Proposal: 

Posted by Robert Sanderson on 19/7/2018

Vote: NO

Having both creates incompatibility between systems that use one or the other. Every query, in a system that does not do inferencing, would need to test for each class. The representations of the data would also be wildly inconsistent, making the simple consumption of a representation of the data significantly harder.

Posted by Melanie Roche on 20/7/2018

Dear Martin, dear all,

On behalf of BnF, I vote YES.

The DOREMUS project demonstrated that  label-free classes and properties are very important, as renaming them is very costly, as well as damaging for dissemination. For instance, the renaming of  "M14_Medium_Of_Performance" into "M14_Medium_of_Performance" (no capital "o") was a nightmare for our IT people as well as for those users who had already begun disseminating our data.

Plus, opaque names encourage users to go and read the documentation so as to really understand the scope of what they are using, instead of just assuming the meaning of a class or property based on the label only.
That is especially true for non-native English speakers: I believe supporting multilinguism was precisely the reason why WikiData opted for opaque codes, and I think this was a sound decision.

Posted by Richard Light on 21/7/2018

I vote YES.

Richard

Posted by Thanasis on 23/7/2018

This is a difficult one. I agree with Melanie. I also agree with Rob's
point about implementation but I think rewriting a query from full
code-label to code-only is not too much of an overhead. Plus we should
encourage inference.

So I vote YES. 

Posted by Jutta Lindenthal on 23/7/2018

I vote YES.

Posted by Robert Sanderson on 23/7/2018

Given that I seem to be in the vast minority, I propose (assuming that the vote is carried as yes) that we introduce further conveniences, following from the same rationale of it being easy to rewrite queries per server, and wanting to require inferencing:

 

·         All of the terms /without/ the numbers (e.g. Material as a more convenient, easier to remember form of E57)

·         All of the previous names for terms (e.g. both E78_Collection and E78_Curated_Holding) to solve the cross-version issue

·         Property names without the inconsistent and difficult to remember has_ / had_ / is_ / was_ prefixes

Posted by Robert Sanderson on 23/7/2018

I agree that the opaque terms equally disadvantage everyone. No one has any benefit at all.

 

Following this line of thinking, we could be even more opaque by using UUIDs instead of almost memorable numbers, requiring that everyone use software tools to work with the CRM and no one could look at the data directly with any way of understanding it without those internationalized and accessible tools.  This would enforce the best practice of multilingualism and not unfairly advantage people who have memorized the numbers or can speak English.

Posted by  Detlev  Balzer on 23/7/2018

I also vote YES.

Where natural-language names are desired for readability, why not allow for any number of non-normative label sets? This would put Chinese or Armenian class and property names on a par with the English ones, without compromising interoperability as long as the distinction between URI and label is kept in mind.

 

Posted by Richard Light on 23/7/2018

Rob,

I don't think that's particularly fair. The goal is not, as I am sure you realise, to make CRM-encoded data as obscure as possible.  It is to ensure that the Linked Data identifiers which we publish are as persistent as they need to be, in the context of cultural heritage documentation which may be around long after we're all dead.
 

Posted by Richard Light on 23/7/2018

On 23/07/2018 17:02, Detlev Balzer wrote:
> I also vote YES.
>
> Where natural-language names are desired for readability, why not allow for any number of non-normative label sets? This would put Chinese or Armenian class and property names on a par with the English ones, without compromising interoperability as long as the distinction between URI and label is kept in mind.
We already have multilingual labels (seven languages) appearing in the January 2015 version of the RDF, which could be picked up and used in a language-sensitive user interface for CRM instances.  At present the detailed definition of each class and property is an English-only <rdfs:comment>, but there is no reason why the RDF could not include multilingual definitions as well.

Posted by Rober Sanderson on 23/7/2018

Hi Richard,

 

Thanks for bringing up persistence, as I think it’s an important discussion with direct relevance on this topic.  I don’t think that URI persistence is a function of form, obscurity or language, but of the will of the maintainers to keep the URI available.  http://cidoc-crm.org/ns/E22_Man_Made_Object can be just as persistent as http://cidoc-crm.org/ns/E22.  Or http://cidoc-crm.org/ns/A3F59536-4C30-4AD1-AEC7-E9EEC93411A5.

 

As is frequently demonstrated, URI persistence is made more likely by use. Use is more likely when the URI is usable:

 

Usability is the degree to which a software can be used by specified consumers to achieve quantified objectives with effectiveness, efficiency, and satisfaction in a quantified context of use. [Wikipedia definition of Usability]

 

In other words, usability is a metric based on how well the audience can effectively, efficiently and happily achieve their objectives.  You can either make many people happy by including the labels, or you can make everyone equally unhappy by not including them.

 

The arguments in favor of having the codes as separate terms in fact *double the cost of that persistence*, and open the door to multiplying it many times by requiring the persistence of both E78_Collection and E78_Curated_Holding, of M14_Medium_Of_Performance and M14_Medium_of_Performance.

 

If you believe that the aim is to ensure persistence, then you should be in favor of as few terms to maintain as possible, thereby maximizing the likelihood that they will survive.  And given the above, you should pick the term with the label included over the term without the label. That is, in fact, exactly the logic that led to my vote of no :)

 

Secondly, if you think that the terms should be understandable for as long as possible, then “E22” is much less likely to be understood in the future than “E22_Man_Made_Object”. The code alone *requires* the documentation in order to be understood. The form with the label can be understood by approximately 20% of the world’s population and is unlikely to die out any time soon. 80% of linguistic content on the web is in English. English is typically treated as the lingua franca of computing, and especially on the web given its origins and ASCII-centric early nature. Yes, that’s not fair, diverse or equitable, but it’s also impossible to dispute. 

Posted by Francesco Beretta on 23/7/2018

Dear all,

I also vote YES.

Furthermore, I'd also like to stress the importance to distinguish between the identifier, which must be stable during the whole life of a class, or property, and the label(s), which can be multiple, multilingual and evolve, as everyone knows.

The meaning of the class or property, as it was already stressed on this list, is provided by the scope note and, in fact, by the scope note AND the version (or namespace) of CRM. Strictly speaking it's always about a class in a specific CRM version: "This is the scope note of E59 Primitive Value of the CIDOC CRM version 6." (cf. note 7 in the document under discussion).

Labels are often confusing. Therefore, it is not in my opinion just for "convenience of implementation" (as the new document states) that the RDF serialisation should define "number-only classes and properties" but it is something fundamental. Therefore, in my opinion, the alphanumeric form E7 should be the preferred one in the URI, and of cours the URI with labels, insofar as used in earlier versions, maintained as condition sine aqua non of interoperability.

At the same time, the human should be always provided with an easy way of retrieving the label(s) for his/her convenience. This is not to be provided, in my opinion, by the RDF serialisation as a static file (which will of course contain the labels) but by a dereferencing service implemented as a web service where you can send the URI of the class, or property, and receive a web-page for the human  to read, like this http://ontologies.dataforhistory.org/class/7 but devoted to the whole CIDOC CRM community and dereferencing the specified identifiers, like the Agent class in the DBPedia ontology.

The dereferencing page by DBPedia of Agent shows, in my opinion, the limits of an identification for the class provided by the label: the label in the URI will remain forever even if a better one is found for the class, problems could be raised with disambiguation (at list in the human mind, not by the machine), etc. On this same page, the property owl:equivalentClass shows the solution by Wikidata mentioned by Melanie, which is evidently more robust: https://www.wikidata.org/wiki/Q24229398. Of course this solution, like the "http://www.cidoc-crm.org/cidoc-crm/E7" URI form needs double dereferencing, for the human and for the manchine in form of a data stream.

Therefore, in my opinion, in the context of semantic web the issue of the ongoing discussion is much more about having a URIs dereferencing service then adding labels to URIs specifications in static documents. REF documents are useful for collective memory and experts, but in every day life web services are more effective and useful: just write the URI, and you'll get in tenths of a second the answer.

In this same context, the CRM version's number should also be always provided in the URI e.g. http://www.cidoc-crm.org/cidoc-crm/6.2/E7 because the scope note and labels depend on the version, they are not absolute in the whole class (or property) history, and a URL redirection could lead easily to the page "http://www.cidoc-crm.org/Entity/E7-Activity/Version-6.2.1", providing at the same time HTML for me to read and RDF data (in XML, json or whatelse) for consumption by the machine.

The same principle sould be applied to CRM extensions, e.g. http://www.cidoc-crm.org/crm-geo/1.2/SP2.

In my opinion, this point sould be treated as a part of the discussion we started in the last SIG in Lyon about improving CRM versions and extensions management, and we should find in future a more dynamic, web based way of managing versions and dereferencing. And discussions... 

Posted by George Bryseker on 24/7/2018

Dear all,

I do not want to be obstructionist to progress on a pragmatic issue but I feel that we should pause the vote process.

It seems to be that both sides have very good points and we need to find a means to reconcile these as much as possible. Let me try to abbreviate the main aspects of the points made so far:

As Rob points out, with the labels embedded in the class and property names, we have a readable RDF and a single RDF of reference. These are fundamental attributes we should be looking to support.

With a label-less entity/property version, we would have greater neutrality from labels which is a strength with regards to robustness against label update and creates linguistic neutrality. On the other hand it means having two versions of CRM around leading to potential interoperability problems and extra overhead. It also makes the plain RDF unreadable in any sense except to the versed few.

Melanie’s point of the cost of change resulting from updates to the standard is a very fundamental argument and I think a big aspect we have to bear in mind, with which I believe Rob would concur. If changes to labels cost user communities significant time and money, this is a big problem to CRM sustainability. That being said, if the SIG has historically been conservative about label changes, perhaps it is not as big an issue as we think.

Before we move forward with creating a version like this:

I suggest that we need to check how many times we have changed class or property names in the past to see how big an issue this is. From the point of update robustness/cost to users/community (though not linguistic flexibility) this is the major issue.

If we decide to make such a version, I would think we would want to ensure that we have the correct mechanisms in place for ensuring the management of the versions and the resolution and persistence of the URIs as per Richard and Francesco’s suggestion. Along those lines, I believe Francesco’s comments on the URI service and group ontology development are fruitful. Indeed, joining them to Rob’s extended questions about previous version names etc., much could be addressed through a robust service for URI resolution of CRM entities and properties.

Feel free to disagree with this summary if I have missed or misinterpreted your points. I think we all share the same aim of robust interoperability on the data level and just have to find the right balance. I would invite that we check the label changes and discuss how to robustly support CRM URI resolution before proceeding to creating new RDF versions.

Posted by Christian Emil on 24/4/2018

Dear all,

The debate following Martin's proposal idemonstrates the problem connected to PIDs and URI. When the idea of LOD/Semantic Web was introduced one seems to have had an idea that there would be one single universal identifier for an item. In the last 18 years it has clearly been demonstrated that this is not possible in general.  One will always need synonym mechanisms.  One of the more successful identifier schemas are ISBN-numbers which is a pure numeric standard.  Another small standard is the paper sheet sizes.

In systematic biology (at least in botany) one use Latin names supplied with a complex but precise way to formulate name changes including the name of the responsible. For most non-native Latin speakers these Latin/Graeco names function as labels with no internal meaning.  What is the meaning of “Rubus chamaemorus” and if you decipher it, will you know what it actually denotes? Still the system works well.

For non-native English speakers English names as the labels in the CRM, the element names in TEI or the terms used in IIIF function as similar labels. We do not bother very much about the English nuances. They could as well have been in Latin.  With all respect, this observation seems to be surprising for many native English (or for any native speakers of a language of some standard).

In my opinion the numeric labels should be the authoritative ones and the more verbose in some native language (English, French, Norwegian, Chinese) are synonyms. A language neutral standard is the best. So if the voting is still open I will vote yes.

 

Posted by Martin Doerr on 27/7/2018

Dear All,

I take George's message for a "VETO".

The vote is hereby canceled.

Under these circumstances, the issue will be discussed in the next meeting. Please provide more objective evidence about costs and methods of migrating between label versions and other changes, as George argues.

I thank you all for your contributions.
I am particularly pleased that members that tend to be more silent have expressed detailed opinions, which we will respect further in the process.

Posted by Richard Light on 25/7/2018

Martin,

As we close this discussion (which I agree has been very productive) I would like to make the following points:

    widely-used systems such as WikiData and Geonames manage quite happily with numeric identifiers. We should look at how they manage to disclose the meaning of each URI to their users
    in my experience, real-life instances of CRM-encoded data can involve chains of class-property-class-property-... which can be quite long. If each individual class and property URI includes its label, these chains will become unwieldy to read, leading to criticisms of the CRM as 'too verbose' (as against the potential 'too cryptic' criticisms which Rob points out )
    I strongly support Francesco's suggestion that we use this discussion as a spur to move beyond a static expression of the CRM in RDFS to a web-based framework which actively supports its users

Reference to Issues: