All over the world there are increasing investments for making information from culture and science publicly available on the Web. After a phase of promoting that each information provider from the archival, library, museum and scientific disciplinary communities creates his own “Web presence” in order to make information available, now a new phase of integrating rich metadata in very large aggregation services has started, such as the National Digital Library of Taiwan, the European Digital Library “Europeana”, the Mellon Foundation funded ResearchSpace Project, the British CLAROS Project, and many more. The recent attempts go well beyond OAI PMH harvesting of minimal metadata. Ultimately, these services should bring the dream of a global network of knowledge [1] closer to us.
Behind this development stands an architecture of “information providers” actively delivering content and metadata in various, heterogeneous formats, which are transformed to a normalized data model such as the CIDOC CRM by a mapping toolset often called “Submission Information Package Creator”, following the OAI Digital Preservation Model, and subsequently ingested into an aggregation service.
Numerous projects have created such mapping tools, again and again, each time claiming to having solved the problem, often with public funding. Up to now, no comprehensive toolset exists that would allow for a mapping service of industrial quality, that takes into account quality criteria, the scale, social constellations and social roles under which current aggregation services aim at running. As a consequence, the integrated data are much poorer than the provider’s internal ones, much less than could be provided, and rarely or never updated. The transformed data contain numerous mapping errors and suffer from other quality shortcomings. The mapping process itself causes immense costs.
The main reason for this global failure of tool development is a complete underestimation of the complexity of such a tool set, and the lack of a reference model that would make generic requirements and a suitable architecture widely known. A complete mapping service consists of quite a number of necessary and optional subservices that can be implemented in a wide range of sophistication. Current solutions suffer from the following:
- They are monolithic. Of possible functionalities, each implementation has developed another subset, without chance of integration.
- They do not represent the schema matching information in a way a domain expert could verify.
- They do not allow for switching between XML, RDF and RDBMS support on the source side, and XML and RDF on the target side.
- They do not support incremental changes of source schema, target schema and URI generation policies.
- They do not maintain an automated communication for data cleaning with the provider.
- They do not foresee collaborative work of experts with different roles on the mapping process.
How to define mappings
The available documents contain information on how to define mappings between relevant data structures and the CIDOC CRM.