Project

General

Profile

Actions

GUIDs in the CDM


Purpose of this document

The EDIT Common Data Model Java Library is a generic API for building applications for revisionary taxonomy and taxonomic field work. Such applications are envisaged to be integrated into the wider biodiversity informatics landscape. Standard vocabularies for describing data, the use of Globally unique identifiers to distinguish between those data items, and standard protocols & web services for exchanging metadata are understood to be the means to achieve this integration.

GUIDs in this context are taken to be a synonym for the whole system of Globally Unique Identifiers plus the associated technology for resolving metadata about the objects those identifiers identify, the ontology for describing the relationships and properties of those objects and their semantics, the format of any representations of those object, plus software applications that can use and understand those objects. A useful reference is the definition of GUIDs provided by the TDWG GUID Wiki http://wiki.tdwg.org/twiki/bin/view/GUID/WebHome#A_Definition_of_Globally_Unique.

This document attempts to:

  • Describe some things (use-cases) that a user would want to achieve, and how the use of a GUID would help the user achieve that goal

  • Identify any gaps in the current proposed GUID technology, and perhaps propose solutions to those gaps.

  • Identify functional and non-functional requirements of such a system in terms of lower level operations, and map these requirements onto the CDM Java Library

  • Specify, at a high level, missing components of the CDM Java Library with regards to GUIDs.

Scenario

Taxonomists and other biodiversity scientists collect and create information about biological entities. The amount of information in total is very large, so large that it is impossible to collect more than a fraction of the total during the course of any one project or to hold more than a fraction of the total in one database or software application. Instead, databases of limited scope (e.g. taxonomic scope, or geographical scope, or being restricted to a certain subset of the total number of categories of information) are created, usually for a particular purpose by a particular organization or group of individuals. There are many databases each containing a subset of the total information.

The proceeding use cases are set in the context of the scenario of shared information between cate-araceae.org and IPNI. CATE Araceae is a database created by Simon Mayo & collaborators (cate-araceae.org). It is a taxonomic revision of the Araceae, a group of about 3,000 species of plants. The primary purpose of the database is to provide a classification and diagnostic description of the accepted species of Aroids. Initially the taxonomic concepts used in CATE Araceae were those accepted by the Moncot Checklist, but it is belived that a large > 1000 species of Aroids are yet to be described, so it is likely that the cate-araceae.org checklist will diverge from the Monocot Checklist in the future unless particular effort is spent maintaining them in synchrony. The core data within this database are Taxonomic Concepts, and Descriptions of those Taxonomic Concepts. However, cate-araceae.org also contains lists of Taxonomic Names, authors, specimens, references, controlled terms and many other types of data. Ideally it would like to use global authority files for these entities as there is more than enough work purely maintaining the classification and descriptive data. An added complication is that these "global authority files" are themselves not static but change as new publications, specimens etc are created or as existing data is improved.

The International Plant Names Index has been created by a consortium of RBG Kew, the Australian National Herbarium, and the Harvard University Herbaria. It was assembled from Index Kewensis, the Gray Card Index & APNI, and aims to compile and maintain a comprehensive literature based record of the scientific names of all vascular plants and to make it freely available on the Internet. It is updated on a regular basis by the IPNI Editors. The core data in IPNI are Taxonomic Names, Publications, and Authors.

Some general features of such databases is that

  • They do not cover all information, but specialize in a particular subset of information.

  • They are created or compiled in order to meet immediate business needs of specific users working for, or with the organization that supports the database. In the case of cate-araceae.org, Simon Mayo and his collaborators are active Araceae taxonomists and use the database in his day-to-day work, and as a means of publishing the results of his research. Likewise the components of IPNI were created primarily to serve botanical research globally.

  • Once assembled, the data in such databases need to be updated if they are to remain useful. In the case of cate-araceae.org, it is estimated that approximatly 1,000 new aroid species are believed to be currently undescribed. In the case of IPNI, new, validly published names are added to the index, in addition to continuous efforts to improve the quality of the data.

  • They are (usually) publicly funded and publicly available. It is important to the organizations that support such databases that these resources are used, and are useful beyond the organization that created them, although it can be difficult to demonstrate this usefulness or use.

The CDM & GUIDs

The CDM is a data model implemented in java. The metadata returned by a LSID resolution service, for example, is a RDF document typed according to the TDWG ontology. There is (in most cases) a one-to-one mapping between classes and properties in the CDM and the current TDWG ontology (found at http://rs.tdwg.org/ontology). There is not complete coverage between the two data models in either direction. In some cases properties in one model are composites of properties in the other. A second problem is that cardinality constraints have not been be placed on the properties of RDF objects in the TDWG ontology. As a consequence conversion between the CDM and the TDWG ontology is expected to be lossy (i.e. CDM objects cannot be converted into RDF and back again without loss of data in some instances). The following table gives some of the main objects or properties used in the usecases below as either CDM objects or their equivalents from the TDWG ontology.

| CDM | TDWG Ontology |
| TaxonBase (abstract class, can be Taxon or Synonym) | TaxonConcept |
| TaxonNameBase (abstract class, can be BotanicalName, ZoologicalName etc)| TaxonName |
| TaxonBase.name | TaxonName.hasName |
| TaxonBase.descriptions | TaxonBase.hasDescription |
| NonViralName.specificEpithet | TaxonName.specificEpithet |
| NonViralName.genusOrUninomial | TaxonName.uninomial |
| NonViralName.rank | TaxonName.rank |
| no direct equivalent | TaxonName.authorship |
| TaxonNameBase.descriptions | no direct equivalent |

The current TDWG reccommendation is that LSIDs and a LSID Resolution Service is used to publish data about objects. The CDM Server implements the LSID Resolution Service specification (partially, it does not have working Foreign Authority Notification). There is a LSID Assigning Service specification that the CDM Server does not implement.

Assumptions

  1. GUID in this context is a globally unique identifier for an object (and associated technology)

  2. The classes or categories of object, and their properties are defined by the TDWG ontology and can be mapped onto the CDM objects

  3. GUIDs are resolvable using any software client that uses the standard protocol defined as part of the GUID technology

  4. The representation formats for the different classes are also defined by TDWG

  5. In addition to having properties that are "part of" or "core" to the object, globally identifiable objects can also be related to other globally identifiable objects

  6. The properties or attributes of an object are not neccessarily immutable i.e. it is possible to change properties of an object or its relationships with other objects.

  7. If two representations have the same GUID then they unambiguously represent the same thing

  8. If two representations have different GUIDs then they may represent the same thing, but this may be a value judgement (based on comparison of their properties).

  9. Objects with GUIDs are intended to be permanently resolvable (in the same sense as anything man-made i.e. when it is published the intention is for it to always be resolvable).

Objects "belong" to only one authority. The authority that "owns" the object is entitled to change the properties of the object.

Objects can be associated with other objects (within the constraints of the TDWG ontology). An authority cannot restrict other users using their GUIDs in associations once published

Use Cases

User finds guid in publication & uses it to discover more information

Use Case: A user is interested in learning about Philodendron venustifolium, and obtains a PDF document of Philodendron venustifoliatum (Araceae): a new species from Brazil. Kew Bull. 53: 483–486. The GUID "urn:lsid:cate-araceae.org:taxonconcepts:152024" is embedded in the pdf as a hyperlink. The user retrieves information using their client from a variety of data providers. Some of the data wasn't neccessarily available at the time the document was published, or does not reside in the database of cate-araceae.org.

  1. The user clicks on the link, hoping to find more information about the species

  2. Using a GUID client, the user obtains a document in a standard format that is typed according to the TDWG ontology. The user's client can understand that this document describes a taxon concept. It also understands the meaning (semantics) of the properties of a taxon object as the semantics of these properties are also defined by the TDWG ontology.

  3. The taxon object contains data including the description of the species, links to images, a coded distribution according to a controlled vocabulary and another embedded GUID - "urn:lsid:ipni.org:names:320552-2" associated with the name property of the taxon object. The GUID client knows that this GUID is a pointer to a name object (this is also defined by the TDWG ontology).

  4. The user indicates that they want to learn more about the name of the taxon using the user interface (e.g. by clicking on the name property) and their client resolves this identifier and retrieves the name object from IPNI. The name object contains more data, again typed according to the TDWG ontology. The user discovers the location that the type was collected from, and the current location of the holotype and syntypes.

  5. The user's client can discover and use other services that also understand the TDWG ontology e.g. specimen databases, to retrieve further information about the specimens that typify this name.

The use of guids are benificial for cate-araceae.org because they are able to supply data to a larger number of users and clients by adopting a single, generic, standard protocol and ontology (rather than developing a service specifically for each client). cate-araceae.org adds value to its data by linking it to other data that, in turn, can be linked to more data, all accessable through the same standard route. For IPNI, the benifits are from increased numbers of users discovering or using their data through links from external data providers.

In choosing to use an IPNI guid, cate-araceae.org is defering to the expertise of the IPNI Editors, increasing their status as being authoritative for that particular class of data. The provenance of the names data in cate-araceae.org is established as originating from a particular record in IPNI, even if a user downloads that data from cate-araceae.org. Provided that cate-araceae.org adds some value (i.e. by providing extra data or extra services) beyond the data and services offered by IPNI the relationship can be symbiotic.

Because GUIDs are permanently resolvable, the metadata associated with the objects in CATE Araceae and IPNI can be retrieved even if the data moves location (e.g. between servers, databases, or even institutions hosting the data).

User contributes new taxon

Use Case: Simon Mayo has added the (fossil) Aroid Genus Albertarum Bogner, G.L. Hoffman & Aulenback to cate-araceae.org. Joseph Bogner, who described this species, collaborates with Simon and is in regular communication with him about his research. The name Albertarum is submitted to the IPNI Editors for consideration on Simon's behalf, much sooner that it would be otherwise if they had been forced to discover the publication of the name by scanning the paper literature. IPNI includes the name in its database, assigning a new GUID for it. It notifies CATE Araceae that it has published a GUID once it is assigned, and CATE Araceae attaches the GUID to the name object in its database.

  1. Simon uses the cate-araceae.org interface to create a new species page for the taxon Albertarum.

  2. He uses a web-form to fill in data about the protologue (Bogner, J., Hoffman, G.L., Aulenback, K.R. 2005. A fossilized aroid infructescence, Albertarum pueri gen.nov. et sp.nov., of Late Cretaceous (Late Campanian) age from the Horseshoe Canyon Formation of southern Alberta, Canada. Canadian Journal of Botany), and the authorship.

  3. He submits this data to CATE Araceae, and this is published on the web.

  4. CATE Araceae would like to apply the IPNI ids for all of their names. The software checks IPNI to discover if the name Albertarum already exists

  5. In this case, IPNI does not (currently) hold this information.

  6. IPNI processes the metadata passed to it by cate-araceae.org and (after intervention by the IPNI Editors, plus some offline checking of the Can. J. Bot. article), the name Albertarum is added to IPNI

  7. IPNI recorded that the request for a GUID for the name Albertarum originated from cate-araceae.org and the new GUID is sent to cate-araceae.org, which associates it with the name object in its database.

From cate-araceae.org's point of view, it is behaving as a "good citizen" by notifying IPNI about an event (a new Genus) that it might be interested in. Overall, it is in cate-araceae.org's best interests that IPNI is comprehensive, especially if IPNI offers services that cate-araceae.org uses that improve with the completeness and accuracy of the nomenclator e.g. validation of name strings.

Alternative end points might be:

  1. IPNI rejects the request for an id for Albertarum because the Editors decide that the name is not validly published, or that, being a Fossil genus, it is out of scope of IPNI

  2. IPNI assigns an identifier, but corrects some of the metadata as supplied by cate-araceae (perhaps the authority was misspelled).

  3. IPNI returns an identifier that has already been created, confirming that the metadata supplied in the request made by cate-araceae.org is correct.

In these cases, having IPNI check the metadata associated with Albertarum prior to issuing a GUID is useful for cate-araceae.org because the users of CATE might not neccessarily be experts in nomenclature, or might have made an error in entering the data.

From IPNI's point of view, they are providing a rout for feedback and new information in a standard way. They are also potentially reducing the workload for their Editors by accepting data that is already parsed and does not need to be entered by hand twice - for trustworthy clients it may be possible to verify such information quickly and incorporate it into IPNI. In addition it increases the rate of discovery of new names by allowing other users to submit new names as part of work - there is no need for a user of cate-araceae.org to log in to a specific IPNI client in order to contribute names to the nomenclator. This follows the principle that "given many eyes, all bugs are shallow".

It may increase the workload of the IPNI editors if a service like this results in a large number of requests for GUIDs, especially if each request requires significant checking, or some sort of interaction with the user that submitted the request. It will also make the IPNI application more complex to develop and maintain. For CATE Araceae, relying on IPNI for identifiers means that the IPNI Editors have the final say in issuing an identifier and associating metadata with a name. If it were important the CATE Araceae has total control over the names it publishes, then it should be its own authority for names.

Existing taxon is split

Use Case: A user has downloaded data from cate-araceae.org and used those taxon concepts as the basis of a dataset of measurements taken from specimens. One of the editors of cate-araceae.org decides to include the split of an existing genus Phildendron Schott. into the CATE Araceae web revision and change in status of subgen. Pteromischum (Schott) Mayo to the rank of Genus. The dataset of the user is out of date. The user is able to discover that the classification of Philodendron has changed and they are able to update their dataset automatically to use the correct accepted names according to CATE Araceae.

  1. A user has downloaded a checklist of Philodendron species, including the GUIDs of those species.

  2. They use the checklist to create a dataset of leaf morphometric measurements for many species in the genus, including Philodendron acreanum K.Krause.

  3. An editor of cate-araceae.org increases the rank of Pteromischum and it appears in cate-araceae.org as a new genus. 66 of the species of Philodendron are recombined in this new genus.

  4. The software recognises that the name Pteromischum originated from IPNI (urn:lsid:ipni.org:names:927070-1:1.1.2.1.1.2), and that this name has been changed by a user of cate-araceae.org.

  5. The software transmits this information to IPNI.

  6. (following the same kind of process as the one outlined in the previous use case), IPNI editors decided that Pteromischum (Schott) Mayo is a new name, and return a new identifier for the name, which cate-araceae.org applies.

  7. cate-araceae.org increments the version of the taxon concept Philodendron Schott sec CATE Araceae, 2009 (urn:lsid:cate-araceae.org:taxonconcepts:151375) and saves the new version of Philodendron under this new version number.

  8. cate-araceae.org creates 66 new names (and attempts to acquire 66 IPNI identifiers for them) in the genus Pteromischum (Schott) Mayo, and 66 new taxon concepts that are related to the original taxon concepts before they were recombined (e.g. through synonym relationships).

  9. At a later date, the original user uses a GUID client to discover that, for example, Philodendron acreanum has been changed (because the metadata returned indicates that the version identified isReplacedBy another, later version). The semantics of this relationship are defined by the TDWG ontology.

The original user decides that they trust cate-araceae.org is correct and would like to update their data so that it is labled according to the current accepted names.

The original user uses a GUID client to resolve the most up to date versions the taxon concepts in their checklist and discovers that they now have metadata that indicate that they are synonyms of other taxon concepts.

By retrieving metadata about those new taxon concepts, the user is able to update their checklist automatically to contain the currently accepted names e.g. Philodendron acreanum K.Krause is a synonym of Pteromischum acreanum (K.Krause) Mayo, and so is replaced automatically. The data that the user created remains associated with the correct taxon concept.

In some cases the mapping between older objects and new objects might be ambiguous (e.g. when a taxon is thought to be a pro parte synonym of another taxon). In this case, and so data cannot be transformed without human intervention.

From the point of view of a user of cate-araceae.org, GUIDs represent a standard way to identify precisely the provenance of data they have obtained from cate-araceae - because cate-araceae.org saves a different version each time the data is changed in the database, and links these versions via replaces and isReplacedBy links in the metadata that it returns, users of that data can find out if they are using the most up-to-date version of the data (it does not have an isReplacedBy link, because it has not been replaced).

If users want to use the most up to date data possible and trust the authority, GUIDs provide a method for obtaining the most current version (by following links between versions). Although not covered by current GUID protocols, it would also be possible for a client to pull or harvest changes in bulk from an authority, rather than querying the authority on an object-by-object basis.

If users do not want to update their data or do not trust the authority that the changes are correct, then they can continue to use the old GUIDs provided the authority versions their objects. If the authority does not version their objects then the metadata about the object provided by the authority and the metadata provided by the client could be different or even contradictory. As an example, in step 6 of the usecase described above:

  1. cate-araceae.org does not change the version number for Philodendron (Schott). Phlodendron (Schott) as served by cate-araceae.org has 371 species, and Philodendron (Schott) [urn:lsid:cate-araceae.org:taxonconcepts:151375 in the dataset of the client has 437 species.

In this case, the user's guid client can only detect changes by comparing properties (e.g. dc:modified) of the object that it has cached with the object currently being resolved by the authority.

New data added to existing taxon is repurposed by aggregator

Use Case: One of the editors of cate-araceae.org adds some extra data to an existing taxon concept, a new distribution record for Homalomena Schott (urn:lsid:cate-araceae.org:taxonconcepts:99893), stating that it occurs in Brazil. This data is harvested by GBIF and the taxon Homalomena Schott appears in search results for taxa found in Brazil in the GBIF portal, linking users through to other data about this taxon (for example, textual diagnoses or images).

  1. One of the users of CATE Araceae enter data into cate-araceae.org using a web form. This data is entered into the CDM database used by CATE Araceae.

  2. CATE Araceae exposes this data in a web page (e.g. as a map showing the regions colour coded by presence / absence).

  3. The Distribution Record can be expressed as a Species Profile Model InfoItem (http://rs.tdwg.org/ontology/voc/SpeciesProfileModel#InfoItem), of class Distribution (http://rs.tdwg.org/ontology/voc/SPMInfoItems#Distribution) which has a value http://rs.tdwg.org/ontology/voc/GeographicRegion.rdf#84.

  4. The CDM Server underlying CATE Araceae exposes its data for harvesting by aggregators. An aggregator from GBIF uses a standard protocol to discover any objects that are new or changed since it last harvested data from CATE Araceae.

  5. CATE Araceae responds with a list of GUIDs for objects that are new or have changed since GBIF last harvested it. This includes the GUID for the InfoItem that states that Homalomena Schott is found in the TDWG Region of Brazil.

  6. GBIF harvests the new and updated objects by resolving these identifiers and requesting the metadata. CATE Araceae responds with the metadata in the standard format (i.e. RDF)

  7. GBIF adds this data to its own aggregated database. It can use the fact that the TDWG Vocabulary specifies the semantics of the returned data to build queries across data harvested from a variety of sources e.g. "images of taxa that occur in Brazil"

  8. Because GBIF has indexed data from other sources that associate images with the taxon concept Homalomena Schott sec CATE Araceae, 2009 (urn:lsid:cate-araceae.org:taxonconcepts:99893, via the hasDigitalImage property), these images are now returned in searches for images of Brazilian taxa.

Here GUIDs and the associated technology are advantageous for the data provider because they can be used to build additional services that re-use data in ways not originally envisaged by the data provider. Because the use of GUIDs, plus the ontology provided by TDWG unambiguously associate objects in particular ways (e.g. an image of a taxon or an image of a publication about a taxon), the data provider's data can be connected with other data in a way that is easier for a computer to understand - resulting in higher quality data returned by queries across aggregated datasets, and more correct hits on the data provided by the data provider.

In addition, the use of the data provider's GUIDs in external objects (provided such objects can be discovered, for example through some sort of harvesting by an aggregator), is a useful metric of use and usefulness of data belonging to the data provider. The number of web pages with the word Homalomena Schott cannot convincingly be used as a metric of the usefulness of CATE Araceae, wheras the number of data objects (not published by CATE Araceae) that use urn:lsid:cate-araceae.org:taxonconcepts:99893 is a useful metric not only of all data from CATE Araceae, but also of that particular item of data (i.e. maybe for some taxon concepts CATE Araceae is preferred, wheras ubio is preferred for others).

The use of aggregation could be less effective if there were lots of other data providers that link contradictory information to objects published by the data provider (e.g. if lots of people publish images of non Homalomena species that are "tagged" as the CATE Araceae concept of Homalomena Schott). This problem is not unique to GUIDs, but might be expected to be less problematic provided objects with GUIDs are created with care.

For aggregators, the GUIDs are useful because they can be harvested, and the data associated with these objects is returned in a well defined format common to all data providers that use GUIDs. Because GUIDs should be present in an object or an associated object, they can be used to detect if data about the same thing have been harvested from different sources, or if the same metadata has arrived via two or more routes (e.g. directly from IPNI and also from the Catalogue of Life, for example). The use of GUIDs for a data provider is proportional to the number of objects with GUIDs, and (if the aggregator is able to understand the meaning of the TDWG Vocabulary to make connections between objects it has harvested) the number of links between objects.

For the end user, the advantage of GUIDs in this scenario is that they are able to query across a large number of data providers in a single query, and that their queries are more powerful because the semantics of the associations between objects are well specified in the TDWG Ontology used to describe the objects.

Data exported to a flat file for use in external tool

Use Case: A user of CATE Araceae exports data into SDD (Structured Descriptive Data) and this dataset is imported into the Lucid Builder. The user adds data (e.g. character state data) to the dataset and generates a new SDD document that they import back into the same database. The new measurements are added to the database (e.g. new description elements to existing descriptions), and elements that existed in the original dataset are updated if they have been edited in Lucid. Likewise new characters are imported into the CDM Database, but characters that existed at the time the data was originally exported are not duplicated (although they may be updated if the user has e.g. associated new images with them).

  1. A user uses an export tool provided by the CDM to export data as Structured Descriptive Data (or they could download data from the cate-araceae.org website). The GUID of an object (where it exists) is included in the SDD Element that represents that object.

  2. The user imports the data into Lucid Builder, preserving the GUIDs.

  3. The user edits the dataset in Lucid, updating some objects and adding new ones (these objects do not have GUIDs).

  4. They export the dataset as a new SDD document and import this document into the CATE Araceae database.

  5. The CDM Java Library unmarshalls the document and recognises some objects as having GUIDs

  6. The software checks the CDM Database to discover if these objects already exist. Objects without GUIDs are assumed to be unknown and therfore new.

  7. If an object does exist, the existing (persisted) object is updated, unless it is newer than the object being imported.

  8. If both the persisted and imported object have changes, the software alerts the user and requires them to manually resolve the issue.

For software developers, GUIDs provide a standard way to assert identity of an object. Because the protocol for handling identifiable objects is defined by the GUID system, different software tools can import and export data safely provided they behave in the correct way (e.g. preserving GUIDs). For users, GUIDs provide a way to exchange partial datasets between applications (e.g. exporting part of a dataset from the CDM and using it in Lucid to make a Multi-Access Key, or exporting data into nexus format and using the data in R, for example). By including version parts of a GUID in the exported data, and by versioning objects every time the object is changed, it is possible to say exactly which object was exported.

GUIDs make importing data back into existing datasets easier, but it is unlikely that they could remove the need for manual intervention if both persisted and imported objects have changes.

New information added to plant name

Use Case: One of the IPNI Editors makes a correction to the authority of Bognera recondita (Madison) Mayo & Nicolsen, correcting Nicolsen to Nicolson. This information is discovered by CATE Araceae, which has already associated the taxon concept Bognera recondita (Madison) Mayo & Nicolsen sec CATE Araceae, 2009 (urn:lsid:cate-araceae.org:taxonconcepts:22805) with the name published by IPNI (urn:lsid:ipni.org:names:942108-1).

  1. One of the IPNI Editors corrects the authority of the taxonomic name Bognera recondita (Madison) Mayo & Nicolsen.

  2. Either IPNI has stored information associating the taxon concept urn:lsid:cate-araceae.org:taxonconcepts:22805 with urn:lsid:ipni.org:names:942108-1 and notifies CATE Araceae that this object has been changed (push) or:

  3. CATE Araceae has stored that the name urn:lsid:ipni.org:names:942108-1 is the name of urn:lsid:cate-araceae.org:taxonconcepts:22805 and periodically polls IPNI to discover any relevant names that have changed (pull)

  4. Either way, CATE Araceae discovers that the authority of Bognera recondita should be (Madison) Mayo & Nicolson and updates its cached data accordingly.

  5. This update has knock-on effects, changing the title of the Taxon and TaxonDescription pages, altering search results, taxon tree etc within CATE Araceae.

For IPNI notifying CATE Araceae of changes to names is a burden, either through having to push changes to interested clients or suffering the additional burden of clients pulling data from it at regular intervals. This may be seen as the converse to "User contributes new taxon", i.e. if data providers want feedback about their data objects, they should reciprocate and inform clients (or a least allow clients to discover) changes in data that they are using.

Being notified or discovering changes in external data is a real advantage to clients as is allows the quality of secondary (to the client) data to be maintained without the need for manual checking. Automatic updating of names does have implications for clients if the authority makes changes that a client disagrees with. Because changes to foreign objects could happen automatically, it is expected that authorities should be explict about the kind of data that might change in their objects.

Problems & Gaps:

  • GUID Clients are not widely available (e.g. lsid clients).

  • Few tools import data formatted according to the TDWG standards (i.e. the TDWG RDF vocabularies). The CDM can import a subset of this data (TCS-RDF), but this is a manual step - the CDM Java Library does not provide an LSID client.

  • Conversion between RDF and the CDM is expected to be lossy. This is not neccessarily a problem for publishing data that is read-only, but is problematic if data is imported or merged from multiple sources into the CDM.

  • The LSID specification does not provide a service that allows

o Objects to be harvested regardless of their properties (for use by aggregators). There are existing open standards for metadata harvesting, notably OAI-PMH.

o Objects to be discovered based upon their properties (to allow the discovery of an object that already exists, provided by the authority). This service might be conceptually similar to OpenUrl.

o The metadata associated with objects to be updated (for use by clients). This could be conceptually similar to the LSID Assigning service, except that the metadata provided is associated with an existing object, not a new object.

  • The CDM Server does not support LSID Assignment.

  • Currently the LSID Assigning Service does not support long running (asynchronous) processes. This means that LSID assignment requires an immediate response, with no option for an authority (i.e. a human being) to take time to decide if they want to assign an identifier or not. This make LSID Assignment of identifiers for abstract objects provided by non-trusted clients difficult or impossible in reality.

    *

The CDM Server does not support Foreign Authority Notification, which is required if multiple data providers are to be able to resolve data about the same object. I don't know of any LSID authorities that do support this part of the protocol, and indeed this part of the protocol has not been developed to the point where I belive it could be implemented without making assumptions about the way it should work.

  • GUIDs are not preserved by magic. Users might not understand the importance of preserving the guids, or they might understand what a GUID is and not wish to preserve them anyway. If users obtain data but throw away the guids, then the benifit of using guids is naturally lost.

  • It is not always possible to recognise that a string of characters is a GUID, so even though a GUID might be preserved, subsequent users might not understand the significance. In this case, the benifit of using a GUID is lost (for that user). Some technologies have been created for the sole purpose of providing GUIDs, so provided a user recognises what it is, then they should understand that it is a GUID. In the case of technologies like HTTP URI's, these strings can be used as GUIDs but a user may not be able to tell if a URI is a GUID without attempting to resolve it.

  • Users may recognise a string as a link to further information or an identifier, but unless they are familiar with the technology or have a client that enables them to resolve the data associated with the object, they may not know what to do with it. If the identifier is of a form familiar to most users (e.g. HTTP URIs), users will naturally attempt to use a web client to obtain more information (but they may think that the GUID is just a url). In the case of more esoteric GUIDs like doi's and lsids, users may not know how to use them. To cope with this problem, GUIDs in documents intended for human consumption tend to be "clickable" i.e. use a http proxy or other mechanism to allow web browsers to resolve them

Functional requirements

This section attempts to break down the use cases described above into much smaller operations that could be implemented by components of the CDM Java Library.

<code class="rst">
+-------------------------------+-------------------------------------+--------------------------------------------------------------+ 
| Requirement                   | Native Version                      | Foreign Version                                              |
+===============================+=====================================+==============================================================+
| Creation of an object or data | 1.1 Create a new object with a GUID | 1.3 Ask a foreign authority to assign a GUID to a new object |
|                               | 1.2 Assign a GUID to an object that |                                                              |
|                               | didn’t previously have a GUID       |                                                              |
+-------------------------------+-------------------------------------+--------------------------------------------------------------+ 
| Updating metadata about an    | 2.1 Update an existing object       | 2.5 Ask a foreign authority to update an existing object     |
| object                        | 2.2 Update an existing object       | 2.6 Find out if a foreign object has changed                 |
|                               | (and assign it a new GUID)          | 2.7 Notify a foreign authority that another authority holds  |
|                               | 2.3 Update an existing object (and  | metadata about an object (FAN)                               |
|                               | state that it replaces an existing  | 2.8 Notify a foreign authority that another authority no     |
|                               | object)                             | longer holds information about an object                     |
|                               | 2.4 Notify foreign authorities that |                                                              |
|                               | an object has changed               |                                                              |
+-------------------------------+-------------------------------------+--------------------------------------------------------------+ 
| Deleting an object            | 3.1 Delete an existing object (that | 3.3 Ask a foreign authority to delete an existing object     |
|                               | had a GUID)                         |                                                              |      
|                               | 3.2 Delete an existing object (that |                                                              |
|                               | had a GUID, and state that another  |                                                              |
|                               | object replaces it).                |                                                              |
+-------------------------------+-------------------------------------+--------------------------------------------------------------+ 
| Resolving metadata about an   | 4.1 Find an object based upon its   | 4.2 Resolve a foreign object based upon its GUID             |
| object                        | GUID                                |                                                              |
+-------------------------------+-------------------------------------+--------------------------------------------------------------+
| Discovery of objects based on | 5.1 Find objects with GUIDs based   | 5.2 Find foreign objects based upon their metadata (from a   |
| their metadata                | upon their metadata                 | specific foreign authority)                                  |
|                               |                                     | 5.3 Find foreign objects based upon their metadata           |
|                               |                                     | (globally, from an aggregator)                               |
+-------------------------------+-------------------------------------+--------------------------------------------------------------+ 
| Object navigation             | 6.1 Find different versions of an   | 6.5 Resolve different versions of an object (given an object |
|                               | object                              | that has versions)                                           |
|                               | 6.2 Find the current version of an  | 6.6 Resolve the canonical version of an object (given a      |
|                               | object                              | version of the object)                                       |
|                               | 6.3 Find an object that is replaced | 6.7 Resolve the object that a given object is replaced by    |
|                               | by a given object                   | 6.8 Resolve the original object (s) replaced by an object    |
|                               | 6.4 Find an object that replaces a  |                                                              |
|                               | given object                        |                                                              |
+-------------------------------+-------------------------------------+--------------------------------------------------------------+ 

In addition, there are a couple of special stories related to importing / exporting data into static files & integrating existing data from foreign authorities into CDM applications:

7.1 Import Data (from a static file / database) that contains GUIDs (some may already exist in the CDM store)

7.2 Integrate external sources of data (from a foreign authority)

8.1 Export Data (into a static file) that contains GUIDs

8.2 Serve Metadata (to a foreign authority or client application)

1.1 Create a new object with a GUID

A user creates a new object in the database and the application assigns a new GUID to it.

1.2 Assign a GUID to an object that didn’t previously have a GUID

A user decides that an object in the database should be globally resolvable and identifiable and the application assigns a new GUID to it.

1.3 Ask a foreign authority to assign a GUID to a new object

A user creates a new object in the application and persists it locally, but would like a foreign authority to assign an identifier to it (e.g. FOA creates a taxon concept object that represents a taxon concept that is not found in taxonConcepts.org.). The application makes a request on the users behalf that taxonConcepts.org assigns a new GUID to the object.

2.1 Update an existing object

A user updates an existing object. The application saves the updated object as the current version of the object.

2.2 Update an existing object (and assign it a new GUID)

A user updates an existing object, and decides that the change is so significant that the object should be published under a new GUID. The application assigns a new GUID to the object, and updates the previous version of the object to record the fact that the old object is “replaced by” the new object

2.3 Update an existing object (and state that it replaces an existing object)

A user updates an existing object, and wishes to “merge” another object. The application updates all of the references within the database so that they refer to the remaining object and updates the merged object to state that it has been replaced by the remaining object.

2.4 Notify foreign authorities that an object has changed

A user updates an existing object. 0 or more foreign authorities hold copies of this object and have explicitly registered that they would like to be notified when the object changes. The application notifies the foreign authorities on behalf of the user.

2.5 Ask a foreign authority to update an existing object

A user updates an existing object that belongs to a foreign authority. The application asks the foreign authority on behalf of the user to update the object. The foreign authority may decline to update the object. At this point, the application should offer the user the choice of (a) revert the object back to the original state prior to the update, (b) update the object to the new state of the object in the foreign authority, (c) replace the foreign object with a native object that does not share the same GUID. The application should prevent users from changing foreign objects if the authority does not permit it.

2.6 Find out if a foreign object has changed (and accept those changes)

A foreign object has changed in the foreign authority. The application should be capable of being notified (i.e. acting as a client in 2.4) or actively trying to pull updates from foreign authorities. In the first instance, we will assume that the application will accept changes from the owning authority without giving users the option to decide whether they accept those changes or not.

2.7 Notify a foreign authority that another authority holds metadata about an object (FAN)

A foreign authority holds metadata about an object and wishes the authority of that object to include a reference to the foriegn authority in any metadata response for that object so GUID clients can discover the metadata that the foreign authority has. The foreign authority uses the Foreign Authority Notification part of the GUID protocol to inform the authority of its existence. If the authority implements this part of the specification, then it will return a reference to the foreign authority in any metadata response for that object as per the specification.

2.8 Notify a foreign authority that another authority no longer holds information about an object

A foreign authority no longer wishes to resolve information about a foreign object. It uses Foreign Authority Notification to inform the authority that it no longer has any metadata about the object that it wishes to resolve. If the authority supports this part of the specification it no longer returns a reference to the foreign authority in any metadata response for that object as per the specification.

3.1 Delete an existing object (that had a GUID)

A user deletes an object. Because the object has had a GUID, the application must still resolve the object (although it may only return some metadata stating that the object has been deleted).

3.2 Delete an existing object (that had a GUID, and state that another object replaces it).

The reverse of 2.3 – the user wishes to replace an object with another, better / more correct object. The application stores a reference to the object that replaces the deleted object.

3.3 Ask a foreign authority to delete an existing object

The user deletes a foreign object from the CDM store. Provided that the user thinks this should be a global delete / replace (as an example, a FOA user finds two duplicate synonyms in taxonConcepts.org with identical name and sec fields), the object is removed from the local CDM store and the request is propagated by the application to the foreign authority. In this case, it is possible for the CDM store to remove the object even if the foreign authority doesn’t delete it (if the deleted object is not referenced by any other object in the CDM store).

4.1 Find an object based upon its GUID

A user has a (possibly made-up) guid that should belong to this CDM store. The application either (a) returns an object (b) throws an exception stating that the object did exist, but was deleted (and maybe provides a way of retrieving that deleted object), (c) throws an exception stating that no object with that GUID has ever existed.

4.2 Resolve a foreign object based upon its GUID

In several other use-cases, the application will encounter foreign GUIDs. The application should transparently retrieve those objects. If those objects are used (e.g. as part of a checklist) within the CDM Store, then those foreign objects should be cached (persisted). This may introduce the requirement for polling / notification of the local CDM store if the foreign object changes.

5.1 Find objects with GUIDs based upon their metadata

A user wishes to discover existing objects based on their metadata (properties or relations). The application should return a list of 0 or more objects that match the query.

5.2 Find foreign objects based upon their metadata (from a specific foreign authority)

A user wishes to discover existing objects based on their metadata (properties or relations). The CDM Store understands that a particular authority should be used for this type of data. The application should query the foreign authority and return a list of 0 or more objects that match the query. This is similar to 1.3, but no objects are created if the foreign authority does not have any matches.

5.3 Find foreign objects based upon their metadata (globally, from an aggregator)

A user wishes to discover existing objects based on their metadata (properties or relations). The CDM Store uses an aggregator to discover matching objects in a collection of objects belonging to a number of authorities (including perhaps this authority). The application should query the aggregator and return a list of 0 or more objects that match the query

6.1 Find different versions of an object

A user wishes to find out how many times an object has changed and what changes were made to an object. The application returns a list of different versions of the object.

6.2 Find the current version of an object

A user wishes to use the most up-to-date version of an object. The application returns the most up to date version of the object.

6.3 Find objects that are replaced by a given object

A user wants to find the object(s) that were replaced by a given object. The application returns these objects.

6.4 Find objects that replace a given object

A user wants to find the object(s) that replace a given object. The application returns these objects.

6.5 Resolve different versions of an object (given an object that has versions)

A user has a foreign object that is versioned. The application seamlessly retrieves and presents the different versions of the object (so that users can check to see how the object has changed over time). The application should prevent users from persisting old versions of the object, although retrieving the most recent version might be useful if an export is intended to be static (so that it is possible to determine the exact state of the document at the time of creation).

6.6 Resolve the canonical version of an object (given a version of the object)

A user has a foreign object that is a specific version of that object. The application seamlessly retrieves and presents the current version of the object (this would be required upon import of some data with versions). The application should check to see if an object has been replaced by other objects and if so, should warn the user.

6.7 Resolve a replacement object given the object it replaces

A user has a foreign object that is replaced by another object. The application seamlessly retrieves the replacement and presents it to the user.

6.8 Resolve the original object (s) replaced by an object

A user has a foreign object that replaces one or more other objects. The application seamlessly retrieves these objects (to allow the user to check the original objects). The application should prevent users importing objects that have been replaced into the CDM store

7.1 Import Data (from a static file / database) that contains GUIDs (some may already exist in the CDM store)

A user has a static resource that they wish to import into the CDM store. The application identifies that some objects within the resource have GUIDs. It handles issues such as: some objects already exist within the CDM store, some objects might be earlier versions of an object that has since been updated. There is business logic (and perhaps a workflow) for checking objects to find out if they have been updated. It may be a value judgement on the part of the user whether an object has changed significantly or not.

7.2 Integrate external sources of data (from a foreign authority)

A user wishes to search an external authority for data. The application acts as a proxy and searches the authority on behalf of the user. The external authority may return objects that the user can inspect (some may already exist in the CDM store). The user may then wish to use these foreign objects (i.e. attach them to new or existing native objects).

8.1 Export Data (into a static file) that contains GUIDs

A user wishes to export a subset of the data into a static file or resource (i.e. a read-only database). The application provides the data in a format that will allow clients to find and use the GUIDs in that resource (i.e. it might ensure that the version part of the GUID is included in those objects that are versioned, to ensure that the precise object is referenced).

8.2 Serve Metadata (to a foreign authority or client application)

A foreign authority or client application makes a request for a particular representation of an object. The application handles this request according to the specification.

Non Functional Requirements

In addition to the functional requirements outlined above, the non-functional requirements of Authentication and Identity of principals across GUID authoritites need to be met.

As with business rules for accepting, validating, and handling data, the implementation of rules for authorization of particular operations (e.g. for updating), and audit of operations should be determined by the authority. However, methods for authentication should be specified in any GUID protocol. It is also desirable to be able to provide a globally unique identifier for the user account or the person who holds that account, so that credit can be assigned (e.g. for improving data, or creating new data).

Technical Issues / Decisions

  1. GUID identity and Java Object identity and Hibernate / Database identity should mean the same thing

Hibernate enables us to equate java object identity (using object.equals()) is the same as database (row) identity (using the primary key, for example). Given that two different objects with the same GUID (identifier) in the rdf:about or being asserted to be the same using owl:sameAs are the same thing, then they should be the same java object (and database row) too.

Relationships between two distinct (independently resolvable) but somehow related objects are handled using a different mechanism in rdf, and likewise, within the CDM (e.g. taxon.synonyms.synonym or taxon.relationsFromThisTaxon[type=CONGRUENT_TO] or term.generalizationOf).

The consequence of this rule is that a given CDM store cannot have more than one object with the same GUID, regardless of version. This makes programming the CDM possible (feasible), but also has consequences when importing representations of objects with GUIDs, or when querying foreign authorities or aggregators for objects. If an object is already present within the CDM store, the application should either (a) return the already persisted version, if that version is the most up to date or (b) update the persisted version and return that, but it should not create a new object with a different primary key.

  1. GUID Assignment, discovery and harvesting are all services that should be implemented in the service layer and exposed by the controller layer in the CDM Server because the low-level implementation of these services might need to be re-used in various contexts e.g. in the Taxonomic Editor.

  2. The version identifier of an object should be incremented every time an object is updated.

  3. Objects can be deleted within the CDM, as the CDM implements versioning meaning that objects can be removed but are still resolvable.

  4. The application should prevent users from assigning a GUID that has already been used. This should be part of the validation component of the CDM.

  5. Given that there are other identifier schemes that are used, the CDM should support these. The consequence of this is that the current identity implementation, that is based on LSIDs only, should be changed to support other identifier schemes (provided that they fit into the general design outlined here). It is particularly important to support HTTP URI's as these are already being used for some terms in the TDWG LSID Vocabularies, and DOI's as the current best practice is that references with DOIs are not assigned new GUIDs as they are considered to have one already.

  6. Given the correct user permissions and access rights, users of the CDM can do more-or-less what they want to native objects. If there are general business rules that exist for handling certain data types, the CDM should implement those rules as part of the data validation functionality.

  7. The CDM should distinguish between foreign and native objects with identifiers and behave appropriately with these objects.

CDM Application Architecture & Design

The following diagram shows the kind of environment that the CDM might be operating in:

The following components are required in general. The components in bold already exist to some extent

  1. GUID data model Already implemented for LSIDs, but for LSIDs only. Could be extended / refactored to incorporate other GUID types (e.g. doi, HTTP URI)

  2. GUID Registry This component maps authority:namespace pairs to CDM Objects (and thus to services in the CDM service layer)

  3. GUID Assigning Service. This component assigns new guids to local CDM objects, or acts as a proxy for an external assigning service

  4. GUID Resolution (controller layer). This component handles GUID requests. The CDM has a functional LSID Resolution component.

  5. GUID Harvesting. (controller layer) This component allows aggregators to poll the service and retrieve new or updated objects for indexing. There is no standard for guid harvesting although something like OAI PMH seems like a good choice.

  6. GUID Discovery. (controller layer) This component would allow clients to discover if an object with a given set of properties already exists. There is no standard for GUID Discovery, although in principle something like the assigning service but with GET semantics would work (e.g. don't create something if it doesn't exist). Alternatively, something like OpenUrl with a custom metadata format could suffice.

  7. GUID Assignment. This component functionally similar to the discovery component, but has POST semantics i.e. a resource is created if the request is sucessfull. This component could be extended beyond the original LSID assignment spec to encompass object updating, i.e. transmit an object that belongs to the authority which already has a guid and the authority may update the representation as provided by the client. GUID assignment is essentially a complex operation and if a human being needs to be involved in deciding whether a request from a foreign authority should be accepted or not, then the request must be asynchronous and long running. If such asynchronous, long running operations are supported, then notification (of the PUSH or PULL variety) should be supported.

  8. Foreign Authority Delegating DAO. This component wraps a DAO and allows the application to resolve foreign objects, persist local cached copies of foreign objects, and if the foreign authority allows it, assign foreign identifiers for new objects that it creates.

For a CDM application that imports or exports (foreign or local) identifiabe objects, the i/o componets will need special business logic to handle objects with GUIDs.

Conculsion: where next?

This document presents a set of use cases focussed on the use of Globally Unique Identifiers to help users of a single CDM Community Store achieve their objectives. A number of Use-Cases have been presented, showing how GUIDs can be used by Data Providers, aggregators and end users to discover, connect, and manage data, and to distribute data across a number of data providers. For GUIDs to be very useful, additional services are needed such as services that allow GUIDs and their metadata to be harvested (discovered en-masse, regardless of their properties) or discovered, based upon their properties. In addition, existing GUID specifications should be extended to allow for long-running, asynchronous processes.

Given the nascent state of GUID resolution services, it is unlikely that such complex services will be developed in the near future. Consequently, the CDM should not attempt to develop serviced based upon GUIDs until the community as a whole has a shared understanding of the problem, and has refined their specifications further.

It would be more productive to bear in mind the overall architectural design presented here whilst developing related areas of the CDM Java Library further e.g. de-duplication / merging, validation, web services, data-import / export.

Updated by Andreas Müller almost 2 years ago · 10 revisions