Project

General

Profile

GuidReport » History » Version 2

Ben Clark, 08/12/2009 04:35 PM

1 2 Ben Clark
2
# GUIDs in the CDM
3
4
5
Purpose of this document
6
7
8
The EDIT Common Data Model Java Library is a generic API for building applications for revisionary taxonomy and taxonomic field work. Such applications are envisaged to be integrated into the wider biodiversity informatics landscape. Standard vocabularies for describing data, the use of Globally unique identifiers to distinguish between those data items, and standard protocols & web services for exchanging metadata are understood to be the means to achieve this integration.
9
10
11
12
GUIDs in this context are taken to be a synonym for the whole system of Globally Unique Identifiers plus the associated technology for resolving metadata about the objects those identifiers identify, the ontology for describing the relationships and properties of those objects and their semantics, the format of any representations of those object, plus software applications that can use and understand those objects. A useful reference is the definition of GUIDs provided by the TDWG GUID Wiki http://wiki.tdwg.org/twiki/bin/view/GUID/WebHome#A_Definition_of_Globally_Unique.
13
14
15
16
This document attempts to:
17
18
19
* Describe some things (use-cases) that a user would want to achieve, and how the use of a GUID would help the user achieve that goal
20
21
* Identify any gaps in the current proposed GUID technology, and perhaps propose solutions to those gaps.
22
23
* Identify functional and non-functional requirements of such a system in terms of lower level operations, and map these requirements onto the CDM Java Library
24
25
* Specify, at a high level, missing components of the CDM Java Library with regards to GUIDs.
26
27
28
29
## Scenario
30
31
32
Taxonomists and other biodiversity scientists collect and create information about biological entities. The amount of information in total is very large, so large that it is impossible to collect more than a fraction of the total during the course of any one project or to hold more than a fraction of the total in one database or software application. Instead, databases of limited scope (e.g. taxonomic scope, or geographical scope, or being restricted to a certain subset of the total number of categories of information) are created, usually for a particular purpose by a particular organization or group of individuals. There are many databases each containing a subset of the total information.
33
34
35
36
The proceeding use cases are set in the context of the scenario of shared information between cate-araceae.org and IPNI.  CATE Araceae is a database created by Simon Mayo & collaborators (cate-araceae.org). It is a taxonomic revision of the Araceae, a group of about 3,000 species of plants. The primary purpose of the database is to provide a classification and diagnostic description of the accepted species of Aroids. Initially the taxonomic concepts used in CATE Araceae were those accepted by the Moncot Checklist, but it is belived that a large > 1000 species of Aroids are yet to be described, so it is likely that the cate-araceae.org checklist will diverge from the Monocot Checklist in the future unless particular effort is spent maintaining them in synchrony. The core data within this database are Taxonomic Concepts, and Descriptions of those Taxonomic Concepts. However, cate-araceae.org also contains lists of Taxonomic Names, authors, specimens, references, controlled terms and many other types of data. Ideally it would like to use global authority files for these entities as there is more than enough work purely maintaining the classification and descriptive data. An added complication is that these "global authority files" are themselves not static but change as new publications, specimens etc are created or as existing data is improved.
37
38
39
40
The International Plant Names Index has been created by a consortium of RBG Kew, the Australian National Herbarium, and the Harvard University Herbaria. It was assembled from Index Kewensis, the Gray Card Index  & APNI, and aims to compile and maintain a comprehensive literature based record of the scientific names of all vascular plants and to make it freely available on the Internet. It is updated on a regular basis by the IPNI Editors. The core data in IPNI are Taxonomic Names, Publications, and Authors.
41
42
43
44
Some general features of such databases is that
45
46
47
* They do not cover all information, but specialize in a particular subset of information.
48
49
* They are created or compiled in order to meet immediate business needs of specific users working for, or with the organization that supports the database. In the case of cate-araceae.org, Simon Mayo and his collaborators are active Araceae taxonomists and use the database in his day-to-day work, and as a means of publishing the results of his research. Likewise the components of IPNI were created primarily to serve botanical research globally.
50
51
* Once assembled, the data in such databases need to be updated if they are to remain useful. In the case of cate-araceae.org, it is estimated that approximatly 1,000 new aroid species are believed to be currently undescribed. In the case of IPNI, new, validly published names are added to the index, in addition to continuous efforts to improve the quality of the data.
52
53
* They are (usually) publicly funded and publicly available. It is important to the organizations that support such databases that these resources are used, and are useful beyond the organization that created them, although it can be difficult to demonstrate this usefulness or use.
54
55
56
57
## The CDM & GUIDs
58
59
The CDM is a data model implemented in java. The metadata returned by a LSID resolution service, for example, is a RDF document typed according to the TDWG ontology. There is (in most cases) a one-to-one mapping between classes and properties in the CDM and the current TDWG ontology (found at http://rs.tdwg.org/ontology). There is not complete coverage between the two data models in either direction. In some cases properties in one model are composites of properties in the other. A second problem is that cardinality constraints have not been be placed on the properties of RDF objects in the TDWG ontology.  As a consequence conversion between the CDM and the TDWG ontology is expected to be lossy (i.e. CDM objects cannot be converted into RDF and back again without loss of data in some instances). The following table gives some of the main objects or properties used in the usecases below as either CDM objects or their equivalents from the TDWG ontology.
60
61
62
| CDM | TDWG Ontology |
63
| TaxonBase (abstract class, can be Taxon or Synonym) | TaxonConcept |
64
| TaxonNameBase (abstract class, can be BotanicalName, ZoologicalName etc)| TaxonName |
65
| TaxonBase.name | TaxonName.hasName |
66
| TaxonBase.descriptions | TaxonBase.hasDescription |
67
| NonViralName.specificEpithet | TaxonName.specificEpithet |
68
| NonViralName.genusOrUninomial | TaxonName.uninomial |
69
| NonViralName.rank | TaxonName.rank |
70
| no direct equivalent | TaxonName.authorship |
71
| TaxonNameBase.descriptions | no direct equivalent |
72
73
The current TDWG reccommendation is that LSIDs and a LSID Resolution Service is used to publish data about objects. The CDM Server implements the LSID Resolution Service specification (partially, it does not have working Foreign Authority Notification). There is a LSID Assigning Service specification that the CDM Server does not implement.
74
75
76
77
## Assumptions
78
79
80
1. GUID in this context is a globally unique identifier for an object (and associated technology)
81
82
1. The classes or categories of object, and their properties are defined by the TDWG ontology and can be mapped onto the CDM objects
83
84
1. GUIDs are resolvable using any software client that uses the standard protocol defined as part of the GUID technology
85
86
1. The representation formats for the different classes are also defined by TDWG
87
88
1. In addition to having properties that are "part of" or "core" to the object, globally identifiable objects can also be related to other globally identifiable objects
89
90
1. The properties or attributes of an object are not neccessarily immutable i.e. it is possible to change properties of an object or its relationships with other objects.
91
92
1. If two representations have the same GUID then they unambiguously represent the same thing
93
94
1. If two representations have different GUIDs then they may represent the same thing, but this may be a value judgement (based on comparison of their properties).
95
96
1. Objects with GUIDs are intended to be permanently resolvable (in the same sense as anything man-made i.e. when it is published the intention is for it to always be resolvable).
97
98
 Objects "belong" to only one authority. The authority that "owns" the object is entitled to change the properties of the object.
99
100
 Objects can be associated with other objects (within the constraints of the TDWG ontology). An authority cannot restrict other users using their GUIDs in associations once published
101
102
103
104
## Use Cases
105
106
107
108
### User finds guid in publication & uses it to discover more information
109
110
111
*Use Case*: A user is interested in learning about Philodendron venustifolium, and obtains a PDF document of Philodendron venustifoliatum (Araceae): a new species from Brazil. Kew Bull. 53: 483–486. The GUID "urn:lsid:cate-araceae.org:taxonconcepts:152024" is embedded in the pdf as a hyperlink. The user retrieves information using their client from a variety of data providers. Some of the data wasn't neccessarily available at the time the document was published, or does not reside in the database of cate-araceae.org.
112
113
114
1. The user clicks on the link, hoping to find more information about the species
115
116
1. Using a GUID client, the user obtains a document in a standard format that is typed according to the TDWG ontology. The user's client can understand that this document describes a taxon concept. It also understands the meaning (semantics) of the properties of a taxon object as the semantics of these properties are also defined by the TDWG ontology.
117
118
1. The taxon object contains data including the description of the species, links to images, a coded distribution according to a controlled vocabulary and another embedded GUID - "urn:lsid:ipni.org:names:320552-2" associated with the name property of the taxon object. The GUID client knows that this GUID is a pointer to a name object (this is also defined by the TDWG ontology).
119
120
1. The user indicates that they want to learn more about the name of the taxon using the user interface (e.g. by clicking on the name property) and their client resolves this identifier and retrieves the name object from IPNI. The name object contains more data, again typed according to the TDWG ontology. The user discovers the location that the type was collected from, and the current location of the holotype and syntypes.
121
122
1. The user's client can discover and use other services that also understand the TDWG ontology e.g. specimen databases, to retrieve further information about the specimens that typify this name.
123
124
125
The use of guids are benificial for cate-araceae.org because they are able to supply data to a larger number of users and clients by adopting a single, generic, standard protocol and ontology (rather than developing a service specifically for each client). cate-araceae.org adds value to its data by linking it to other data that, in turn, can be linked to more data, all accessable through the same standard route. For IPNI, the benifits are from increased numbers of users discovering or using their data through links from external data providers.
126
127
128
In choosing to use an IPNI guid, cate-araceae.org is defering to the expertise of the IPNI Editors, increasing their status as being authoritative for that particular class of data. The provenance of the names data in cate-araceae.org is established as originating from a particular record in IPNI, even if a user downloads that data from cate-araceae.org. Provided that cate-araceae.org adds some value (i.e. by providing extra data or extra services) beyond the data and services offered by IPNI the relationship can be symbiotic.
129
130
131
Because GUIDs are permanently resolvable, the metadata associated with the objects in CATE Araceae and IPNI can be retrieved even if the data moves location (e.g. between servers, databases, or even institutions hosting the data).
132
133
134
135
### User contributes new taxon
136
137
138
*Use Case*: Simon Mayo has added the (fossil) Aroid Genus Albertarum Bogner, G.L. Hoffman & Aulenback to cate-araceae.org. Joseph Bogner, who described this species, collaborates with Simon and is in regular communication with him about his research. The name Albertarum is submitted to the IPNI Editors for consideration on Simon's behalf, much sooner that it would be otherwise if they had been forced to discover the publication of the name by scanning the paper literature. IPNI includes the name in its database, assigning a new GUID for it. It notifies CATE Araceae that it has published a GUID once it is assigned, and CATE Araceae attaches the GUID to the name object in its database.
139
140
141
1. Simon uses the cate-araceae.org interface to create a new species page for the taxon Albertarum.
142
143
1. He uses a web-form to fill in data about the protologue (Bogner, J., Hoffman, G.L., Aulenback, K.R. 2005. A fossilized aroid infructescence, Albertarum pueri gen.nov. et sp.nov., of Late Cretaceous (Late Campanian) age from the Horseshoe Canyon Formation of southern Alberta, Canada. Canadian Journal of Botany), and the authorship.
144
145
1. He submits this data to CATE Araceae, and this is published on the web.
146
147
1. CATE Araceae would like to apply the IPNI ids for all of their names. The software checks IPNI to discover if the name Albertarum already exists
148
149
1. In this case, IPNI does not (currently) hold this information.
150
151
1. IPNI processes the metadata passed to it by cate-araceae.org and (after intervention by the IPNI Editors, plus some offline checking of the Can. J. Bot. article), the name Albertarum is added to IPNI
152
153
1. IPNI recorded that the request for a GUID for the name Albertarum originated from cate-araceae.org and the new GUID is sent to cate-araceae.org, which associates it with the name object in its database.
154
155
156
From cate-araceae.org's point of view, it is behaving as a "good citizen" by notifying IPNI about an event (a new Genus) that it might be interested in. Overall, it is in cate-araceae.org's best interests that IPNI is comprehensive, especially if IPNI offers services that cate-araceae.org uses that improve with the completeness and accuracy of the nomenclator e.g. validation of name strings.
157
158
159
Alternative end points might be:
160
161
162
1. IPNI rejects the request for an id for Albertarum because the Editors decide that the name is not validly published, or that, being a Fossil genus, it is out of scope of IPNI
163
164
1. IPNI assigns an identifier, but corrects some of the metadata as supplied by cate-araceae (perhaps the authority was misspelled).
165
166
1. IPNI returns an identifier that has already been created, confirming that the metadata supplied in the request made by cate-araceae.org is correct. 
167
168
169
In these cases, having IPNI check the metadata associated with Albertarum prior to issuing a GUID is useful for cate-araceae.org because the users of CATE might not neccessarily be experts in nomenclature, or might have made an error in entering the data.
170
171
172
From IPNI's point of view, they are providing a rout for feedback and new information in a standard way. They are also potentially reducing the workload for their Editors by accepting data that is already parsed and does not need to be entered by hand twice - for trustworthy clients it may be possible to verify such information quickly and incorporate it into IPNI. In addition it increases the rate of discovery of new names by allowing other users to submit new names as part of work - there is no need for a user of cate-araceae.org to log in to a specific IPNI client in order to contribute names to the nomenclator. This follows the principle that "given many eyes, all bugs are shallow".
173
174
175
It may increase the workload of the IPNI editors if a service like this results in a large number of requests for GUIDs, especially if each request requires significant checking, or some sort of interaction with the user that submitted the request. It will also make the IPNI application more complex to develop and maintain. For CATE Araceae, relying on IPNI for identifiers means that the IPNI Editors have the final say in issuing an identifier and associating metadata with a name. If it were important the CATE Araceae has total control over the names it publishes, then it should be its own authority for names.
176
177
178
179
### Existing taxon is split
180
181
182
*Use Case*: A user has downloaded data from cate-araceae.org and used those taxon concepts as the basis of a dataset of measurements taken from specimens. One of the editors of cate-araceae.org decides to include the split of an existing genus Phildendron Schott. into the CATE Araceae web revision and change in status of subgen. Pteromischum (Schott) Mayo to the rank of Genus. The dataset of the user is out of date. The user is able to discover that the classification of Philodendron has changed and they are able to update their dataset automatically to use the correct accepted names according to CATE Araceae.
183
184
185
1. A user has downloaded a checklist of Philodendron species, including the GUIDs of those species.
186
187
1.  They use the checklist to create a dataset of leaf morphometric measurements for many species in the genus, including Philodendron acreanum K.Krause.
188
189
1. An editor of cate-araceae.org increases the rank of Pteromischum and it appears in cate-araceae.org as a new genus. 66 of the species of Philodendron are recombined in this new genus.
190
191
1. The software recognises that the name Pteromischum originated from IPNI (urn:lsid:ipni.org:names:927070-1:1.1.2.1.1.2), and that this name has been changed by a user of cate-araceae.org.
192
193
1. The software transmits this information to IPNI.
194
195
1. (following the same kind of process as the one outlined in the previous use case), IPNI editors decided that Pteromischum (Schott) Mayo is a new name, and return a new identifier for the name, which cate-araceae.org applies.
196
197
1. cate-araceae.org increments the version of the taxon concept Philodendron Schott sec CATE Araceae, 2009 (urn:lsid:cate-araceae.org:taxonconcepts:151375) and saves the new version of Philodendron under this new version number.
198
199
1. cate-araceae.org creates 66 new names (and attempts to acquire 66 IPNI identifiers for them) in the genus Pteromischum (Schott) Mayo, and 66 new taxon concepts that are related to the original taxon concepts before they were recombined (e.g. through synonym relationships).
200
201
1. At a later date, the original user uses a GUID client to discover that, for example, Philodendron acreanum has been changed (because the metadata returned indicates that the version identified isReplacedBy another, later version). The semantics of this relationship are defined by the TDWG ontology.
202
203
 The original user decides that they trust cate-araceae.org is correct and would like to update their data so that it is labled according to the current accepted names.
204
205
 The original user uses a GUID client to resolve the most up to date versions the taxon concepts in their checklist and discovers that they now have metadata that indicate that they are synonyms of other taxon concepts.
206
207
 By retrieving metadata about those new taxon concepts, the user is able to update their checklist automatically to contain the currently accepted names e.g. Philodendron acreanum K.Krause is a synonym of Pteromischum acreanum (K.Krause) Mayo, and so is replaced automatically. The data that the user created remains associated with the correct taxon concept.
208
209
 In some cases the mapping between older objects and new objects might be ambiguous (e.g. when a taxon is thought to be a pro parte synonym of another taxon). In this case, and so data cannot be transformed without human intervention.
210
211
212
From the point of view of a user of cate-araceae.org, GUIDs represent a standard way to identify precisely the provenance of data they have obtained from cate-araceae - because cate-araceae.org saves a different version each time the data is changed in the database, and links these versions via replaces and isReplacedBy links in the metadata that it returns, users of that data can find out if they are using the most up-to-date version of the data (it does not have an isReplacedBy link, because it has not been replaced).
213
214
215
If users want to use the most up to date data possible and trust the authority, GUIDs provide a method for obtaining the most current version (by following links between versions). Although not covered by current GUID protocols, it would also be possible for a client to pull or harvest changes in bulk from an authority, rather than querying the authority on an object-by-object basis.
216
217
218
If users do not want to update their data or do not trust the authority that the changes are correct, then they can continue to use the old GUIDs provided the authority versions their objects. If the authority does not version their objects then the metadata about the object provided by the authority and the metadata provided by the client could be different or even contradictory. As an example, in step 6 of the usecase described above:
219
220
221
1. cate-araceae.org does not change the version number for Philodendron (Schott). Phlodendron (Schott) as served by cate-araceae.org [[urnlsidcate-araceaeorgtaxonconcepts151375]|has 371 species, and Philodendron (Schott) [urn:lsid:cate-araceae.org:taxonconcepts:151375]] in the dataset of the client has 437 species.
222
223
224
In this case, the user's guid client can only detect changes by comparing properties (e.g. dc:modified) of the object that it has cached with the object currently being resolved by the authority.
225
226
227
228
### New data added to existing taxon is repurposed by aggregator
229
230
231
*Use Case*: One of the editors of cate-araceae.org adds some extra data to an existing taxon concept, a new distribution record for Homalomena Schott (urn:lsid:cate-araceae.org:taxonconcepts:99893), stating that it occurs in Brazil. This data is harvested by GBIF and the taxon Homalomena Schott appears in search results for taxa found in Brazil in the GBIF portal, linking users through to other data about this taxon (for example, textual diagnoses or images).
232
233
234
1. One of the users of CATE Araceae enter data into cate-araceae.org using a web form. This data is entered into the CDM database used by CATE Araceae.
235
236
1. CATE Araceae exposes this data in a web page (e.g. as a map showing the regions colour coded by presence / absence).
237
238
1. The Distribution Record can be expressed as a Species Profile Model InfoItem (http://rs.tdwg.org/ontology/voc/SpeciesProfileModel#InfoItem), of class Distribution (http://rs.tdwg.org/ontology/voc/SPMInfoItems#Distribution) which has a value http://rs.tdwg.org/ontology/voc/GeographicRegion.rdf#84.
239
240
1. The CDM Server underlying CATE Araceae exposes its data for harvesting by aggregators. An aggregator from GBIF uses a standard protocol to discover any objects that are new or changed since it last harvested data from CATE Araceae.
241
242
1. CATE Araceae responds with a list of GUIDs for objects that are new or have changed since GBIF last harvested it. This includes the GUID for the InfoItem that states that Homalomena Schott is found in the TDWG Region of Brazil.
243
244
1. GBIF harvests the new and updated objects by resolving these identifiers and requesting the metadata. CATE Araceae responds with the metadata in the standard format (i.e. RDF)
245
246
1. GBIF adds this data to its own aggregated database. It can use the fact that the TDWG Vocabulary specifies the semantics of the returned data to build queries across data harvested from a variety of sources e.g. "images of taxa that occur in Brazil"
247
248
1. Because GBIF has indexed data from other sources that associate images with the taxon concept Homalomena Schott sec CATE Araceae, 2009 (urn:lsid:cate-araceae.org:taxonconcepts:99893, via the hasDigitalImage property), these images are now returned in searches for images of Brazilian taxa.
249
250
251
Here GUIDs and the associated technology are advantageous for the data provider because they can be used to build additional services that re-use data in ways not originally envisaged by the data provider. Because the use of GUIDs, plus the ontology provided by TDWG unambiguously associate objects in particular ways (e.g. an image of a taxon or an image of a publication about a taxon), the data provider's data can be connected with other data in a way that is easier for a computer to understand - resulting in higher quality data returned by queries across aggregated datasets, and more correct hits on the data provided by the data provider.
252
253
254
In addition, the use of the data provider's GUIDs in external objects (provided such objects can be discovered, for example through some sort of harvesting by an aggregator), is a useful metric of use and usefulness of data belonging to the data provider. The number of web pages with the word Homalomena Schott cannot convincingly be used as a metric of the usefulness of CATE Araceae, wheras the number of data objects (not published by CATE Araceae) that use urn:lsid:cate-araceae.org:taxonconcepts:99893 is a useful metric not only of all data from CATE Araceae, but also of that particular item of data (i.e. maybe for some taxon concepts CATE Araceae is preferred, wheras ubio is preferred for others).
255
256
257
The use of aggregation could be less effective if there were lots of other data providers that link contradictory information to objects published by the data provider (e.g. if lots of people publish images of non Homalomena species that are "tagged" as the CATE Araceae concept of Homalomena Schott). This problem is not unique to GUIDs, but might be expected to be less problematic provided objects with GUIDs are created with care.
258
259
260
For aggregators, the GUIDs are useful because they can be harvested, and the data associated with these objects is returned in a well defined format common to all data providers that use GUIDs. Because GUIDs should be present in an object or an associated object, they can be used to detect if data about the same thing have been harvested from different sources, or if the same metadata has arrived via two or more routes (e.g. directly from IPNI and also from the Catalogue of Life, for example). The use of GUIDs for a data provider is proportional to the number of objects with GUIDs, and (if the aggregator is able to understand the meaning of the TDWG Vocabulary to make connections between objects it has harvested) the number of links between objects.
261
262
263
For the end user, the advantage of GUIDs in this scenario is that they are able to query across a large number of data providers in a single query, and that their queries are more powerful because the semantics of the associations between objects are well specified in the TDWG Ontology used to describe the objects.
264
265
266
267
### Data exported to a flat file for use in external tool
268
269
270
*Use Case*: A user of CATE Araceae exports data into SDD (Structured Descriptive Data) and this dataset is imported into the Lucid Builder. The user adds data (e.g. character state data) to the dataset and generates a new SDD document that they import back into the same database. The new measurements are added to the database (e.g. new description elements to existing descriptions), and elements that existed in the original dataset are updated if they have been edited in Lucid. Likewise new characters are imported into the CDM Database, but characters that existed at the time the data was originally exported are not duplicated (although they may be updated if the user has e.g. associated new images with them).
271
272
273
1. A user uses an export tool provided by the CDM to export data as Structured Descriptive Data (or they could download data from the cate-araceae.org website). The GUID of an object (where it exists) is included in the SDD Element that represents that object.
274
275
1. The user imports the data into Lucid Builder, preserving the GUIDs.
276
277
1. The user edits the dataset in Lucid, updating some objects and adding new ones (these objects do not have GUIDs).
278
279
1. They export the dataset as a new SDD document and import this document into the CATE Araceae database.
280
281
1. The CDM Java Library unmarshalls the document and recognises some objects as having GUIDs
282
283
1. The software checks the CDM Database to discover if these objects already exist. Objects without GUIDs are assumed to be unknown and therfore new.
284
285
1. If an object does exist, the existing (persisted) object is updated, unless it is newer than the object being imported.
286
287
1. If both the persisted and imported object have changes, the software alerts the user and requires them to manually resolve the issue.
288
289
290
For software developers, GUIDs provide a standard way to assert identity of an object. Because the protocol for handling identifiable objects is defined by the GUID system, different software tools can import and export data safely provided they behave in the correct way (e.g. preserving GUIDs). For users, GUIDs provide a way to exchange partial datasets between applications (e.g. exporting part of a dataset from the CDM and using it in Lucid to make a Multi-Access Key, or exporting data into nexus format and using the data in R, for example). By including version parts of a GUID in the exported data, and by versioning objects every time the object is changed, it is possible to say exactly which object was exported.
291
292
293
GUIDs make importing data back into existing datasets easier, but it is unlikely that they could remove the need for manual intervention if both persisted and imported objects have changes. 
294
295
296
297
### New information added to plant name
298
299
300
*Use Case*: One of the IPNI Editors makes a correction to the authority of Bognera recondita (Madison) Mayo & Nicolsen, correcting Nicolsen to Nicolson. This information is discovered by CATE Araceae, which has already associated the taxon concept Bognera recondita (Madison) Mayo & Nicolsen sec CATE Araceae, 2009 (urn:lsid:cate-araceae.org:taxonconcepts:22805) with the name published by IPNI (urn:lsid:ipni.org:names:942108-1). 
301
302
303
1. One of the IPNI Editors corrects the authority of the taxonomic name Bognera recondita (Madison) Mayo & Nicolsen.
304
305
1. Either IPNI has stored information associating the taxon concept urn:lsid:cate-araceae.org:taxonconcepts:22805 with urn:lsid:ipni.org:names:942108-1 and notifies CATE Araceae that this object has been changed (push) or:
306
307
1. CATE Araceae has stored that the name urn:lsid:ipni.org:names:942108-1 is the name of urn:lsid:cate-araceae.org:taxonconcepts:22805 and periodically polls IPNI to discover any relevant names that have changed (pull)
308
309
1. Either way, CATE Araceae discovers that the authority of Bognera recondita should be (Madison) Mayo & Nicolson and updates its cached data accordingly.
310
311
1. This update has knock-on effects, changing the title of the Taxon and TaxonDescription pages, altering search results, taxon tree etc within CATE Araceae.
312
313
314
315
For IPNI notifying CATE Araceae of changes to names is a burden, either through having to push changes to interested clients or suffering the additional burden of clients pulling data from it at regular intervals. This may be seen as the converse to "User contributes new taxon", i.e. if data providers want feedback about their data objects, they should reciprocate and inform clients (or a least allow clients to discover) changes in data that they are using.
316
317
318
Being notified or discovering changes in external data is a real advantage to clients as is allows the quality of secondary (to the client) data to be maintained without the need for manual checking. Automatic updating of names does have implications for clients if the authority makes changes that a client disagrees with. Because changes to foreign objects could happen automatically, it is expected that authorities should be explict about the kind of data that might change in their objects.
319
320
321
322
### Problems & Gaps:
323
324
325
* GUID Clients are not widely available (e.g. lsid clients).
326
327
* Few tools import data formatted according to the TDWG standards (i.e. the TDWG RDF vocabularies). The CDM can import a subset of this data (TCS-RDF), but this is a manual step - the CDM Java Library does not provide an LSID client.
328
329
* Conversion between RDF and the CDM is expected to be lossy. This is not neccessarily a problem for publishing data that is read-only, but is problematic if data is imported or merged from multiple sources into the CDM.
330
331
* The LSID specification does not provide a service that allows
332
333
 o Objects to be harvested regardless of their properties (for use by aggregators). There are existing open standards for metadata harvesting, notably OAI-PMH.
334
335
 o Objects to be discovered based upon their properties (to allow the discovery of an object that already exists, provided by the authority). This service might be conceptually similar to OpenUrl.
336
337
 o The metadata associated with objects to be updated (for use by clients). This could be conceptually similar to the LSID Assigning service, except that the metadata provided is associated with an existing object, not a new object.
338
339
* The CDM Server does not support LSID Assignment.
340
341
* Currently the LSID Assigning Service does not support long running (asynchronous) processes. This means that LSID assignment requires an immediate response, with no option for an authority (i.e. a human being) to take time to decide if they want to assign an identifier or not. This make LSID Assignment of identifiers for abstract objects provided by non-trusted clients difficult or impossible in reality.
342
343
    *
344
345
 The CDM Server does not support Foreign Authority Notification, which is required if multiple data providers are to be able to resolve data about the same object. I don't know of any LSID authorities that do support this part of the protocol, and indeed this part of the protocol has not been developed to the point where I belive it could be implemented without making assumptions about the way it should work.
346
347
* GUIDs are not preserved by magic. Users might not understand the importance of preserving the guids, or they might understand what a GUID is and not wish to preserve them anyway. If users obtain data but throw away the guids, then the benifit of using guids is naturally lost.
348
349
*  It is not always possible to recognise that a string of characters is a GUID, so even though a GUID might be preserved, subsequent users might not understand the significance. In this case, the benifit of using a GUID is lost (for that user). Some technologies have been created for the sole purpose of providing GUIDs, so provided a user recognises what it is, then they should understand that it is a GUID. In the case of technologies like HTTP URI's, these strings can be used as GUIDs but a user may not be able to tell if a URI is a GUID without attempting to resolve it.
350
351
* Users may recognise a string as a link to further information or an identifier, but unless they are familiar with the technology or have a client that enables them to resolve the data associated with the object, they may not know what to do with it. If the identifier is of a form familiar to most users (e.g. HTTP URIs), users will naturally attempt to use a web client to obtain more information (but they may think that the GUID is just a url). In the case of more esoteric GUIDs like doi's and lsids, users may not know how to use them. To cope with this problem, GUIDs in documents intended for human consumption tend to be "clickable" i.e. use a http proxy or other mechanism to allow web browsers to resolve them
352
353
354
355
## Functional requirements
356
357
 This section attempts to break down the use cases described above into much smaller operations that could be implemented by components of the CDM Java Library.
358
359
~~~
360
<code class="rst">
361
+-------------------------------+-------------------------------------+--------------------------------------------------------------+ 
362
| Requirement                   | Native Version                      | Foreign Version                                              |
363
+===============================+=====================================+==============================================================+
364
| Creation of an object or data | 1.1 Create a new object with a GUID | 1.3 Ask a foreign authority to assign a GUID to a new object |
365
|                               | 1.2 Assign a GUID to an object that |                                                              |
366
|                               | didn’t previously have a GUID       |                                                              |
367
+-------------------------------+-------------------------------------+--------------------------------------------------------------+ 
368
~~~
369
370
Updating metadata about an object
371
372
	
373
374
2.1 Update an existing object
375
376
377
2.2 Update an existing object (and assign it a new GUID)
378
379
380
2.3 Update an existing object (and state that it replaces an existing object)
381
382
383
2.4 Notify foreign authorities that an object has changed
384
385
	
386
387
2.5 Ask a foreign authority to update an existing object
388
389
390
2.6 Find out if a foreign object has changed
391
392
393
2.7 Notify a foreign authority that another authority holds metadata about an object (FAN)
394
395
396
2.8 Notify a foreign authority that another authority no longer holds information about an object
397
398
399
 
400
Deleting an object
401
402
	
403
404
3.1 Delete an existing object (that had a GUID)
405
406
407
3.2 Delete an existing object (that had a GUID, and state that another object replaces it).
408
409
	
410
411
3.3 Ask a foreign authority to delete an existing object
412
413
Resolving metadata about an object
414
415
	
416
417
4.1 Find an object based upon its GUID
418
419
	
420
421
4.2 Resolve a foreign object based upon its GUID
422
423
Discovery of objects based on their metadata
424
425
	
426
427
5.1 Find objects with GUIDs based upon their metadata
428
429
	
430
431
5.2 Find foreign objects based upon their metadata (from a specific foreign authority)
432
433
434
5.3 Find foreign objects based upon their metadata (globally, from an aggregator)
435
436
Object navigation
437
438
	
439
440
6.1 Find different versions of an object
441
442
443
6.2 Find the current version of an object
444
445
446
6.3 Find an object that is replaced by a given object
447
448
449
6.4 Find an object that replaces a given object
450
451
	
452
453
6.5 Resolve different versions of an object (given an object that has versions)
454
455
456
6.6 Resolve the canonical version of an object (given a version of the object)
457
458
459
6.7 Resolve the object that a given object is replaced by
460
461
462
6.8 Resolve the original object (s) replaced by an object
463
464
465
In addition, there are a couple of special stories related to importing / exporting data into static files & integrating existing data from foreign authorities into CDM applications:
466
467
468
 
469
470
7.1 Import Data (from a static file / database) that contains GUIDs (some may already exist in the CDM store)
471
472
473
7.2 Integrate external sources of data (from a foreign authority)
474
475
476
8.1 Export Data (into a static file) that contains GUIDs
477
478
479
8.2 Serve Metadata (to a foreign authority or client application)
480
481
1.1 Create a new object with a GUID
482
483
484
A user creates a new object in the database and the application assigns a new GUID to it.
485
486
1.2 Assign a GUID to an object that didn’t previously have a GUID
487
488
489
A user decides that an object in the database should be globally resolvable and identifiable and the application assigns a new GUID to it.
490
491
1.3 Ask a foreign authority to assign a GUID to a new object
492
493
494
A user creates a new object in the application and persists it locally, but would like a foreign authority to assign an identifier to it (e.g. FOA creates a taxon concept object that represents  a taxon concept that is not found in taxonConcepts.org.). The application makes a request on the users behalf that taxonConcepts.org assigns a new GUID to the object.
495
496
2.1 Update an existing object
497
498
499
A user updates an existing object. The application saves the updated object as the current version of the object.
500
501
2.2 Update an existing object (and assign it a new GUID)
502
503
504
A user updates an existing object, and decides that the change is so significant that the object should be published under a new GUID. The application assigns a new GUID to the object, and updates the previous version of the object to record the fact that the old object is “replaced by” the new object
505
506
2.3 Update an existing object (and state that it replaces an existing object)
507
508
509
A user updates an existing object, and wishes to “merge” another object. The application updates all of the references within the database so that they refer to the remaining object and updates the merged object to state that it has been replaced by the remaining object.
510
511
2.4 Notify foreign authorities that an object has changed
512
513
514
A user updates an existing object. 0 or more foreign authorities hold copies of this object and have explicitly registered that they would like to be notified when the object changes. The application notifies the foreign authorities on behalf of the user.
515
516
2.5 Ask a foreign authority to update an existing object
517
518
519
A user updates an existing object that belongs to a foreign authority. The application asks the foreign authority on behalf of the user to update the object. The foreign authority may decline to update the object. At this point, the application should offer the user the choice of (a) revert the object back to the original state prior to the update, (b) update the object to the new state of the object in the foreign authority, (c) replace the foreign object with a native object that does not share the same GUID. The application should prevent users from changing foreign objects if the authority does not permit it.
520
521
2.6 Find out if a foreign object has changed (and accept those changes)
522
523
524
A foreign object has changed in the foreign authority. The application should be capable of being notified (i.e. acting as a client in 2.4) or actively trying to pull updates from foreign authorities. In the first instance, we will assume that the application will accept changes from the owning authority without giving users the option to decide whether they accept those changes or not.
525
526
2.7 Notify a foreign authority that another authority holds metadata about an object (FAN)
527
528
A foreign authority holds metadata about an object and wishes the authority of that object to include a reference to the foriegn authority in any metadata response for that object so GUID clients can discover the metadata that the foreign authority has. The foreign authority uses the Foreign Authority Notification part of the GUID protocol to inform the authority of its existence. If the authority implements this part of the specification, then it will return a reference to the foreign authority in any metadata response for that object as per the specification.
529
530
2.8 Notify a foreign authority that another authority no longer holds information about an object
531
532
A foreign authority no longer wishes to resolve information about a foreign object. It uses Foreign Authority Notification to inform the authority that it no longer has any metadata about the object that it wishes to resolve. If the authority supports this part of the specification it no longer returns a reference to the foreign authority in any metadata response for that object as per the specification.
533
534
3.1 Delete an existing object (that had a GUID)
535
536
537
A user deletes an object. Because the object has had a GUID, the application must still resolve the object (although it may only return some metadata stating that the object has been deleted).
538
539
3.2 Delete an existing object (that had a GUID, and state that another object replaces it).
540
541
542
The reverse of 2.3 – the user wishes to replace an object with another, better / more correct object. The application stores a reference to the object that replaces the deleted object.
543
544
3.3 Ask a foreign authority to delete an existing object
545
546
547
The user deletes a foreign object from the CDM store. Provided that the user thinks this should be a global delete / replace (as an example, a FOA user finds two duplicate synonyms in taxonConcepts.org with identical name and sec fields), the object is removed from the local CDM store and the request is propagated by the application to the foreign authority. In this case, it is possible for the CDM store to remove the object even if the foreign authority doesn’t delete it (if the deleted object is not referenced by any other object in the CDM store).
548
549
4.1 Find an object based upon its GUID
550
551
552
A user has a (possibly made-up) guid that should belong to this CDM store. The application either (a) returns an object (b) throws an exception stating that the object did exist, but was deleted (and maybe provides a way of retrieving that deleted object), (c) throws an exception stating that no object with that GUID has ever existed.
553
554
4.2 Resolve a foreign object based upon its GUID
555
556
557
In several other use-cases, the application will encounter foreign GUIDs. The application should transparently retrieve those objects. If those objects are used (e.g. as part of a checklist) within the CDM Store, then those foreign objects should be cached (persisted). This may introduce the requirement for polling / notification of the local CDM store if the foreign object changes.
558
559
5.1 Find objects with GUIDs based upon their metadata
560
561
562
A user wishes to discover existing objects based on their metadata (properties or relations). The application should return a list of 0 or more objects that match the query.
563
564
5.2 Find foreign objects based upon their metadata (from a specific foreign authority)
565
566
567
A user wishes to discover existing objects based on their metadata (properties or relations). The CDM Store understands that a particular authority should be used for this type of data. The application should query the foreign authority and return a list of 0 or more objects that match the query. This is similar to 1.3, but no objects are created if the foreign authority does not have any matches.
568
569
5.3 Find foreign objects based upon their metadata (globally, from an aggregator)
570
571
572
A user wishes to discover existing objects based on their metadata (properties or relations). The CDM Store uses an aggregator to discover matching objects in a collection of objects belonging to a number of authorities (including perhaps this authority). The application should query the aggregator and return a list of 0 or more objects that match the query
573
574
6.1 Find different versions of an object
575
576
577
A user wishes to find out how many times an object has changed and what changes were made to an object. The application returns a list of different versions of the object.
578
579
6.2 Find the current version of an object
580
581
582
A user wishes to use the most up-to-date version of an object. The application returns the most up to date version of the object.
583
584
6.3 Find objects that are replaced by a given object
585
586
587
A user wants to find the object(s) that were replaced by a given object. The application returns these objects.
588
589
6.4 Find objects that replace a given object
590
591
592
A user wants to find the object(s) that replace a given object. The application returns these objects.
593
594
6.5 Resolve different versions of an object (given an object that has versions)
595
596
597
A user has a foreign object that is versioned. The application seamlessly retrieves and presents the different versions of the object (so that users can check to see how the object has changed over time). The application should prevent users from persisting old versions of the object, although retrieving the most recent version might be useful if an export is intended to be static (so that it is possible to determine the exact state of the document at the time of creation).
598
599
6.6 Resolve the canonical version of an object (given a version of the object)
600
601
602
A user has a foreign object that is a specific version of that object. The application seamlessly retrieves and presents the current version of the object (this would be required upon import of some data with versions). The application should check to see if an object has been replaced by other objects and if so, should warn the user.
603
604
6.7 Resolve a replacement object given the object it replaces
605
606
607
A user has a foreign object that is replaced by another object. The application seamlessly retrieves the replacement and presents it to the user.
608
609
6.8 Resolve the original object (s) replaced by an object
610
611
612
A user has a foreign object that replaces one or more other objects. The application seamlessly retrieves these objects (to allow the user to check the original objects). The application should prevent users importing objects that have been replaced into the CDM store
613
614
7.1 Import Data (from a static file / database) that contains GUIDs (some may already exist in the CDM store)
615
616
617
A user has a static resource that they wish to import into the CDM store. The application identifies that some objects within the resource have GUIDs. It handles issues such as: some objects already exist within the CDM store, some objects might be earlier versions of an object that has since been updated. There is business logic (and perhaps a workflow) for checking objects to find out if they have been updated. It may be a value judgement on the part of the user whether an object has changed significantly or not.
618
619
7.2 Integrate external sources of data (from a foreign authority)
620
621
622
A user wishes to search an external authority for data. The application acts as a proxy and searches the authority on behalf of the user. The external authority may return objects that the user can inspect (some may already exist in the CDM store). The user may then wish to use these foreign objects (i.e. attach them to new or existing native objects).
623
624
8.1 Export Data (into a static file) that contains GUIDs
625
626
627
A user wishes to export a subset of the data into a static file or resource (i.e. a read-only database). The application provides the data in a format that will allow clients to find and use the GUIDs in that resource (i.e. it might ensure that the version part of the GUID is included in those objects that are versioned, to ensure that the precise object is referenced).
628
629
8.2 Serve Metadata (to a foreign authority or client application)
630
631
632
A foreign authority or client application makes a request for a particular representation of an object. The application handles this request according to the specification.
633
634
Non Functional Requirements
635
636
In addition to the functional requirements outlined above, the non-functional requirements of Authentication and Identity of principals across GUID authoritites need to be met.
637
638
639
As with business rules for accepting, validating, and handling data, the implementation of rules for authorization of particular operations (e.g. for updating), and audit of operations should be determined by the authority. However, methods for authentication should be specified in any GUID protocol. It is also desirable to be able to provide a globally unique identifier for the user account or the person who holds that account, so that credit can be assigned (e.g. for improving data, or creating new data).
640
641
642
Technical Issues / Decisions
643
644
645
1. GUID identity and Java Object identity and Hibernate / Database identity should mean the same thing
646
647
648
Hibernate enables us to equate java object identity (using object.equals())  is the same as database (row) identity (using the primary key, for example). Given that two different objects with the same GUID (identifier) in the rdf:about or being asserted to be the same using owl:sameAs are the same thing, then they should be the same java object (and database row) too.
649
650
651
 
652
653
Relationships between two distinct (independently resolvable) but somehow related objects are handled using a different mechanism in rdf, and likewise, within the CDM (e.g. taxon.synonyms.synonym or taxon.relationsFromThisTaxon[type=CONGRUENT_TO] or term.generalizationOf).
654
655
656
 
657
658
The consequence of this rule is that a given CDM store cannot have more than one object with the same GUID, regardless of version. This makes programming the CDM possible (feasible), but also has consequences when importing representations of objects with GUIDs, or when querying foreign authorities or aggregators for objects. If an object is already present within the CDM store, the application should either (a) return the already persisted version, if that version is the most up to date or (b) update the persisted version and return that, but it should not create a new object with a different primary key.
659
660
661
1. GUID Assignment, discovery and harvesting are all services that should be implemented in the service layer and exposed by the controller layer in the CDM Server because the low-level implementation of these services might need to be re-used in various contexts e.g. in the Taxonomic Editor.
662
663
1. The version identifier of an object should be incremented every time an object is updated.
664
665
1. Objects can be deleted within the CDM, as the CDM implements versioning meaning that objects can be removed but are still resolvable.
666
667
1. The application should prevent users from assigning a GUID that has already been used. This should be part of the validation component of the CDM.
668
669
1. Given that there are other identifier schemes that are used, the CDM should support these. The consequence of this is that the current identity implementation, that is based on LSIDs only, should be changed to support other identifier schemes (provided that they fit into the general design outlined here). It is particularly important to support HTTP URI's as these are already being used for some terms in the TDWG LSID Vocabularies, and DOI's as the current best practice is that references with DOIs are not assigned new GUIDs as they are considered to have one already.
670
671
1. Given the correct user permissions and access rights, users of the CDM can do more-or-less what they want to native objects. If there are general business rules that exist for handling certain data types, the CDM should implement those rules as part of the data validation functionality.
672
673
1. The CDM should distinguish between foreign and native objects with identifiers and behave appropriately with these objects.
674
675
676
CDM Application Architecture & Design
677
678
679
The following diagram shows the kind of environment that the CDM might be operating in:
680
681
682
 The following components are required in general. The components  in bold already exist to some extent
683
684
685
1. GUID data model Already implemented for LSIDs, but for LSIDs only. Could be extended / refactored to incorporate other GUID types (e.g. doi, HTTP URI)
686
687
1. GUID Registry  This component maps authority:namespace pairs to CDM Objects (and thus to services in the CDM service layer)
688
689
1. GUID Assigning Service.  This component assigns new guids to local CDM objects, or acts as a proxy for an external assigning service
690
691
1. GUID Resolution (controller layer).  This component handles GUID requests. The CDM has a functional LSID Resolution component. 
692
693
1. GUID Harvesting. (controller layer) This component allows aggregators to poll the service and retrieve new or updated objects for indexing. There is no standard for guid harvesting although something like OAI PMH seems like a good choice.
694
695
1. GUID Discovery. (controller layer)  This component would allow clients to discover if an object with a given set of properties already exists. There is no standard for GUID Discovery, although in principle something like the assigning service but with GET semantics would work (e.g. don't create something if it doesn't exist). Alternatively, something like OpenUrl with a custom metadata format could suffice.
696
697
1.  GUID Assignment. This component functionally similar to the discovery component, but has POST semantics i.e. a resource is created if the request is sucessfull. This component could be extended beyond the original LSID assignment spec to encompass object updating, i.e. transmit an object that belongs to the authority which already has a guid and the authority may update the representation as provided by the client. GUID assignment is essentially a complex operation and if a human being needs to be involved in deciding whether a request from a foreign authority should be accepted or not, then the request must be asynchronous and long running. If such asynchronous, long running operations are supported, then notification (of the PUSH or PULL variety) should be supported.
698
699
1.  Foreign Authority Delegating DAO. This component wraps a DAO and allows the application to resolve foreign objects, persist local cached copies of foreign objects, and if the foreign authority allows it, assign foreign identifiers for new objects that it creates.
700
701
702
703
For a CDM application that imports or exports (foreign or local) identifiabe objects, the i/o componets will need special business logic to handle objects with GUIDs.
704
705
706
Conculsion: where next?
707
708
This document presents a set of use cases focussed on the use of Globally Unique Identifiers to help users of a single CDM Community Store achieve their objectives. A number of Use-Cases have been presented, showing how GUIDs can be used by Data Providers, aggregators and end users to discover, connect, and manage data, and to distribute data across a number of data providers. For GUIDs to be very useful, additional services are needed such as services that allow GUIDs and their metadata to be harvested (discovered en-masse, regardless of their properties) or discovered, based upon their properties. In addition, existing GUID specifications should be extended to allow for long-running, asynchronous processes.
709
710
711
Given the nascent state of GUID resolution services, it is unlikely that such complex services will be developed in the near future. Consequently, the CDM should not attempt to develop serviced based upon GUIDs until the community as a whole has a shared understanding of the problem, and has refined their specifications further.
712
713
714
It would be more productive to bear in mind the overall architectural design presented here whilst developing related areas of the CDM Java Library further e.g. de-duplication / merging, validation, web services, data-import / export.