Version 2 - History - GuidReport - EDIT - EDIT Project Management

1

Ben Clark

2

# GUIDs in the CDM

3

4

5

Purpose of this document

6

7

8

The EDIT Common Data Model Java Library is a generic API for building applications for revisionary taxonomy and taxonomic field work. Such applications are envisaged to be integrated into the wider biodiversity informatics landscape. Standard vocabularies for describing data, the use of Globally unique identifiers to distinguish between those data items, and standard protocols & web services for exchanging metadata are understood to be the means to achieve this integration.

9

10

11

12

GUIDs in this context are taken to be a synonym for the whole system of Globally Unique Identifiers plus the associated technology for resolving metadata about the objects those identifiers identify, the ontology for describing the relationships and properties of those objects and their semantics, the format of any representations of those object, plus software applications that can use and understand those objects. A useful reference is the definition of GUIDs provided by the TDWG GUID Wiki http://wiki.tdwg.org/twiki/bin/view/GUID/WebHome#A_Definition_of_Globally_Unique.

13

14

15

16

This document attempts to:

17

18

19

* Describe some things (use-cases) that a user would want to achieve, and how the use of a GUID would help the user achieve that goal

20

21

* Identify any gaps in the current proposed GUID technology, and perhaps propose solutions to those gaps.

22

23

* Identify functional and non-functional requirements of such a system in terms of lower level operations, and map these requirements onto the CDM Java Library

24

25

* Specify, at a high level, missing components of the CDM Java Library with regards to GUIDs.

26

27

28

29

## Scenario

30

31

32

Taxonomists and other biodiversity scientists collect and create information about biological entities. The amount of information in total is very large, so large that it is impossible to collect more than a fraction of the total during the course of any one project or to hold more than a fraction of the total in one database or software application. Instead, databases of limited scope (e.g. taxonomic scope, or geographical scope, or being restricted to a certain subset of the total number of categories of information) are created, usually for a particular purpose by a particular organization or group of individuals. There are many databases each containing a subset of the total information.

33

34

35

36

The proceeding use cases are set in the context of the scenario of shared information between cate-araceae.org and IPNI.  CATE Araceae is a database created by Simon Mayo & collaborators (cate-araceae.org). It is a taxonomic revision of the Araceae, a group of about 3,000 species of plants. The primary purpose of the database is to provide a classification and diagnostic description of the accepted species of Aroids. Initially the taxonomic concepts used in CATE Araceae were those accepted by the Moncot Checklist, but it is belived that a large > 1000 species of Aroids are yet to be described, so it is likely that the cate-araceae.org checklist will diverge from the Monocot Checklist in the future unless particular effort is spent maintaining them in synchrony. The core data within this database are Taxonomic Concepts, and Descriptions of those Taxonomic Concepts. However, cate-araceae.org also contains lists of Taxonomic Names, authors, specimens, references, controlled terms and many other types of data. Ideally it would like to use global authority files for these entities as there is more than enough work purely maintaining the classification and descriptive data. An added complication is that these "global authority files" are themselves not static but change as new publications, specimens etc are created or as existing data is improved.

37

38

39

40

The International Plant Names Index has been created by a consortium of RBG Kew, the Australian National Herbarium, and the Harvard University Herbaria. It was assembled from Index Kewensis, the Gray Card Index  & APNI, and aims to compile and maintain a comprehensive literature based record of the scientific names of all vascular plants and to make it freely available on the Internet. It is updated on a regular basis by the IPNI Editors. The core data in IPNI are Taxonomic Names, Publications, and Authors.

41

42

43

44

Some general features of such databases is that

45

46

47

* They do not cover all information, but specialize in a particular subset of information.

48

49

* They are created or compiled in order to meet immediate business needs of specific users working for, or with the organization that supports the database. In the case of cate-araceae.org, Simon Mayo and his collaborators are active Araceae taxonomists and use the database in his day-to-day work, and as a means of publishing the results of his research. Likewise the components of IPNI were created primarily to serve botanical research globally.

50

51

* Once assembled, the data in such databases need to be updated if they are to remain useful. In the case of cate-araceae.org, it is estimated that approximatly 1,000 new aroid species are believed to be currently undescribed. In the case of IPNI, new, validly published names are added to the index, in addition to continuous efforts to improve the quality of the data.

52

53

* They are (usually) publicly funded and publicly available. It is important to the organizations that support such databases that these resources are used, and are useful beyond the organization that created them, although it can be difficult to demonstrate this usefulness or use.

54

55

56

57

## The CDM & GUIDs

58

59

The CDM is a data model implemented in java. The metadata returned by a LSID resolution service, for example, is a RDF document typed according to the TDWG ontology. There is (in most cases) a one-to-one mapping between classes and properties in the CDM and the current TDWG ontology (found at http://rs.tdwg.org/ontology). There is not complete coverage between the two data models in either direction. In some cases properties in one model are composites of properties in the other. A second problem is that cardinality constraints have not been be placed on the properties of RDF objects in the TDWG ontology.  As a consequence conversion between the CDM and the TDWG ontology is expected to be lossy (i.e. CDM objects cannot be converted into RDF and back again without loss of data in some instances). The following table gives some of the main objects or properties used in the usecases below as either CDM objects or their equivalents from the TDWG ontology.

60

61

62

| CDM | TDWG Ontology |

63

| TaxonBase (abstract class, can be Taxon or Synonym) | TaxonConcept |

64

| TaxonNameBase (abstract class, can be BotanicalName, ZoologicalName etc)| TaxonName |

65

| TaxonBase.name | TaxonName.hasName |

66

| TaxonBase.descriptions | TaxonBase.hasDescription |

67

| NonViralName.specificEpithet | TaxonName.specificEpithet |

68

| NonViralName.genusOrUninomial | TaxonName.uninomial |

69

| NonViralName.rank | TaxonName.rank |

70

| no direct equivalent | TaxonName.authorship |

71

| TaxonNameBase.descriptions | no direct equivalent |

72

73

The current TDWG reccommendation is that LSIDs and a LSID Resolution Service is used to publish data about objects. The CDM Server implements the LSID Resolution Service specification (partially, it does not have working Foreign Authority Notification). There is a LSID Assigning Service specification that the CDM Server does not implement.

74

75

76

77

## Assumptions

78

79

80

1. GUID in this context is a globally unique identifier for an object (and associated technology)

81

82

1. The classes or categories of object, and their properties are defined by the TDWG ontology and can be mapped onto the CDM objects

83

84

1. GUIDs are resolvable using any software client that uses the standard protocol defined as part of the GUID technology

85

86

1. The representation formats for the different classes are also defined by TDWG

87

88

1. In addition to having properties that are "part of" or "core" to the object, globally identifiable objects can also be related to other globally identifiable objects

89

90

1. The properties or attributes of an object are not neccessarily immutable i.e. it is possible to change properties of an object or its relationships with other objects.

91

92

1. If two representations have the same GUID then they unambiguously represent the same thing

93

94

1. If two representations have different GUIDs then they may represent the same thing, but this may be a value judgement (based on comparison of their properties).

95

96

1. Objects with GUIDs are intended to be permanently resolvable (in the same sense as anything man-made i.e. when it is published the intention is for it to always be resolvable).

97

98

 Objects "belong" to only one authority. The authority that "owns" the object is entitled to change the properties of the object.

99

100

 Objects can be associated with other objects (within the constraints of the TDWG ontology). An authority cannot restrict other users using their GUIDs in associations once published

101

102

103

104

## Use Cases

105

106

107

108

### User finds guid in publication & uses it to discover more information

109

110

111

*Use Case*: A user is interested in learning about Philodendron venustifolium, and obtains a PDF document of Philodendron venustifoliatum (Araceae): a new species from Brazil. Kew Bull. 53: 483â€“486. The GUID "urn:lsid:cate-araceae.org:taxonconcepts:152024" is embedded in the pdf as a hyperlink. The user retrieves information using their client from a variety of data providers. Some of the data wasn't neccessarily available at the time the document was published, or does not reside in the database of cate-araceae.org.

112

113

114

1. The user clicks on the link, hoping to find more information about the species

115

116

1. Using a GUID client, the user obtains a document in a standard format that is typed according to the TDWG ontology. The user's client can understand that this document describes a taxon concept. It also understands the meaning (semantics) of the properties of a taxon object as the semantics of these properties are also defined by the TDWG ontology.

117

118

1. The taxon object contains data including the description of the species, links to images, a coded distribution according to a controlled vocabulary and another embedded GUID - "urn:lsid:ipni.org:names:320552-2" associated with the name property of the taxon object. The GUID client knows that this GUID is a pointer to a name object (this is also defined by the TDWG ontology).

119

120

1. The user indicates that they want to learn more about the name of the taxon using the user interface (e.g. by clicking on the name property) and their client resolves this identifier and retrieves the name object from IPNI. The name object contains more data, again typed according to the TDWG ontology. The user discovers the location that the type was collected from, and the current location of the holotype and syntypes.

121

122

1. The user's client can discover and use other services that also understand the TDWG ontology e.g. specimen databases, to retrieve further information about the specimens that typify this name.

123

124

125

The use of guids are benificial for cate-araceae.org because they are able to supply data to a larger number of users and clients by adopting a single, generic, standard protocol and ontology (rather than developing a service specifically for each client). cate-araceae.org adds value to its data by linking it to other data that, in turn, can be linked to more data, all accessable through the same standard route. For IPNI, the benifits are from increased numbers of users discovering or using their data through links from external data providers.

126

127

128

In choosing to use an IPNI guid, cate-araceae.org is defering to the expertise of the IPNI Editors, increasing their status as being authoritative for that particular class of data. The provenance of the names data in cate-araceae.org is established as originating from a particular record in IPNI, even if a user downloads that data from cate-araceae.org. Provided that cate-araceae.org adds some value (i.e. by providing extra data or extra services) beyond the data and services offered by IPNI the relationship can be symbiotic.

129

130

131

Because GUIDs are permanently resolvable, the metadata associated with the objects in CATE Araceae and IPNI can be retrieved even if the data moves location (e.g. between servers, databases, or even institutions hosting the data).

132

133

134

135

### User contributes new taxon

136

137

138

*Use Case*: Simon Mayo has added the (fossil) Aroid Genus Albertarum Bogner, G.L. Hoffman & Aulenback to cate-araceae.org. Joseph Bogner, who described this species, collaborates with Simon and is in regular communication with him about his research. The name Albertarum is submitted to the IPNI Editors for consideration on Simon's behalf, much sooner that it would be otherwise if they had been forced to discover the publication of the name by scanning the paper literature. IPNI includes the name in its database, assigning a new GUID for it. It notifies CATE Araceae that it has published a GUID once it is assigned, and CATE Araceae attaches the GUID to the name object in its database.

139

140

141

1. Simon uses the cate-araceae.org interface to create a new species page for the taxon Albertarum.

142

143

1. He uses a web-form to fill in data about the protologue (Bogner, J., Hoffman, G.L., Aulenback, K.R. 2005. A fossilized aroid infructescence, Albertarum pueri gen.nov. et sp.nov., of Late Cretaceous (Late Campanian) age from the Horseshoe Canyon Formation of southern Alberta, Canada. Canadian Journal of Botany), and the authorship.

144

145

1. He submits this data to CATE Araceae, and this is published on the web.

146

147

1. CATE Araceae would like to apply the IPNI ids for all of their names. The software checks IPNI to discover if the name Albertarum already exists

148

149

1. In this case, IPNI does not (currently) hold this information.

150

151

1. IPNI processes the metadata passed to it by cate-araceae.org and (after intervention by the IPNI Editors, plus some offline checking of the Can. J. Bot. article), the name Albertarum is added to IPNI

152

153

1. IPNI recorded that the request for a GUID for the name Albertarum originated from cate-araceae.org and the new GUID is sent to cate-araceae.org, which associates it with the name object in its database.

154

155

156

From cate-araceae.org's point of view, it is behaving as a "good citizen" by notifying IPNI about an event (a new Genus) that it might be interested in. Overall, it is in cate-araceae.org's best interests that IPNI is comprehensive, especially if IPNI offers services that cate-araceae.org uses that improve with the completeness and accuracy of the nomenclator e.g. validation of name strings.

157

158

159

Alternative end points might be:

160

161

162

1. IPNI rejects the request for an id for Albertarum because the Editors decide that the name is not validly published, or that, being a Fossil genus, it is out of scope of IPNI

163

164

1. IPNI assigns an identifier, but corrects some of the metadata as supplied by cate-araceae (perhaps the authority was misspelled).

165

166

1. IPNI returns an identifier that has already been created, confirming that the metadata supplied in the request made by cate-araceae.org is correct.

167

168

169

In these cases, having IPNI check the metadata associated with Albertarum prior to issuing a GUID is useful for cate-araceae.org because the users of CATE might not neccessarily be experts in nomenclature, or might have made an error in entering the data.

170

171

172

From IPNI's point of view, they are providing a rout for feedback and new information in a standard way. They are also potentially reducing the workload for their Editors by accepting data that is already parsed and does not need to be entered by hand twice - for trustworthy clients it may be possible to verify such information quickly and incorporate it into IPNI. In addition it increases the rate of discovery of new names by allowing other users to submit new names as part of work - there is no need for a user of cate-araceae.org to log in to a specific IPNI client in order to contribute names to the nomenclator. This follows the principle that "given many eyes, all bugs are shallow".

173

174

175

It may increase the workload of the IPNI editors if a service like this results in a large number of requests for GUIDs, especially if each request requires significant checking, or some sort of interaction with the user that submitted the request. It will also make the IPNI application more complex to develop and maintain. For CATE Araceae, relying on IPNI for identifiers means that the IPNI Editors have the final say in issuing an identifier and associating metadata with a name. If it were important the CATE Araceae has total control over the names it publishes, then it should be its own authority for names.

176

177

178

179

### Existing taxon is split

180

181

182

*Use Case*: A user has downloaded data from cate-araceae.org and used those taxon concepts as the basis of a dataset of measurements taken from specimens. One of the editors of cate-araceae.org decides to include the split of an existing genus Phildendron Schott. into the CATE Araceae web revision and change in status of subgen. Pteromischum (Schott) Mayo to the rank of Genus. The dataset of the user is out of date. The user is able to discover that the classification of Philodendron has changed and they are able to update their dataset automatically to use the correct accepted names according to CATE Araceae.

183

184

185

1. A user has downloaded a checklist of Philodendron species, including the GUIDs of those species.

186

187

1.  They use the checklist to create a dataset of leaf morphometric measurements for many species in the genus, including Philodendron acreanum K.Krause.

188

189

1. An editor of cate-araceae.org increases the rank of Pteromischum and it appears in cate-araceae.org as a new genus. 66 of the species of Philodendron are recombined in this new genus.

190

191

1. The software recognises that the name Pteromischum originated from IPNI (urn:lsid:ipni.org:names:927070-1:1.1.2.1.1.2), and that this name has been changed by a user of cate-araceae.org.

192

193

1. The software transmits this information to IPNI.

194

195

1. (following the same kind of process as the one outlined in the previous use case), IPNI editors decided that Pteromischum (Schott) Mayo is a new name, and return a new identifier for the name, which cate-araceae.org applies.

196

197

1. cate-araceae.org increments the version of the taxon concept Philodendron Schott sec CATE Araceae, 2009 (urn:lsid:cate-araceae.org:taxonconcepts:151375) and saves the new version of Philodendron under this new version number.

198

199

1. cate-araceae.org creates 66 new names (and attempts to acquire 66 IPNI identifiers for them) in the genus Pteromischum (Schott) Mayo, and 66 new taxon concepts that are related to the original taxon concepts before they were recombined (e.g. through synonym relationships).

200

201

1. At a later date, the original user uses a GUID client to discover that, for example, Philodendron acreanum has been changed (because the metadata returned indicates that the version identified isReplacedBy another, later version). The semantics of this relationship are defined by the TDWG ontology.

202

203

 The original user decides that they trust cate-araceae.org is correct and would like to update their data so that it is labled according to the current accepted names.

204

205

 The original user uses a GUID client to resolve the most up to date versions the taxon concepts in their checklist and discovers that they now have metadata that indicate that they are synonyms of other taxon concepts.

206

207

 By retrieving metadata about those new taxon concepts, the user is able to update their checklist automatically to contain the currently accepted names e.g. Philodendron acreanum K.Krause is a synonym of Pteromischum acreanum (K.Krause) Mayo, and so is replaced automatically. The data that the user created remains associated with the correct taxon concept.

208

209

 In some cases the mapping between older objects and new objects might be ambiguous (e.g. when a taxon is thought to be a pro parte synonym of another taxon). In this case, and so data cannot be transformed without human intervention.

210

211

212

From the point of view of a user of cate-araceae.org, GUIDs represent a standard way to identify precisely the provenance of data they have obtained from cate-araceae - because cate-araceae.org saves a different version each time the data is changed in the database, and links these versions via replaces and isReplacedBy links in the metadata that it returns, users of that data can find out if they are using the most up-to-date version of the data (it does not have an isReplacedBy link, because it has not been replaced).

213

214

215

If users want to use the most up to date data possible and trust the authority, GUIDs provide a method for obtaining the most current version (by following links between versions). Although not covered by current GUID protocols, it would also be possible for a client to pull or harvest changes in bulk from an authority, rather than querying the authority on an object-by-object basis.

216

217

218

If users do not want to update their data or do not trust the authority that the changes are correct, then they can continue to use the old GUIDs provided the authority versions their objects. If the authority does not version their objects then the metadata about the object provided by the authority and the metadata provided by the client could be different or even contradictory. As an example, in step 6 of the usecase described above:

219

220

221

1. cate-araceae.org does not change the version number for Philodendron (Schott). Phlodendron (Schott) as served by cate-araceae.org [[urnlsidcate-araceaeorgtaxonconcepts151375]|has 371 species, and Philodendron (Schott) [urn:lsid:cate-araceae.org:taxonconcepts:151375]] in the dataset of the client has 437 species.

222

223

224

In this case, the user's guid client can only detect changes by comparing properties (e.g. dc:modified) of the object that it has cached with the object currently being resolved by the authority.

225

226

227

228

### New data added to existing taxon is repurposed by aggregator

229

230

231

*Use Case*: One of the editors of cate-araceae.org adds some extra data to an existing taxon concept, a new distribution record for Homalomena Schott (urn:lsid:cate-araceae.org:taxonconcepts:99893), stating that it occurs in Brazil. This data is harvested by GBIF and the taxon Homalomena Schott appears in search results for taxa found in Brazil in the GBIF portal, linking users through to other data about this taxon (for example, textual diagnoses or images).

232

233

234

1. One of the users of CATE Araceae enter data into cate-araceae.org using a web form. This data is entered into the CDM database used by CATE Araceae.

235

236

1. CATE Araceae exposes this data in a web page (e.g. as a map showing the regions colour coded by presence / absence).

237

238

1. The Distribution Record can be expressed as a Species Profile Model InfoItem (http://rs.tdwg.org/ontology/voc/SpeciesProfileModel#InfoItem), of class Distribution (http://rs.tdwg.org/ontology/voc/SPMInfoItems#Distribution) which has a value http://rs.tdwg.org/ontology/voc/GeographicRegion.rdf#84.

239

240

1. The CDM Server underlying CATE Araceae exposes its data for harvesting by aggregators. An aggregator from GBIF uses a standard protocol to discover any objects that are new or changed since it last harvested data from CATE Araceae.

241

242

1. CATE Araceae responds with a list of GUIDs for objects that are new or have changed since GBIF last harvested it. This includes the GUID for the InfoItem that states that Homalomena Schott is found in the TDWG Region of Brazil.

243

244

1. GBIF harvests the new and updated objects by resolving these identifiers and requesting the metadata. CATE Araceae responds with the metadata in the standard format (i.e. RDF)

245

246

1. GBIF adds this data to its own aggregated database. It can use the fact that the TDWG Vocabulary specifies the semantics of the returned data to build queries across data harvested from a variety of sources e.g. "images of taxa that occur in Brazil"

247

248

1. Because GBIF has indexed data from other sources that associate images with the taxon concept Homalomena Schott sec CATE Araceae, 2009 (urn:lsid:cate-araceae.org:taxonconcepts:99893, via the hasDigitalImage property), these images are now returned in searches for images of Brazilian taxa.

249

250

251

Here GUIDs and the associated technology are advantageous for the data provider because they can be used to build additional services that re-use data in ways not originally envisaged by the data provider. Because the use of GUIDs, plus the ontology provided by TDWG unambiguously associate objects in particular ways (e.g. an image of a taxon or an image of a publication about a taxon), the data provider's data can be connected with other data in a way that is easier for a computer to understand - resulting in higher quality data returned by queries across aggregated datasets, and more correct hits on the data provided by the data provider.

252

253

254

In addition, the use of the data provider's GUIDs in external objects (provided such objects can be discovered, for example through some sort of harvesting by an aggregator), is a useful metric of use and usefulness of data belonging to the data provider. The number of web pages with the word Homalomena Schott cannot convincingly be used as a metric of the usefulness of CATE Araceae, wheras the number of data objects (not published by CATE Araceae) that use urn:lsid:cate-araceae.org:taxonconcepts:99893 is a useful metric not only of all data from CATE Araceae, but also of that particular item of data (i.e. maybe for some taxon concepts CATE Araceae is preferred, wheras ubio is preferred for others).

255

256

257

The use of aggregation could be less effective if there were lots of other data providers that link contradictory information to objects published by the data provider (e.g. if lots of people publish images of non Homalomena species that are "tagged" as the CATE Araceae concept of Homalomena Schott). This problem is not unique to GUIDs, but might be expected to be less problematic provided objects with GUIDs are created with care.

258

259

260

For aggregators, the GUIDs are useful because they can be harvested, and the data associated with these objects is returned in a well defined format common to all data providers that use GUIDs. Because GUIDs should be present in an object or an associated object, they can be used to detect if data about the same thing have been harvested from different sources, or if the same metadata has arrived via two or more routes (e.g. directly from IPNI and also from the Catalogue of Life, for example). The use of GUIDs for a data provider is proportional to the number of objects with GUIDs, and (if the aggregator is able to understand the meaning of the TDWG Vocabulary to make connections between objects it has harvested) the number of links between objects.

261

262

263

For the end user, the advantage of GUIDs in this scenario is that they are able to query across a large number of data providers in a single query, and that their queries are more powerful because the semantics of the associations between objects are well specified in the TDWG Ontology used to describe the objects.

264

265

266

267

### Data exported to a flat file for use in external tool

268

269

270

*Use Case*: A user of CATE Araceae exports data into SDD (Structured Descriptive Data) and this dataset is imported into the Lucid Builder. The user adds data (e.g. character state data) to the dataset and generates a new SDD document that they import back into the same database. The new measurements are added to the database (e.g. new description elements to existing descriptions), and elements that existed in the original dataset are updated if they have been edited in Lucid. Likewise new characters are imported into the CDM Database, but characters that existed at the time the data was originally exported are not duplicated (although they may be updated if the user has e.g. associated new images with them).

271

272

273

1. A user uses an export tool provided by the CDM to export data as Structured Descriptive Data (or they could download data from the cate-araceae.org website). The GUID of an object (where it exists) is included in the SDD Element that represents that object.

274

275

1. The user imports the data into Lucid Builder, preserving the GUIDs.

276

277

1. The user edits the dataset in Lucid, updating some objects and adding new ones (these objects do not have GUIDs).

278

279

1. They export the dataset as a new SDD document and import this document into the CATE Araceae database.

280

281

1. The CDM Java Library unmarshalls the document and recognises some objects as having GUIDs

282

283

1. The software checks the CDM Database to discover if these objects already exist. Objects without GUIDs are assumed to be unknown and therfore new.

284

285

1. If an object does exist, the existing (persisted) object is updated, unless it is newer than the object being imported.

286

287

1. If both the persisted and imported object have changes, the software alerts the user and requires them to manually resolve the issue.

288

289

290

For software developers, GUIDs provide a standard way to assert identity of an object. Because the protocol for handling identifiable objects is defined by the GUID system, different software tools can import and export data safely provided they behave in the correct way (e.g. preserving GUIDs). For users, GUIDs provide a way to exchange partial datasets between applications (e.g. exporting part of a dataset from the CDM and using it in Lucid to make a Multi-Access Key, or exporting data into nexus format and using the data in R, for example). By including version parts of a GUID in the exported data, and by versioning objects every time the object is changed, it is possible to say exactly which object was exported.

291

292

293

GUIDs make importing data back into existing datasets easier, but it is unlikely that they could remove the need for manual intervention if both persisted and imported objects have changes.

294

295

296

297

### New information added to plant name

298

299

300

*Use Case*: One of the IPNI Editors makes a correction to the authority of Bognera recondita (Madison) Mayo & Nicolsen, correcting Nicolsen to Nicolson. This information is discovered by CATE Araceae, which has already associated the taxon concept Bognera recondita (Madison) Mayo & Nicolsen sec CATE Araceae, 2009 (urn:lsid:cate-araceae.org:taxonconcepts:22805) with the name published by IPNI (urn:lsid:ipni.org:names:942108-1).

301

302

303

1. One of the IPNI Editors corrects the authority of the taxonomic name Bognera recondita (Madison) Mayo & Nicolsen.

304

305

1. Either IPNI has stored information associating the taxon concept urn:lsid:cate-araceae.org:taxonconcepts:22805 with urn:lsid:ipni.org:names:942108-1 and notifies CATE Araceae that this object has been changed (push) or:

306

307

1. CATE Araceae has stored that the name urn:lsid:ipni.org:names:942108-1 is the name of urn:lsid:cate-araceae.org:taxonconcepts:22805 and periodically polls IPNI to discover any relevant names that have changed (pull)

308

309

1. Either way, CATE Araceae discovers that the authority of Bognera recondita should be (Madison) Mayo & Nicolson and updates its cached data accordingly.

310

311

1. This update has knock-on effects, changing the title of the Taxon and TaxonDescription pages, altering search results, taxon tree etc within CATE Araceae.

312

313

314

315

For IPNI notifying CATE Araceae of changes to names is a burden, either through having to push changes to interested clients or suffering the additional burden of clients pulling data from it at regular intervals. This may be seen as the converse to "User contributes new taxon", i.e. if data providers want feedback about their data objects, they should reciprocate and inform clients (or a least allow clients to discover) changes in data that they are using.

316

317

318

Being notified or discovering changes in external data is a real advantage to clients as is allows the quality of secondary (to the client) data to be maintained without the need for manual checking. Automatic updating of names does have implications for clients if the authority makes changes that a client disagrees with. Because changes to foreign objects could happen automatically, it is expected that authorities should be explict about the kind of data that might change in their objects.

319

320

321

322

### Problems & Gaps:

323

324

325

* GUID Clients are not widely available (e.g. lsid clients).

326

327

* Few tools import data formatted according to the TDWG standards (i.e. the TDWG RDF vocabularies). The CDM can import a subset of this data (TCS-RDF), but this is a manual step - the CDM Java Library does not provide an LSID client.

328

329

* Conversion between RDF and the CDM is expected to be lossy. This is not neccessarily a problem for publishing data that is read-only, but is problematic if data is imported or merged from multiple sources into the CDM.

330

331

* The LSID specification does not provide a service that allows

332

333

 o Objects to be harvested regardless of their properties (for use by aggregators). There are existing open standards for metadata harvesting, notably OAI-PMH.

334

335

 o Objects to be discovered based upon their properties (to allow the discovery of an object that already exists, provided by the authority). This service might be conceptually similar to OpenUrl.

336

337

 o The metadata associated with objects to be updated (for use by clients). This could be conceptually similar to the LSID Assigning service, except that the metadata provided is associated with an existing object, not a new object.

338

339

* The CDM Server does not support LSID Assignment.

340

341

* Currently the LSID Assigning Service does not support long running (asynchronous) processes. This means that LSID assignment requires an immediate response, with no option for an authority (i.e. a human being) to take time to decide if they want to assign an identifier or not. This make LSID Assignment of identifiers for abstract objects provided by non-trusted clients difficult or impossible in reality.

342

343

344

345

 The CDM Server does not support Foreign Authority Notification, which is required if multiple data providers are to be able to resolve data about the same object. I don't know of any LSID authorities that do support this part of the protocol, and indeed this part of the protocol has not been developed to the point where I belive it could be implemented without making assumptions about the way it should work.

346

347

* GUIDs are not preserved by magic. Users might not understand the importance of preserving the guids, or they might understand what a GUID is and not wish to preserve them anyway. If users obtain data but throw away the guids, then the benifit of using guids is naturally lost.

348

349

*  It is not always possible to recognise that a string of characters is a GUID, so even though a GUID might be preserved, subsequent users might not understand the significance. In this case, the benifit of using a GUID is lost (for that user). Some technologies have been created for the sole purpose of providing GUIDs, so provided a user recognises what it is, then they should understand that it is a GUID. In the case of technologies like HTTP URI's, these strings can be used as GUIDs but a user may not be able to tell if a URI is a GUID without attempting to resolve it.

350

351

* Users may recognise a string as a link to further information or an identifier, but unless they are familiar with the technology or have a client that enables them to resolve the data associated with the object, they may not know what to do with it. If the identifier is of a form familiar to most users (e.g. HTTP URIs), users will naturally attempt to use a web client to obtain more information (but they may think that the GUID is just a url). In the case of more esoteric GUIDs like doi's and lsids, users may not know how to use them. To cope with this problem, GUIDs in documents intended for human consumption tend to be "clickable" i.e. use a http proxy or other mechanism to allow web browsers to resolve them

352

353

354

355

## Functional requirements

356

357

 This section attempts to break down the use cases described above into much smaller operations that could be implemented by components of the CDM Java Library.

358

359

~~~

360

<code class="rst">

361

+-------------------------------+-------------------------------------+--------------------------------------------------------------+

362

| Requirement                   | Native Version                      | Foreign Version                                              |

363

+===============================+=====================================+==============================================================+

364

| Creation of an object or data | 1.1 Create a new object with a GUID | 1.3 Ask a foreign authority to assign a GUID to a new object |

365

|                               | 1.2 Assign a GUID to an object that |                                                              |

366

|                               | didnâ€™t previously have a GUID       |                                                              |

367

+-------------------------------+-------------------------------------+--------------------------------------------------------------+

368

~~~

369

370

Updating metadata about an object

371

372

373

374

2.1 Update an existing object

375

376

377

2.2 Update an existing object (and assign it a new GUID)

378

379

380

2.3 Update an existing object (and state that it replaces an existing object)

381

382

383

2.4 Notify foreign authorities that an object has changed

384

385

386

387

2.5 Ask a foreign authority to update an existing object

388

389

390

2.6 Find out if a foreign object has changed

391

392

393

2.7 Notify a foreign authority that another authority holds metadata about an object (FAN)

394

395

396

2.8 Notify a foreign authority that another authority no longer holds information about an object

397

398

399

400

Deleting an object

401

402

403

404

3.1 Delete an existing object (that had a GUID)

405

406

407

3.2 Delete an existing object (that had a GUID, and state that another object replaces it).

408

409

410

411

3.3 Ask a foreign authority to delete an existing object

412

413

Resolving metadata about an object

414

415

416

417

4.1 Find an object based upon its GUID

418

419

420

421

4.2 Resolve a foreign object based upon its GUID

422

423

Discovery of objects based on their metadata

424

425

426

427

5.1 Find objects with GUIDs based upon their metadata

428

429

430

431

5.2 Find foreign objects based upon their metadata (from a specific foreign authority)

432

433

434

5.3 Find foreign objects based upon their metadata (globally, from an aggregator)

435

436

Object navigation

437

438

439

440

6.1 Find different versions of an object

441

442

443

6.2 Find the current version of an object

444

445

446

6.3 Find an object that is replaced by a given object

447

448

449

6.4 Find an object that replaces a given object

450

451

452

453

6.5 Resolve different versions of an object (given an object that has versions)

454

455

456

6.6 Resolve the canonical version of an object (given a version of the object)

457

458

459

6.7 Resolve the object that a given object is replaced by

460

461

462

6.8 Resolve the original object (s) replaced by an object

463

464

465

In addition, there are a couple of special stories related to importing / exporting data into static files & integrating existing data from foreign authorities into CDM applications:

466

467

468

469

470

7.1 Import Data (from a static file / database) that contains GUIDs (some may already exist in the CDM store)

471

472

473

7.2 Integrate external sources of data (from a foreign authority)

474

475

476

8.1 Export Data (into a static file) that contains GUIDs

477

478

479

8.2 Serve Metadata (to a foreign authority or client application)

480

481

1.1 Create a new object with a GUID

482

483

484

A user creates a new object in the database and the application assigns a new GUID to it.

485

486

1.2 Assign a GUID to an object that didnâ€™t previously have a GUID

487

488

489

A user decides that an object in the database should be globally resolvable and identifiable and the application assigns a new GUID to it.

490

491

1.3 Ask a foreign authority to assign a GUID to a new object

492

493

494

A user creates a new object in the application and persists it locally, but would like a foreign authority to assign an identifier to it (e.g. FOA creates a taxon concept object that represents  a taxon concept that is not found in taxonConcepts.org.). The application makes a request on the users behalf that taxonConcepts.org assigns a new GUID to the object.

495

496

2.1 Update an existing object

497

498

499

A user updates an existing object. The application saves the updated object as the current version of the object.

500

501

2.2 Update an existing object (and assign it a new GUID)

502

503

504

A user updates an existing object, and decides that the change is so significant that the object should be published under a new GUID. The application assigns a new GUID to the object, and updates the previous version of the object to record the fact that the old object is â€œreplaced byâ€ the new object

505

506

2.3 Update an existing object (and state that it replaces an existing object)

507

508

509

A user updates an existing object, and wishes to â€œmergeâ€ another object. The application updates all of the references within the database so that they refer to the remaining object and updates the merged object to state that it has been replaced by the remaining object.

510

511

2.4 Notify foreign authorities that an object has changed

512

513

514

A user updates an existing object. 0 or more foreign authorities hold copies of this object and have explicitly registered that they would like to be notified when the object changes. The application notifies the foreign authorities on behalf of the user.

515

516

2.5 Ask a foreign authority to update an existing object

517

518

519

A user updates an existing object that belongs to a foreign authority. The application asks the foreign authority on behalf of the user to update the object. The foreign authority may decline to update the object. At this point, the application should offer the user the choice of (a) revert the object back to the original state prior to the update, (b) update the object to the new state of the object in the foreign authority, (c) replace the foreign object with a native object that does not share the same GUID. The application should prevent users from changing foreign objects if the authority does not permit it.

520

521

2.6 Find out if a foreign object has changed (and accept those changes)

522

523

524

A foreign object has changed in the foreign authority. The application should be capable of being notified (i.e. acting as a client in 2.4) or actively trying to pull updates from foreign authorities. In the first instance, we will assume that the application will accept changes from the owning authority without giving users the option to decide whether they accept those changes or not.

525

526

2.7 Notify a foreign authority that another authority holds metadata about an object (FAN)

527

528

A foreign authority holds metadata about an object and wishes the authority of that object to include a reference to the foriegn authority in any metadata response for that object so GUID clients can discover the metadata that the foreign authority has. The foreign authority uses the Foreign Authority Notification part of the GUID protocol to inform the authority of its existence. If the authority implements this part of the specification, then it will return a reference to the foreign authority in any metadata response for that object as per the specification.

529

530

2.8 Notify a foreign authority that another authority no longer holds information about an object

531

532

A foreign authority no longer wishes to resolve information about a foreign object. It uses Foreign Authority Notification to inform the authority that it no longer has any metadata about the object that it wishes to resolve. If the authority supports this part of the specification it no longer returns a reference to the foreign authority in any metadata response for that object as per the specification.

533

534

3.1 Delete an existing object (that had a GUID)

535

536

537

A user deletes an object. Because the object has had a GUID, the application must still resolve the object (although it may only return some metadata stating that the object has been deleted).

538

539

3.2 Delete an existing object (that had a GUID, and state that another object replaces it).

540

541

542

The reverse of 2.3 â€“ the user wishes to replace an object with another, better / more correct object. The application stores a reference to the object that replaces the deleted object.

543

544

3.3 Ask a foreign authority to delete an existing object

545

546

547

The user deletes a foreign object from the CDM store. Provided that the user thinks this should be a global delete / replace (as an example, a FOA user finds two duplicate synonyms in taxonConcepts.org with identical name and sec fields), the object is removed from the local CDM store and the request is propagated by the application to the foreign authority. In this case, it is possible for the CDM store to remove the object even if the foreign authority doesnâ€™t delete it (if the deleted object is not referenced by any other object in the CDM store).

548

549

4.1 Find an object based upon its GUID

550

551

552

A user has a (possibly made-up) guid that should belong to this CDM store. The application either (a) returns an object (b) throws an exception stating that the object did exist, but was deleted (and maybe provides a way of retrieving that deleted object), (c) throws an exception stating that no object with that GUID has ever existed.

553

554

4.2 Resolve a foreign object based upon its GUID

555

556

557

In several other use-cases, the application will encounter foreign GUIDs. The application should transparently retrieve those objects. If those objects are used (e.g. as part of a checklist) within the CDM Store, then those foreign objects should be cached (persisted). This may introduce the requirement for polling / notification of the local CDM store if the foreign object changes.

558

559

5.1 Find objects with GUIDs based upon their metadata

560

561

562

A user wishes to discover existing objects based on their metadata (properties or relations). The application should return a list of 0 or more objects that match the query.

563

564

5.2 Find foreign objects based upon their metadata (from a specific foreign authority)

565

566

567

A user wishes to discover existing objects based on their metadata (properties or relations). The CDM Store understands that a particular authority should be used for this type of data. The application should query the foreign authority and return a list of 0 or more objects that match the query. This is similar to 1.3, but no objects are created if the foreign authority does not have any matches.

568

569

5.3 Find foreign objects based upon their metadata (globally, from an aggregator)

570

571

572

A user wishes to discover existing objects based on their metadata (properties or relations). The CDM Store uses an aggregator to discover matching objects in a collection of objects belonging to a number of authorities (including perhaps this authority). The application should query the aggregator and return a list of 0 or more objects that match the query

573

574

6.1 Find different versions of an object

575

576

577

A user wishes to find out how many times an object has changed and what changes were made to an object. The application returns a list of different versions of the object.

578

579

6.2 Find the current version of an object

580

581

582

A user wishes to use the most up-to-date version of an object. The application returns the most up to date version of the object.

583

584

6.3 Find objects that are replaced by a given object

585

586

587

A user wants to find the object(s) that were replaced by a given object. The application returns these objects.

588

589

6.4 Find objects that replace a given object

590

591

592

A user wants to find the object(s) that replace a given object. The application returns these objects.

593

594

6.5 Resolve different versions of an object (given an object that has versions)

595

596

597

A user has a foreign object that is versioned. The application seamlessly retrieves and presents the different versions of the object (so that users can check to see how the object has changed over time). The application should prevent users from persisting old versions of the object, although retrieving the most recent version might be useful if an export is intended to be static (so that it is possible to determine the exact state of the document at the time of creation).

598

599

6.6 Resolve the canonical version of an object (given a version of the object)

600

601

602

A user has a foreign object that is a specific version of that object. The application seamlessly retrieves and presents the current version of the object (this would be required upon import of some data with versions). The application should check to see if an object has been replaced by other objects and if so, should warn the user.

603

604

6.7 Resolve a replacement object given the object it replaces

605

606

607

A user has a foreign object that is replaced by another object. The application seamlessly retrieves the replacement and presents it to the user.

608

609

6.8 Resolve the original object (s) replaced by an object

610

611

612

A user has a foreign object that replaces one or more other objects. The application seamlessly retrieves these objects (to allow the user to check the original objects). The application should prevent users importing objects that have been replaced into the CDM store

613

614

7.1 Import Data (from a static file / database) that contains GUIDs (some may already exist in the CDM store)

615

616

617

A user has a static resource that they wish to import into the CDM store. The application identifies that some objects within the resource have GUIDs. It handles issues such as: some objects already exist within the CDM store, some objects might be earlier versions of an object that has since been updated. There is business logic (and perhaps a workflow) for checking objects to find out if they have been updated. It may be a value judgement on the part of the user whether an object has changed significantly or not.

618

619

7.2 Integrate external sources of data (from a foreign authority)

620

621

622

A user wishes to search an external authority for data. The application acts as a proxy and searches the authority on behalf of the user. The external authority may return objects that the user can inspect (some may already exist in the CDM store). The user may then wish to use these foreign objects (i.e. attach them to new or existing native objects).

623

624

8.1 Export Data (into a static file) that contains GUIDs

625

626

627

A user wishes to export a subset of the data into a static file or resource (i.e. a read-only database). The application provides the data in a format that will allow clients to find and use the GUIDs in that resource (i.e. it might ensure that the version part of the GUID is included in those objects that are versioned, to ensure that the precise object is referenced).

628

629

8.2 Serve Metadata (to a foreign authority or client application)

630

631

632

A foreign authority or client application makes a request for a particular representation of an object. The application handles this request according to the specification.

633

634

Non Functional Requirements

635

636

In addition to the functional requirements outlined above, the non-functional requirements of Authentication and Identity of principals across GUID authoritites need to be met.

637

638

639

As with business rules for accepting, validating, and handling data, the implementation of rules for authorization of particular operations (e.g. for updating), and audit of operations should be determined by the authority. However, methods for authentication should be specified in any GUID protocol. It is also desirable to be able to provide a globally unique identifier for the user account or the person who holds that account, so that credit can be assigned (e.g. for improving data, or creating new data).

640

641

642

Technical Issues / Decisions

643

644

645

1. GUID identity and Java Object identity and Hibernate / Database identity should mean the same thing

646

647

648

Hibernate enables us to equate java object identity (using object.equals())  is the same as database (row) identity (using the primary key, for example). Given that two different objects with the same GUID (identifier) in the rdf:about or being asserted to be the same using owl:sameAs are the same thing, then they should be the same java object (and database row) too.

649

650

651

652

653

Relationships between two distinct (independently resolvable) but somehow related objects are handled using a different mechanism in rdf, and likewise, within the CDM (e.g. taxon.synonyms.synonym or taxon.relationsFromThisTaxon[type=CONGRUENT_TO] or term.generalizationOf).

654

655

656

657

658

The consequence of this rule is that a given CDM store cannot have more than one object with the same GUID, regardless of version. This makes programming the CDM possible (feasible), but also has consequences when importing representations of objects with GUIDs, or when querying foreign authorities or aggregators for objects. If an object is already present within the CDM store, the application should either (a) return the already persisted version, if that version is the most up to date or (b) update the persisted version and return that, but it should not create a new object with a different primary key.

659

660

661

1. GUID Assignment, discovery and harvesting are all services that should be implemented in the service layer and exposed by the controller layer in the CDM Server because the low-level implementation of these services might need to be re-used in various contexts e.g. in the Taxonomic Editor.

662

663

1. The version identifier of an object should be incremented every time an object is updated.

664

665

1. Objects can be deleted within the CDM, as the CDM implements versioning meaning that objects can be removed but are still resolvable.

666

667

1. The application should prevent users from assigning a GUID that has already been used. This should be part of the validation component of the CDM.

668

669

1. Given that there are other identifier schemes that are used, the CDM should support these. The consequence of this is that the current identity implementation, that is based on LSIDs only, should be changed to support other identifier schemes (provided that they fit into the general design outlined here). It is particularly important to support HTTP URI's as these are already being used for some terms in the TDWG LSID Vocabularies, and DOI's as the current best practice is that references with DOIs are not assigned new GUIDs as they are considered to have one already.

670

671

1. Given the correct user permissions and access rights, users of the CDM can do more-or-less what they want to native objects. If there are general business rules that exist for handling certain data types, the CDM should implement those rules as part of the data validation functionality.

672

673

1. The CDM should distinguish between foreign and native objects with identifiers and behave appropriately with these objects.

674

675

676

CDM Application Architecture & Design

677

678

679

The following diagram shows the kind of environment that the CDM might be operating in:

680

681

682

 The following components are required in general. The components  in bold already exist to some extent

683

684

685

1. GUID data model Already implemented for LSIDs, but for LSIDs only. Could be extended / refactored to incorporate other GUID types (e.g. doi, HTTP URI)

686

687

1. GUID Registry  This component maps authority:namespace pairs to CDM Objects (and thus to services in the CDM service layer)

688

689

1. GUID Assigning Service.  This component assigns new guids to local CDM objects, or acts as a proxy for an external assigning service

690

691

1. GUID Resolution (controller layer).  This component handles GUID requests. The CDM has a functional LSID Resolution component.

692

693

1. GUID Harvesting. (controller layer) This component allows aggregators to poll the service and retrieve new or updated objects for indexing. There is no standard for guid harvesting although something like OAI PMH seems like a good choice.

694

695

1. GUID Discovery. (controller layer)  This component would allow clients to discover if an object with a given set of properties already exists. There is no standard for GUID Discovery, although in principle something like the assigning service but with GET semantics would work (e.g. don't create something if it doesn't exist). Alternatively, something like OpenUrl with a custom metadata format could suffice.

696

697

1.  GUID Assignment. This component functionally similar to the discovery component, but has POST semantics i.e. a resource is created if the request is sucessfull. This component could be extended beyond the original LSID assignment spec to encompass object updating, i.e. transmit an object that belongs to the authority which already has a guid and the authority may update the representation as provided by the client. GUID assignment is essentially a complex operation and if a human being needs to be involved in deciding whether a request from a foreign authority should be accepted or not, then the request must be asynchronous and long running. If such asynchronous, long running operations are supported, then notification (of the PUSH or PULL variety) should be supported.

698

699

1.  Foreign Authority Delegating DAO. This component wraps a DAO and allows the application to resolve foreign objects, persist local cached copies of foreign objects, and if the foreign authority allows it, assign foreign identifiers for new objects that it creates.

700

701

702

703

For a CDM application that imports or exports (foreign or local) identifiabe objects, the i/o componets will need special business logic to handle objects with GUIDs.

704

705

706

Conculsion: where next?

707

708

This document presents a set of use cases focussed on the use of Globally Unique Identifiers to help users of a single CDM Community Store achieve their objectives. A number of Use-Cases have been presented, showing how GUIDs can be used by Data Providers, aggregators and end users to discover, connect, and manage data, and to distribute data across a number of data providers. For GUIDs to be very useful, additional services are needed such as services that allow GUIDs and their metadata to be harvested (discovered en-masse, regardless of their properties) or discovered, based upon their properties. In addition, existing GUID specifications should be extended to allow for long-running, asynchronous processes.

709

710

711

Given the nascent state of GUID resolution services, it is unlikely that such complex services will be developed in the near future. Consequently, the CDM should not attempt to develop serviced based upon GUIDs until the community as a whole has a shared understanding of the problem, and has refined their specifications further.

712

713

714

It would be more productive to bear in mind the overall architectural design presented here whilst developing related areas of the CDM Java Library further e.g. de-duplication / merging, validation, web services, data-import / export.

Project

General

Profile

EDIT

GuidReport » History » Version 2