Project

General

Profile

Data Validation & Data Integrity in the CDM

This document proposes a framework to provide a measure of data integrity and validation within the CDM. This framework is intended to consolidate data constraints currently implemented throughout the CDM API (i.e. in the data model, the persistence, service, and io layer methods) as far as possible into a single set of components that can be used by all applications based upon the CDM. It follows the DRY (Don't Repeat Yourself) principle that has already been applied to the CDM (in terms of the database & xml mapping) - that code for a given function should not be repeated in several places, leading to an increased risk of mistakes, alternative implementations and bugs. This is particularly important in the context of the CDM in that the data model is explicitly intended to be used by multiple applications - thus the applications must apply the same constraints to the data in the same way (so that the data can be shared).

The issue of validation is complicated by the fact that applications based upon the CDM may be used in different ways. The CDM is intended to be able to persist and manipulated legacy data (which might include "errors"). In addition, different applications might have different requirements - applications might be intended to handle a single checklist only, or they might be intended to handle multiple taxonomic views. Depending upon the requirements of the given application, some constraints might or might not be needed.

Thus, the approach taken here is to specify three different levels of validation. The first "Basic" level consists of simple constraints that all applications based upon the CDM must follow - if "Basic" constraints are not followed, the objects will not be persisted properly (for example, fields might be truncated) or runtime errors might be thrown by the persistence layer. Basic validation allows such errors to be detected and caught in the controller or view layer of the application and corrected without the need to hit the database.

The second and third levels of validation are taxonomic business logic and are split into context independent and context dependent validation. Taxonomic business logic consists of a set of rules that are implemented in code that depend upon multiple properties of a given object or related objects. Typically they would take the form of an if-then-else statement - for example if a name is family-group then its specific epithet must be null. Context dependent and context intependent validation are distinguished because some constraints only depend upon internal properties of an object and do not depend upon the existence of other unrelated objects within a given CDM store. Consequently context-independent validation can be performed without querying the database, and is expected to be more performant. Context dependent validation cannot be performed without querying the database to check for other objects, and is thus expected to be much slower.

Because of the requirement that the CDM should be able to persist non-taxonomically-correct legacy data (e.g. misspellings, multiple basionyms etc), both context dependent and context independent validation should be optional i.e. it should be possible to persist such "invalid" objects. Depending upon the specific application, users might be "warned" that an object appears to have errors, and be forced to confirm that they are sure that they would like to persist the object with errors. The interpretation of errors and presentation of errors to the user should be implemented in controller and view-layer code and is a matter for the specific application using the validation routines. The common validation framework presented here is intended to ensure that "errors" are detected in a consistent way, not to enforce particular behaviour within an application (except in the case of basic errors which should be enforced across all CDM applications).

Character Encoding

The first step in presenting data consistently across applications is to specify the character encoding that string properties use. The suggestion here is to use UTF-8 as the default character encoding and stick to it. Although it is possible to dynamically select an encoding on the output side of things (e.g. using the Accept-Encoding header in a web application) almost all APIs and specifications stick to UTF-8 because it is actually pretty difficult to cope with other character sets on the incoming side. Many APIs (e.g. OAI-PMH) specify that documents must be in UTF-8.

This should be set by

  • Adding a CharacterEncodingFilter to the CDMServer to filter incoming strings in POST or GET requests

  • Adding (database-specific) parameters to the JDBC Connection string (for mySQL you can append @useUnicode=true&characterEncoding=UTF-8@)

  • Adding (database-specific) parameters to the database config itself - for mySQL you can add the following (please check MySqlCharactersetAndCollation for more up to date information on this)

[client]

default-character-set=utf8

[mysqld]

default-character-set=utf8

character-set-server=utf8

  • Adding -Dfile.encoding=UTF-8 to tomcat init parameters

  • Adding accept-charset="UTF-8" to forms posted to the CDMServer

  • Adding charset=UTF-8 to headers of responses of the CDMServer

  • Adding encoding="UTF-8" to the first part of xml documents produced by CDM-I/O routines

  • Setting the encoding on output streams used throughout the CDM (e.g. @new OutputStreamWriter(outputStream, "UTF8")@)

Basic Validation

Basic validation consists of internal properties of the object that can be validated without knowing anything about the other objects within the CDM store and without knowing anything about the specific constraints being applied to the CDM store. As a consequence, this means that validation can take place with a single object and does not depend upon having access to, for example, a live database to be able to determine whether an object is valid or not. Most of the basic constraints are on individual fields, determined by limits on the size of fields or other constraints.

Where possible, these constraints should be implemented directly on the java domain objects using annotations (preferably an implementation of JSR 303 - Bean Validation. An obvious choice seems to be hibernate-validator, which is the reference implementation for this technology, and integrates with other hibernate-based apis already in use within the CDM). In some cases, constraints should be implemented using @Cascade annotations to ensure related child objects are removed. Implementing validation logic using JSR-303 compliant annotations and methods will also allow integration with the springframework's new validation infrastructure (http://jira.springframework.org/browse/SPR-69). In addition, some constraints can be added to the xml schema files for CDM XML, allowing documents to be validated (There is also a json-schema definition, http://jsonschema.org, but not many java validators).

Common constraint patterns include:

  • Objects that are “owned” by enclosing entities being deleted when the enclosing entities are deleted

  • Collection properties being represented by empty collections when there are no elements within them, not a null property.

  • String properties being restricted to a certain length (based on database storage restrictions)

  • String properties being identified as being allowed to hold html or not: if they are allowed to store html then this should be filtered to prevent cross-site scripting attacks (XSS), if they are not allowed to store html then any html should be filtered prior to persistence. CATE has an implementation of a service that wraps the antisamy tool for sanitizing html.

CdmBase

|Property|Type|Description|Constraints|
|id|int|The primary key of the object|?should this be an Integer? then could be NotNull Positive Unsaved-Value = 0|
|uuid|UUID|The surrogate key of the object|NotNull Unique|
|created|DateTime|datetime that this object was created (persisted)|Should be set by hibernate listeners on persistence, thus cannot validate pre-persistence|
|createdBy|User|user (principle) that created this object|Should be set by hibernate listeners on persistence, thus cannot validate pre-persistence|

VersionableEntity

|Property|Type|Description|Constraints|
|updated|DateTime|datetime that this object was updated (persisted)|Should be set by hibernate listeners on persistence, thus cannot validate pre-persistence|
|updatedBy|User|user (principle) that updated this object|Should be set by hibernate listeners on persistence, thus cannot validate pre-persistence|

AnnotatbleEntity

|Property|Type|Description|Constraints|
|markers|Set|markers belonging to this object|NotNull, CascadeType.DELETE|
|annotations|Set|annotations belonging to this object|NotNull,CascadeType.DELETE|

IdentifiableEntity

|Property|Type|Description|Constraints|
|lsid|LSID|Lifescience Identifier identifying this object||
|titleCache|String|synthetic label of the object|NotNull, NotEmpty Length(max = 255),HTML|
|protectedTitleCache|boolean|do not dynamically generate the titleCache according to the cache strategy for this type of object||
|rights|Set|rights assigned to this object|NotNull, CascadeType.DELETE|
|credits|Set|credits assigned to this object|NotNull, CascadeType.DELETE|
|extensions|Set|extensions added to this object|NotNull, CascadeType.DELETE|
|sources|Set|metadata about the original source (database, document (s)) used to construct this object|NotNull, CascadeType.DELETE|

eu.etaxonomy.cdm.model.agent package

AgentBase & subclasses

|Property|Type|Description|Constraints|
|contact|Contact|Contact details for this agent|NotNull|
|code|String|code for this institution|NotEmpty Length(max = 255)|
|name|String|name of this institution|NotEmpty Length(max = 255)|
|types|Set|type of institution|NotNull|
|isPartOf|Institution|institution that this institution is part of||
|nomenclaturalTitle|String|nomenclatural title of this team or person|NotEmpty Length(max = 255)|
|prefix|String|prefix of a person’s name|NotEmpty Length(max = 255)|
|firstname|String|person’s first name|NotEmpty Length(max = 255)|
|lastname|String|person’s last name|NotEmpty Length(max = 255)|
|suffix|String|suffix of a person’s name|NotEmpty Length(max = 255)|
|lifespan|TimePeriod|lifespan|NotNull|
|institutionalMemberships|Set|institutions that a person is or has been a member of|NotNull CascadeType.Delete|
|keywords|Set|keywords categorizing a person|NotNull|
|protectedNomenclaturalTitleCache|boolean|do not dynamically generate the nomenclatural title from the strategy for this type of object|
|teamMembers|Set|members of this team|NotNull|

eu.etaxonomy.cdm.model.media package

Media & subclasses

|Property|Type|Description|Constraints|
|title|Map|title of this object|NotNull, NotEmpty CascadeType.DELETE|
|mediaCreated|DateTime|date the media was created||
|description|Map|description of this object|NotNull CascadeType.DELETE|
|representations|Set|representations of this object|NotNull, NotEmpty CascadeType.DELETE|
|artist|AgentBase|Agent who created this object||
|coveredTaxa|Set|taxa that are endpoints of this key|NotNull|
|geographicScope|Set|geographical scope of this key|NotNull|
|scopes|Set|scope restrictions of this key|NotNull|
|taxonomicScope|Set|taxonomic scope of this key|NotNull|
|usedSequences|Set|sequences used to create this phylogenetic tree|NotNull|

eu.etaxonomy.cdm.model.name package

TaxonNameBase & subclasses

|Property|Type|Description|Constraints|
|fullTitleCache|String||NotNull, NotEmpty, Length(max = 330)|
|protectedTitleCache|boolean|do not dynamically generate the fullTitleCache according to the cache strategy for this type of object||
|descriptions|Set|descriptions of this taxon name|NotNull|
|appendedPhrase|String||NotEmpty, Length(max = 255)|
|nomenclaturalMicroReference|String|microreference part of the nomenclatural reference|NotEmpty, Length(max = 255)|
|hasProblem|boolean|does this name have a problem?||
|problemStarts|int|the beginning of the problematic part of the name|default value = -1|
|problemStarts|int|the end of the problematic part of the name|default value = -1|
|typeDesignations|Set|the type designations of this name|NotNull, CascadeType.DELETE_ORPHAN|
|homotypicGroup|HomotypicalGroup|the homotypic group that this name belongs to|?NotNull?|
|relationsFromThisName|Set|the set of relationships from this name to other names|NotNull, CascadeType.DELETE_ORPHAN|
|relationsToThisName|Set|the set of relationships from other names to this name|NotNull CascadeType.DELETE_ORPHAN|
|status|Set|the nomenclatural status of this name|NotNull, CascadeType.DELETE|
|rank|Rank|the rank of this name|?NotNull?|
|nomenclaturalReference|ReferenceBase|the reference in which this name was describedr1|?NotNull?|
|nameCache|String|the name part of a non viral name (excluding the authority)|NotNull, NotEmpty, Length(max = 255)|
|protectedNameCache|boolean|do not dynamically generate the nameCache according to the cache strategy for this type of object|
|authorshipCache|String|the authority part of a non viral name|NotNull, NotEmpty, Length(max = 255)|
|protectedAuthorshipCache|boolean|do not dynamically generate the authorshipCache according to the cache strategy for this type of object|
|genusOrUninomial|String|the uninominal for taxa of genus rank or above, or the generic part of a bi- or trinomial name|NotNull, NotEmpty, Length(max = 255) Pattern("[[A-Z][a-z]+")||infraGenericEpithet|String|the generic epithet for infrageneric taxa|NotEmpty, Length(max = 255) Pattern("[a-z]]+")|
|specificEpithet|String|the specific part of a bi- or trinomial|NotEmpty, Length(max = 255) Pattern("[[a-z]+")||infraspecificEpithet|String|the infraspecific part of a trinomial name|NotEmpty, Length(max = 255) Pattern("[a-z]]+")|
|combinationAuthorTeam|TeamOrPersonBase|the combination author||
|exCombinationAuthorTeam|TeamOrPersonBase|the "ex" combination author||
|basionymAuthorTeam|TeamOrPersonBase|the original or basionym author|?NotNull?|
|exBasionymAuthorTeam|TeamOrPersonBase|the "ex" author of the original name or basionym||
|hybridRelationships|Set|the hybrid relationships of this name|NotNull, CascadeType.DELETE_ORPHAN|
|subGenusAuthorship|String|authorship cache for subgenus name|NotEmpty, Length(max = 255)|
|nameApprobation|String|approbation of name according to approved list|NotEmpty, Length(max = 255)|
|hybridFormula|boolean|is the name a hybrid formula||
|monomHybrid|boolean|is the name a monomial hybrid||
|binomHybrid|boolean|is the name a binomial hybrid||
|trinomHybrid|boolean|is the name a trinomial hybrid||
|anamorphic|boolean|is the name anamorphic||
|cultivarName|String|the cultivar name|NotEmpty, Length(max = 255)|
|acronym|String|the acronym|NotEmpty, Length(max = 255)|
|breed|String|the breed|NotEmpty, Length(max = 255)|
|publicationYear|Integer|the publication year of the name if a new combination|NotNegative|
|originalPublicationYear|Integer|the publication year of the original combination, or the year of publication of the name if not recombined|NotNull, NotNegative|

eu.etaxonomy.cdm.model.occurrence package

SpecimenOrObservationBase & subclasses

|Property|Type|Description|Constraints|
|sex|Sex|sex of this occurrence||

|individualCount|int|number of individuals in this occurrence|Positive|
|lifeStage|LifeStage|life stage of this occurrence||
|description|Map|verbatim description of this occurrence|NotNull, Cascade.Delete|
|descriptions|Set|descriptions of this specimen|NotNull, ?Cascade.DELETE_ORPHAN?|
|determinations|Set|determinations of this specimen as belonging to a given species|NotNull, Cascade.Delete|
|derivationEvents|Set|events in which this occurrence was used to produce derived objects|NotNull, Cascade.DELETE_ORPHAN|
|collection|Collection|the collection this derived unit belongs to||
|catalogNumber|String|the catalog number of this occurrence|NotEmpty,Length(max = 255)|
|storedUnder|TaxonNameBase|the taxonomic name this object is stored under||
|derivationEvent|DerivationEvent|the derivation event that created this object||
|accessionNumber|String|the accession number of this occurrence|NotEmpty,Length(max = 255)|
|collectorsNumber|String|the collectors number of this occurrence|NotEmpty,Length(max = 255)|
|preservation|PreservationMethod|the method used to preserve this specimen||
|fieldNumber|String|the field number assigned to this occurrence|NotEmpty,Length(max = 255)|
|fieldNotes|String|the verbatim field notes of this occurrence|NotEmpty,Length(max = 255)|
|gatheringEvent|GatheringEvent|object representing the gathering in which this field observation was observed||

eu.etaxonomy.cdm.model.reference package

ReferenceBase & subclasses

|Property|Type|Description|Constraints|
|uri|String ?URI?|uri such as LSID, DOI or handle|NotEmpty Length(max = 255), Pattern("([:/?#]+:)?(//([/?#]))?([?#])(\?([#]))?(#(.))?")|
|nomenclaturallyRelevant|boolean|is the publication nomenclaturally relevant||
|authorTeam|TeamOrPersonBase|author of the publication|
|hasProblem|int|does this reference have a problem?| default value = ?|
|problemStarts|int|the start of the problematic part of the reference|default value = -1|
|problemStarts|int|the end of the problematic part of the reference|default value = -1|
|datePublished|TimePeriod|the period of time during which this object was published|NotNull|
|title|String|title of this publication|NotEmpty, Length(max = 4096)|
|publisher|String|publisher of this publication|NotEmpty, Length(max = 255)|
|placePublished|String|place this publication was published|NotEmpty, Length(max = 255)|
|editor|String|editor of this publication|NotEmpty, Length(max = 255)|
|volume|String|volume of this publication|NotEmpty, Length(max = 255)|
|pages|String|pages of this publication|NotEmpty, Length(max = 255)|
|inSeries|PrintSeries|series this publication belongs to||
|seriesPart|String|part of the series represented by this publication|NotEmpty, Length(max = 255)|
|inJournal|Journal|Journal that published this article||
|isbn|String|isbn of this publication|NotEmpty, Pattern("ISBN\x20(?=.{13}$)\d{1,5}([[- ])\d{17}\d{16}(\dX)$") Valid check digit||issn|String|the issn of this serial|NotEmpty, Pattern("ISSN\x20(?=.{9}$)\d{4}([- ]])\d{4} (\d|X)$") Valid check digit|
|inBook|Book|book that this section is published in||
|organization|Institution|the conference sponsor||
|institution|Institution|the institution that published this report||
|school|Institution|the institution that published this thesis||

eu.etaxonomy.cdm.model.taxon package

TaxonBase & subclasses

|Property|Type|Description|Constraints|
|name|TaxonNameBase|name of this taxonConcept|NotNull|
|sec|ReferenceBase|reference circumscribing this concept|NotNull|
|synonymRelations|Set|synonym relations belonging to this synonym or taxon|NotNull, CascadeType.DELETE_ORPHAN|
|taxonomicParentCache|Taxon|taxonomic parent of this taxon||
|taxonNodes|Set|taxonomic nodes that this taxon belongs to|NotNull|
|relationsToThisTaxon|Set|relationships pointing to this taxon|NotNull, CascadeType.DELETE|
|relationsFromThisTaxon|Set|relationships pointing from this taxon|NotNull, CascadeType.DELETE|
|descriptions|Set|descriptions of this taxon|NotNull ?CascadeType.DELETE?|
|taxonStatusUnknown|boolean|is the status of this taxon unknown||
|taxonomicChildrenCount|int|the number of taxonomic children of this taxon|NotNegative|

Taxonomic Business-Logic Validation (Context-Independent)

The following constraints are also context independent, but depend upon groups of fields within an object, or depend upon related entities.

eu.etaxonomy.cdm.model.name package

NonViralName & subclasses

If the name is genus group or suprageneric

  • infragenericEpithet == null

  • specificEpithet == null

  • infraspecificEpithet == null

If the name is infrageneric

  • specificEpithet == null

  • infraspecificEpithet == null

If the name is of species rank

  • infragenericEpithet == null

  • specificEpithet != null

  • infraspecificEpithet == null

If the name is infraspecific

  • infragenericEpithet == null

  • specificEpithet != null

  • infraspecificEpithet != null

  • Bidirectional relationship with the homotypic group (homotypicGroup.typifiedName == this)

  • Bidirectional relationship with name relationships (relationsFromThisName.fromName == this, relationsFromThisName.toName != null,relationsToThisName.toName == this, relationsToThisName.fromName != null)

  • If the name has not been recombined, then it should not have a combinationAuthorTeam or exCombinationAuthorTeam (if this.isOriginalCombination() then combinationAuthorTeam == null && exCombinationAuthorTeam)

  • If the name has been recombined, then it should have a combinationAuthorTeam (if this.isOriginalCombination() then combinationAuthorTeam != null)

  • If a name has a basionym then they should belong to the same homotypic group

  • Bidirectional relationship with type designations (typeDesignations.typifiedName == this)

  • Bidirectional relationship with taxa (this.taxonBases.name == this)

  • Bidirectional relationship with descriptions (this.descriptions.taxonName == this)

eu.etaxonomy.cdm.model.occurrence package

SpecimenOrObservationBase & subclasses

  • Bidirectional relationship with descriptions (descriptions.describedSpecimenOrObservation == this)

  • Bidirectional relationship with derivaton events (derivationEvents.originals.contains(this))

  • Bidirectional relationship with determination events (determinationEvents.identifiedUnit == this)

  • Bidirectional relationship with derived from derivation event (derivedFrom.derivedUnits.contains(this))

eu.etaxonomy.cdm.model.taxon package

TaxonBase & subclasses

  • Bidirectional relationship with name (this.name.taxonBases == this)

  • Bidirectional relationship with descriptions (descriptions.taxon == this)

  • Bidirectional relationship with synonyms (this.synonymRelations.taxon == this)

  • Bidirectional relationship with taxon relations (this.relationsFromThisTaxon.fromTaxon == this, this.relationsFromThisTaxon.toTaxon != null,this.relationsToThisTaxon.toTaxon == this, this.relationsToThisTaxon.fromTaxon != null)

  • If the taxon has a taxonomic parent, then the parent will be cached (this.taxonomicParentCache != null and this.taxonomicParentCache will be part of a relationship of type TAXONMICALLY_INCLUDED_IN with the taxon)

  • If the taxon is a taxonomic parent, then the number of taxonomic children will be equal to the taxonomicChildrenCount property (the number of relationships of type TAXONOMCALLY_INCLUDED_IN will equal this.taxonomicChildrenCount)

  • If a taxon has a relationship with a synonym that is a HOMOTYPIC_SYNONYM_OF then the taxon.name and synonym.name should have a nameRelationship and be part of the same homotypic group.

Taxonomic Business-Logic Validation (Context-Dependent)

Context dependent validation is validation that requires access to the current database in order to determine if an object is valid or not. At this stage, it is assumed that the two previous layers of validation have been performed and that the object is internally consistent.

A common aim of context dependent validation is to prevent or reduce the number of redundant (duplicate) objects in the database by warning the user that a similar or identical object already exists. The logic used in such validation (i.e. matching using whole fields, parts of fields, values calculated from fields (e.g. Levenshtein distance) is similar to the logic used in batch deduplication routines etc, although of course the cost of checking a single object is much smaller than that of checking a whole database of duplicates. Detection of duplicates is complex and subjective. Any number of different algorithms for calculating similarity between objects can be used – it is important to recognise that, short comparison of fields for exact matches, there is no objective method.

As a consequence, it is suggested that any context-sensitive validation routines be designed so that other algorithms can be "plugged in".

eu.etaxonomy.cdm.model.name package

NonViralName & subclasses

  • If another name matches on genusOrUninomial, infragenericEpithet, specificEpithet, infraspecificEpithet and basionymAuthorTeam, exBasionymAuthorTeam, combinationAuthorTeam, exCombinationAuthorTeam, reject, because the name already exists

  • If another name matches on genusOrUninomial, infragenericEpithet, specificEpithet, infraspecificEpithet but differs in authorship, then the names should be related using a LATER_HOMONYM

  • If another name has identical authorship, and similar epithet / uninomial then it may be a missapplied name, or it could be a spelling mistake. (could be detected using the search functionality e.g. using a fuzzy search "specificEpithet:{term}~", or using the dictionary support)

eu.etaxonomy.cdm.model.reference package

ReferenceBase & subclasses

  • Reject if matches on authorTeam, datePublished, title +

    • Article: journal, volume, series, series, pages
    • Book: volume, pages, publisher, placePublished
    • BookSection: book
    • Generic: volume, series, pages
    • Check issn / isbn against authority file

eu.etaxonomy.cdm.model.taxon package

TaxonBase & subclasses

Within a given CDM store, some context-dependent validation can take place on Taxon concepts if the CDM store is being used to persist a single checklist as a checklist is intended to be internally coherent. If a CDM store is intended to store multiple checklists, it is more difficult to validate the data.

  • Reject a taxon that matches on name, sec

  • If the CDM store is a checklist, it should not be possible to persist two taxa (synonym & taxon) with the same name, unless the synonym is a misapplied name or is a pro-parte synonym

  • If the CDM store is a checklist, it should not be possible to persist two synonyms with the same name unless they are pro-parte synonyms

r1 If the name is a Zoological Name, then this reference is the original citation of the basionym, not the nomenclatural reference of the recombined name. If this name is a botanical name, this reference is the place where the name (basionym or recombination) has been published.

Add picture from clipboard (Maximum size: 40 MB)