Project

General

Profile

Actions

Data Validation & Data Integrity in the CDM

Report on Data Validation in CDM by B. Clark

This document proposes a framework to provide a measure of data integrity and validation within the CDM. This framework is intended to consolidate data constraints currently implemented throughout the CDM API (i.e. in the data model, the persistence, service, and io layer methods) as far as possible into a single set of components that can be used by all applications based upon the CDM. It follows the DRY (Don't Repeat Yourself) principle that has already been applied to the CDM (in terms of the database & xml mapping) - that code for a given function should not be repeated in several places, leading to an increased risk of mistakes, alternative implementations and bugs. This is particularly important in the context of the CDM in that the data model is explicitly intended to be used by multiple applications - thus the applications must apply the same constraints to the data in the same way (so that the data can be shared).

The issue of validation is complicated by the fact that applications based upon the CDM may be used in different ways. The CDM is intended to be able to persist and manipulated legacy data (which might include "errors"). In addition, different applications might have different requirements - applications might be intended to handle a single checklist only, or they might be intended to handle multiple taxonomic views. Depending upon the requirements of the given application, some constraints might or might not be needed.

Thus, the approach taken here is to specify three different levels of validation. The first "Basic" level consists of simple constraints that all applications based upon the CDM must follow - if "Basic" constraints are not followed, the objects will not be persisted properly (for example, fields might be truncated) or runtime errors might be thrown by the persistence layer. Basic validation allows such errors to be detected and caught in the controller or view layer of the application and corrected without the need to hit the database.

The second and third levels of validation are taxonomic business logic and are split into context independent and context dependent validation. Taxonomic business logic consists of a set of rules that are implemented in code that depend upon multiple properties of a given object or related objects. Typically they would take the form of an if-then-else statement - for example if a name is family-group then its specific epithet must be null. Context dependent and context intependent validation are distinguished because some constraints only depend upon internal properties of an object and do not depend upon the existence of other unrelated objects within a given CDM store. Consequently context-independent validation can be performed without querying the database, and is expected to be more performant. Context dependent validation cannot be performed without querying the database to check for other objects, and is thus expected to be much slower.

Because of the requirement that the CDM should be able to persist non-taxonomically-correct legacy data (e.g. misspellings, multiple basionyms etc), both context dependent and context independent validation should be optional i.e. it should be possible to persist such "invalid" objects. Depending upon the specific application, users might be "warned" that an object appears to have errors, and be forced to confirm that they are sure that they would like to persist the object with errors. The interpretation of errors and presentation of errors to the user should be implemented in controller and view-layer code and is a matter for the specific application using the validation routines. The common validation framework presented here is intended to ensure that "errors" are detected in a consistent way, not to enforce particular behaviour within an application (except in the case of basic errors which should be enforced across all CDM applications).

Character Encoding

The first step in presenting data consistently across applications is to specify the character encoding that string properties use. The suggestion here is to use UTF-8 as the default character encoding and stick to it. Although it is possible to dynamically select an encoding on the output side of things (e.g. using the Accept-Encoding header in a web application) almost all APIs and specifications stick to UTF-8 because it is actually pretty difficult to cope with other character sets on the incoming side. Many APIs (e.g. OAI-PMH) specify that documents must be in UTF-8.

This should be set by

  • Adding a CharacterEncodingFilter to the CDMServer to filter incoming strings in POST or GET requests

  • Adding (database-specific) parameters to the JDBC Connection string (for mySQL you can append @useUnicode=true&characterEncoding=UTF-8@)

  • Adding (database-specific) parameters to the database config itself - for mySQL you can add the following (please check MySqlCharactersetAndCollation for more up to date information on this)

[client]

default-character-set=utf8

[mysqld]

default-character-set=utf8

character-set-server=utf8

  • Adding -Dfile.encoding=UTF-8 to tomcat init parameters

  • Adding accept-charset="UTF-8" to forms posted to the CDMServer

  • Adding charset=UTF-8 to headers of responses of the CDMServer

  • Adding encoding="UTF-8" to the first part of xml documents produced by CDM-I/O routines

  • Setting the encoding on output streams used throughout the CDM (e.g. @new OutputStreamWriter(outputStream, "UTF8")@)

Basic Validation

Basic validation consists of internal properties of the object that can be validated without knowing anything about the other objects within the CDM store and without knowing anything about the specific constraints being applied to the CDM store. As a consequence, this means that validation can take place with a single object and does not depend upon having access to, for example, a live database to be able to determine whether an object is valid or not. Most of the basic constraints are on individual fields, determined by limits on the size of fields or other constraints.

Where possible, these constraints should be implemented directly on the java domain objects using annotations (preferably an implementation of JSR 303 - Bean Validation. An obvious choice seems to be hibernate-validator, which is the reference implementation for this technology, and integrates with other hibernate-based apis already in use within the CDM). In some cases, constraints should be implemented using @Cascade annotations to ensure related child objects are removed. Implementing validation logic using JSR-303 compliant annotations and methods will also allow integration with the springframework's new validation infrastructure (http://jira.springframework.org/browse/SPR-69). In addition, some constraints can be added to the xml schema files for CDM XML, allowing documents to be validated (There is also a json-schema definition, http://jsonschema.org, but not many java validators).

Common constraint patterns include:

  • Objects that are “owned” by enclosing entities being deleted when the enclosing entities are deleted

  • Collection properties being represented by empty collections when there are no elements within them, not a null property.

  • String properties being restricted to a certain length (based on database storage restrictions)

  • String properties being identified as being allowed to hold html or not: if they are allowed to store html then this should be filtered to prevent cross-site scripting attacks (XSS), if they are not allowed to store html then any html should be filtered prior to persistence. CATE has an implementation of a service that wraps the antisamy tool for sanitizing html.

CdmBase

Property Type Description Constraints
id int The primary key of the object ?should this be an Integer? then could be NotNull Positive Unsaved-Value = 0
uuid UUID The surrogate key of the object NotNull Unique
created DateTime datetime that this object was created (persisted) Should be set by hibernate listeners on persistence, thus cannot validate pre-persistence
createdBy User user (principle) that created this object Should be set by hibernate listeners on persistence, thus cannot validate pre-persistence

VersionableEntity

Property Type Description Constraints
updated DateTime datetime that this object was updated (persisted) Should be set by hibernate listeners on persistence, thus cannot validate pre-persistence
updatedBy User user (principle) that updated this object Should be set by hibernate listeners on persistence, thus cannot validate pre-persistence

AnnotatbleEntity

Property Type Description Constraints
markers Set markers belonging to this object NotNull, CascadeType.DELETE
annotations Set annotations belonging to this object NotNull,CascadeType.DELETE

IdentifiableEntity

Property Type Description Constraints
lsid LSID Lifescience Identifier identifying this object
titleCache String synthetic label of the object NotNull, NotEmpty Length(max = 255),HTML
protectedTitleCache boolean do not dynamically generate the titleCache according to the cache strategy for this type of object
rights Set rights assigned to this object NotNull, CascadeType.DELETE
credits Set credits assigned to this object NotNull, CascadeType.DELETE
extensions Set extensions added to this object NotNull, CascadeType.DELETE
sources Set metadata about the original source (database, document (s)) used to construct this object NotNull, CascadeType.DELETE

eu.etaxonomy.cdm.model.agent package

AgentBase & subclasses

Property Type Description Constraints
contact Contact Contact details for this agent NotNull
code String code for this institution NotEmpty Length(max = 255)
name String name of this institution NotEmpty Length(max = 255)
types Set type of institution NotNull
isPartOf Institution institution that this institution is part of
nomenclaturalTitle String nomenclatural title of this team or person NotEmpty Length(max = 255)
prefix String prefix of a person's name NotEmpty Length(max = 255)
firstname String person's first name NotEmpty Length(max = 255)
lastname String person's last name NotEmpty Length(max = 255)
suffix String suffix of a person's name NotEmpty Length(max = 255)
lifespan TimePeriod lifespan NotNull
institutionalMemberships Set institutions that a person is or has been a member of NotNull CascadeType.Delete
keywords Set keywords categorizing a person NotNull
protectedNomenclaturalTitleCache boolean do not dynamically generate the nomenclatural title from the strategy for this type of object
teamMembers Set members of this team NotNull

eu.etaxonomy.cdm.model.media package

Media & subclasses

Property Type Description Constraints
title Map title of this object NotNull, NotEmpty CascadeType.DELETE
mediaCreated DateTime date the media was created
description Map description of this object NotNull CascadeType.DELETE
representations Set representations of this object NotNull, NotEmpty CascadeType.DELETE
artist AgentBase Agent who created this object
coveredTaxa Set taxa that are endpoints of this key NotNull
geographicScope Set geographical scope of this key NotNull
scopes Set scope restrictions of this key NotNull
taxonomicScope Set taxonomic scope of this key NotNull
usedSequences Set sequences used to create this phylogenetic tree NotNull

eu.etaxonomy.cdm.model.name package

TaxonNameBase & subclasses

Property Type Description Constraints
fullTitleCache String NotNull, NotEmpty, Length(max = 330)
protectedTitleCache boolean do not dynamically generate the fullTitleCache according to the cache strategy for this type of object
descriptions Set descriptions of this taxon name NotNull
appendedPhrase String NotEmpty, Length(max = 255)
nomenclaturalMicroReference String microreference part of the nomenclatural reference NotEmpty, Length(max = 255)
hasProblem boolean does this name have a problem?
problemStarts int the beginning of the problematic part of the name default value = -1
problemStarts int the end of the problematic part of the name default value = -1
typeDesignations Set the type designations of this name NotNull, CascadeType.DELETE_ORPHAN
homotypicGroup HomotypicalGroup the homotypic group that this name belongs to ?NotNull?
relationsFromThisName Set the set of relationships from this name to other names NotNull, CascadeType.DELETE_ORPHAN
relationsToThisName Set the set of relationships from other names to this name NotNull CascadeType.DELETE_ORPHAN
status Set the nomenclatural status of this name NotNull, CascadeType.DELETE
rank Rank the rank of this name ?NotNull?
nomenclaturalReference ReferenceBase the reference in which this name was describedr1 ?NotNull?
nameCache String the name part of a non viral name (excluding the authority) NotNull, NotEmpty, Length(max = 255)
protectedNameCache boolean do not dynamically generate the nameCache according to the cache strategy for this type of object
authorshipCache String the authority part of a non viral name NotNull, NotEmpty, Length(max = 255)
protectedAuthorshipCache boolean do not dynamically generate the authorshipCache according to the cache strategy for this type of object
genusOrUninomial String the uninominal for taxa of genus rank or above, or the generic part of a bi- or trinomial name NotNull, NotEmpty, Length(max = 255) Pattern("[[A-Z][a-z]+")
specificEpithet String the specific part of a bi- or trinomial NotEmpty, Length(max = 255) Pattern("[[a-z]+")
combinationAuthorTeam TeamOrPersonBase the combination author
exCombinationAuthorTeam TeamOrPersonBase the "ex" combination author
basionymAuthorTeam TeamOrPersonBase the original or basionym author ?NotNull?
exBasionymAuthorTeam TeamOrPersonBase the "ex" author of the original name or basionym
hybridRelationships Set the hybrid relationships of this name NotNull, CascadeType.DELETE_ORPHAN
subGenusAuthorship String authorship cache for subgenus name NotEmpty, Length(max = 255)
nameApprobation String approbation of name according to approved list NotEmpty, Length(max = 255)
hybridFormula boolean is the name a hybrid formula
monomHybrid boolean is the name a monomial hybrid
binomHybrid boolean is the name a binomial hybrid
trinomHybrid boolean is the name a trinomial hybrid
anamorphic boolean is the name anamorphic
cultivarName String the cultivar name NotEmpty, Length(max = 255)
acronym String the acronym NotEmpty, Length(max = 255)
breed String the breed NotEmpty, Length(max = 255)
publicationYear Integer the publication year of the name if a new combination NotNegative
originalPublicationYear Integer the publication year of the original combination, or the year of publication of the name if not recombined NotNull, NotNegative

eu.etaxonomy.cdm.model.occurrence package

SpecimenOrObservationBase & subclasses

Property Type Description Constraints
sex Sex sex of this occurrence
individualCount int number of individuals in this occurrence Positive
lifeStage LifeStage life stage of this occurrence
description Map verbatim description of this occurrence NotNull, Cascade.Delete
descriptions Set descriptions of this specimen NotNull, ?Cascade.DELETE_ORPHAN?
determinations Set determinations of this specimen as belonging to a given species NotNull, Cascade.Delete
derivationEvents Set events in which this occurrence was used to produce derived objects NotNull, Cascade.DELETE_ORPHAN
collection Collection the collection this derived unit belongs to
catalogNumber String the catalog number of this occurrence NotEmpty,Length(max = 255)
storedUnder TaxonNameBase the taxonomic name this object is stored under
derivationEvent DerivationEvent the derivation event that created this object
accessionNumber String the accession number of this occurrence NotEmpty,Length(max = 255)
collectorsNumber String the collectors number of this occurrence NotEmpty,Length(max = 255)
preservation PreservationMethod the method used to preserve this specimen
fieldNumber String the field number assigned to this occurrence NotEmpty,Length(max = 255)
fieldNotes String the verbatim field notes of this occurrence NotEmpty,Length(max = 255)
gatheringEvent GatheringEvent object representing the gathering in which this field observation was observed

eu.etaxonomy.cdm.model.reference package

ReferenceBase & subclasses

Property Type Description Constraints
uri String ?URI? uri such as LSID, DOI or handle NotEmpty Length(max = 255), Pattern("([:/?#]+:)?(//([/?#]))?([?#])(\?([#]))?(#(.))?")
nomenclaturallyRelevant boolean is the publication nomenclaturally relevant
authorTeam TeamOrPersonBase author of the publication
hasProblem int does this reference have a problem? default value = ?
problemStarts int the start of the problematic part of the reference default value = -1
problemStarts int the end of the problematic part of the reference default value = -1
datePublished TimePeriod the period of time during which this object was published NotNull
title String title of this publication NotEmpty, Length(max = 4096)
publisher String publisher of this publication NotEmpty, Length(max = 255)
placePublished String place this publication was published NotEmpty, Length(max = 255)
editor String editor of this publication NotEmpty, Length(max = 255)
volume String volume of this publication NotEmpty, Length(max = 255)
pages String pages of this publication NotEmpty, Length(max = 255)
inSeries PrintSeries series this publication belongs to
seriesPart String part of the series represented by this publication NotEmpty, Length(max = 255)
inJournal Journal Journal that published this article
isbn String isbn of this publication NotEmpty, Pattern("ISBN\x20(?=.{13}$)\d{1,5}([[- ])\d{17}\d{16}(\dX)$") Valid check digit
inBook Book book that this section is published in
organization Institution the conference sponsor
institution Institution the institution that published this report
school Institution the institution that published this thesis

eu.etaxonomy.cdm.model.taxon package

TaxonBase & subclasses

Property Type Description Constraints
name TaxonNameBase name of this taxonConcept NotNull
sec ReferenceBase reference circumscribing this concept NotNull
synonymRelations Set synonym relations belonging to this synonym or taxon NotNull, CascadeType.DELETE_ORPHAN
taxonomicParentCache Taxon taxonomic parent of this taxon
taxonNodes Set taxonomic nodes that this taxon belongs to NotNull
relationsToThisTaxon Set relationships pointing to this taxon NotNull, CascadeType.DELETE
relationsFromThisTaxon Set relationships pointing from this taxon NotNull, CascadeType.DELETE
descriptions Set descriptions of this taxon NotNull ?CascadeType.DELETE?
taxonStatusUnknown boolean is the status of this taxon unknown
taxonomicChildrenCount int the number of taxonomic children of this taxon NotNegative

Taxonomic Business-Logic Validation (Context-Independent)

The following constraints are also context independent, but depend upon groups of fields within an object, or depend upon related entities.

eu.etaxonomy.cdm.model.name package

NonViralName & subclasses

If the name is genus group or suprageneric

  • infragenericEpithet == null

  • specificEpithet == null

  • infraspecificEpithet == null

If the name is infrageneric

  • specificEpithet == null

  • infraspecificEpithet == null

If the name is of species rank

  • infragenericEpithet == null

  • specificEpithet != null

  • infraspecificEpithet == null

If the name is infraspecific

  • infragenericEpithet == null

  • specificEpithet != null

  • infraspecificEpithet != null

  • Bidirectional relationship with the homotypic group (homotypicGroup.typifiedName == this)

  • Bidirectional relationship with name relationships (relationsFromThisName.fromName == this, relationsFromThisName.toName != null,relationsToThisName.toName == this, relationsToThisName.fromName != null)

  • If the name has not been recombined, then it should not have a combinationAuthorTeam or exCombinationAuthorTeam (if this.isOriginalCombination() then combinationAuthorTeam == null && exCombinationAuthorTeam)

  • If the name has been recombined, then it should have a combinationAuthorTeam (if this.isOriginalCombination() then combinationAuthorTeam != null)

  • If a name has a basionym then they should belong to the same homotypic group

  • Bidirectional relationship with type designations (typeDesignations.typifiedName == this)

  • Bidirectional relationship with taxa (this.taxonBases.name == this)

  • Bidirectional relationship with descriptions (this.descriptions.taxonName == this)

eu.etaxonomy.cdm.model.occurrence package

SpecimenOrObservationBase & subclasses

  • Bidirectional relationship with descriptions (descriptions.describedSpecimenOrObservation == this)

  • Bidirectional relationship with derivaton events (derivationEvents.originals.contains(this))

  • Bidirectional relationship with determination events (determinationEvents.identifiedUnit == this)

  • Bidirectional relationship with derived from derivation event (derivedFrom.derivedUnits.contains(this))

eu.etaxonomy.cdm.model.taxon package

TaxonBase & subclasses

  • Bidirectional relationship with name (this.name.taxonBases == this)

  • Bidirectional relationship with descriptions (descriptions.taxon == this)

  • Bidirectional relationship with synonyms (this.synonymRelations.taxon == this)

  • Bidirectional relationship with taxon relations (this.relationsFromThisTaxon.fromTaxon == this, this.relationsFromThisTaxon.toTaxon != null,this.relationsToThisTaxon.toTaxon == this, this.relationsToThisTaxon.fromTaxon != null)

  • If the taxon has a taxonomic parent, then the parent will be cached (this.taxonomicParentCache != null and this.taxonomicParentCache will be part of a relationship of type TAXONMICALLY_INCLUDED_IN with the taxon)

  • If the taxon is a taxonomic parent, then the number of taxonomic children will be equal to the taxonomicChildrenCount property (the number of relationships of type TAXONOMCALLY_INCLUDED_IN will equal this.taxonomicChildrenCount)

  • If a taxon has a relationship with a synonym that is a HOMOTYPIC_SYNONYM_OF then the taxon.name and synonym.name should have a nameRelationship and be part of the same homotypic group.

Taxonomic Business-Logic Validation (Context-Dependent)

Context dependent validation is validation that requires access to the current database in order to determine if an object is valid or not. At this stage, it is assumed that the two previous layers of validation have been performed and that the object is internally consistent.

A common aim of context dependent validation is to prevent or reduce the number of redundant (duplicate) objects in the database by warning the user that a similar or identical object already exists. The logic used in such validation (i.e. matching using whole fields, parts of fields, values calculated from fields (e.g. Levenshtein distance) is similar to the logic used in batch deduplication routines etc, although of course the cost of checking a single object is much smaller than that of checking a whole database of duplicates. Detection of duplicates is complex and subjective. Any number of different algorithms for calculating similarity between objects can be used – it is important to recognise that, short comparison of fields for exact matches, there is no objective method.

As a consequence, it is suggested that any context-sensitive validation routines be designed so that other algorithms can be "plugged in".

eu.etaxonomy.cdm.model.name package

NonViralName & subclasses

  • If another name matches on genusOrUninomial, infragenericEpithet, specificEpithet, infraspecificEpithet and basionymAuthorTeam, exBasionymAuthorTeam, combinationAuthorTeam, exCombinationAuthorTeam, reject, because the name already exists

  • If another name matches on genusOrUninomial, infragenericEpithet, specificEpithet, infraspecificEpithet but differs in authorship, then the names should be related using a LATER_HOMONYM

  • If another name has identical authorship, and similar epithet / uninomial then it may be a missapplied name, or it could be a spelling mistake. (could be detected using the search functionality e.g. using a fuzzy search "specificEpithet:{term}~", or using the dictionary support)

eu.etaxonomy.cdm.model.reference package

ReferenceBase & subclasses

  • Reject if matches on authorTeam, datePublished, title +

    • Article: journal, volume, series, series, pages
    • Book: volume, pages, publisher, placePublished
    • BookSection: book
    • Generic: volume, series, pages
    • Check issn / isbn against authority file

eu.etaxonomy.cdm.model.taxon package

TaxonBase & subclasses

Within a given CDM store, some context-dependent validation can take place on Taxon concepts if the CDM store is being used to persist a single checklist as a checklist is intended to be internally coherent. If a CDM store is intended to store multiple checklists, it is more difficult to validate the data.

  • Reject a taxon that matches on name, sec

  • If the CDM store is a checklist, it should not be possible to persist two taxa (synonym & taxon) with the same name, unless the synonym is a misapplied name or is a pro-parte synonym

  • If the CDM store is a checklist, it should not be possible to persist two synonyms with the same name unless they are pro-parte synonyms

r1 If the name is a Zoological Name, then this reference is the original citation of the basionym, not the nomenclatural reference of the recombined name. If this name is a botanical name, this reference is the place where the name (basionym or recombination) has been published.

Updated by Katja Luther almost 2 years ago · 19 revisions