Actions

History

Data Validation & Data Integrity in the CDM¶

Report on Data Validation in CDM by B. Clark

Table of contents
Data Validation & Data Integrity in the CDM

This document proposes a framework to provide a measure of data integrity and validation within the CDM. This framework is intended to consolidate data constraints currently implemented throughout the CDM API (i.e. in the data model, the persistence, service, and io layer methods) as far as possible into a single set of components that can be used by all applications based upon the CDM. It follows the DRY (Don't Repeat Yourself) principle that has already been applied to the CDM (in terms of the database & xml mapping) - that code for a given function should not be repeated in several places, leading to an increased risk of mistakes, alternative implementations and bugs. This is particularly important in the context of the CDM in that the data model is explicitly intended to be used by multiple applications - thus the applications must apply the same constraints to the data in the same way (so that the data can be shared).

The issue of validation is complicated by the fact that applications based upon the CDM may be used in different ways. The CDM is intended to be able to persist and manipulated legacy data (which might include "errors"). In addition, different applications might have different requirements - applications might be intended to handle a single checklist only, or they might be intended to handle multiple taxonomic views. Depending upon the requirements of the given application, some constraints might or might not be needed.

Thus, the approach taken here is to specify three different levels of validation. The first "Basic" level consists of simple constraints that all applications based upon the CDM must follow - if "Basic" constraints are not followed, the objects will not be persisted properly (for example, fields might be truncated) or runtime errors might be thrown by the persistence layer. Basic validation allows such errors to be detected and caught in the controller or view layer of the application and corrected without the need to hit the database.

The second and third levels of validation are taxonomic business logic and are split into context independent and context dependent validation. Taxonomic business logic consists of a set of rules that are implemented in code that depend upon multiple properties of a given object or related objects. Typically they would take the form of an if-then-else statement - for example if a name is family-group then its specific epithet must be null. Context dependent and context intependent validation are distinguished because some constraints only depend upon internal properties of an object and do not depend upon the existence of other unrelated objects within a given CDM store. Consequently context-independent validation can be performed without querying the database, and is expected to be more performant. Context dependent validation cannot be performed without querying the database to check for other objects, and is thus expected to be much slower.

Because of the requirement that the CDM should be able to persist non-taxonomically-correct legacy data (e.g. misspellings, multiple basionyms etc), both context dependent and context independent validation should be optional i.e. it should be possible to persist such "invalid" objects. Depending upon the specific application, users might be "warned" that an object appears to have errors, and be forced to confirm that they are sure that they would like to persist the object with errors. The interpretation of errors and presentation of errors to the user should be implemented in controller and view-layer code and is a matter for the specific application using the validation routines. The common validation framework presented here is intended to ensure that "errors" are detected in a consistent way, not to enforce particular behaviour within an application (except in the case of basic errors which should be enforced across all CDM applications).

Character Encoding¶

The first step in presenting data consistently across applications is to specify the character encoding that string properties use. The suggestion here is to use UTF-8 as the default character encoding and stick to it. Although it is possible to dynamically select an encoding on the output side of things (e.g. using the Accept-Encoding header in a web application) almost all APIs and specifications stick to UTF-8 because it is actually pretty difficult to cope with other character sets on the incoming side. Many APIs (e.g. OAI-PMH) specify that documents must be in UTF-8.

This should be set by

Adding a CharacterEncodingFilter to the CDMServer to filter incoming strings in POST or GET requests
Adding (database-specific) parameters to the JDBC Connection string (for mySQL you can append @useUnicode=true&characterEncoding=UTF-8@)
Adding (database-specific) parameters to the database config itself - for mySQL you can add the following (please check MySqlCharactersetAndCollation for more up to date information on this)

[client]

default-character-set=utf8

[mysqld]

default-character-set=utf8

character-set-server=utf8

Adding -Dfile.encoding=UTF-8 to tomcat init parameters
Adding accept-charset="UTF-8" to forms posted to the CDMServer
Adding charset=UTF-8 to headers of responses of the CDMServer
Adding encoding="UTF-8" to the first part of xml documents produced by CDM-I/O routines
Setting the encoding on output streams used throughout the CDM (e.g. @new OutputStreamWriter(outputStream, "UTF8")@)

Basic Validation¶

Basic validation consists of internal properties of the object that can be validated without knowing anything about the other objects within the CDM store and without knowing anything about the specific constraints being applied to the CDM store. As a consequence, this means that validation can take place with a single object and does not depend upon having access to, for example, a live database to be able to determine whether an object is valid or not. Most of the basic constraints are on individual fields, determined by limits on the size of fields or other constraints.

Where possible, these constraints should be implemented directly on the java domain objects using annotations (preferably an implementation of JSR 303 - Bean Validation. An obvious choice seems to be hibernate-validator, which is the reference implementation for this technology, and integrates with other hibernate-based apis already in use within the CDM). In some cases, constraints should be implemented using @Cascade annotations to ensure related child objects are removed. Implementing validation logic using JSR-303 compliant annotations and methods will also allow integration with the springframework's new validation infrastructure (http://jira.springframework.org/browse/SPR-69). In addition, some constraints can be added to the xml schema files for CDM XML, allowing documents to be validated (There is also a json-schema definition, http://jsonschema.org, but not many java validators).

Common constraint patterns include:

Objects that are â€œownedâ€ by enclosing entities being deleted when the enclosing entities are deleted
Collection properties being represented by empty collections when there are no elements within them, not a null property.
String properties being restricted to a certain length (based on database storage restrictions)
String properties being identified as being allowed to hold html or not: if they are allowed to store html then this should be filtered to prevent cross-site scripting attacks (XSS), if they are not allowed to store html then any html should be filtered prior to persistence. CATE has an implementation of a service that wraps the antisamy tool for sanitizing html.

CdmBase¶

Property	Type	Description	Constraints
id	int	The primary key of the object	?should this be an Integer? then could be NotNull Positive Unsaved-Value = 0
uuid	UUID	The surrogate key of the object	NotNull Unique
created	DateTime	datetime that this object was created (persisted)	Should be set by hibernate listeners on persistence, thus cannot validate pre-persistence
createdBy	User	user (principle) that created this object	Should be set by hibernate listeners on persistence, thus cannot validate pre-persistence

VersionableEntity¶

Property	Type	Description	Constraints
updated	DateTime	datetime that this object was updated (persisted)	Should be set by hibernate listeners on persistence, thus cannot validate pre-persistence
updatedBy	User	user (principle) that updated this object	Should be set by hibernate listeners on persistence, thus cannot validate pre-persistence

AnnotatbleEntity¶

Property	Type	Description	Constraints
markers	Set	markers belonging to this object	NotNull, CascadeType.DELETE
annotations	Set	annotations belonging to this object	NotNull,CascadeType.DELETE

IdentifiableEntity¶

Property	Type	Description	Constraints
lsid	LSID	Lifescience Identifier identifying this object
titleCache	String	synthetic label of the object	NotNull, NotEmpty Length(max = 255),HTML
protectedTitleCache	boolean	do not dynamically generate the titleCache according to the cache strategy for this type of object
rights	Set	rights assigned to this object	NotNull, CascadeType.DELETE
credits	Set	credits assigned to this object	NotNull, CascadeType.DELETE
extensions	Set	extensions added to this object	NotNull, CascadeType.DELETE
sources	Set	metadata about the original source (database, document (s)) used to construct this object	NotNull, CascadeType.DELETE

eu.etaxonomy.cdm.model.agent package¶

AgentBase & subclasses¶

Property	Type	Description	Constraints
contact	Contact	Contact details for this agent	NotNull
code	String	code for this institution	NotEmpty Length(max = 255)
name	String	name of this institution	NotEmpty Length(max = 255)
types	Set	type of institution	NotNull
isPartOf	Institution	institution that this institution is part of
nomenclaturalTitle	String	nomenclatural title of this team or person	NotEmpty Length(max = 255)
prefix	String	prefix of a person's name	NotEmpty Length(max = 255)
firstname	String	person's first name	NotEmpty Length(max = 255)
lastname	String	person's last name	NotEmpty Length(max = 255)
suffix	String	suffix of a person's name	NotEmpty Length(max = 255)
lifespan	TimePeriod	lifespan	NotNull
institutionalMemberships	Set	institutions that a person is or has been a member of	NotNull CascadeType.Delete
keywords	Set	keywords categorizing a person	NotNull
protectedNomenclaturalTitleCache	boolean	do not dynamically generate the nomenclatural title from the strategy for this type of object
teamMembers	Set	members of this team	NotNull

eu.etaxonomy.cdm.model.media package¶

Media & subclasses¶

Property	Type	Description	Constraints
title	Map	title of this object	NotNull, NotEmpty CascadeType.DELETE
mediaCreated	DateTime	date the media was created
description	Map	description of this object	NotNull CascadeType.DELETE
representations	Set	representations of this object	NotNull, NotEmpty CascadeType.DELETE
artist	AgentBase	Agent who created this object
coveredTaxa	Set	taxa that are endpoints of this key	NotNull
geographicScope	Set	geographical scope of this key	NotNull
scopes	Set	scope restrictions of this key	NotNull
taxonomicScope	Set	taxonomic scope of this key	NotNull
usedSequences	Set	sequences used to create this phylogenetic tree	NotNull

eu.etaxonomy.cdm.model.name package¶

TaxonNameBase & subclasses¶

Property	Type	Description	Constraints
fullTitleCache	String		NotNull, NotEmpty, Length(max = 330)
protectedTitleCache	boolean	do not dynamically generate the fullTitleCache according to the cache strategy for this type of object
descriptions	Set	descriptions of this taxon name	NotNull
appendedPhrase	String		NotEmpty, Length(max = 255)
nomenclaturalMicroReference	String	microreference part of the nomenclatural reference	NotEmpty, Length(max = 255)
hasProblem	boolean	does this name have a problem?
problemStarts	int	the beginning of the problematic part of the name	default value = -1
problemStarts	int	the end of the problematic part of the name	default value = -1
typeDesignations	Set	the type designations of this name	NotNull, CascadeType.DELETE_ORPHAN
homotypicGroup	HomotypicalGroup	the homotypic group that this name belongs to	?NotNull?
relationsFromThisName	Set	the set of relationships from this name to other names	NotNull, CascadeType.DELETE_ORPHAN
relationsToThisName	Set	the set of relationships from other names to this name	NotNull CascadeType.DELETE_ORPHAN
status	Set	the nomenclatural status of this name	NotNull, CascadeType.DELETE
rank	Rank	the rank of this name	?NotNull?
nomenclaturalReference	ReferenceBase	the reference in which this name was describedr1	?NotNull?
nameCache	String	the name part of a non viral name (excluding the authority)	NotNull, NotEmpty, Length(max = 255)
protectedNameCache	boolean	do not dynamically generate the nameCache according to the cache strategy for this type of object
authorshipCache	String	the authority part of a non viral name	NotNull, NotEmpty, Length(max = 255)
protectedAuthorshipCache	boolean	do not dynamically generate the authorshipCache according to the cache strategy for this type of object
genusOrUninomial	String	the uninominal for taxa of genus rank or above, or the generic part of a bi- or trinomial name	NotNull, NotEmpty, Length(max = 255) Pattern("[[A-Z][a-z]+")
specificEpithet	String	the specific part of a bi- or trinomial	NotEmpty, Length(max = 255) Pattern("[[a-z]+")
combinationAuthorTeam	TeamOrPersonBase	the combination author
exCombinationAuthorTeam	TeamOrPersonBase	the "ex" combination author
basionymAuthorTeam	TeamOrPersonBase	the original or basionym author	?NotNull?
exBasionymAuthorTeam	TeamOrPersonBase	the "ex" author of the original name or basionym
hybridRelationships	Set	the hybrid relationships of this name	NotNull, CascadeType.DELETE_ORPHAN
subGenusAuthorship	String	authorship cache for subgenus name	NotEmpty, Length(max = 255)
nameApprobation	String	approbation of name according to approved list	NotEmpty, Length(max = 255)
hybridFormula	boolean	is the name a hybrid formula
monomHybrid	boolean	is the name a monomial hybrid
binomHybrid	boolean	is the name a binomial hybrid
trinomHybrid	boolean	is the name a trinomial hybrid
anamorphic	boolean	is the name anamorphic
cultivarName	String	the cultivar name	NotEmpty, Length(max = 255)
acronym	String	the acronym	NotEmpty, Length(max = 255)
breed	String	the breed	NotEmpty, Length(max = 255)
publicationYear	Integer	the publication year of the name if a new combination	NotNegative
originalPublicationYear	Integer	the publication year of the original combination, or the year of publication of the name if not recombined	NotNull, NotNegative

eu.etaxonomy.cdm.model.occurrence package¶

SpecimenOrObservationBase & subclasses¶

Property	Type	Description	Constraints
sex	Sex	sex of this occurrence
individualCount	int	number of individuals in this occurrence	Positive
lifeStage	LifeStage	life stage of this occurrence
description	Map	verbatim description of this occurrence	NotNull, Cascade.Delete
descriptions	Set	descriptions of this specimen	NotNull, ?Cascade.DELETE_ORPHAN?
determinations	Set	determinations of this specimen as belonging to a given species	NotNull, Cascade.Delete
derivationEvents	Set	events in which this occurrence was used to produce derived objects	NotNull, Cascade.DELETE_ORPHAN
collection	Collection	the collection this derived unit belongs to
catalogNumber	String	the catalog number of this occurrence	NotEmpty,Length(max = 255)
storedUnder	TaxonNameBase	the taxonomic name this object is stored under
derivationEvent	DerivationEvent	the derivation event that created this object
accessionNumber	String	the accession number of this occurrence	NotEmpty,Length(max = 255)
collectorsNumber	String	the collectors number of this occurrence	NotEmpty,Length(max = 255)
preservation	PreservationMethod	the method used to preserve this specimen
fieldNumber	String	the field number assigned to this occurrence	NotEmpty,Length(max = 255)
fieldNotes	String	the verbatim field notes of this occurrence	NotEmpty,Length(max = 255)
gatheringEvent	GatheringEvent	object representing the gathering in which this field observation was observed

eu.etaxonomy.cdm.model.reference package¶

ReferenceBase & subclasses¶

Property	Type	Description	Constraints
uri	String ?URI?	uri such as LSID, DOI or handle	NotEmpty Length(max = 255), Pattern("^{([^:/?#]+}:)?(//([^{/?#]))?([^?#])(\?([^{#]))?(#(.))?")}}
nomenclaturallyRelevant	boolean	is the publication nomenclaturally relevant
authorTeam	TeamOrPersonBase	author of the publication
hasProblem	int	does this reference have a problem?	default value = ?
problemStarts	int	the start of the problematic part of the reference	default value = -1
problemStarts	int	the end of the problematic part of the reference	default value = -1
datePublished	TimePeriod	the period of time during which this object was published	NotNull
title	String	title of this publication	NotEmpty, Length(max = 4096)
publisher	String	publisher of this publication	NotEmpty, Length(max = 255)
placePublished	String	place this publication was published	NotEmpty, Length(max = 255)
editor	String	editor of this publication	NotEmpty, Length(max = 255)
volume	String	volume of this publication	NotEmpty, Length(max = 255)
pages	String	pages of this publication	NotEmpty, Length(max = 255)
inSeries	PrintSeries	series this publication belongs to
seriesPart	String	part of the series represented by this publication	NotEmpty, Length(max = 255)
inJournal	Journal	Journal that published this article
isbn	String	isbn of this publication	NotEmpty, Pattern("^{ISBN\x20(?=.{13}$)\d{1,5}([[-} ])\d{17}\d{16}(\dX)$") Valid check digit
inBook	Book	book that this section is published in
organization	Institution	the conference sponsor
institution	Institution	the institution that published this report
school	Institution	the institution that published this thesis

eu.etaxonomy.cdm.model.taxon package¶

TaxonBase & subclasses¶

Property	Type	Description	Constraints
name	TaxonNameBase	name of this taxonConcept	NotNull
sec	ReferenceBase	reference circumscribing this concept	NotNull
synonymRelations	Set	synonym relations belonging to this synonym or taxon	NotNull, CascadeType.DELETE_ORPHAN
taxonomicParentCache	Taxon	taxonomic parent of this taxon
taxonNodes	Set	taxonomic nodes that this taxon belongs to	NotNull
relationsToThisTaxon	Set	relationships pointing to this taxon	NotNull, CascadeType.DELETE
relationsFromThisTaxon	Set	relationships pointing from this taxon	NotNull, CascadeType.DELETE
descriptions	Set	descriptions of this taxon	NotNull ?CascadeType.DELETE?
taxonStatusUnknown	boolean	is the status of this taxon unknown
taxonomicChildrenCount	int	the number of taxonomic children of this taxon	NotNegative

Taxonomic Business-Logic Validation (Context-Independent)¶

The following constraints are also context independent, but depend upon groups of fields within an object, or depend upon related entities.

eu.etaxonomy.cdm.model.name package¶

NonViralName & subclasses¶

If the name is genus group or suprageneric

infragenericEpithet == null
specificEpithet == null
infraspecificEpithet == null

If the name is infrageneric

specificEpithet == null
infraspecificEpithet == null

If the name is of species rank

infragenericEpithet == null
specificEpithet != null
infraspecificEpithet == null

If the name is infraspecific

infragenericEpithet == null
specificEpithet != null
infraspecificEpithet != null
Bidirectional relationship with the homotypic group (homotypicGroup.typifiedName == this)
Bidirectional relationship with name relationships (relationsFromThisName.fromName == this, relationsFromThisName.toName != null,relationsToThisName.toName == this, relationsToThisName.fromName != null)
If the name has not been recombined, then it should not have a combinationAuthorTeam or exCombinationAuthorTeam (if this.isOriginalCombination() then combinationAuthorTeam == null && exCombinationAuthorTeam)
If the name has been recombined, then it should have a combinationAuthorTeam (if this.isOriginalCombination() then combinationAuthorTeam != null)
If a name has a basionym then they should belong to the same homotypic group
Bidirectional relationship with type designations (typeDesignations.typifiedName == this)
Bidirectional relationship with taxa (this.taxonBases.name == this)
Bidirectional relationship with descriptions (this.descriptions.taxonName == this)

eu.etaxonomy.cdm.model.occurrence package¶

SpecimenOrObservationBase & subclasses¶

Bidirectional relationship with descriptions (descriptions.describedSpecimenOrObservation == this)
Bidirectional relationship with derivaton events (derivationEvents.originals.contains(this))
Bidirectional relationship with determination events (determinationEvents.identifiedUnit == this)
Bidirectional relationship with derived from derivation event (derivedFrom.derivedUnits.contains(this))

eu.etaxonomy.cdm.model.taxon package¶

TaxonBase & subclasses¶

Bidirectional relationship with name (this.name.taxonBases == this)
Bidirectional relationship with descriptions (descriptions.taxon == this)
Bidirectional relationship with synonyms (this.synonymRelations.taxon == this)
Bidirectional relationship with taxon relations (this.relationsFromThisTaxon.fromTaxon == this, this.relationsFromThisTaxon.toTaxon != null,this.relationsToThisTaxon.toTaxon == this, this.relationsToThisTaxon.fromTaxon != null)
If the taxon has a taxonomic parent, then the parent will be cached (this.taxonomicParentCache != null and this.taxonomicParentCache will be part of a relationship of type TAXONMICALLY_INCLUDED_IN with the taxon)
If the taxon is a taxonomic parent, then the number of taxonomic children will be equal to the taxonomicChildrenCount property (the number of relationships of type TAXONOMCALLY_INCLUDED_IN will equal this.taxonomicChildrenCount)
If a taxon has a relationship with a synonym that is a HOMOTYPIC_SYNONYM_OF then the taxon.name and synonym.name should have a nameRelationship and be part of the same homotypic group.

Taxonomic Business-Logic Validation (Context-Dependent)¶

Context dependent validation is validation that requires access to the current database in order to determine if an object is valid or not. At this stage, it is assumed that the two previous layers of validation have been performed and that the object is internally consistent.

A common aim of context dependent validation is to prevent or reduce the number of redundant (duplicate) objects in the database by warning the user that a similar or identical object already exists. The logic used in such validation (i.e. matching using whole fields, parts of fields, values calculated from fields (e.g. Levenshtein distance) is similar to the logic used in batch deduplication routines etc, although of course the cost of checking a single object is much smaller than that of checking a whole database of duplicates. Detection of duplicates is complex and subjective. Any number of different algorithms for calculating similarity between objects can be used â€“ it is important to recognise that, short comparison of fields for exact matches, there is no objective method.

As a consequence, it is suggested that any context-sensitive validation routines be designed so that other algorithms can be "plugged in".

eu.etaxonomy.cdm.model.name package¶

NonViralName & subclasses¶

If another name matches on genusOrUninomial, infragenericEpithet, specificEpithet, infraspecificEpithet and basionymAuthorTeam, exBasionymAuthorTeam, combinationAuthorTeam, exCombinationAuthorTeam, reject, because the name already exists
If another name matches on genusOrUninomial, infragenericEpithet, specificEpithet, infraspecificEpithet but differs in authorship, then the names should be related using a LATER_HOMONYM
If another name has identical authorship, and similar epithet / uninomial then it may be a missapplied name, or it could be a spelling mistake. (could be detected using the search functionality e.g. using a fuzzy search "specificEpithet:{term}~", or using the dictionary support)

eu.etaxonomy.cdm.model.reference package¶

ReferenceBase & subclasses¶

Reject if matches on authorTeam, datePublished, title +
- Article: journal, volume, series, series, pages
- Book: volume, pages, publisher, placePublished
- BookSection: book
- Generic: volume, series, pages
- Check issn / isbn against authority file

eu.etaxonomy.cdm.model.taxon package¶

TaxonBase & subclasses¶

Within a given CDM store, some context-dependent validation can take place on Taxon concepts if the CDM store is being used to persist a single checklist as a checklist is intended to be internally coherent. If a CDM store is intended to store multiple checklists, it is more difficult to validate the data.

Reject a taxon that matches on name, sec
If the CDM store is a checklist, it should not be possible to persist two taxa (synonym & taxon) with the same name, unless the synonym is a misapplied name or is a pro-parte synonym
If the CDM store is a checklist, it should not be possible to persist two synonyms with the same name unless they are pro-parte synonyms

r1 If the name is a Zoological Name, then this reference is the original citation of the basionym, not the nomenclatural reference of the recombined name. If this name is a botanical name, this reference is the place where the name (basionym or recombination) has been published.

Files (0)

Updated by Katja Luther about 2 years ago · 19 revisions

Project

General

Profile

EDIT

Wiki

Data Validation & Data Integrity in the CDM¶

Character Encoding¶

Basic Validation¶

CdmBase¶

VersionableEntity¶

AnnotatbleEntity¶

IdentifiableEntity¶

eu.etaxonomy.cdm.model.agent package¶

AgentBase & subclasses¶

eu.etaxonomy.cdm.model.media package¶

Media & subclasses¶

eu.etaxonomy.cdm.model.name package¶

TaxonNameBase & subclasses¶

eu.etaxonomy.cdm.model.occurrence package¶

SpecimenOrObservationBase & subclasses¶

eu.etaxonomy.cdm.model.reference package¶

ReferenceBase & subclasses¶

eu.etaxonomy.cdm.model.taxon package¶

TaxonBase & subclasses¶

Taxonomic Business-Logic Validation (Context-Independent)¶

eu.etaxonomy.cdm.model.name package¶

NonViralName & subclasses¶

eu.etaxonomy.cdm.model.occurrence package¶

SpecimenOrObservationBase & subclasses¶

eu.etaxonomy.cdm.model.taxon package¶

TaxonBase & subclasses¶

Taxonomic Business-Logic Validation (Context-Dependent)¶

eu.etaxonomy.cdm.model.name package¶

NonViralName & subclasses¶

eu.etaxonomy.cdm.model.reference package¶

ReferenceBase & subclasses¶

eu.etaxonomy.cdm.model.taxon package¶

TaxonBase & subclasses¶