Project

General

Profile

(Dynamic) Taxon Concepts

Introduction

Taxon concepts describe a set of organisms that belong to this concept. They are used to group organisms into those belonging to the concept and those not belonging to it and they somehow describe certain characters of the organisms belonging to the concept.

Types of Concepts

The way how concepts are described may differ, though they finally always relate to characters of organisms or potential organisms.
However, in the way how concepts are described, the following differences exist:

Description Based

Description based concepts explicitly define which values/states for a given character the organism may have and may not have to belong to the concept. This description may be morphological, molecular or any other as long as it helps to distinguish the affiliation of an organism to the concept.
Often the description of the concept itself does not contain all characters but only those needed to distinguish it from it direct neighbors. In this case also the descriptions of the parent concepts are essential to come to a complete concept description. All descriptions of a parent concept must also be true for the given concept.
In some cases also the additive descriptions of all child concepts are included in the given concepts definition.

Name Based

Name based concepts do define taxon names or groups of taxon names (homotypic groups) that belong to the concept. This works because a taxon name is always based on a type specimen (or a group of type specimens) and these specimens do have characters and also in the protologue of the name (or the names basionym) a description exists.
So indirectly name based concepts are also based on descriptions.
Name based concepts as being based on single specimens/specimen descriptions per se do include all child concepts while parent concepts are only part of the concept as far as the parent may define external/neighboring concepts.
For this concept type the completeness of the homotypic groups in the synonymy is essential (but it might be a (regionally) filtered completeness)

Externally defined

Some concepts only refer to concepts defined elsewhere. So these concepts are rather links to or exact copies of the original concept than concepts on their own.

Concept Borders

The border between 2 concepts can be described by either explicitly defining which (combination of) character states belong to the given concept (internal definition) but additionally it can be described by saying that certain character states do NOT belong to the given taxon or even better they belong to another (neighboring?) concept (external definition). In this case the internal definitions of neighboring concepts are explicitly part of the complete definition of the given concept.

Dynamic Concepts

Persistence

Traditionally taxon concepts are published in print publication. So they are defined by the unchangeable data explicitly mentioned in the publication or in related or referenced publications.
More recently concepts are also published online. As online publications usually are not persistent the concept definition may change over time. This makes it difficult to verify a statement like "organism A belongs to concept B" at a later time as the concept B may have changed in the meanwhile.
Therefore concepts must either guarantee that they stay persistent for long time/ever or they must clearly declare that they are not persistent.

Provenance / Concept Relation Types

If concepts change over time it is essential for many usecases that relationships between old concepts and more recent concepts are defined.
These relationships may be defined by the well known set based relationships

  • congruent (does not make much sense for concept provenance)
  • included in/includes
  • overlapping
  • distinct

Further more it may be important to know the

  • (time based) direction of the relationship

to know which concept is successor and which is the precursor. This is especially useful for usecases where someone refers to an old concept and wants a list of current concepts that are potential successors of the old concept.

Also it may be important to know if a concept is a

  • direct successor/precursor

as this way we retrieve information on the type of transformation that took place when reorganizing a set of concepts (the set of all direct successor relations and the set of all precursors of these successors define the type of tranformation that took place when a concept outdated.

Further interesting relationships might be

  • deleted (a concept may be completely deleted and not transformed into another concept if it turned out to be not a valid concept, e.g. because the type specimen does not exist anymore and no similar specimen was found or because the concept has a regional filter and it turned out that the concept never existed in the given region and therefore it was fully removed from the list of current concepts)
  • neighboring concepts (an explicit list of external concepts that are part of the full definition of the given concept)

Rules for Data Changes

Not every change in a database should define a new concept. E.g. correcting a typo in a name or in a description should not create a new concept.
But for users relying on the definition of a concept it is essential to know how far data related to a concept may change over time without is being marked as outdated and getting a successor concept.
So exact rules are needed what the defining part in a concept is and what changes in the data are considered definition changing changes. E.g. a rule could be that each type specimen which is directly or indirectly added or removed from a concept will result in a new concept. This is mostly an objective definition. A more subjective definition could be that a concept changes only if a type specimen is added or removed from the concept and by editorial definition it is decided that this type specimen changed the characteristics of the concept. In this case it might also be important to include information on the decision making editor into the concept.
It may also be helpful to log all not-definition changing changes to a concept somehow and therefore make transparent what has changed within the concept over time but was not considered a concept change itself.

Dynamic Concept Attributes

From the above we may retrieve the following attributes a dynamic/online taxon concept should have:

  • how it is defined, by secundum, by description and/or by names/homotypic groups
  • how far external definitions (concepts) belong to the given concept definition and therefore also need to be available in the database as long as the given concept is available
  • how far the concept definition is persistent or not
    • this may include information on which data is persistent (only the concept identifier and the provenance or all concept defining information or parts of concept defining information such as the internal definitions)
    • in detail also an expiration date (for online availability) or/and a service level agreement (7/24 availability) could be included
  • how far provenance is supported for persistent concepts
    • this may include information on the type of provenance
  • time period in which the concept was considered being a current concept in its given context
    • this includes information if a concept is a current concept at the given time (might be computed by the number of successors being 0)
  • data changing rules
  • editor (important if data changing rules depend on editorial decisions)

Use-Cases

Taxonomic backbone for other checklists/taxonomic lists

E.g.

  • INSPIRE
  • GermanSL
  • Red Lists
  • NFDI
  • Links from other taxonomic portals (Italian checklist, Jacqu, ...)
  • tbc

Taxonomic lists like the German Red Lists or the GermanSL (for vegetation data) may want to link their taxa to (current concepts of) other taxonomies such as Euro+Med or EU-NOMEN.
They may do this by matching their names with names existing in the backbone. As long as the match is successful they can easily link to the current concept in Euro+Med.
If the name does not match because it is not (yet) included in the backbone the matching needs to be done manually if possible.
There is also the possibility to link to a homotypic group in the backbone. This makes links even more stable than links to concepts, as concepts change more often than homotypic groups (in theory should never change). The link to the homotypic group automatically includes the link to the current concept as each HG should belong only to not more than 1 concept (and usually exactly 1 concept).

  • name matches
    • if continous (on-the-fly) matching/linking is possible => persistence of concepts is not necessary => the result is a single current concept.
    • if discontinous linkage to concept => persistence and provenance is necessary => the result is a single concept (not necessarily current) and a list of possible current concepts.
    • if discontinous linkage to homotypic group persistence and provenance is not necessary => the result is a single current concept (and a list of former concepts if wanted)
  • name manually matched
    • to concept = persistence and provenance necessary => the result is a single concept (not necessarily current) and a list of possible current concepts.
    • to homotypic group => persistence and prov not necessary => the result is a single current concept
  • concept (set of homotypic groups) matching

    • input: set of homotypic groups (the set defines the input concept; if the input has no homotypic information each name has its own HG)
    • (possible) output:
      • a list of concepts: with a list of input HGs attached that match this concept
      • a list of HGs: with a list of input HGs(or names) matching this HG
      • list of non matching input HGs: input HGs which did not match for ANY name in the group
      • list of non matching input Names: input names that did not match, but another name it its own HG did match (so we consider the HG itself as matching)
      • name-HG mapping: a mapping for each input name to the backbone HG
      • HG-HG mapping: a mapping for each input HG to the backbone HG
      • Conflicting HGs: input HGs that map to >1 output HG
  • match updating (for discontinous linkage)

    • input: all backbone concepts currently linked to in the list; output: all concepts from the input list, that are no "current" concepts anymore with a list of possible current concepts and the relationship to them
    • input: timestamp; output: all concepts outdated since then
    • input: timestamp; output: all HGs that changed the current concept affiliation since then
  • for the rare case of HGs being split in the backbone:

    • some services indicating theses splits ... (tbc)

For this usecase the completeness of names and homotypic groups as well as a good taxon name match algorithm is essential.

Taxonomic backbone for specimen data

  • NFDI
  • (BGBM) Herbar

Citation

  • Allow to cite a concept in the same way as it is possible for printed publications; similar to phycobank nameIDs
  • tbc

Expected results

For all use-cases we need to define what the expected result is.
Is a klickable link which shows only the concept data (list of homotypical groups, link to included children concepts, link to excluded neighboring concept) enough?
UseCase "Italian Flora linking to E+M" will only be interested in links to real and current taxon pages, not in concept pages. If necessary an intermediate page which shows all possible current taxon pages could be possible.
For the Usecase "Make a taxon concept citable" it might be enough to have only the homotypical group list as this is the only persistent information which can be cited. But is this really something we want to cite?

Data Model implications

With Current Model (TaxonNode, TaxonBase(Taxon, Syonym), Name, HG, TaxonRelation

  • Changing the accepted name of a taxon must not change the taxon ID => adaptation in operations
  • Taxon must know the single TaxonNode it belongs to (additional column "conceptNode"(?) in Taxon)
  • HGs should know which Taxon they belong to (for a given classification/context), currently it is possible though possibly not wanted that a HG includes names from multiple Taxa
  • Concept Persistence

    • Use TaxIncludedIn (where TaxonNode is the same but only for current taxa) relation to link to included taxa, this relation is bidirectional and means, that the concept of the child is included in the concept of the parent AND that all children related in this way to the parent are distinct (AND that the sum of all such relations of a parent cover the full concept of the parent taxon, so there is NO taxon which can be added to the taxon or its children in a TaxIncludedIn way )
    • for concepts which do not have a TaxIncludedIn relation to parent (as the parent uses a more current set of children) use
      • "excluded" relation to directly exclude neighboring concepts
      • "transitive excluded" to the parent indicating that all concepts that are excluded to the parent are also excluded to the child - the parent is still the parent of the child but the child is not the child of the parent anymore
  • Taxon needs indicator what is part of the concept it represents

Split NameUsage and TaxonConcept

Splitting Taxon into NameUsage and TaxonConcept may make sense.

Taxon Concept

  • all concept relevant data: sec, HGs, descriptions (morph.+molecular), concept children, concept parent or (- for non current concepts - directly excluded taxa + step-parent)

NameUsage

  • placement sec, accepted name, synonyms,
  • ? non-concept facts, ? images
  • NU children , NU parent,
  • link to concept
  • not persistent/versioned (?)

Conclusion:

  • Concepts and NameUsage potentially share a lot of attributes
  • most critical is the hierarchical information which both may have or may not have
  • TaxonNode structures should be used only for current views on the data, as NameUsages are not versioned they are mostly used to relate NameUsages. However, for concepts of the current view they can also be used, if concepts and NameUsages are not split; if split 2 separate relations are needed which need to be kept in sync :-(
  • options, if versioning NameUsage is required:
    • Using AUD tables
    • if the related concept uses hierarchical information, use this, otherwise use explicit relationships (or second TaxonNode), this should be enough to generate the most important information like higher taxa and children
    • allow linking a given Taxon(combined NameUsage-Concept class) to another concept where all NameUsage data is stored in the given instance and all concept data is stored in the linked concept

=> keep the class together but clearly separate concept data from non-concept data

=> we may even remove TaxonNode class completely and add parent child information to the hybrid class; this works because

  • there is usually only 1 current NameUsage for 1 current concept (NameUsage == Concept) => generally simple
  • complex queries against the database usually run against the current version => no case distinction is needed if queries should assume that certain data is in the given hybrid instance or in the linked concept instance
  • as a rare case NameUsages are also versioned and therefore link to another concept (NameUsage != Concept; n NameUsage versions may link to 1 concept) limited computation is needed; mostly we need to be able to represent all data of the old NameUsage; computations like "give me all NameUsage including versioned NameUsages available in a distribution area are usually not needed
  • if NameUsage even links to another instance being also a NameUsage (for having alternative classifications with the same NameUsages) also more complex computations are not possible/more difficult; but this is a very rare case and this limitation is acceptable => for this beside the link to the concept we need a second link to the nameUsage (or we keep 1 link and allow transitive linking NameUsage2->NameUsage1->Concept)

In other words if NameUsage != Concept we have 2 parents the NameUsage parent and the concept parent but we have also 2 instances of the hybrid class (or 3 instances in rare cases) but complex computations on these instances are usually not needed

The structure is similar as before but we merge TaxonNode and Taxon and instead of linking a Taxon to multiple TaxonNodes with no priorisation we do have a clear priorisation: the current concept parent is the primary parent. All other parents need to be computed but for most queries are simply not needed

Taxon Match

For all Use-Cases it might be extremely important to have a good taxon name match algorithm to identify equalness of names. As names are the main input for use-cases this is very important.

Add picture from clipboard (Maximum size: 40 MB)