Controlled vocabularies in the Common Data Model
Introduction to Problem¶
Lists of "predefined, authorised terms" on WikiPedia:Controlled_vocabulary are used throughout the taxonomic domain. I think there are common properties of controlled vocabularies that we should discuss and tackle in a general form â€“ perhaps this is an old discussion which I just didnâ€™t find yet.
I take a rather wide definition of CVs here, including any list of two or more terms (string or numeric) that represent a list of exclusive values for a defined attribute/element/property. Mark that this includes value/null, or yes/no values, if not covered by a bolean data type.
Two examples from the LSID Ontology and one example drawing on the ABCD.RecordBasis type restriction to make my points:
Class: Taxon Rank Term http://wiki.tdwg.org/twiki/bin/view/TAG/TaxonRankLsidVoc
Class: Nomenclatural Code Term http://rs.tdwg.org/ontology/voc/TaxonName * http://www.bgbm.org/TDWG/CODATA/Schema/ABCD_2.06/HTML/ABCD_2.06.html
I assume that in designing the model we will strive to make it as simple as possible, while trying to remain open for future extensions, also unforeseen ones, as much as possible.
With controlled vocabularies, extension can simply mean added terms, so the model must cover that possibility.
However, it should also cover the possibility to add further information that programs can use or act on as a part of the CV itself. These may be
further restrictions â€“ e.g. for type specimens, the RecordBasis must be â€œPreservedSpecimenâ€ or â€œDrawingOrPhotographâ€.
functional attributes, which are exclusive to one of the terms in the list, e.g. default value
attributes that classify the terms (e.g. ranks not recommended by the code of nomenclature, deprecated terms)
alternative labels, e.g. for language representations, abbreviated / not abbreviated
language representations of descriptions that can be used as help text
language representations of short descriptions that can be used as prompts in forms etc.
references to other controlled vocabularies, that define various subsets of the term list itself (e.g. rank term used only according to Zoo
I donâ€™t say that we need to implement any of this on the outset, but that our modelling method should allow to extend the model in this way. For example, I donâ€™t think that XML schema restrictions can cover any of the above directly.
Another area to be discussed is versioning of the CVs.