Darwin Core Archive

Helpful information about the mapping and import/export CDM-DwCA-CDM

GBIF Assistant


dwca assistant


meta.xml schema

spreadsheet processor

best practices




Core Taxon

  • Accepted taxa are retrieved from all nodes from all classifications (if not filtered).

  • Synonyms are retrieved from all node.taxon.synonymRelationships.synonym from the above nodes (dedeuplication necessary ??). Replace all starting node.taxon by synonym, if no starting node.taxon use null

  • Misapplied names are retrieved from all node.taxon.taxonRelationship.fromTaxon. Replace all starting node.taxon by fromTaxon, if no starting node.taxon use null

  • (Concept relationships not yet checked)

DwcA CDM Problems Documentation
id -- --
scientificNameId -- not clear if only resolvable ids should be used here. The Dwc-A says so whereas the referenced does not
acceptedNameUsageId needed for taxa ??, taxon or name?? --
parentNameUsageId needed for synonyms and misapplied names?? --
originalNameUsageId multiple basionyms possible, should replaced synonyms be included DwC-A and DwC documentation is unclear about if this id should link to a nameUsage/concept or if also linking to a name (which includes the nom.ref.) is sufficient as it mentions the basionym which is a nomenclatural object not a name usage but at other places it refers to name usages
nameAccordingToId -- --
namePublishedInId -- --
taxonConceptId -- no documentation given, clarification needed about the difference between id, acceptedNameUsageId and taxonConceptId
scientificName needed?? --
acceptedNameUsage node.taxon.titleCache needed in general? needed for taxa ??, taxon or name?? --
parentNameUsage node.parent.taxon.titleCache needed in general? needed for synonyms and misapplied names?? needs improvement: "most proximate higher-rank parent taxon (in a classification) of the most specific element of the scientificName." looks like taxa and names are mixed here. What is is a higher rank parent taxon of a name? A name does not have a parent (if it is not hybrid)
originalNameUsage ?? needed in general? multiple basionyms possible, replaced synonyms see originalNameUsageId
nameAccordingTo node.taxon.sec.titleCache needed in general? how can sec. be an institution or an individual. One always needs a timestamp as opinions of institutions or individuals change over time
namePublishedIn ?? --
higherClassification ?? needs to be computed via classification --
kingdom node.parent(*)[] uninomial instead?? Ranks in between. This implementation is not according to the documentation as the documentation requires kingdom "Plantae" for ALL botanical taxa not only taxa of rank "kingdom" what is meant by full scientific name? with author or not? The examples don't have an author!!
phylum see kingdom see kingdom see kingdom; a remark is missing how this applies to phylum and division
clazz see kingdom see kingdom see kingdom
order see kingdom see kingdom see kingdom
family see kingdom see kingdom see kingdom
genus see kingdom see kingdom see kingdom
subgenus see kingdom see kingdom; how to create the Genus (Subgenus) syntax, we have the infrageneric marker in the namecache see kingdom
specificEpithet -- --
infraspecificEpithet -- --
taxonRank -> transform to gbif rank vocabulary -- --
verbatimTaxonRank -- --
scientificNameAuthorship --
vernacularName -- (see vernacular names extensions) -- --
nomenclaturalCode vocabulary recommended vocabulary is missing
taxonomicStatus for taxa:; for misapplied names: "misapplied"; for synonyms "((homo hetero)?typic )?synonym" depending on the relationship type; "invalid" for zoological "synonyms" --
nomenclaturalStatus -> transform to vocabulary multiple status possible --
taxonRemarks ?? annotations and markers? - multiple possible for each object (taxon, name, ...) --
modified ?? node.taxon.updated multiple possible for each related object unclear about what record means, if this record is an aggregation of multiple records, which date should be used. E.g. should an update to a vernacular name also change the 'modified' value?
language ?? which language?? absolutely unclear which language is meant here. No documentation given. Language for scientific name, common name, rights term, references, ... are possible
rights ?? node.taxon.rights?? missing
rightsHolder ?? ?? missing
accessRights ?? ?? missing
bibliographicCitation ?? ?? --
informationWithheld -- currently not yet available, needs implementation for roles&rights --
datasetId ?? ?? more self-explaning examples should be given. Can it be used to seperate multiple classifications within a given DwC-A file?
datasetName ?? ?? see datasetId
source defaultValue like{id} how to provide the static part as it is not part of the data but configured by the CDM Server. Must be given by the configuration what is preferred, a human readable website or a RESTful webservice?


Not yet here

  • Identification History (not yet implemented by GBIF)

  • Measurement or Facts

  • Alternative Identifiers

  • Species Profile (not so important for CDM)

Resource Relationship

There are multiple resource relationships which are partly handled differently

  • TaxonInteractions (CDM-Interactions)

  • NameRelationships (CDM-Relations)

  • TaxonConceptRelationships (except for misapplied names as they are handled via DwC:Taxon.taxonomicStatus)

DwcA CDM-Relations CDM-Interactions Problems/Remarks
coreid --
resourceRelationshipId --
relatedResourceId to be used for name relationships ?
relationshipOfResource relationship.type.titleCache interaction.description.text which vocabulary to use, list of preferred languages
relationshipAccordingTo relationship.citation relationship.sources.citation how to handle multiple sources ??
relatioshipEstablishedDate -- -- missing
relationshipRemarks relationship.annotations interaction.annotations how to handle multiple annotations
scientificName -- to be used for taxa ?

NOTE: As relations can be accessed from both sides we need to take special care about duplicate removing.

NOTE-2: As relations origin is from different classes we need to use uuid's rather than ids as identifiers. Otherwise the link is ambiguous.

Types and Specimen

There are multiple specimen associated with a taxon.

  • IndividualsAssociations

  • TypeSpecimen for the taxon name and all synonym names (the specimen are associated with the synonym not with the taxon

  • Determinations (via DeterminationEvent)

DwcA CDM Problems Documentation
coreid -- --
bibliographicCitation specimen.titleCache -- --
typeStatus typeDesignation.status.titleCache only for TypeSpecimen --
typeDesignatedBy typeDesignation.citation.titleCache only for TypeSpecimen --
scientificName semantics not necessarily exact --
taxonRank specimen.determinationEvents.taxon.rank.titleCache see above --
occurrenceId specimen.lsid or specimen.uuid we do not have a common uri field for specimen --
institutionCode -- --
collectionCode specimen.collection.code -- --
catalogNumber specimen.catalogNumber -- --
locality specimen...fieldObservation.gatheringEvent.locality -- --
sex or specimen(.derivedFrom.original)*.sex use derived unit facade implementation --
recordedBy -- --
source specimen.source how to handle multiple sources; filter data provenance sources not defined
eventDate -- --
verbatimLabel -- missing --
verbatimLongitude specimen...fieldObservation.gatheringEvent.exactLocation.longitude -- why verbatim and not decimal
verbatimLatitude specimen...fieldObservation.gatheringEvent.exactLocation.latitude -- why verbatim and not decimal

Vernacular Names

For each accepted taxon get all description elements with feature Common_Name.

Open issues: how to handle common names of type TextData.

DwcA CDM Problems Documentation
coreid -- --
vernacularName -- --
source specimen.source filter data provenance sources how to handle multiple sources
language commonTaxonName.language -- --
temporal ?? missing --
locationId -- Example is missing; unclear if handling should be in accordance to distribution.loactionId, if yes the documentation should be the same
locality commonTaxonName.area.label -- --
countryCode commonTaxonName.area.iso3166_A2 -- --
sex ?? missing --
lifeStage ?? missing --
isPlural ?? missing --
isPreferredName ?? missing --
organismPart ?? missing --
taxonRemarks ?? how to filter taxon specific annotations/marker --

Literature References

The main problem here is to define which literature needs to be mapped.

Currently we map the taxon sec reference and the name nomenclaturalReference.

So we export references referenced by description elements only via the source attribute in the according extensions which is anySeperator-seperated and not atomized or enriched value.

See further comments in the general remarks part.

Note: The sec reference urgently needs to be deduplicated.

DwcA CDM Problems Documentation
coreid -- --
identifier(isbn/Issn) reference.isbn or isbn (depending on which is available) current implementation checks if isbn is available, if not issn is returned --
identifier(uri) reference.uri -- --
identifier(doi) reference.extensionsType:DOI implementation needed, how about multiple DOIs --
identifier(lsid) reference.lsid.toString -- --
bibliographicCitation reference.titleCache -- --
title reference.title -- --
creator reference.authorTeam.titleCache -- --
date reference.datePublished check return type, if freetext is available, the type is not according to the suggested type YYYY-MM-DD --
source reference.inReference.titleCache -- --
description reference.referenceAbstract what about annotations and markers, how to concatenate, what is the difference to taxonRemarks --
subject ?? keywords or relationship type to the core taxon, which vocabulary to use?? --
language -- missing --
rights reference.rights -- --
taxonRemarks ?? ?? --
type ?? strange field, which vocabulary to use, how to handle lists, ... --

Taxon Description

For each accepted taxon get all description elements of class TextData.

Open issues: how to handle other description element classes (TaxonInteraction, QuantitativeData, CategoricalData).

DwcA CDM Problems Documentation
coreid -- --
description textData.getPreferredLanguageString.text -- --
type textData.feature.titleCache -- --
source textData.sources filter data provenance sources filter data provenance sources
language textData.getPreferredLanguageString.language one record for each language ?? copy&paste error - doc for vernacular names, not descriptions. Or is this on purpose?
creator -- textData.credits.agent ?? But we have not role for credits, textData.createdBy ?? --
contributor -- textData.credits.agent ?? What if the creator is missing ?? --
audience -- missing --
license textData.inDescription.rights only available for the whole description; should we use taxon.rights if first is null?? --
rightsHolder -- ?? --

Species Distributions

For each accepted taxon get all description elements with feature Distribution.

Open issues: how to handle distributions of type TextData.

DwcA CDM Problems Documentation
coreid -- --
locationId -- --
locality distribution.area.label -- --
countryCode distribution.area.iso3166_A2 -- --
lifeStage -- missing --
occurrenceStatus distribution.status extracted from PresenceAbsenceTerm ?? --
threadStatus -- missing --
establishmentMeans -- extracted from PresenceAbsenceTerm ?? --
appendixCITES -- missing --
eventDate -- missing --
seasonalDate -- missing --
source distribution.sources filter data provenance sources how to handle multiple sources
occurrenceRemarks distribution.annotations and distribution.marker implementation ?? --

Simple Images

For each accepted taxon (taxon) get all description elements with media attached.

For each media (media) get the representation parts.

For each part map as follows.

Open issues: are there other media? Do we want to filter on image galleries?

DwcA CDM Problems Documentation
coreid -- --
identifier part.uri -- --
title media.titleCache or media.title?? --
description media.description.text handle default language not available --
spatial -- missing --
coordinates -- missing --
format part.representation.mimeType -- --
license media.rights implementation for collections, which field to take (abbreviated text, uri, ...) --
created media.mediaCreated NOTE: not media.created as this is for the CDM record, not the media --
creator media.artist -- --
contributor -- missing --
publisher -- missing --
audience -- missing --

General Mapping Remarks

CDM -> DwcA(Tax)

  • taxonRemarks: the CDM does not mark annotations as being taxon specific so it is difficult to create the taxonRemarks subset of the annotations

  • handling of literature is crucial for scientific data. Literature handling in DwC-A is very unclear. As far as we can see literature can either be added to data via the source attribute as anySeperator-separated and not atomized or enriched value. This creates lot's redundant data and leads to loss of atomization.

Maybe the literature extension can be used for this purpose but it seems to be unclear how e.g. literature can be attached to distribution or descriptive data via the existing star schema.

The general problem here is that literature is usually referenced but does not reference other. But as an extension in star schema it is mainly supposed to reference core data. However most links in scientific data are not between taxon/name data and literature data but between extension data like descriptions, distributions, etc and literature data. And even the earlier case only can be handled via the type or the subject attribute in the given schema.

  • in general the correct use of fields that may hold sets is not clear. Don't we need a clearly defined separator character. Otherwise it is hard to digest data

  • some terms are not consistently used throughout the schema (e.g. 'source', 'bibiographicCitation' and maybe 'locationID'). This easily leads to misunderstandings and should be avoided

  • though using ids as identifiers rather than uuids or uris may reduce the size and it is allowed in general it looks like this is not possbile as resource relations to refer to different classes and therefore identifiers are not unique (or we need a prefix for each class)

  • use of vocabularies: many fields recommend to use a controlled vocabulary and the descriptior file allows to define this vocabulary. This leads to problems when the vocabulary used is not a "controlled and publicly available" vocabulary. Also it leads to problems when we have terms from multiple vocabularies. Also there should be given much more advice on which vocabularies are recommended. Some of the given vocabularies are rather small and don't cover the full range of possible values. There should always be given also the largest known vocabulary available.

Assistant Tool

  • Documentaion notes are sometimes hard to read as they disappear after a while. It would be nice to have them for an unlimited time period by clicking on the annotated object.


Meta Data Processor

  • Taxonomic Keywords are not processed

  • Second primary contact is not processed

  • Project description is not processed

  • single keywords are not trimed

  • First Name and last Name is required though organinzations are allowed as associated parties (they don't have first and last names)

  • Metadata language is not processed

  • Resource Creator should allow teams (in our case we have a team of 3 creators)

Term Mapping

GBIF vocabularies

GermanSL Example given by GBIF

see link to GermanSL at

This example has a couple of errors/problems like the following

  • Eml.xml file is missing completely

  • Coretable

    • index 1 is missing
    • taxonRank does not use the recommended vocabulary but a 3 letter code based partly on german taxon ranks (e.g. GAT)
    • there is a 9th and a 10th column holding values A,F,G,M,P,S and 0,x which are not referred to in the metaData.xml at all
  • SpeciesInfo.txt

  • Distribution.txt

    • locationId uses not the correct status. It uses e.g. BW for Baden-Württemberg but the correct code is DE-BW. BW stands for Botswana in ISO_3166-2 ( So the DE- is missing for all values and the metadata.xml doesn't address this issue.
  • Vernacular.txt

*index3 referrs to which is usually a string defining a concrete area. Here it uses a country code (DE) which should go into the explicit field "countryCode"

Being an example the data should be improvemed or removed.

Updated by Andreas Müller over 1 year ago · 69 revisions