Project

General

Profile

Darwin Core Archive

Helpful information about the mapping and import/export CDM-DwCA-CDM

GBIF Assistant

overview

dwca assistant

DarwinCoreTerms

meta.xml schema

spreadsheet processor

best practices

Mapping

Tickets

#2342

Core Taxon

  • Accepted taxa are retrieved from all nodes from all classifications (if not filtered).

  • Synonyms are retrieved from all node.taxon.synonymRelationships.synonym from the above nodes (dedeuplication necessary ??). Replace all starting node.taxon by synonym, if no starting node.taxon use null

  • Misapplied names are retrieved from all node.taxon.taxonRelationship.fromTaxon. Replace all starting node.taxon by fromTaxon, if no starting node.taxon use null

  • (Concept relationships not yet checked)

|DwcA|CDM|Problems|Documentation|
|id|node.taxon.id|--|--|
|scientificNameId|node.taxon.name.id|--|not clear if only resolvable ids should be used here. The Dwc-A says so whereas the referenced http://rs.tdwg.org/dwc/terms/index.htm#scientificNameID does not|
|acceptedNameUsageId|node.taxon.id|needed for taxa ??, taxon or name??|--|
|parentNameUsageId|node.parent.taxon.id|needed for synonyms and misapplied names??|--|
|originalNameUsageId|node.taxon.name.basionym.id|multiple basionyms possible, should replaced synonyms be included|DwC-A and DwC documentation is unclear about if this id should link to a nameUsage/concept or if also linking to a name (which includes the nom.ref.) is sufficient as it mentions the basionym which is a nomenclatural object not a name usage but at other places it refers to name usages|
|nameAccordingToId|node.taxon.sec.id|--|--|
|namePublishedInId|node.taxon.name.nomenclaturalReference.id|--|--|
|taxonConceptId|node.taxon.id|--|no documentation given, clarification needed about the difference between id, acceptedNameUsageId and taxonConceptId|
|scientificName|node.name.titleCache|needed??|--|
|acceptedNameUsage|node.taxon.titleCache|needed in general? needed for taxa ??, taxon or name??|--|
|parentNameUsage|node.parent.taxon.titleCache|needed in general? needed for synonyms and misapplied names??|needs improvement: "most proximate higher-rank parent taxon (in a classification) of the most specific element of the scientificName." looks like taxa and names are mixed here. What is is a higher rank parent taxon of a name? A name does not have a parent (if it is not hybrid)|
|originalNameUsage|node.taxon.name.basionym.titleCache ??|needed in general? multiple basionyms possible, replaced synonyms|see originalNameUsageId|
|nameAccordingTo|node.taxon.sec.titleCache|needed in general?|how can sec. be an institution or an individual. One always needs a timestamp as opinions of institutions or individuals change over time|
|namePublishedIn|node.taxon.name.nomenclaturalReference.titleCache|??|--|
|higherClassification|??|needs to be computed via classification|--|
|kingdom|node.parent()[.taxon.name.rank=kingdom].taxon.name.titleCache|uninomial instead?? Ranks in between. This implementation is not according to the documentation as the documentation requires kingdom "Plantae" for ALL botanical taxa not only taxa of rank "kingdom"|what is meant by **full* scientific name? with author or not? The examples don't have an author!!|
|phylum|see kingdom|see kingdom|see kingdom; a remark is missing how this applies to phylum and division|
|clazz|see kingdom|see kingdom|see kingdom|
|order|see kingdom|see kingdom|see kingdom|
|family|see kingdom|see kingdom|see kingdom|
|genus|see kingdom|see kingdom|see kingdom|
|subgenus|see kingdom|see kingdom; how to create the Genus (Subgenus) syntax, we have the infrageneric marker in the namecache|see kingdom|
|specificEpithet|node.taxon.name.specificEpithet|--|--|
|infraspecificEpithet|node.taxon.name.infraSpecificEpithet|--|--|
|taxonRank|node.taxon.name.rank -> transform to gbif rank vocabulary|--|--|
|verbatimTaxonRank|node.taxon.name.rank.getAbbreviation|--|--|
|scientificNameAuthorship|node.taxon.name.authorshipCache|--|
|vernacularName|-- (see vernacular names extensions)|--|--|
|nomenclaturalCode|node.taxon.name.nomenclaturalCode|vocabulary|recommended vocabulary is missing|
|taxonomicStatus|for taxa: node.taxon.name.nomenclaturalCode.acceptedTaxonStatusLabel; for misapplied names: "misapplied"; for synonyms "((homo|hetero)?typic )?synonym" depending on the relationship type; "invalid" for zoological "synonyms"|--|--|
|nomenclaturalStatus|node.taxon.name.status -> transform to vocabulary|multiple status possible|--|
|taxonRemarks|??|annotations and markers? - multiple possible for each object (taxon, name, ...)|--|
|modified|??|node.taxon.updated multiple possible for each related object|unclear about what record means, if this record is an aggregation of multiple records, which date should be used. E.g. should an update to a vernacular name also change the 'modified' value?|
|language|??|which language??|absolutely unclear which language is meant here. No documentation given. Language for scientific name, common name, rights term, references, ... are possible|
|rights|??|node.taxon.rights??|missing|
|rightsHolder|??|??|missing|
|accessRights|??|??|missing|
|bibliographicCitation|??|??|--|
|informationWithheld|--|currently not yet available, needs implementation for roles&rights|--|
|datasetId|??|node.classification.id ??|more self-explaning examples should be given. Can it be used to seperate multiple classifications within a given DwC-A file?|
|datasetName|??|node.classification.name ??|see datasetId|
|source|defaultValue like http://wp6-cichorieae.e-taxonomy.eu/portal/?q=cdm_dataportal/taxon/{id}|how to provide the static part as it is not part of the data but configured by the CDM Server. Must be given by the configuration|what is preferred, a human readable website or a RESTful webservice?|

Extensions

Not yet here

  • Identification History (not yet implemented by GBIF)

  • Measurement or Facts

  • Alternative Identifiers

  • Species Profile (not so important for CDM)

Resource Relationship

There are multiple resource relationships which are partly handled differently

  • TaxonInteractions (CDM-Interactions)

  • NameRelationships (CDM-Relations)

  • TaxonConceptRelationships (except for misapplied names as they are handled via DwC:Taxon.taxonomicStatus)

|DwcA|CDM-Relations|CDM-Interactions|Problems/Remarks|
|coreid|relationship.relatedFrom.id|interaction.indescription.taxon.id|--|
|resourceRelationshipId|relationship.id|interaction.id|--|
|relatedResourceId|relationship.relatedTo.id|interaction.taxon2.id|to be used for name relationships ?|
|relationshipOfResource|relationship.type.titleCache|interaction.description.text|which vocabulary to use, list of preferred languages|
|relationshipAccordingTo|relationship.citation|relationship.sources.citation|how to handle multiple sources ??|
|relatioshipEstablishedDate|--|--|missing|
|relationshipRemarks|relationship.annotations|interaction.annotations|how to handle multiple annotations|
|scientificName|relationship.relatedTo.id|--|to be used for taxa ?|

NOTE: As relations can be accessed from both sides we need to take special care about duplicate removing.

NOTE-2: As relations origin is from different classes we need to use uuid's rather than ids as identifiers. Otherwise the link is ambiguous.

Types and Specimen

There are multiple specimen associated with a taxon.

  • IndividualsAssociations

  • TypeSpecimen for the taxon name and all synonym names (the specimen are associated with the synonym not with the taxon

  • Determinations (via DeterminationEvent)

|DwcA|CDM|Problems|Documentation|
|coreid |taxon.id/ synonym.id|--|--|
|bibliographicCitation|specimen.titleCache|--|--|
|typeStatus|typeDesignation.status.titleCache|only for TypeSpecimen|--|
|typeDesignatedBy|typeDesignation.citation.titleCache|only for TypeSpecimen|--|
|scientificName|specimen.determinationEvents.taxon.name.titleCache|semantics not necessarily exact|--|
|taxonRank|specimen.determinationEvents.taxon.rank.titleCache|see above|--|
|occurrenceId|specimen.lsid or specimen.uuid|we do not have a common uri field for specimen|--|
|institutionCode|specimen.collection.institute.code|--|--|
|collectionCode|specimen.collection.code|--|--|
|catalogNumber|specimen.catalogNumber|--|--|
|locality|specimen...fieldObservation.gatheringEvent.locality|--|--|
|sex|specimen.sex or specimen(.derivedFrom.original)*.sex|use derived unit facade implementation|--|
|recordedBy|specimen...fieldObservation.gatheringEvent.actor|--|--|
|source|specimen.source|how to handle multiple sources; filter data provenance sources|not defined|
|eventDate|specimen...fieldObservation.gatheringEvent.date|--|--|
|verbatimLabel|--|missing|--|
|verbatimLongitude|specimen...fieldObservation.gatheringEvent.exactLocation.longitude|--|why verbatim and not decimal|
|verbatimLatitude|specimen...fieldObservation.gatheringEvent.exactLocation.latitude|--|why verbatim and not decimal|

Vernacular Names

For each accepted taxon get all description elements with feature Common_Name.

Open issues: how to handle common names of type TextData.

|DwcA|CDM|Problems|Documentation|
|coreid |commonTaxonName.inDescription.taxon.id|--|--|
|vernacularName|commonTaxonName.name|--|--|
|source|specimen.source|filter data provenance sources|how to handle multiple sources|
|language|commonTaxonName.language|--|--|
|temporal|??|missing|--|
|locationId|commonTaxonName.area.id|--|Example is missing; unclear if handling should be in accordance to distribution.loactionId, if yes the documentation should be the same|
|locality|commonTaxonName.area.label|--|--|
|countryCode|commonTaxonName.area.iso3166_A2|--|--|
|sex|??|missing|--|
|lifeStage|??|missing|--|
|isPlural|??|missing|--|
|isPreferredName|??|missing|--|
|organismPart|??|missing|--|
|taxonRemarks|??|how to filter taxon specific annotations/marker|--|

Literature References

The main problem here is to define which literature needs to be mapped.

Currently we map the taxon sec reference and the name nomenclaturalReference.

So we export references referenced by description elements only via the source attribute in the according extensions which is anySeperator-seperated and not atomized or enriched value.

See further comments in the general remarks part.

Note: The sec reference urgently needs to be deduplicated.

|DwcA|CDM|Problems|Documentation|
|coreid |taxon.id|--|--|
|identifier(isbn/Issn)|reference.isbn or isbn (depending on which is available)|current implementation checks if isbn is available, if not issn is returned|--|
|identifier(uri)|reference.uri|--|--|
|identifier(doi)|reference.extensionsType:DOI|implementation needed, how about multiple DOIs|--|
|identifier(lsid)|reference.lsid.toString|--|--|
|bibliographicCitation|reference.titleCache|--|--|
|title|reference.title|--|--|
|creator|reference.authorTeam.titleCache|--|--|
|date|reference.datePublished|check return type, if freetext is available, the type is not according to the suggested type YYYY-MM-DD|--|
|source|reference.inReference.titleCache|--|--|
|description|reference.referenceAbstract|what about annotations and markers, how to concatenate, what is the difference to taxonRemarks|--|
|subject|??|keywords or relationship type to the core taxon, which vocabulary to use??|--|
|language|--|missing|--|
|rights|reference.rights|--|--|
|taxonRemarks|??|??|--|
|type|??|strange field, which vocabulary to use, how to handle lists, ...|--|

Taxon Description

For each accepted taxon get all description elements of class TextData.

Open issues: how to handle other description element classes (TaxonInteraction, QuantitativeData, CategoricalData).

|DwcA|CDM|Problems|Documentation|
|coreid |taxon.id|--|--|
|description|textData.getPreferredLanguageString.text|--|--|
|type|textData.feature.titleCache |--|--|
|source|textData.sources|filter data provenance sources|filter data provenance sources|how to handle multiple sources|
|language|textData.getPreferredLanguageString.language|one record for each language ??|copy&paste error - doc for vernacular names, not descriptions. Or is this on purpose?|
|creator|--|textData.credits.agent ?? But we have not role for credits, textData.createdBy ??|--|
|contributor|--|textData.credits.agent ?? What if the creator is missing ??|--|
|audience|--|missing|--|
|license|textData.inDescription.rights|only available for the whole description; should we use taxon.rights if first is null??|--|
|rightsHolder|--|??|--|

Species Distributions

For each accepted taxon get all description elements with feature Distribution.

Open issues: how to handle distributions of type TextData.

|DwcA|CDM|Problems|Documentation|
|coreid |taxon.id|--|--|
|locationId|distribution.area.id|--|--|
|locality|distribution.area.label|--|--|
|countryCode|distribution.area.iso3166_A2|--|--|
|lifeStage|--|missing|--|
|occurrenceStatus|distribution.status|extracted from PresenceAbsenceTerm ??|--|
|threadStatus|--|missing|--|
|establishmentMeans|--|extracted from PresenceAbsenceTerm ??|--|
|appendixCITES|--|missing|--|
|eventDate|--|missing|--|
|seasonalDate|--|missing|--|
|source|distribution.sources|filter data provenance sources|how to handle multiple sources|
|occurrenceRemarks|distribution.annotations and distribution.marker|implementation ??|--|

Simple Images

For each accepted taxon (taxon) get all description elements with media attached.

For each media (media) get the representation parts.

For each part map as follows.

Open issues: are there other media? Do we want to filter on image galleries?

|DwcA|CDM|Problems|Documentation|
|coreid |taxon.id|--|--|
|identifier|part.uri|--|--|
|title|media.titleCache|or media.title??|--|
|description|media.description.text|handle default language not available|--|
|spatial|--|missing|--|
|coordinates|--|missing|--|
|format|part.representation.mimeType|--|--|
|license|media.rights|implementation for collections, which field to take (abbreviated text, uri, ...)|--|
|created|media.mediaCreated|NOTE: not media.created as this is for the CDM record, not the media|--|
|creator|media.artist|--|--|
|contributor|--|missing|--|
|publisher|--|missing|--|
|audience|--|missing|--|

General Mapping Remarks

CDM -> DwcA(Tax)

  • taxonRemarks: the CDM does not mark annotations as being taxon specific so it is difficult to create the taxonRemarks subset of the annotations

  • handling of literature is crucial for scientific data. Literature handling in DwC-A is very unclear. As far as we can see literature can either be added to data via the source attribute as anySeperator-separated and not atomized or enriched value. This creates lot's redundant data and leads to loss of atomization.

Maybe the literature extension can be used for this purpose but it seems to be unclear how e.g. literature can be attached to distribution or descriptive data via the existing star schema.

The general problem here is that literature is usually referenced but does not reference other. But as an extension in star schema it is mainly supposed to reference core data. However most links in scientific data are not between taxon/name data and literature data but between extension data like descriptions, distributions, etc and literature data. And even the earlier case only can be handled via the type or the subject attribute in the given schema.

  • in general the correct use of fields that may hold sets is not clear. Don't we need a clearly defined separator character. Otherwise it is hard to digest data

  • some terms are not consistently used throughout the schema (e.g. 'source', 'bibiographicCitation' and maybe 'locationID'). This easily leads to misunderstandings and should be avoided

  • though using ids as identifiers rather than uuids or uris may reduce the size and it is allowed in general it looks like this is not possbile as resource relations to refer to different classes and therefore identifiers are not unique (or we need a prefix for each class)

  • use of vocabularies: many fields recommend to use a controlled vocabulary and the descriptior file allows to define this vocabulary. This leads to problems when the vocabulary used is not a "controlled and publicly available" vocabulary. Also it leads to problems when we have terms from multiple vocabularies. Also there should be given much more advice on which vocabularies are recommended. Some of the given vocabularies are rather small and don't cover the full range of possible values. There should always be given also the largest known vocabulary available.

Assistant Tool

  • Documentaion notes are sometimes hard to read as they disappear after a while. It would be nice to have them for an unlimited time period by clicking on the annotated object.

Documentation

Meta Data Processor

  • Taxonomic Keywords are not processed

  • Second primary contact is not processed

  • Project description is not processed

  • single keywords are not trimed

  • First Name and last Name is required though organinzations are allowed as associated parties (they don't have first and last names)

  • Metadata language is not processed

  • Resource Creator should allow teams (in our case we have a team of 3 creators)

Term Mapping

GBIF vocabularies

GermanSL Example given by GBIF

see link to GermanSL at http://code.google.com/p/gbif-ecat/wiki/DwCArchive#Static_mappings

This example has a couple of errors/problems like the following

  • Eml.xml file is missing completely

  • Coretable

    • index 1 is missing
    • taxonRank does not use the recommended vocabulary but a 3 letter code based partly on german taxon ranks (e.g. GAT)
    • there is a 9th and a 10th column holding values A,F,G,M,P,S and 0,x which are not referred to in the metaData.xml at all
  • SpeciesInfo.txt

  • Distribution.txt

    • locationId uses not the correct status. It uses e.g. BW for Baden-W├╝rttemberg but the correct code is DE-BW. BW stands for Botswana in ISO_3166-2 (http://en.wikipedia.org/wiki/ISO_3166-2:BW). So the DE- is missing for all values and the metadata.xml doesn't address this issue.
  • Vernacular.txt

*index3 referrs to http://rs.tdwg.org/dwc/terms/locality which is usually a string defining a concrete area. Here it uses a country code (DE) which should go into the explicit field "countryCode"

Being an example the data should be improvemed or removed.

Add picture from clipboard (Maximum size: 40 MB)