Actions

History

DarwinCoreArchive » History » Revision 68

« Previous | Revision 68/69 (diff) | Next »
Andreas Müller, 05/09/2022 05:55 PM

Darwin Core Archive¶

Helpful information about the mapping and import/export CDM-DwCA-CDM

DarwinCoreArchiveScratchpads

GBIF Assistant¶

spreadsheet processor

best practices

Mapping¶

Tickets¶

#2342

Core Taxon¶

Accepted taxa are retrieved from all nodes from all classifications (if not filtered).
Synonyms are retrieved from all node.taxon.synonymRelationships.synonym from the above nodes (dedeuplication necessary ??). Replace all starting node.taxon by synonym, if no starting node.taxon use null
Misapplied names are retrieved from all node.taxon.taxonRelationship.fromTaxon. Replace all starting node.taxon by fromTaxon, if no starting node.taxon use null
(Concept relationships not yet checked)

DwcA	CDM	Problems	Documentation
id	node.taxon.id	--	--
scientificNameId	node.taxon.name.id	--	not clear if only resolvable ids should be used here. The Dwc-A says so whereas the referenced http://rs.tdwg.org/dwc/terms/index.htm#scientificNameID does not
acceptedNameUsageId	node.taxon.id	needed for taxa ??, taxon or name??	--
parentNameUsageId	node.parent.taxon.id	needed for synonyms and misapplied names??	--
originalNameUsageId	node.taxon.name.basionym.id	multiple basionyms possible, should replaced synonyms be included	DwC-A and DwC documentation is unclear about if this id should link to a nameUsage/concept or if also linking to a name (which includes the nom.ref.) is sufficient as it mentions the basionym which is a nomenclatural object not a name usage but at other places it refers to name usages
nameAccordingToId	node.taxon.sec.id	--	--
namePublishedInId	node.taxon.name.nomenclaturalReference.id	--	--
taxonConceptId	node.taxon.id	--	no documentation given, clarification needed about the difference between id, acceptedNameUsageId and taxonConceptId
scientificName	node.name.titleCache	needed??	--
acceptedNameUsage	node.taxon.titleCache	needed in general? needed for taxa ??, taxon or name??	--
parentNameUsage	node.parent.taxon.titleCache	needed in general? needed for synonyms and misapplied names??	needs improvement: "most proximate higher-rank parent taxon (in a classification) of the most specific element of the scientificName." looks like taxa and names are mixed here. What is is a higher rank parent taxon of a name? A name does not have a parent (if it is not hybrid)
originalNameUsage	node.taxon.name.basionym.titleCache ??	needed in general? multiple basionyms possible, replaced synonyms	see originalNameUsageId
nameAccordingTo	node.taxon.sec.titleCache	needed in general?	how can sec. be an institution or an individual. One always needs a timestamp as opinions of institutions or individuals change over time
namePublishedIn	node.taxon.name.nomenclaturalReference.titleCache	??	--
higherClassification	??	needs to be computed via classification	--
kingdom	node.parent(*)[.taxon.name.rank=kingdom].taxon.name.titleCache	uninomial instead?? Ranks in between. This implementation is not according to the documentation as the documentation requires kingdom "Plantae" for ALL botanical taxa not only taxa of rank "kingdom"	what is meant by full scientific name? with author or not? The examples don't have an author!!
phylum	see kingdom	see kingdom	see kingdom; a remark is missing how this applies to phylum and division
clazz	see kingdom	see kingdom	see kingdom
order	see kingdom	see kingdom	see kingdom
family	see kingdom	see kingdom	see kingdom
genus	see kingdom	see kingdom	see kingdom
subgenus	see kingdom	see kingdom; how to create the Genus (Subgenus) syntax, we have the infrageneric marker in the namecache	see kingdom
specificEpithet	node.taxon.name.specificEpithet	--	--
infraspecificEpithet	node.taxon.name.infraSpecificEpithet	--	--
taxonRank	node.taxon.name.rank -> transform to gbif rank vocabulary	--	--
verbatimTaxonRank	node.taxon.name.rank.getAbbreviation	--	--
scientificNameAuthorship	node.taxon.name.authorshipCache	--
vernacularName	-- (see vernacular names extensions)	--	--
nomenclaturalCode	node.taxon.name.nomenclaturalCode	vocabulary	recommended vocabulary is missing
taxonomicStatus	for taxa: node.taxon.name.nomenclaturalCode.acceptedTaxonStatusLabel; for misapplied names: "misapplied"; for synonyms "((homo	hetero)?typic )?synonym" depending on the relationship type; "invalid" for zoological "synonyms"	--
nomenclaturalStatus	node.taxon.name.status -> transform to vocabulary	multiple status possible	--
taxonRemarks	??	annotations and markers? - multiple possible for each object (taxon, name, ...)	--
modified	??	node.taxon.updated multiple possible for each related object	unclear about what record means, if this record is an aggregation of multiple records, which date should be used. E.g. should an update to a vernacular name also change the 'modified' value?
language	??	which language??	absolutely unclear which language is meant here. No documentation given. Language for scientific name, common name, rights term, references, ... are possible
rights	??	node.taxon.rights??	missing
rightsHolder	??	??	missing
accessRights	??	??	missing
bibliographicCitation	??	??	--
informationWithheld	--	currently not yet available, needs implementation for roles&rights	--
datasetId	??	node.classification.id ??	more self-explaning examples should be given. Can it be used to seperate multiple classifications within a given DwC-A file?
datasetName	??	node.classification.name ??	see datasetId
source	defaultValue like http://wp6-cichorieae.e-taxonomy.eu/portal/?q=cdm_dataportal/taxon/{id}	how to provide the static part as it is not part of the data but configured by the CDM Server. Must be given by the configuration	what is preferred, a human readable website or a RESTful webservice?

Extensions¶

Not yet here

Identification History (not yet implemented by GBIF)
Measurement or Facts
Alternative Identifiers
Species Profile (not so important for CDM)

Resource Relationship¶

There are multiple resource relationships which are partly handled differently

TaxonInteractions (CDM-Interactions)
NameRelationships (CDM-Relations)
TaxonConceptRelationships (except for misapplied names as they are handled via DwC:Taxon.taxonomicStatus)

NOTE: As relations can be accessed from both sides we need to take special care about duplicate removing.

NOTE-2: As relations origin is from different classes we need to use uuid's rather than ids as identifiers. Otherwise the link is ambiguous.

Types and Specimen¶

There are multiple specimen associated with a taxon.

IndividualsAssociations
TypeSpecimen for the taxon name and all synonym names (the specimen are associated with the synonym not with the taxon
Determinations (via DeterminationEvent)

Vernacular Names¶

For each accepted taxon get all description elements with feature Common_Name.

Open issues: how to handle common names of type TextData.

|DwcA|CDM|Problems|Documentation|
|coreid |commonTaxonName.inDescription.taxon.id|--|--|
|vernacularName|commonTaxonName.name|--|--|
|source|specimen.source|filter data provenance sources|how to handle multiple sources|
|language|commonTaxonName.language|--|--|
|temporal|??|missing|--|
|locationId|commonTaxonName.area.id|--|Example is missing; unclear if handling should be in accordance to distribution.loactionId, if yes the documentation should be the same|
|locality|commonTaxonName.area.label|--|--|
|countryCode|commonTaxonName.area.iso3166_A2|--|--|
|sex|??|missing|--|
|lifeStage|??|missing|--|
|isPlural|??|missing|--|
|isPreferredName|??|missing|--|
|organismPart|??|missing|--|
|taxonRemarks|??|how to filter taxon specific annotations/marker|--|

Literature References¶

The main problem here is to define which literature needs to be mapped.

Currently we map the taxon sec reference and the name nomenclaturalReference.

So we export references referenced by description elements only via the source attribute in the according extensions which is anySeperator-seperated and not atomized or enriched value.

See further comments in the general remarks part.

Note: The sec reference urgently needs to be deduplicated.

|DwcA|CDM|Problems|Documentation|
|coreid |taxon.id|--|--|
|identifier(isbn/Issn)|reference.isbn or isbn (depending on which is available)|current implementation checks if isbn is available, if not issn is returned|--|
|identifier(uri)|reference.uri|--|--|
|identifier(doi)|reference.extensionsType:DOI|implementation needed, how about multiple DOIs|--|
|identifier(lsid)|reference.lsid.toString|--|--|
|bibliographicCitation|reference.titleCache|--|--|
|title|reference.title|--|--|
|creator|reference.authorTeam.titleCache|--|--|
|date|reference.datePublished|check return type, if freetext is available, the type is not according to the suggested type YYYY-MM-DD|--|
|source|reference.inReference.titleCache|--|--|
|description|reference.referenceAbstract|what about annotations and markers, how to concatenate, what is the difference to taxonRemarks|--|
|subject|??|keywords or relationship type to the core taxon, which vocabulary to use??|--|
|language|--|missing|--|
|rights|reference.rights|--|--|
|taxonRemarks|??|??|--|
|type|??|strange field, which vocabulary to use, how to handle lists, ...|--|

Taxon Description¶

For each accepted taxon get all description elements of class TextData.

Open issues: how to handle other description element classes (TaxonInteraction, QuantitativeData, CategoricalData).

Species Distributions¶

For each accepted taxon get all description elements with feature Distribution.

Open issues: how to handle distributions of type TextData.

|DwcA|CDM|Problems|Documentation|
|coreid |taxon.id|--|--|
|locationId|distribution.area.id|--|--|
|locality|distribution.area.label|--|--|
|countryCode|distribution.area.iso3166_A2|--|--|
|lifeStage|--|missing|--|
|occurrenceStatus|distribution.status|extracted from PresenceAbsenceTerm ??|--|
|threadStatus|--|missing|--|
|establishmentMeans|--|extracted from PresenceAbsenceTerm ??|--|
|appendixCITES|--|missing|--|
|eventDate|--|missing|--|
|seasonalDate|--|missing|--|
|source|distribution.sources|filter data provenance sources|how to handle multiple sources|
|occurrenceRemarks|distribution.annotations and distribution.marker|implementation ??|--|

Simple Images¶

For each accepted taxon (taxon) get all description elements with media attached.

For each media (media) get the representation parts.

For each part map as follows.

Open issues: are there other media? Do we want to filter on image galleries?

|DwcA|CDM|Problems|Documentation|
|coreid |taxon.id|--|--|
|identifier|part.uri|--|--|
|title|media.titleCache|or media.title??|--|
|description|media.description.text|handle default language not available|--|
|spatial|--|missing|--|
|coordinates|--|missing|--|
|format|part.representation.mimeType|--|--|
|license|media.rights|implementation for collections, which field to take (abbreviated text, uri, ...)|--|
|created|media.mediaCreated|NOTE: not media.created as this is for the CDM record, not the media|--|
|creator|media.artist|--|--|
|contributor|--|missing|--|
|publisher|--|missing|--|
|audience|--|missing|--|

General Mapping Remarks¶

CDM -> DwcA(Tax)¶

taxonRemarks: the CDM does not mark annotations as being taxon specific so it is difficult to create the taxonRemarks subset of the annotations
handling of literature is crucial for scientific data. Literature handling in DwC-A is very unclear. As far as we can see literature can either be added to data via the source attribute as anySeperator-separated and not atomized or enriched value. This creates lot's redundant data and leads to loss of atomization.

Maybe the literature extension can be used for this purpose but it seems to be unclear how e.g. literature can be attached to distribution or descriptive data via the existing star schema.

The general problem here is that literature is usually referenced but does not reference other. But as an extension in star schema it is mainly supposed to reference core data. However most links in scientific data are not between taxon/name data and literature data but between extension data like descriptions, distributions, etc and literature data. And even the earlier case only can be handled via the type or the subject attribute in the given schema.

in general the correct use of fields that may hold sets is not clear. Don't we need a clearly defined separator character. Otherwise it is hard to digest data
some terms are not consistently used throughout the schema (e.g. 'source', 'bibiographicCitation' and maybe 'locationID'). This easily leads to misunderstandings and should be avoided
though using ids as identifiers rather than uuids or uris may reduce the size and it is allowed in general it looks like this is not possbile as resource relations to refer to different classes and therefore identifiers are not unique (or we need a prefix for each class)
use of vocabularies: many fields recommend to use a controlled vocabulary and the descriptior file allows to define this vocabulary. This leads to problems when the vocabulary used is not a "controlled and publicly available" vocabulary. Also it leads to problems when we have terms from multiple vocabularies. Also there should be given much more advice on which vocabularies are recommended. Some of the given vocabularies are rather small and don't cover the full range of possible values. There should always be given also the largest known vocabulary available.

Assistant Tool¶

Documentaion notes are sometimes hard to read as they disappear after a while. It would be nice to have them for an unlimited time period by clicking on the annotated object.

Documentation¶

In the Meta Data How To Guide foot note 13 is wrong. Hi links to http://tools.gbif.org/spreadsheet_processor instead of http://tools.gbif.org/spreadsheet_processor ( see the last '-')

Meta Data Processor¶

Taxonomic Keywords are not processed
Second primary contact is not processed
Project description is not processed
single keywords are not trimed
First Name and last Name is required though organinzations are allowed as associated parties (they don't have first and last names)
Metadata language is not processed
Resource Creator should allow teams (in our case we have a team of 3 creators)

Term Mapping¶

GBIF vocabularies¶

in general it is unclear which vocabulary to use. There are several possible URIs for the GBIF vocs such as http://rs.gbif.org/vocabulary/gbif/nomenclatural_status.xml and http://vocabularies.gbif.org/vocabularies/nomenc_status
GBIF nomen dubimum should be called nomen dubium: http://vocabularies.gbif.org/nomenc_status/dubimum
taxonomic status vocabulary uses camelCase whereas the documented example use whitespace
taxonomic status vocabulary misses 'partial synonyms'
relationshipOfResource (Darwin Core Resource Relationship) recommends to use a controlled vocabulary but does not give an example
the recommended vocabulary for 'sex' (e.g. http://rs.gbif.org/vocabulary/gbif/sex.xml) is missing in the documentation
species distribution - occurrence status documentation is missing a recommended vocabulary
species distribution - establishmentMeans documentation is missing a recommended vocabulary

GermanSL Example given by GBIF¶

see link to GermanSL at http://code.google.com/p/gbif-ecat/wiki/DwCArchive#Static_mappings

This example has a couple of errors/problems like the following

Eml.xml file is missing completely
Coretable
- index 1 is missing
- taxonRank does not use the recommended vocabulary but a 3 letter code based partly on german taxon ranks (e.g. GAT)
- there is a 9th and a 10th column holding values A,F,G,M,P,S and 0,x which are not referred to in the metaData.xml at all
SpeciesInfo.txt
- http://rs.gbif.org/terms/ellenberg/xxx can't be resolved to a given vocabulary
- isHybrid holds values like c, cs, csr which is not according to the description of the field as being a flag
Distribution.txt
- locationId uses not the correct status. It uses e.g. BW for Baden-Württemberg but the correct code is DE-BW. BW stands for Botswana in ISO_3166-2 (http://en.wikipedia.org/wiki/ISO_3166-2:BW). So the DE- is missing for all values and the metadata.xml doesn't address this issue.
Vernacular.txt

*index3 referrs to http://rs.tdwg.org/dwc/terms/locality which is usually a string defining a concrete area. Here it uses a country code (DE) which should go into the explicit field "countryCode"

Being an example the data should be improvemed or removed.

Files (0)

Updated by Andreas Müller almost 2 years ago · 68 revisions

Project

General

Profile

EDIT

Wiki