i4Life

MS9 - CoL to EDIT Interface Specification available

This wiki page specifies interfaces for importing data from the Catalogue of Life (CoL) into the EDIT platform.

Prerequisites and Requirements

ETI has implemented a first version of a means to export the complete or a subset of CoL data into a format that is agreed on by the i4Life Global Partners. The service produces Darwin Core Archive (DwC-A) and can be accessed at following URL:

http://dev.4d4life.eu/dca_export/

The Catalogue of Life to EDIT pipeline should be implemented as a DwC-A import procedure. It is therefore crucial for its success, that the service will deliver the DwC-A data in a consistent manner. The flow of data will be as follows:

CoL to EDIT pipeline diagram

Mapping CoL specific DwC-A data to CDM

The mapping of CoL-DwC-A datatypes is based on this DwC-A metadata file as well as some additions from the CoL-DwC-A taskgroup meeting on July 13th 2011.

Taxonomic Core

DwC-A DwC-A Notes CDM CDM Notes
dwc:TaxonID CoL taxon id Taxon|Synonym.sources.idInSource
dc:identifier LSID Taxon|Synonym.lsid LSID
dwc:datasetID CoL source database id TaxonNode.classification only relevant for accepted taxa
dwc:datasetName Short name of source database plus CoL credits TaxonNode.classification this information will go with the classification. The exact fileds have not been decided yet
dwc:acceptedNameUsageID CoL taxon id of accepted taxon (relevant for synonyms only) Synonym.synonymRelations.relatedTo
dwc:parentNameUsageID CoL taxon id of parent taxon (relevant for valid taxa only) TaxonNode.parent Parent child relations are mapped through Classification/TaxonNodes in CDM
dwc:taxonomicStatus  Species 2000 status for taxa: TaxonNode.taxon.name.nomenclaturalCode.acceptedTaxonStatusLabel; for misapplied names: "misapplied"; for synonyms "((homo|hetero)?typic )?synonym" depending on the relationship type; "invalid" for zoological "synonyms"
dwc:taxonRank full accurate rank no matter if infraspecific or not. "infraspecies" would only be used if you have no idea about the exact infraspecific rank TaxonNode.taxon.name.rank
dwc:verbatimTaxonRank exact marker used in the scientific name, e.g. "var." or "subsp." or nothing in the case of zoological names. NO MATCH will be used for processing only
dwc:scientificName Complete scientific name, including subspecific marker where appropriate TaxonNode.taxon.name.titleCache use only when atomized fields are not available or inconsistent
dwc:kingdom Top level group; listed as kingdom but may be interpreted as domain or superkingdom. The following eight groups are recognized: Animalia, Archaea, Bacteria, Chromista, Fungi, Plantae, Protozoa, Viruses TaxonNode.parent(*)[.taxon.name.rank=kingdom].taxon.name.titleCache will only be used depending on rank and Nomenclatural Code (e.g. family in botany will be read when taxon is of rank family, but not if it is a genus)
dwc:phylum Phylum in which the taxon has been classified see kingdom see kingdom
dwc:class Class in which the taxon has been classified see kingdom see kingdom
dwc:order Order in which the taxon has been classified see kingdom see kingdom
dwc:family Family in which the taxon has been classified see kingdom see kingdom
dwc:genus Genus in which the taxon has been classified see kingdom see kingdom
dwc:subgenus see kingdom see kingdom
dwc:specificEpithet Specific epithet; for hybrids, the multiplication symbol is included in the epithet TaxonNode.taxon.name.specificEpithet
dwc:infraspecificEpithet Infraspecific epithet TaxonNode.taxon.name.infraSpecificEpithet
dwc:scientificNameAuthorship Authorship TaxonNode.taxon.name.authorshipCache
dc:source Acceptance status published in -- not completely clear yet
dwc:namePublishedIn Reference in which the scientific name was first published TaxonNode.taxon.name.nomenclaturalReference
dwc:nameAccordingTo    Taxon scrutinized by Taxon.credits
dc:modified Scrutiny date -- Not decided yet.
dc:description Additional data for the taxon TaxonNode.taxon.annotation

Extensions

The following extensions are used by the CoL-DwC-A.

Distribution

http://rs.gbif.org/extension/gbif/1.0/distribution.xml <==> http://wp5.e-taxonomy.eu/cdm/latest/index.htm?goto=3:133

DwC-A Notes CDM CDM Notes
dwc:coreid Original id prefixed with the standard, e.g. tdwg:AGE-BA; eez:polish; fao:18, iso3166-1-alpha-2:SN. Is left empty in case distribution is taken from free text string Taxon.uuid
dwc:locationID namespace prefix : ID example (tdwg:AGE-BA; tdwg:AND; eez:polish; fao:18) Distribution.area.id
dwc:locality              verbatim string, example (Buenos Aires, Argentina; Andaman Islands; Polish Exclusive Economic Zone; FAO fishing area 18) Distribution.area.label
dwc:occurrenceStatus Distribution status (currently not yet implemented, reserved for future edition) -- see dwc:occurrenceStatus and dwc:establishedMeans
dwc:establishmentMeans The process by which the taxon became established (currently not yet implemented, reserved for future edition) -- see dwc:occurrenceStatus and dwc:establishedMeans

Reference

http://rs.gbif.org/extension/gbif/1.0/references.xml <==> http://wp5.e-taxonomy.eu/cdm/latest/index.htm?goto=9:295

The reference dataset will contain a list of all references (the entire bibliography for a name) atomized to some degree


DwC-A Notes CDM CDM Notes
dwc:coreid Taxon.uuid
dc:creator Author Reference.author
dc:date Year Reference.datePublished
dc:title Title Reference.title
dc:description Published In Reference.referenceType? + reference.titleCache this has to be evaluated within the complete datasets
dc:identifier Uri -- has to yet to be decided
dc:type can be used to specify the type of reference (nomenclature,taxonomicStatus,vernacularName,distribution,...) UNUSED CDM uses type of reference differently. References may be reused in different scenarios.

Species Profile

http://rs.gbif.org/extension/gbif/1.0/speciesprofile.xml


DwC-A Notes CDM CDM Notes
dwc:coreid Taxon.uuid
dwc: habitat Life zone (currently not yet implemented, reserved for future edition) These comprise: marine, terrestrial, brackish, freshwater, unknown (vocabulary='http://www.catalogueoflife.org/dwc/habitats-classification-scheme') Taxon.description.textData with Feature "Habitat" not decided yet. Currently, 'Habitat' is a description feature that is connected to TextData, thus allowing for pure text entry only. This has to be thought over, because dwc:habitat comes with a controlled vocabulary

Vernacular Names

http://rs.gbif.org/extension/gbif/1.0/vernacularname.xml <==> http://wp5.e-taxonomy.eu/cdm/latest/index.htm?goto=3:130

DwC-A Notes CDM CDM Notes
dwc:coreid Taxon.uuid
dwc: vernacularName Vernacular name CommonTaxonName.name
dc:language Language CommonTaxonName.language
dwc: countryCode Country in which the vernacular name is used CommonTaxonName.area<WaterbodyOrCountry>.iso3166_A2 It will be necessary to introduce a vocabulary for the countryCode, waterbodyOrCountry is not exactly iso3166_A2
dwc: locality Region in which the vernacular name is used CommonTaxonName.area.label
dc:transliteration Transliteration CommonTaxonName.annotations

Metadata

Although metadata is currently not needed for making the CoL available in CDM it might be of interest in the future. It was decided by the CoL-DwC-A Taskgroup to include metadata in EML format. Unfortunately, the CDM is not capable of processing EML data at the moment. Therefore the EML data will be stored alongside the secundum reference of the CoL data.

Processing

This section describes the implementation details of the CoL-DwC-A import

Dataflow

All dataflow from CoL-DwC-A to CDM should be implemented as streams.

Core Data

Import of the core data is similar to already existing Excel import functionality in the CDM.

As could seen in the mapping of the taxonomic core data in DwC-A, the level of atomization in the CDM is much higher and spans multiple classes. In order to handle this situation effectively, a class should be implemented that functions as a bucket for unstructured data. This unstructed data will then be processed into CDM objects and persisted to the datastore This way synergies for the further development are created.

CoL to EDIT pipeline dataflow

Extensions

As can be seen in the mapping, DwC-A extension data are not as diverse as the core data and generally have a direct analogy to a CDM data type. A routine for every DwC-A extension type will be implemented that translates the data into CDM datatypes and links it to the corresponding record as specified by the dwc:coreid field.

Miscellaneous

dwc:occurrenceStatus and dwc:establishedMeans

Currently there is no real analogy for these two fields in the CDM as distribution in the CDM does not make asumption about quantities but only qualities of distribution. A new CDM.occurrenceStatus vocabulary will be created that maps the two vocabularies into one.

Implementation

CoL-DwC(A) to CDM Import

Details related to the import of CoL DwC(A)Data to the CDM can be found here here.

Web Service Specification

The web service specification for the accessing CoL Data through the CDM API is described here.

Web Service Development

The current status of development of the web services can be found here.

Attachments