Project

General

Profile

Actions

CoL2EDITPipeline

This wiki page specifies interfaces for importing data from the Catalogue of Life (CoL) into the EDIT platform


MS9 - CoL to EDIT Interface Specification available

Prerequisites and Requirements

ETI has implemented a first version of a means to export the complete or a subset of CoL data into a format that is agreed on by the i4Life Global Partners. The service produces Darwin Core Archive (DwC-A) and can be accessed at following URL:

http://dev.4d4life.eu/dca_export/

The Catalogue of Life to EDIT pipeline should be implemented as a DwC-A import procedure. It is therefore crucial for its success, that the service will deliver the DwC-A data in a consistent manner. The flow of data will be as follows:

CoL to EDIT pipeline diagram

Mapping CoL specific DwC-A data to CDM

The mapping of CoL-DwC-A datatypes is based on this source:/trunk/cdmlib-apps/cdmlib-col/format/meta.tpl.txt as well as some additions from the CoL-DwC-A taskgroup meeting on July 13th 2011.

Taxonomic Core

DwC-A DwC-A Notes CDM CDM Notes
dwc:TaxonID CoL taxon id Taxon Synonym.sources.idInSource
dc:identifier LSID Taxon Synonym.lsid
dwc:datasetID CoL source database id !TaxonNode.classification only relevant for accepted taxa
dwc:datasetName Short name of source database plus CoL credits !TaxonNode.classification this information will go with the classification. The exact fileds have not been decided yet
dwc:acceptedNameUsageID CoL taxon id of accepted taxon (relevant for synonyms only) Synonym.synonymRelations.relatedTo
dwc:parentNameUsageID CoL taxon id of parent taxon (relevant for valid taxa only) !TaxonNode.parent Parent child relations are mapped through Classification/!TaxonNodes in CDM
dwc:taxonomicStatus  Species 2000 status for taxa: TaxonNode.taxon.name.nomenclaturalCode.acceptedTaxonStatusLabel; for misapplied names: "misapplied"; for synonyms "((homo hetero)?typic )?synonym" depending on the relationship type; "invalid" for zoological "synonyms"
dwc:taxonRank full accurate rank no matter if infraspecific or not. "infraspecies" would only be used if you have no idea about the exact infraspecific rank !TaxonNode.taxon.name.rank
dwc:verbatimTaxonRank exact marker used in the scientific name, e.g. "var." or "subsp." or nothing in the case of zoological names. NO MATCH will be used for processing only
dwc:scientificName Complete scientific name, including subspecific marker where appropriate !TaxonNode.taxon.name.titleCache use only when atomized fields are not available or inconsistent
dwc:kingdom Top level group; listed as kingdom but may be interpreted as domain or superkingdom. The following eight groups are recognized: Animalia, Archaea, Bacteria, Chromista, Fungi, Plantae, Protozoa, Viruses !TaxonNode.parent(*)[.taxon.name.rank=kingdom].taxon.name.titleCache will only be used depending on rank and Nomenclatural Code (e.g. family in botany will be read when taxon is of rank family, but not if it is a genus)
dwc:phylum Phylum in which the taxon has been classified see kingdom see kingdom
dwc:class Class in which the taxon has been classified see kingdom see kingdom
dwc:order Order in which the taxon has been classified see kingdom see kingdom
dwc:family Family in which the taxon has been classified see kingdom see kingdom
dwc:genus Genus in which the taxon has been classified see kingdom see kingdom
dwc:subgenus see kingdom see kingdom
dwc:specificEpithet Specific epithet; for hybrids, the multiplication symbol is included in the epithet !TaxonNode.taxon.name.specificEpithet
dwc:infraspecificEpithet Infraspecific epithet !TaxonNode.taxon.name.infraSpecificEpithet
dwc:scientificNameAuthorship Authorship !TaxonNode.taxon.name.authorshipCache
dc:source Acceptance status published in -- not completely clear yet
dwc:namePublishedIn Reference in which the scientific name was first published !TaxonNode.taxon.name.nomenclaturalReference
dwc:nameAccordingTo    Taxon scrutinized by Taxon.credits
dc:modified Scrutiny date -- Not decided yet.
dc:description Additional data for the taxon !TaxonNode.taxon.annotation

Extensions

The following extensions are used by the CoL-DwC-A.

Distribution

| http://rs.gbif.org/extension/gbif/1.0/distribution.xml | <==> | http://cybertaxonomy.org/cdm/latest/index.htm?goto=3:133 |


DwC-A Notes CDM CDM Notes
dwc:coreid Original id prefixed with the standard, e.g. tdwg:AGE-BA; eez:polish; fao:18, iso3166-1-alpha-2:SN. Is left empty in case distribution is taken from free text string Taxon.uuid
dwc:locationID namespace prefix : ID example (tdwg:AGE-BA; tdwg:AND; eez:polish; fao:18) Distribution.area.id
dwc:locality              verbatim string, example (Buenos Aires, Argentina; Andaman Islands; Polish Exclusive Economic Zone; FAO fishing area 18) Distribution.area.label
dwc:occurrenceStatus Distribution status (currently not yet implemented, reserved for future edition) -- see [[#dwcoccurrenceStatusanddwcestablishedMeans
dwc:establishmentMeans The process by which the taxon became established (currently not yet implemented, reserved for future edition) -- see [[#dwcoccurrenceStatusanddwcestablishedMeans

Reference

| http://rs.gbif.org/extension/gbif/1.0/references.xml | <==> | http://cybertaxonomy.org/cdm/latest/index.htm?goto=9:295 |

The reference dataset will contain a list of all references (the entire bibliography for a name) atomized to some degree


| DwC-A | Notes | CDM | CDM Notes |
| | | | |
| dwc:coreid | | Taxon.uuid | |
| dc:creator | Author | Reference.author | |
| dc:date | Year | Reference.datePublished | |
| dc:title | Title | Reference.title | |
| dc:description | Published In | Reference.referenceType? + reference.titleCache | this has to be evaluated within the complete datasets |
| dc:identifier | Uri | -- | has to yet to be decided |
| dc:type | can be used to specify the type of reference (nomenclature,taxonomicStatus,vernacularName,distribution,...) | UNUSED | CDM uses type of reference differently. References may be reused in different scenarios. |

Species Profile

http://rs.gbif.org/extension/gbif/1.0/speciesprofile.xml


| DwC-A | Notes | CDM | CDM Notes |
| | | | |
| dwc:coreid | | Taxon.uuid | |
| dwc: habitat | Life zone (currently not yet implemented, reserved for future edition) These comprise: marine, terrestrial, brackish, freshwater, unknown (vocabulary='http://www.catalogueoflife.org/dwc/habitats-classification-scheme') | Taxon.description.textData with Feature "Habitat" | not decided yet. Currently, 'Habitat' is a description feature that is connected to TextData, thus allowing for pure text entry only. This has to be thought over, because dwc:habitat comes with a controlled vocabulary |

Vernacular Names

| http://rs.gbif.org/extension/gbif/1.0/vernacularname.xml | <==> | http://cybertaxonomy.org/cdm/latest/index.htm?goto=3:130 |


| DwC-A | Notes | CDM | CDM Notes |
| | | | |
| dwc:coreid | | Taxon.uuid | |
| dwc: vernacularName | Vernacular name | !CommonTaxonName.name | |
| dc:language | Language | !CommonTaxonName.language | |
| dwc: countryCode | Country in which the vernacular name is used | CommonTaxonName.area<!WaterbodyOrCountry>.iso3166_A2 | It will be necessary to introduce a vocabulary for the countryCode, waterbodyOrCountry is not exactly iso3166_A2 |
| dwc: locality | Region in which the vernacular name is used | !CommonTaxonName.area.label | |
| dc:transliteration | Transliteration | !CommonTaxonName.annotations | |

Metadata

Although metadata is currently not needed for making the CoL available in CDM it might be of interest in the future. It was decided by the CoL-DwC-A Taskgroup to include metadata in EML format. Unfortunately, the CDM is not capable of processing EML data at the moment. Therefore the EML data will be stored alongside the secundum reference of the CoL data.

Processing

This section describes the implementation details of the CoL-DwC-A import

Dataflow

All dataflow from CoL-DwC-A to CDM should be implemented as streams.

Core Data

Import of the core data is similar to already existing source:/trunk/cdmlib/cdmlib-io/src/main/java/eu/etaxonomy/cdm/io/excel/taxa in the CDM.

As could seen in the mapping of the taxonomic core data in DwC-A, the level of atomization in the CDM is much higher and spans multiple classes. In order to handle this situation effectively, a class should be implemented that functions as a bucket for unstructured data. This unstructed data will then be processed into CDM objects and persisted to the datastore

This way synergies for the further development are created.

CoL to EDIT pipeline dataflow

Extensions

As can be seen in the mapping, DwC-A extension data are not as diverse as the core data and generally have a direct analogy to a CDM data type. A routine for every DwC-A extension type will be implemented that translates the data into CDM datatypes and links it to the corresponding record as specified by the dwc:coreid field.

Miscellaneous

!dwc:occurrenceStatus and dwc:establishedMeans

Currently there is no real analogy for these two fields in the CDM as distribution in the CDM does not make asumption about quantities but only qualities of distribution. A new CDM.occurrenceStatus vocabulary will be created that maps the two vocabularies into one.

Implementation

CoL-DwC(A) to CDM Import

Details related to the import of CoL DwC(A)Data to the CDM can be found here here

Web Service Specification

The web service specification for the accessing CoL Data through the CDM API is described here

Web Service Development

The current status of development of the web services can be found here

Updated by Andreas Müller almost 2 years ago · 62 revisions