CoL2EDITPipeline¶
This wiki page specifies interfaces for importing data from the Catalogue of Life (CoL) into the EDIT platform
- Table of contents
- CoL2EDITPipeline
- MS9 - CoL to EDIT Interface Specification available
MS9 - CoL to EDIT Interface Specification available¶
Prerequisites and Requirements¶
ETI has implemented a first version of a means to export the complete or a subset of CoL data into a format that is agreed on by the i4Life Global Partners. The service produces Darwin Core Archive (DwC-A) and can be accessed at following URL:
http://dev.4d4life.eu/dca_export/
The Catalogue of Life to EDIT pipeline should be implemented as a DwC-A import procedure. It is therefore crucial for its success, that the service will deliver the DwC-A data in a consistent manner. The flow of data will be as follows:
Mapping CoL specific DwC-A data to CDM¶
The mapping of CoL-DwC-A datatypes is based on this source:/trunk/cdmlib-apps/cdmlib-col/format/meta.tpl.txt as well as some additions from the CoL-DwC-A taskgroup meeting on July 13th 2011.
Taxonomic Core¶
DwC-A | DwC-A Notes | CDM | CDM Notes |
---|---|---|---|
dwc:TaxonID | CoL taxon id | Taxon | Synonym.sources.idInSource |
dc:identifier | LSID | Taxon | Synonym.lsid |
dwc:datasetID | CoL source database id | !TaxonNode.classification | only relevant for accepted taxa |
dwc:datasetName | Short name of source database plus CoL credits | !TaxonNode.classification | this information will go with the classification. The exact fileds have not been decided yet |
dwc:acceptedNameUsageID | CoL taxon id of accepted taxon (relevant for synonyms only) | Synonym.synonymRelations.relatedTo | |
dwc:parentNameUsageID | CoL taxon id of parent taxon (relevant for valid taxa only) | !TaxonNode.parent | Parent child relations are mapped through Classification/!TaxonNodes in CDM |
dwc:taxonomicStatus | Species 2000 status | for taxa: TaxonNode.taxon.name.nomenclaturalCode.acceptedTaxonStatusLabel; for misapplied names: "misapplied"; for synonyms "((homo | hetero)?typic )?synonym" depending on the relationship type; "invalid" for zoological "synonyms" |
dwc:taxonRank | full accurate rank no matter if infraspecific or not. "infraspecies" would only be used if you have no idea about the exact infraspecific rank | !TaxonNode.taxon.name.rank | |
dwc:verbatimTaxonRank | exact marker used in the scientific name, e.g. "var." or "subsp." or nothing in the case of zoological names. | NO MATCH will be used for processing only | |
dwc:scientificName | Complete scientific name, including subspecific marker where appropriate | !TaxonNode.taxon.name.titleCache | use only when atomized fields are not available or inconsistent |
dwc:kingdom | Top level group; listed as kingdom but may be interpreted as domain or superkingdom. The following eight groups are recognized: Animalia, Archaea, Bacteria, Chromista, Fungi, Plantae, Protozoa, Viruses | !TaxonNode.parent(*)[.taxon.name.rank=kingdom].taxon.name.titleCache | will only be used depending on rank and Nomenclatural Code (e.g. family in botany will be read when taxon is of rank family, but not if it is a genus) |
dwc:phylum | Phylum in which the taxon has been classified | see kingdom | see kingdom |
dwc:class | Class in which the taxon has been classified | see kingdom | see kingdom |
dwc:order | Order in which the taxon has been classified | see kingdom | see kingdom |
dwc:family | Family in which the taxon has been classified | see kingdom | see kingdom |
dwc:genus | Genus in which the taxon has been classified | see kingdom | see kingdom |
dwc:subgenus | see kingdom | see kingdom | |
dwc:specificEpithet | Specific epithet; for hybrids, the multiplication symbol is included in the epithet | !TaxonNode.taxon.name.specificEpithet | |
dwc:infraspecificEpithet | Infraspecific epithet | !TaxonNode.taxon.name.infraSpecificEpithet | |
dwc:scientificNameAuthorship | Authorship | !TaxonNode.taxon.name.authorshipCache | |
dc:source | Acceptance status published in | -- | not completely clear yet |
dwc:namePublishedIn | Reference in which the scientific name was first published | !TaxonNode.taxon.name.nomenclaturalReference | |
dwc:nameAccordingTo | Taxon scrutinized by | Taxon.credits | |
dc:modified | Scrutiny date | -- | Not decided yet. |
dc:description | Additional data for the taxon | !TaxonNode.taxon.annotation |
Extensions¶
The following extensions are used by the CoL-DwC-A.
Distribution¶
| http://rs.gbif.org/extension/gbif/1.0/distribution.xml | <==> | http://cybertaxonomy.org/cdm/latest/index.htm?goto=3:133 |
DwC-A | Notes | CDM | CDM Notes |
---|---|---|---|
dwc:coreid | Original id prefixed with the standard, e.g. tdwg:AGE-BA; eez:polish; fao:18, iso3166-1-alpha-2:SN. Is left empty in case distribution is taken from free text string | Taxon.uuid | |
dwc:locationID | namespace prefix : ID example (tdwg:AGE-BA; tdwg:AND; eez:polish; fao:18) | Distribution.area.id | |
dwc:locality | verbatim string, example (Buenos Aires, Argentina; Andaman Islands; Polish Exclusive Economic Zone; FAO fishing area 18) | Distribution.area.label | |
dwc:occurrenceStatus | Distribution status (currently not yet implemented, reserved for future edition) | -- | see [[#dwcoccurrenceStatusanddwcestablishedMeans |
dwc:establishmentMeans | The process by which the taxon became established (currently not yet implemented, reserved for future edition) | -- | see [[#dwcoccurrenceStatusanddwcestablishedMeans |
Reference¶
| http://rs.gbif.org/extension/gbif/1.0/references.xml | <==> | http://cybertaxonomy.org/cdm/latest/index.htm?goto=9:295 |
The reference dataset will contain a list of all references (the entire bibliography for a name) atomized to some degree
| DwC-A | Notes | CDM | CDM Notes |
| | | | |
| dwc:coreid | | Taxon.uuid | |
| dc:creator | Author | Reference.author | |
| dc:date | Year | Reference.datePublished | |
| dc:title | Title | Reference.title | |
| dc:description | Published In | Reference.referenceType? + reference.titleCache | this has to be evaluated within the complete datasets |
| dc:identifier | Uri | -- | has to yet to be decided |
| dc:type | can be used to specify the type of reference (nomenclature,taxonomicStatus,vernacularName,distribution,...) | UNUSED | CDM uses type of reference differently. References may be reused in different scenarios. |
Species Profile¶
http://rs.gbif.org/extension/gbif/1.0/speciesprofile.xml
| DwC-A | Notes | CDM | CDM Notes |
| | | | |
| dwc:coreid | | Taxon.uuid | |
| dwc: habitat | Life zone (currently not yet implemented, reserved for future edition) These comprise: marine, terrestrial, brackish, freshwater, unknown (vocabulary='http://www.catalogueoflife.org/dwc/habitats-classification-scheme') | Taxon.description.textData with Feature "Habitat" | not decided yet. Currently, 'Habitat' is a description feature that is connected to TextData, thus allowing for pure text entry only. This has to be thought over, because dwc:habitat comes with a controlled vocabulary |
Vernacular Names¶
| http://rs.gbif.org/extension/gbif/1.0/vernacularname.xml | <==> | http://cybertaxonomy.org/cdm/latest/index.htm?goto=3:130 |
| DwC-A | Notes | CDM | CDM Notes |
| | | | |
| dwc:coreid | | Taxon.uuid | |
| dwc: vernacularName | Vernacular name | !CommonTaxonName.name | |
| dc:language | Language | !CommonTaxonName.language | |
| dwc: countryCode | Country in which the vernacular name is used | CommonTaxonName.area<!WaterbodyOrCountry>.iso3166_A2 | It will be necessary to introduce a vocabulary for the countryCode, waterbodyOrCountry is not exactly iso3166_A2 |
| dwc: locality | Region in which the vernacular name is used | !CommonTaxonName.area.label | |
| dc:transliteration | Transliteration | !CommonTaxonName.annotations | |
Metadata¶
Although metadata is currently not needed for making the CoL available in CDM it might be of interest in the future. It was decided by the CoL-DwC-A Taskgroup to include metadata in EML format. Unfortunately, the CDM is not capable of processing EML data at the moment. Therefore the EML data will be stored alongside the secundum reference of the CoL data.
Processing¶
This section describes the implementation details of the CoL-DwC-A import
Dataflow¶
All dataflow from CoL-DwC-A to CDM should be implemented as streams.
Core Data¶
Import of the core data is similar to already existing source:/trunk/cdmlib/cdmlib-io/src/main/java/eu/etaxonomy/cdm/io/excel/taxa in the CDM.
As could seen in the mapping of the taxonomic core data in DwC-A, the level of atomization in the CDM is much higher and spans multiple classes. In order to handle this situation effectively, a class should be implemented that functions as a bucket for unstructured data. This unstructed data will then be processed into CDM objects and persisted to the datastore
This way synergies for the further development are created.
Extensions¶
As can be seen in the mapping, DwC-A extension data are not as diverse as the core data and generally have a direct analogy to a CDM data type. A routine for every DwC-A extension type will be implemented that translates the data into CDM datatypes and links it to the corresponding record as specified by the dwc:coreid field.
Miscellaneous¶
!dwc:occurrenceStatus and dwc:establishedMeans¶
Currently there is no real analogy for these two fields in the CDM as distribution in the CDM does not make asumption about quantities but only qualities of distribution. A new CDM.occurrenceStatus vocabulary will be created that maps the two vocabularies into one.
Implementation¶
CoL-DwC(A) to CDM Import¶
Details related to the import of CoL DwC(A)Data to the CDM can be found here here
Web Service Specification¶
The web service specification for the accessing CoL Data through the CDM API is described here
Web Service Development¶
The current status of development of the web services can be found here
Updated by Andreas Müller almost 2 years ago · 62 revisions