Importing checklist data in DwC-A from CoL to the CDM Database¶
This wiki page deals with issues related to the import of data from CoL to the CDM Database.
- Importing checklist data in DwC-A from CoL to the CDM Database
Problems experienced when importing data from CoL¶
The metadata file in the CoL DwC-A (meta.xml) describes a species profile file (speciesprofile.txt) but the file does not exist in the CoL DwC-A zip.
Ranks are sometimes not consistent with the DwC taxonRank":http://rs.tdwg.org/dwc/terms/taxonRank field in accordance with it's suggested vocabulary. For example, at record 1129 we have a scientific name "Champia parvula var. Prostrata" with the verbatimRank as "var.", but taxonRank as "infraspecies". This is not part of the recommended vocabulary as can be seen "here
Sometimes this field holds the pure name without authorship. This is against the DwC scientific name definition, which clearly states that the authorship should be part of the scientific name.
The url of the original source is provided by CoL in its web service, as can be seen in this example but is not available in CoL DwC-A zip.
The file including references is stored as reference.txt but is referenced as references.txt (plural) in the meta.xml. This makes it unreadable.
http://purl.org/dc/terms/description is used for "Published in". This is incorrect as DwC-A offers http://purl.org/dc/terms/source for in-references. .../description is meant for abstracts, remarks, notes, etc.
http://purl.org/dc/terms/type is used to distinguish the "type of the reference; pertaining to taxon, synonym or vernacular name". Distinguishing taxon and synonym is redundant as this is already done in core.txt. Linking to vernacular is ambigous as there may be multiple vernacular names related to one taxon so it is unclear to which vernacular name the reference is a source. E.g. taxon 6979482 (Crassostrea gigas (Thunberg, 1793)) has 25 vernacualr names attached and there are 25 references of type "vernacular", most of them redundant. There is no clear indicator which reference belongs to which name, one can only guess that the order of both might be the same and therefore one may use the order to assign references correctly. This is a general problem of DwC-A as there is no way to attache references correctly to extension data.
Escaping quotations marks¶
Some records do include quotation (") marks (e.g. line 163882, taxon_id = 2387923). At the same time there is no "fieldsEnclosedBy" character defined. This creates problems as many CSVReaders use (") as default field enclosers. It would be a good idea to first check all taxa with an (") if this character is really intended. Those records for which this is true should be corrected by escaping the quotation marks.
This is even more CRITICAL for records with an odd number of quotation marks. Here the EndOfLine character is not be found and therefore the line can not be separated correctly resulting in corrupt data. This needs urgently to be fixed!!
This count's for all data, core and extensions.
It seems that the download by genus doesn't work correctly. This is because the genus" field isn't used correctly as per the "DwC genus":http://rs.tdwg.org/dwc/terms/#genus field. The field should actually contain the name of the genus taxon within the calssification rather than the genus epithet of the name. This is usually the same for accepted taxa, but differs for synonyms. CoL uses the genus field for the genus epithet but for filtering pretends to use it for the genus in the classification. However, the correct use of the genus field is currently under discussion and details can be found "here
- There are a couple of taxa not having a parent. Is this on purpose. These are the taxa:
Achaea, Aenetus, Anachloris, Apoctena, Aponotoreas, Arcola, Austrocidaria, Boldenaria, Chrysolarentia, Chrysorthenches, Ctenarchis, Danaus, Diarsia, Dipaustica, Epicyme, Epiphthora, Epyaxa, Erebiola, Gingidiobora, Grypotheca, Helastia, Heloxycanus, Heterocrossa, Heteroteucha, Holocola, Homodotis, Hygraula, Leucotenes, Microdes, Morosaphycita, Paranotoreas, Parienia, Pasiphila, Pasiphilodes, Percnodaimon, Phaeosaces, Phrissogonus, Polychrosis, Prepalla, Proternia, Proteroeca, Protithona, Protosynaema, Speiredonia, Stegommata, Teia, Tingena, Tirumala, Tmetolophota, Uraba, Xanadoses, Zizina
There are 409 taxa called "not assigned". These should be removed in the correct way.
It is unclear what the sec reference of CoL data should be. Currently only a view nameAccordingTo fields are filled. Also the documentation for nameAccordingTo by CoL is "last scrutized" what is slightly different to what nameAccordingTo is meant to be for.