task #8792: Issues to have in mind for ERMS import - EDIT - EDIT Project Management

task #8792

Updated by Andreas Müller over 4 years ago

The functioning of the ERMS import (and PESI export) is explained at: [[Erms_mapping]] 

 ## General information 

 Old databases are available at BGBM-PESISQL\SQLExpress. 

 Many issues have been handled in #1444. However as the ERMS-CDM is complex in often difficult there are many issues to take care for during upcoming imports. 

 ## Data check before import 

 * Check that the rule still is true that synonyms are directly linked to their accepted taxon and not via recursion (at least for field tu_accfinal (which has only less NULL values then tu_accfinal but never differs from tu_acctaxon in value) 

 ~~~ sql 
 SELECT    * 
 FROM [tu] INNER JOIN tu acc ON tu.tu_acctaxon = acc.id 
 WHERE tu.tu_accfinal <> tu.id AND acc.id <> acc.tu_accfinal 
 ~~~ 

 * tu_acctaxon should not have relevant information 

 ~~~ sql 
 SELECT * 
 FROM tu 
 WHERE tu_acctaxon is NOT NULL AND    tu_accfinal IS NULL OR tu_acctaxon <>    tu_accfinal  
 ~~~ 

 * Ranks 

   kingdom is identified by first number in tu_sp. 
   Only the kingdoms 2-7 should are supported, others should not exist and the code may not work 

   ~~~ sql 
   SELECT * 
   FROM tu 
   WHERE tu.tu_sp not like '#2#%' AND tu.tu_sp not like '#3#%' AND tu.tu_sp not like '#4#%' AND tu.tu_sp not like '#5#%' AND tu.tu_sp not like '#6#%' AND tu.tu_sp not like '#7#%' 
   ~~~ 

 * Run ErmsImportActivator in CHECK_ONLY mode (and move above queries to validation) 

 ## open issues 

 * Misapplications 

 Currently misapplications are identified by a UNION statement in ErmsTaxonImport.getAcceptedTaxaKeys(). It searches for misidentification patterns in tu.tu_unacceptreason . We also tried to evaluate the tu_authority field but searching for "sensu", "non", "auct". But realized that all these patterns are not used specific enough in ERMS. "Auct" is often simply used for misspellings, "sensu" is used for unassessed taxa, genus transfer, sensu in the meaning of sec and others or simply unclear. "Non" is generally to unspecific as it can be used for homonyms, too. 
 So only the above query is possible but discussion with VLIZ is needed on data cleaning or standardization. Maybe even an own tu_status for misapplications is possible. 
 Also the status of the following taxa should be checked again: 

 ~~~ sql 
 SELECT * 
 from    tu 
 where tu.tu_unacceptreason like 'misidentification%' AND not (trim(tu.tu_unacceptreason) like 'misidentification' OR trim(tu.tu_unacceptreason) like 'misidentifications') 
 ~~~ 

 * Name status 

 The name status is only guessed by the value in tu_unacceptreason. As this field is not standardized this is not exact. The current patterns are mostly taken from the SQL script + some obvious further mappings. 
 A further discussion with VLIZ about the semantics of some values is necessary for further mapping (especially on ISZN related issues). Start with values with high counts! 

 * Multiple open issues to be hanled by or discussed with VLIZ, mentioned in email to VLIZ 2019-12-23 

 * the status of taxa with status "alternate representation" should be discussed with VLIZ (and mayb ERS) 

 * interim published taxa 

 ~~~ sql 
 SELECT * 
 FROM    tu 
 WHERE tu_status = 9 
 ~~~ 

 None, or only very few should exist. Open question: should they be exported at all? 

 * type designations without type 

 ~~~ sql 
 SELECT *  
 FROM tu WHERE tu_typedesignation IS NOT NULL  
   AND tu_typetaxon IS NULL 
 ORDER BY tu_rank, tu_typedesignation 
 ~~~ 

 These have a typedesignation (by original designation, by subsequent designation, ...) but no name. All are of rank subgenus or higher (and therfore should have a name type, not a specimen type). 
 Current count = 3177. 

 Possible explanations: name types are not European or missing data (not discussed with VLIZ yet) 

 * distributions attached to synonyms 

 ~~~ sql 
 SELECT    v.*, t.tu_status, t.tu_displayname, t.id, t.tu_accfinal 
 FROM dr v 
 INNER JOIN tu t ON v.tu_id = t.id 
 WHERE    t.tu_accfinal <> t.id 
 ORDER BY    tu_status 
 ~~~ 

 very few distributions are attached to synonyms (n=23), this was 0 before. Check with VLIZ if this is dirty data. 

 ## Data cleaning: 

 * tu_accfinal IS NULL 

   ~~~ sql 
   SELECT * 
   FROM tu 
   WHERE tu_accfinal IS NULL 
   ~~~ 

   should be checked. VLIZ should fix these (n=479). Currently they are handled as accepted taxa. 

 * synonyms being the accepted taxon of there parent 

 ~~~ sql 
 SELECT * 
 FROM tu syn 
 LEFT JOIN tu AS accTaxon ON myTaxon.tu_accfinal = accTaxon.id 
 WHERE    syn.tu_accfinal IS NOT NULL AND myTaxon.id = accTaxon.tu_parent AND accTaxon.id <> myTaxon.id  
 ~~~ 

 These are often alternate representations (children) or autonyms (parents). They are explicitly handled in ErmsTaxonImport.getAcceptedTaxaKeys() 

 Note: accTaxon.id <> myTaxon.id is to remove Biota from query 

 * tu_status 

 There should be no synonyms with tu_status == 1 

 ~~~ sql 
 SELECT    * 
 FROM [tu] INNER JOIN tu acc ON tu.tu_acctaxon = acc.id 
 WHERE tu.tu_acctaxon <> tu.id AND tu.tu_status = 1 
 ~~~ 

 There should be no accepted taxa with status 2,3,5 (unaccepted, nomen nudem, alternate representation) 

 ~~~ sql 
 SELECT    * 
 FROM [tu] INNER JOIN tu acc ON tu.tu_acctaxon = acc.id 
 WHERE tu.tu_acctaxon = tu.id AND tu.tu_status NOT IN (1, 7, 10, 8, 9, 6) 
 ORDER BY tu.tu_status 
 ~~~ 

 ## other open issues 

 * is there any meaning in the tu.tsn attribute? 
 * trim on notes, links, and others (difficult as they are often ntext, but sortableNote can be used)

Back

Project

General

Profile

EDIT

task #8792