Project

General

Profile

Actions

task #7798

open

[AM, ERS]: Import (freetext) nameInSource correctly

Added by Andreas Müller over 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Highest
Category:
cdmadapter
Target version:
Start date:
Due date:
% Done:

100%

Estimated time:
1:00 h
Severity:
normal
Tags:

Description

For occurrence sources and maybe also for common names and facts.

Occurrences:

  1. case: freetext exists parallel to link => if nameCache or fullnameCache are exactly equal then neglect freetext, otherwise store it parallel to name link; a list has been sent to ERS to clean up these cases (see #7798#note-4)
  2. freetext is only name

    a. name can not be found for any Name.nameCache => we could create new TaxonName for this, but will not as it polutes the DB, we better try to clean up theses cases over time (see #7798#note-6)
    b. >1 name is mentioned, separated by '/'. We better clone these sources (now or later)
    c. a name exists in DB, we try to find the best matching name by first searching in the synonymy and, if not exists use the single matching name, if >1 matching names exists, log the case and use the first matching name (prefer name without author)

Most important for import: try to reduce the number of names that only come via occurrence import. These are

  • "Orphaned name: A similar name" : names that do not belong to any taxon but a matching name was found in "synonymy" * Decide for invalid designations (why do they show up?) and orthographic variants
  • "Orphaned name: A similar misapplied name" : like above but for misapplications. * Decide on all
  • TBC

open issues:

  • test code
  • test for facts and common names

Related issues

Related to EDIT - bug #7406: [CHECK] Check if all original source names of occurrences are correctly importedResolvedAndreas Müller

Actions
Actions #2

Updated by Andreas Müller over 5 years ago

  • Target version changed from Unassigned CDM tickets to Euro+Med Migration
Actions #3

Updated by Andreas Müller over 5 years ago

  • Subject changed from Import freetext nameInSource correctly to AM: Import freetext nameInSource correctly
Actions #4

Updated by Andreas Müller over 5 years ago

im Berlin Model gibt es ja 2 Felder in die man den Namen, der in der Quelle für eine Verbreitungsangabe verwendet wurde, eintragen. Einmal als Freitext, das andere Mal als Link zu einem existierenden Namen.

In einigen Fällen sind beide Felder gefüllt, weichen aber (leicht) voneinander ab. Siehe Anhang.

Kannst du die bereinigen bzw. eine Regel angeben, wie sie behandelt werden sollen?

 SELECT os.OccurrenceSourceId, pt.PTRefFk,  n.fullNameCache, os.OldName, os.OldNameFk,oldN.nameCache, oldN.FullNameCache, n.nameCache, ar.Unit area, pt.RIdentifier, sumcat.emOccurSumCatId, sumcat.Short
FROM emOccurrence occ
 INNER JOIN PTaxon pt ON occ.PTNameFk = pt.PTNameFk AND occ.PTRefFk = pt.PTRefFk
 INNER JOIN Name n ON n.NameId = pt.PTNameFk
 INNER JOIN emOccurrenceSource os ON os.OccurrenceFk = occ.OccurrenceId
 LEFT OUTER JOIN Name oldN ON oldN.NameId = os.OldNameFk
  INNER JOIN emArea ar ON occ.AreaFk = ar.AreaId 
 LEFT OUTER JOIN emOccurSumCat sumcat ON occ.SummaryStatus = sumcat.emOccurSumCatId
WHERE occ.occurrenceId IN ( SELECT occurrenceId FROM v_cdm_exp_occurrenceAll ) 
AND ( OldName IS NOT NULL AND oldNameFk IS NULL OR OldName IS NOT NULL AND oldN.NameCache <> OldName )
AND oldN.nameCache IS NOT NULL
ORDER BY pt.PTRefFk, oldN.nameCache
Actions #5

Updated by Andreas Müller over 5 years ago

  • Description updated (diff)
Actions #6

Updated by Andreas Müller over 5 years ago

In einer nicht ganz kleinen Anzahl Fälle, für die der Name NUR als Freitext vorliegt, nicht als Link, wurde ein entsprechender Name in der Datenbank nicht gefunden.

Dies kann ein Hinweis darauf sein, dass eine abweichende Rechtschreibung vorliegt oder dass er in der Synonymie nicht vorkommt.

 SELECT os.OccurrenceSourceId, pt.PTRefFk,  n.fullNameCache, os.OldName, os.OldNameFk,oldN.nameCache, oldN.FullNameCache, n.nameCache, ar.Unit area, pt.RIdentifier, sumcat.emOccurSumCatId, sumcat.Short
FROM emOccurrence occ
 INNER JOIN PTaxon pt ON occ.PTNameFk = pt.PTNameFk AND occ.PTRefFk = pt.PTRefFk
 INNER JOIN Name n ON n.NameId = pt.PTNameFk
 INNER JOIN emOccurrenceSource os ON os.OccurrenceFk = occ.OccurrenceId
 LEFT OUTER JOIN Name oldN ON oldN.NameId = os.OldNameFk
  INNER JOIN emArea ar ON occ.AreaFk = ar.AreaId 
 LEFT OUTER JOIN emOccurSumCat sumcat ON occ.SummaryStatus = sumcat.emOccurSumCatId
WHERE occ.occurrenceId IN ( SELECT occurrenceId FROM v_cdm_exp_occurrenceAll ) 
AND ( OldName IS NOT NULL AND oldNameFk IS NULL)
AND OldName not like '%/%'
AND OldName NOT IN (SELECT DISTINCT NameCache FROM Name n1 WHERE nameCache IS NOT NULL)
AND occ2.SummaryStatus IS NOT NULL)
 ORDER BY pt.PTRefFk, oldN.nameCache, os.OccurrenceSourceId
Actions #7

Updated by Andreas Müller over 5 years ago

  • Description updated (diff)
  • % Done changed from 0 to 20
Actions #8

Updated by Andreas Müller over 5 years ago

  • Description updated (diff)
Actions #9

Updated by Andreas Müller over 5 years ago

  • Description updated (diff)
Actions #10

Updated by Andreas Müller over 5 years ago

  • Description updated (diff)
Actions #11

Updated by Andreas Müller over 5 years ago

  • Description updated (diff)
Actions #12

Updated by Andreas Müller over 5 years ago

  • Subject changed from AM: Import freetext nameInSource correctly to [CHECK]: Import freetext nameInSource correctly
  • Status changed from New to In Progress
Actions #13

Updated by Andreas Müller over 5 years ago

  • Description updated (diff)
  • % Done changed from 20 to 40
Actions #14

Updated by Andreas Müller over 5 years ago

  • Estimated time changed from 2:00 h to 1:00 h
Actions #16

Updated by Andreas Müller over 5 years ago

  • Related to bug #7406: [CHECK] Check if all original source names of occurrences are correctly imported added
Actions #17

Updated by Andreas Müller over 5 years ago

  • Subject changed from [CHECK]: Import freetext nameInSource correctly to [CHECK]: Import (freetext) nameInSource correctly
Actions #18

Updated by Andreas Müller over 5 years ago

  • Subject changed from [CHECK]: Import (freetext) nameInSource correctly to [AM, ERS]: Import (freetext) nameInSource correctly
  • Description updated (diff)

All names imported only by occurrences: (n=1435)

SELECT   n.nameId, n.nameCache, n.FullNameCache, n.uuid, count(*) as n
FROM            dbo.v_cdm_exp_namesOccurrenceSource v LEFT OUTER JOIN Name n ON n.NameID = v.NameId 
WHERE v.NameId NOT IN 
(SELECT        NameId
FROM            Name
WHERE (NameId IN (SELECT PTNameFk AS NameId FROM dbo.v_cdm_exp_taxaAll)) OR
        (NameId IN
            (SELECT        NameId
            FROM            dbo.v_cdm_exp_namesRelatedTo)) OR
        (NameId IN
            (SELECT        NameId
            FROM            dbo.v_cdm_exp_namesRelatedFrom)) OR
        (NameId IN
            (SELECT        NameId
            FROM            dbo.v_cdm_exp_namesCommonNameSource)) OR
        (NameId IN (7502960, 7709217, 7502034, 28349))
)
GROUP BY n.nameId, n.nameCache, n.FullNameCache, n.uuid
ORDER BY n DESC
Actions #19

Updated by Andreas Müller about 5 years ago

ERS decided to use synonymie names whereever possible. Equalness of existence of authors is not so important.
With this about 900 names remained as names not existing in current import but linked by occSources. These we explicitly added to the import via id in a separate query.

Actions #20

Updated by Andreas Müller about 5 years ago

  • % Done changed from 40 to 90

Eckhard noch fragen, was mit den Namen geschehen soll, bei denen LinkedName und FreeTextName nicht übereinstimmen.
Dann kann das Ticket wohl geschlossen werden.

Actions #21

Updated by Andreas Müller about 5 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 90 to 100

Andreas Müller wrote:

Eckhard noch fragen, was mit den Namen geschehen soll, bei denen LinkedName und FreeTextName nicht übereinstimmen.
Dann kann das Ticket wohl geschlossen werden.

Entscheidung: werden im CDM bearbeitet, grundsätzlich sollen im die wirklichen Originalschreibweisen gespeichert werden, daher müssen diese Fälle manuell aufgeräumt werden, was im CDM leichter ist (im BM wird in diesen Fällen nur die Freitext-Variante angezeigt).

Put this to resolved but do not close yet to remind to cleanup after import. Mail in post-migration folder exists. Also remember to decide on pure freetext names. Should they be created as names or should they be kept as freetext. Mail also exists in post-migration.

Actions

Also available in: Atom PDF