task #7798
open[AM, ERS]: Import (freetext) nameInSource correctly
100%
Description
For occurrence sources and maybe also for common names and facts.
Occurrences:
- case: freetext exists parallel to link => if nameCache or fullnameCache are exactly equal then neglect freetext, otherwise store it parallel to name link; a list has been sent to ERS to clean up these cases (see #7798#note-4)
freetext is only name
a. name can not be found for any Name.nameCache => we could create new TaxonName for this, but will not as it polutes the DB, we better try to clean up theses cases over time (see #7798#note-6)
b. >1 name is mentioned, separated by '/'. We better clone these sources (now or later)
c. a name exists in DB, we try to find the best matching name by first searching in the synonymy and, if not exists use the single matching name, if >1 matching names exists, log the case and use the first matching name (prefer name without author)
Most important for import: try to reduce the number of names that only come via occurrence import. These are
- "Orphaned name: A similar name" : names that do not belong to any taxon but a matching name was found in "synonymy" * Decide for invalid designations (why do they show up?) and orthographic variants
- "Orphaned name: A similar misapplied name" : like above but for misapplications. * Decide on all
- TBC
open issues:
- test code
- test for facts and common names
Related issues
Updated by Andreas Müller over 5 years ago
- Target version changed from Unassigned CDM tickets to Euro+Med Migration
Updated by Andreas Müller over 5 years ago
- Subject changed from Import freetext nameInSource correctly to AM: Import freetext nameInSource correctly
Updated by Andreas Müller over 5 years ago
im Berlin Model gibt es ja 2 Felder in die man den Namen, der in der Quelle für eine Verbreitungsangabe verwendet wurde, eintragen. Einmal als Freitext, das andere Mal als Link zu einem existierenden Namen.
In einigen Fällen sind beide Felder gefüllt, weichen aber (leicht) voneinander ab. Siehe Anhang.
Kannst du die bereinigen bzw. eine Regel angeben, wie sie behandelt werden sollen?
SELECT os.OccurrenceSourceId, pt.PTRefFk, n.fullNameCache, os.OldName, os.OldNameFk,oldN.nameCache, oldN.FullNameCache, n.nameCache, ar.Unit area, pt.RIdentifier, sumcat.emOccurSumCatId, sumcat.Short
FROM emOccurrence occ
INNER JOIN PTaxon pt ON occ.PTNameFk = pt.PTNameFk AND occ.PTRefFk = pt.PTRefFk
INNER JOIN Name n ON n.NameId = pt.PTNameFk
INNER JOIN emOccurrenceSource os ON os.OccurrenceFk = occ.OccurrenceId
LEFT OUTER JOIN Name oldN ON oldN.NameId = os.OldNameFk
INNER JOIN emArea ar ON occ.AreaFk = ar.AreaId
LEFT OUTER JOIN emOccurSumCat sumcat ON occ.SummaryStatus = sumcat.emOccurSumCatId
WHERE occ.occurrenceId IN ( SELECT occurrenceId FROM v_cdm_exp_occurrenceAll )
AND ( OldName IS NOT NULL AND oldNameFk IS NULL OR OldName IS NOT NULL AND oldN.NameCache <> OldName )
AND oldN.nameCache IS NOT NULL
ORDER BY pt.PTRefFk, oldN.nameCache
Updated by Andreas Müller over 5 years ago
In einer nicht ganz kleinen Anzahl Fälle, für die der Name NUR als Freitext vorliegt, nicht als Link, wurde ein entsprechender Name in der Datenbank nicht gefunden.
Dies kann ein Hinweis darauf sein, dass eine abweichende Rechtschreibung vorliegt oder dass er in der Synonymie nicht vorkommt.
SELECT os.OccurrenceSourceId, pt.PTRefFk, n.fullNameCache, os.OldName, os.OldNameFk,oldN.nameCache, oldN.FullNameCache, n.nameCache, ar.Unit area, pt.RIdentifier, sumcat.emOccurSumCatId, sumcat.Short
FROM emOccurrence occ
INNER JOIN PTaxon pt ON occ.PTNameFk = pt.PTNameFk AND occ.PTRefFk = pt.PTRefFk
INNER JOIN Name n ON n.NameId = pt.PTNameFk
INNER JOIN emOccurrenceSource os ON os.OccurrenceFk = occ.OccurrenceId
LEFT OUTER JOIN Name oldN ON oldN.NameId = os.OldNameFk
INNER JOIN emArea ar ON occ.AreaFk = ar.AreaId
LEFT OUTER JOIN emOccurSumCat sumcat ON occ.SummaryStatus = sumcat.emOccurSumCatId
WHERE occ.occurrenceId IN ( SELECT occurrenceId FROM v_cdm_exp_occurrenceAll )
AND ( OldName IS NOT NULL AND oldNameFk IS NULL)
AND OldName not like '%/%'
AND OldName NOT IN (SELECT DISTINCT NameCache FROM Name n1 WHERE nameCache IS NOT NULL)
AND occ2.SummaryStatus IS NOT NULL)
ORDER BY pt.PTRefFk, oldN.nameCache, os.OccurrenceSourceId
Updated by Andreas Müller over 5 years ago
- Description updated (diff)
- % Done changed from 0 to 20
Updated by Andreas Müller over 5 years ago
- Subject changed from AM: Import freetext nameInSource correctly to [CHECK]: Import freetext nameInSource correctly
- Status changed from New to In Progress
Updated by Andreas Müller over 5 years ago
- Description updated (diff)
- % Done changed from 20 to 40
Updated by Andreas Müller over 5 years ago
- Estimated time changed from 2:00 h to 1:00 h
Updated by Andreas Müller over 5 years ago
- Related to bug #7406: [CHECK] Check if all original source names of occurrences are correctly imported added
Updated by Andreas Müller over 5 years ago
- Subject changed from [CHECK]: Import freetext nameInSource correctly to [CHECK]: Import (freetext) nameInSource correctly
Updated by Andreas Müller over 5 years ago
- Subject changed from [CHECK]: Import (freetext) nameInSource correctly to [AM, ERS]: Import (freetext) nameInSource correctly
- Description updated (diff)
All names imported only by occurrences: (n=1435)
SELECT n.nameId, n.nameCache, n.FullNameCache, n.uuid, count(*) as n
FROM dbo.v_cdm_exp_namesOccurrenceSource v LEFT OUTER JOIN Name n ON n.NameID = v.NameId
WHERE v.NameId NOT IN
(SELECT NameId
FROM Name
WHERE (NameId IN (SELECT PTNameFk AS NameId FROM dbo.v_cdm_exp_taxaAll)) OR
(NameId IN
(SELECT NameId
FROM dbo.v_cdm_exp_namesRelatedTo)) OR
(NameId IN
(SELECT NameId
FROM dbo.v_cdm_exp_namesRelatedFrom)) OR
(NameId IN
(SELECT NameId
FROM dbo.v_cdm_exp_namesCommonNameSource)) OR
(NameId IN (7502960, 7709217, 7502034, 28349))
)
GROUP BY n.nameId, n.nameCache, n.FullNameCache, n.uuid
ORDER BY n DESC
Updated by Andreas Müller about 5 years ago
ERS decided to use synonymie names whereever possible. Equalness of existence of authors is not so important.
With this about 900 names remained as names not existing in current import but linked by occSources. These we explicitly added to the import via id in a separate query.
Updated by Andreas Müller about 5 years ago
- % Done changed from 40 to 90
Eckhard noch fragen, was mit den Namen geschehen soll, bei denen LinkedName und FreeTextName nicht übereinstimmen.
Dann kann das Ticket wohl geschlossen werden.
Updated by Andreas Müller about 5 years ago
- Status changed from In Progress to Resolved
- % Done changed from 90 to 100
Andreas Müller wrote:
Eckhard noch fragen, was mit den Namen geschehen soll, bei denen LinkedName und FreeTextName nicht übereinstimmen.
Dann kann das Ticket wohl geschlossen werden.
Entscheidung: werden im CDM bearbeitet, grundsätzlich sollen im die wirklichen Originalschreibweisen gespeichert werden, daher müssen diese Fälle manuell aufgeräumt werden, was im CDM leichter ist (im BM wird in diesen Fällen nur die Freitext-Variante angezeigt).
Put this to resolved but do not close yet to remind to cleanup after import. Mail in post-migration folder exists. Also remember to decide on pure freetext names. Should they be created as names or should they be kept as freetext. Mail also exists in post-migration.