feature request #7801closed
AM: Deduplicate references
Many references are duplicates. We could try to deduplicate them during import.
- check for parameters like annotations and extensions
Updated by Andreas Müller almost 4 years ago
Find deduplicated records in CDM
SELECT osb.id, ref.id, osb.idInSource, osb.idNamespace, ref.refType, ref.titleCache, ref.authorship_id, ref.abbrevTitleCache, ref.protectedTitleCache, ref.protectedAbbrevTitleCache, ref.abbrevTitle, ref.title, ref.volume, ref.* FROM Reference ref LEFT OUTER JOIN Reference_OriginalSourceBase MN ON MN.Reference_id = ref.id LEFT OUTER JOIN OriginalSourceBase osb ON osb.id = MN.sources_id INNER JOIN (SELECT Reference_id, count(*) as n FROM Reference_OriginalSourceBase MN INNER JOIN OriginalSourceBase osb ON osb.id = MN.sources_id WHERE idNamespace <> 'import to Berlin Model' GROUP BY MN.Reference_id HAVING n > 1) as drvTab2 ON drvTab2.Reference_id = ref.id WHERE (1 = 1) -- AND idNamespace <> 'RefDetail' -- AND titleCache like '%unde%' -- AND ab.protectedTitleCache = false -- AND ab.id NOT IN (SELECT Team_id FROM AgentBase_AgentBase MM WHERE MM.Team_id IS NOT NULL) -- AND idInSOurce = '1' AND idInSource like '7712094' -- AND ref.id IN (SELECT Reference_id FROM (SELECT Reference_id, count(*) as n FROM Reference_OriginalSourceBase MN GROUP BY MN.Reference_id HAVING n > 1) as drvTab) /* AND titleCache IN (SELECT titleCache FROM ( SELECT r2.titleCache, r2.abbrevTitleCache, r2.authorship_id, count(*) n FROM Reference r2 GROUP BY r2.titleCache, r2.abbrevTitleCache, r2.authorship_id HAVING n > 1) as drv ) */ ORDER BY ref.titleCache, ref.id, length(idInSource), idInSource, ref.refType
Updated by Andreas Müller about 1 year ago
- Status changed from In Progress to Closed
- % Done changed from 0 to 100
The import did run long time ago. Deduplication was done as far as it was done. As we can't change much here other then trying to further find duplicates and deduplicate them I think we can close this ticket.