Actions
feature request #7801
closedAM: Deduplicate references
Status:
Closed
Priority:
Highest
Assignee:
Category:
cdmadapter
Target version:
Start date:
Due date:
% Done:
100%
Estimated time:
15:00 h
Severity:
normal
Description
Many references are duplicates. We could try to deduplicate them during import.
TODO:
- check for parameters like annotations and extensions
- RefDetails
Related issues
Updated by Andreas Müller about 5 years ago
- Related to feature request #7800: Parse preliminary RefDetails added
Updated by Andreas Müller about 5 years ago
- Related to feature request #7799: AM: Parse authorteams added
Updated by Andreas Müller almost 5 years ago
Find deduplicated records in CDM
SELECT osb.id, ref.id, osb.idInSource, osb.idNamespace, ref.refType, ref.titleCache, ref.authorship_id, ref.abbrevTitleCache, ref.protectedTitleCache,
ref.protectedAbbrevTitleCache, ref.abbrevTitle, ref.title, ref.volume, ref.*
FROM Reference ref LEFT OUTER JOIN Reference_OriginalSourceBase MN ON MN.Reference_id = ref.id
LEFT OUTER JOIN OriginalSourceBase osb ON osb.id = MN.sources_id
INNER JOIN (SELECT Reference_id, count(*) as n FROM Reference_OriginalSourceBase MN INNER JOIN OriginalSourceBase osb ON osb.id = MN.sources_id WHERE idNamespace <> 'import to Berlin Model' GROUP BY MN.Reference_id HAVING n > 1) as drvTab2 ON drvTab2.Reference_id = ref.id
WHERE (1 = 1)
-- AND idNamespace <> 'RefDetail'
-- AND titleCache like '%unde%'
-- AND ab.protectedTitleCache = false
-- AND ab.id NOT IN (SELECT Team_id FROM AgentBase_AgentBase MM WHERE MM.Team_id IS NOT NULL) -- AND idInSOurce = '1'
AND idInSource like '7712094'
-- AND ref.id IN (SELECT Reference_id FROM (SELECT Reference_id, count(*) as n FROM Reference_OriginalSourceBase MN GROUP BY MN.Reference_id HAVING n > 1) as drvTab)
/* AND titleCache IN (SELECT titleCache FROM (
SELECT r2.titleCache, r2.abbrevTitleCache, r2.authorship_id, count(*) n
FROM Reference r2
GROUP BY r2.titleCache, r2.abbrevTitleCache, r2.authorship_id
HAVING n > 1) as drv )
*/
ORDER BY ref.titleCache, ref.id, length(idInSource), idInSource, ref.refType
Updated by Andreas Müller almost 5 years ago
- Status changed from New to In Progress
- Priority changed from Priority14 to Highest
Updated by Andreas Müller about 2 years ago
- Status changed from In Progress to Closed
- % Done changed from 0 to 100
The import did run long time ago. Deduplication was done as far as it was done. As we can't change much here other then trying to further find duplicates and deduplicate them I think we can close this ticket.
Actions