Project

General

Profile

Actions

feature request #7801

closed

AM: Deduplicate references

Added by Andreas Müller almost 4 years ago. Updated about 1 year ago.

Status:
Closed
Priority:
Highest
Category:
cdmadapter
Target version:
Start date:
Due date:
% Done:

100%

Estimated time:
15:00 h
Severity:
normal
Tags:

Description

Many references are duplicates. We could try to deduplicate them during import.

TODO:

  • check for parameters like annotations and extensions
  • RefDetails

Related issues

Related to EDIT - feature request #7800: Parse preliminary RefDetailsClosedAndreas Müller

Actions
Related to EDIT - feature request #7799: AM: Parse authorteamsResolvedAndreas Müller

Actions
Actions #1

Updated by Andreas Müller almost 4 years ago

Actions #2

Updated by Andreas Müller almost 4 years ago

Actions #4

Updated by Andreas Müller almost 4 years ago

Find deduplicated records in CDM

SELECT osb.id, ref.id, osb.idInSource, osb.idNamespace, ref.refType, ref.titleCache, ref.authorship_id, ref.abbrevTitleCache, ref.protectedTitleCache, 
ref.protectedAbbrevTitleCache, ref.abbrevTitle, ref.title, ref.volume, ref.*
FROM Reference ref LEFT OUTER JOIN Reference_OriginalSourceBase MN ON MN.Reference_id = ref.id 
LEFT OUTER JOIN OriginalSourceBase osb ON osb.id = MN.sources_id
 INNER JOIN (SELECT Reference_id, count(*) as n FROM Reference_OriginalSourceBase MN INNER JOIN OriginalSourceBase osb ON osb.id = MN.sources_id WHERE idNamespace <> 'import to Berlin Model' GROUP BY MN.Reference_id HAVING n > 1) as drvTab2 ON drvTab2.Reference_id = ref.id
WHERE  (1 = 1) 
-- AND idNamespace <> 'RefDetail'
-- AND titleCache like '%unde%'
-- AND ab.protectedTitleCache = false
-- AND ab.id NOT IN (SELECT Team_id FROM AgentBase_AgentBase MM WHERE MM.Team_id IS NOT NULL) -- AND idInSOurce = '1'
 AND idInSource like '7712094'
-- AND ref.id IN (SELECT Reference_id FROM (SELECT Reference_id, count(*) as n FROM Reference_OriginalSourceBase MN GROUP BY MN.Reference_id HAVING n > 1) as drvTab)

/* AND titleCache IN (SELECT titleCache FROM (
SELECT r2.titleCache, r2.abbrevTitleCache, r2.authorship_id, count(*) n
FROM Reference r2
GROUP BY r2.titleCache, r2.abbrevTitleCache, r2.authorship_id
HAVING n > 1) as drv )
*/

ORDER BY ref.titleCache, ref.id, length(idInSource), idInSource, ref.refType
Actions #5

Updated by Andreas Müller almost 4 years ago

  • Description updated (diff)
Actions #6

Updated by Andreas Müller almost 4 years ago

  • Status changed from New to In Progress
  • Priority changed from Priority14 to Highest
Actions #7

Updated by Andreas Müller about 1 year ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100

The import did run long time ago. Deduplication was done as far as it was done. As we can't change much here other then trying to further find duplicates and deduplicate them I think we can close this ticket.

Actions

Also available in: Atom PDF