Project

General

Profile

feature request #7801

AM: Deduplicate references

Added by Andreas Müller 4 months ago. Updated 3 months ago.

Status:
In Progress
Priority:
Highest
Category:
cdmadapter
Target version:
Start date:
09/29/2018
Due date:
% Done:

0%

Estimated time:
15.00 h
Severity:
normal
Tags:

Description

Many references are duplicates. We could try to deduplicate them during import.

TODO:

  • check for parameters like annotations and extensions
  • RefDetails

Related issues

Related to Edit - feature request #7800: Parse preliminary RefDetails In Progress 09/29/2018
Related to Edit - feature request #7799: AM: Parse authorteams In Progress 09/29/2018

Associated revisions

Revision f6275b1f (diff)
Added by Andreas Müller 3 months ago

ref #7801 unify cache initialization in deduplicationHelper

Revision db652f5c (diff)
Added by Andreas Müller 3 months ago

ref #7801 and ref #3787 deduplicate reference.authorstring and reference itself

History

#1 Updated by Andreas Müller 4 months ago

#2 Updated by Andreas Müller 4 months ago

#4 Updated by Andreas Müller 3 months ago

Find deduplicated records in CDM

SELECT osb.id, ref.id, osb.idInSource, osb.idNamespace, ref.refType, ref.titleCache, ref.authorship_id, ref.abbrevTitleCache, ref.protectedTitleCache, 
ref.protectedAbbrevTitleCache, ref.abbrevTitle, ref.title, ref.volume, ref.*
FROM Reference ref LEFT OUTER JOIN Reference_OriginalSourceBase MN ON MN.Reference_id = ref.id 
LEFT OUTER JOIN OriginalSourceBase osb ON osb.id = MN.sources_id
 INNER JOIN (SELECT Reference_id, count(*) as n FROM Reference_OriginalSourceBase MN INNER JOIN OriginalSourceBase osb ON osb.id = MN.sources_id WHERE idNamespace <> 'import to Berlin Model' GROUP BY MN.Reference_id HAVING n > 1) as drvTab2 ON drvTab2.Reference_id = ref.id
WHERE  (1 = 1) 
-- AND idNamespace <> 'RefDetail'
-- AND titleCache like '%unde%'
-- AND ab.protectedTitleCache = false
-- AND ab.id NOT IN (SELECT Team_id FROM AgentBase_AgentBase MM WHERE MM.Team_id IS NOT NULL) -- AND idInSOurce = '1'
 AND idInSource like '7712094'
-- AND ref.id IN (SELECT Reference_id FROM (SELECT Reference_id, count(*) as n FROM Reference_OriginalSourceBase MN GROUP BY MN.Reference_id HAVING n > 1) as drvTab)

/* AND titleCache IN (SELECT titleCache FROM (
SELECT r2.titleCache, r2.abbrevTitleCache, r2.authorship_id, count(*) n
FROM Reference r2
GROUP BY r2.titleCache, r2.abbrevTitleCache, r2.authorship_id
HAVING n > 1) as drv )
*/

ORDER BY ref.titleCache, ref.id, length(idInSource), idInSource, ref.refType

#5 Updated by Andreas Müller 3 months ago

  • Description updated (diff)

#6 Updated by Andreas Müller 3 months ago

  • Status changed from New to In Progress
  • Priority changed from Priority14 to Highest

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 40 MB)