Project

General

Profile

Actions

feature request #10178

open

Implement fuzzy name matching

Added by Andreas Müller over 1 year ago. Updated 24 days ago.

Status:
In Progress
Priority:
New
Category:
cdmlib
Target version:
Start date:
Due date:
% Done:

50%

Estimated time:
Severity:
normal


Files

file.png (2.6 MB) file.png Belen Escobari, 05/10/2023 12:45 PM
Rees-Taxamatching.docx (15.5 KB) Rees-Taxamatching.docx Belen Escobari, 05/22/2023 04:37 PM
Actions #1

Updated by Belen Escobari over 1 year ago

Actions #2

Updated by Andreas Müller over 1 year ago

  • Target version changed from Unassigned CDM tickets to Release 5.44
Actions #3

Updated by Belen Escobari 11 months ago

  • % Done changed from 0 to 20

new methods were included in the classes NameServiceImpl and INameService.

a method calculates the distance among two strings using the levenshtein distance. A second method retrieves the best matching name in database based on the distances LS in regard to the input name

Actions #4

Updated by Belen Escobari 11 months ago

The query is parsed and each part (genus, epithet) will be searched in a database one after each other.

A initial list is made with all the genera in the database starting with the first character of the input genus (alternative, all genera in the database could be listed independently of the first character match).

  1. The input genus is compared agains each element of the initial list and all (near)coincidences are added to a map. The map includes all TaxonNameParts corresponding to the genus name and the distance scored. The default threshold value of similarity is 70%

  2. The input epithet is compared agains epithets in the map built in the previous step and distances are added (distance Genus + distance epithet). Need to check how this is made in the TaxaMatch algorithm

  3. Sort the map according distances and return the first x best matches (let the user decide)

Further documentation:

SQL code written by Rees: http://www.cmar.csiro.au/datacentre/downloads/taxamatch/taxamatch1.sql
Workflow: https://journals.plos.org/plosone/article/figure?id=10.1371/journal.pone.0107510.g002

Actions #5

Updated by Belen Escobari 10 months ago

I attach a word file with the summary of the Taxamatching algorithm by Rees.

There are some points that need to be discussed from my point of view.

Actions #6

Updated by Belen Escobari 10 months ago

  • Assignee changed from Andreas Müller to Belen Escobari
Actions #7

Updated by Belen Escobari 9 months ago

  • % Done changed from 20 to 50

All points described by Rees for fuzzy name matching are included except Authors (to be discussed). The algorithm uses a max total distance of 4 changes between the names.

Actions #8

Updated by Belen Escobari 7 months ago

The algorithm compares monomials and binomial names of the kind genus, genus + infrageneric epithet and genus + epithet. The default distance for monomial names is set to 2 and 4 for binomial names. Authors comparison needs discussion

Actions #9

Updated by Belen Escobari 7 months ago

The algorithm compares monomial names (Genus), binomial names (Genus + infrageneric epithet / Genus + specific epithet), and trinomial names (Genus + epithet + infraspecific epithet).

Actions #10

Updated by Andreas Müller 6 months ago

  • Description updated (diff)
Actions #11

Updated by Andreas Müller 5 months ago

  • Description updated (diff)
Actions #12

Updated by Belen Escobari 4 months ago

The algorithm includes comparison of authorities (cache), nevertheless the method that is used to parse the authority getAuthorshipCache() does not parse authorities as Teams but as single authors. The basionym should be excluded from the comparison.

It should be discussed if the authorities comparison distance should be added to the total distance (species names distances). In this publication https://www.mdpi.com/2223-7747/10/5/974, the total score is calculated like: ReturnedScore = (TaxonScore × 0.9 + AuthorScore × 0.1) × 100

Actions #13

Updated by Belen Escobari 4 months ago

a new parser is needed for input names containing "and" instead of "&" in the authorship

Actions #14

Updated by Belen Escobari 4 months ago

as temporary solution for names containing "and" in the authorship if the input name contains "and" it is replaced by "&" and only after the name will be parsed. This should work for all input names as the input names dont include publications.
Ex Authors are excluded from the authorship and only the last author is compared.

Actions #15

Updated by Andreas Müller about 2 months ago

  • Description updated (diff)
Actions #16

Updated by Andreas Müller about 2 months ago

  • Description updated (diff)
Actions #17

Updated by Andreas Müller about 1 month ago

  • Target version changed from Release 5.44 to Release 5.43
Actions #19

Updated by Belen Escobari 24 days ago

new methods are implemented to use a list of names as input

Actions

Also available in: Atom PDF