Project

General

Profile

Actions

bug #6100

open

Name parser problems

Added by Andreas Kohlbecker almost 6 years ago. Updated 12 months ago.

Status:
In Progress
Priority:
Highest
Category:
cdmlib
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:
Severity:
normal
Found in Version:
Tags:

Description

When running the IAPT import (#6026) a lot of names can not be parsed.

For a better overview I split the affected names into a couple of files. The name feature which has been chosen for the split may have something to do with the problem that occurred in the parser.

The named in Name-parsing-problems-Br.ter.txt are a special case which is has been checked by Henning, see #6100#note-1 for details.


Files

Name-parsing-problems.txt (3.21 KB) Name-parsing-problems.txt Andreas Kohlbecker, 09/20/2016 12:54 PM
Name-parsing-problems-basionyms.txt (3.69 KB) Name-parsing-problems-basionyms.txt Andreas Kohlbecker, 09/20/2016 12:54 PM
Name-parsing-problems-Br.ter.txt (809 Bytes) Name-parsing-problems-Br.ter.txt Andreas Kohlbecker, 09/20/2016 12:54 PM
Name-parsing-problems-ex-authors.txt (4.62 KB) Name-parsing-problems-ex-authors.txt Andreas Kohlbecker, 09/20/2016 12:54 PM
Name-parsing-problems-hybrid.txt (88 Bytes) Name-parsing-problems-hybrid.txt Andreas Kohlbecker, 09/20/2016 12:54 PM

Related issues

Related to EDIT - task #3967: Rethink original spelling strategyDuplicateAndreas Müller

Actions
Related to EDIT - task #9014: Unparsable name stringsIn ProgressAndreas Müller

Actions
Copied to EDIT - feature request #6428: Handle fungi authors in name parser (and model)NewAndreas Müller

Actions
Actions #1

Updated by Andreas Kohlbecker almost 6 years ago

  • Description updated (diff)
Actions #2

Updated by Andreas Kohlbecker almost 6 years ago

  • Description updated (diff)

Hallo Andreas,

korrekter Sonderfall des Brummitt & Powell Standard:

Da R.Br. 3x vorhanden ist, ist der erste R.Br., der 2. R.Br.bis, der 3. R.Br.ter

Also alles in Ordnung mit dem Autor

  • R.Br. - Robert Brown 1773-1858
  • R.Br.ter - Robert, of Campster Brown 1842-1895
  • R.Br.bis - Robert, of NZ Brown 1820-1906

VG
Henning

Actions #3

Updated by Andreas Müller over 5 years ago

  • Priority changed from New to Highest
  • Target version changed from Unassigned CDM tickets to Release 4.6
Actions #4

Updated by Andreas Müller over 5 years ago

Hybrids:

  • Pterocypsela x mansuensis (Hayata) C.I Peng => the problem here is the missing dot after "C.I" ! Is this the correct author abbreviation? The hybrid itself parses correctly.
Actions #5

Updated by Wolf-Henning Kusber over 5 years ago

Andreas Müller wrote:

Hybrids:

  • Pterocypsela x mansuensis (Hayata) C.I Peng => the problem here is the missing dot after "C.I" ! Is this the correct author abbreviation? The hybrid itself parses correctly.

There is no dot missing. Seems to be a short syllable.
For Standard see IPNI: http://www.ipni.org/ipni/idAuthorSearch.do;jsessionid=345D3EDF92F1C4F8619E25BA977409DB?id=15619-1&back_page=%2Fipni%2FeditAdvAuthorSearch.do%3Bjsessionid%3D345D3EDF92F1C4F8619E25BA977409DB%3Ffind_abbreviation%3D%26find_surname%3DPeng%26find_isoCountry%3D%26find_forename%3D%26output_format%3Dnormal

Actions #6

Updated by Andreas Müller over 5 years ago

Hybrids (2):

  • Swida x friedlanderi (W.H.Wagner jun.) Holub => the problem here is "jun." in the author

fixed with cdmlib|4c91a094f879af5

Actions #7

Updated by Andreas Müller over 5 years ago

Ex authors:

  • all the cases seem to follow the format Lycopersicon lycopersicoides A.Child ex (Dunal) J.M.H.Shaw where the basionym follows the ex author. Is this somehow covered by the code? I do not know this type formatting ex authors. I would expect "Lycopersicon lycopersicoides (Dunal) A.Child ex J.M.H.Shaw" or "Lycopersicon lycopersicoides (A.Child ex Dunal) J.M.H.Shaw"
Actions #9

Updated by Andreas Müller over 5 years ago

R.Br.ter => fixed with ce3ed240325c

Actions #10

Updated by Wolf-Henning Kusber over 5 years ago

Andreas Müller wrote:

Ex authors:

  • all the cases seem to follow the format Lycopersicon lycopersicoides A.Child ex (Dunal) J.M.H.Shaw where the basionym follows the ex author. Is this somehow covered by the code? I do not know this type formatting ex authors. I would expect "Lycopersicon lycopersicoides (Dunal) A.Child ex J.M.H.Shaw" or "Lycopersicon lycopersicoides (A.Child ex Dunal) J.M.H.Shaw"

Data errors in the original data, see: http://archive.bgbm.org/scripts/ASP/registration/regDetail.asp?Key=2769

Actions #11

Updated by Wolf-Henning Kusber over 5 years ago

Wolf-Henning Kusber wrote:

Andreas Müller wrote:

Hybrids (2):

  • Swida x friedlanderi (W.H.Wagner jun.) Holub => the problem here is "jun." in the author

fixed with cdmlib|4c91a094f879af5

According to IPNI standard author without "jun." Swida x friedlanderi (W.H.Wagner) Holub

http://www.ipni.org/ipni/advPlantNameSearch.do;jsessionid=228A187EEC1CD3D86704AB30DC02DBB8?find_family=&find_genus=Swida&find_species=friedlanderi&find_infrafamily=&find_infragenus=&find_infraspecies=&find_authorAbbrev=&find_includePublicationAuthors=on&find_includePublicationAuthors=off&find_includeBasionymAuthors=on&find_includeBasionymAuthors=off&find_publicationTitle=&find_isAPNIRecord=on&find_isAPNIRecord=false&find_isGCIRecord=on&find_isGCIRecord=false&find_isIKRecord=on&find_isIKRecord=false&find_rankToReturn=all&output_format=normal&find_sortByFamily=on&find_sortByFamily=off&query_type=by_query&back_page=plantsearch

Second comment: it an author would be "jun." the standard is "f." with space before (if the surname is not abbreviated) or without space, if the surname is abbreviated. For Linnaeus: L., the "jun." = L.f.

Actions #12

Updated by Andreas Müller over 5 years ago

Basionyms (I):

Most cases refer to subgen. as infrageneric marker. According to Art. 5A.1. of the code subg. is the correct http://www.iapt-taxon.org/nomen/main.php?page=art5 marker. However IPNI seems to use subgen. in all names http://www.ipni.org/ipni/simplePlantNameSearch.do?find_wholeName=Pleione+subgen.+Scopulorum&output_format=normal&query_type=by_query&back_page=query_ipni.html

From the code I can't find out if subgen. is only not recommended or forbidden.

Actions #13

Updated by Andreas Müller over 5 years ago

Basionyms (cont.):

Another problem is the use of forma instead of "f.". I haven't found in the code if abbreviation is required so I guess both is correct?

Some problems are about author names:

  • again C.I Peng and similar P.I Mao
  • and other authors not recognized: ´t Hart and la Croix

There is also one wrong name:

  • Psoroma papuana () Aptroot & Diederich

And some typical fungi authors, e.g.

  • Phoma aliena (Fr.: Fr.) v.d. Aa & Boerema
  • Setulipes splachnoides (Horn.: Fr.) Bon
  • Wegelina barbirostris (Dufour: Fr.) M.E. Barr
Actions #15

Updated by Andreas Müller over 5 years ago

fixed subgen. and forma recognition by 7301dfbfa163b (even if maybe not code compliant)

Actions #16

Updated by Wolf-Henning Kusber over 5 years ago

Andreas Müller wrote:

Basionyms (cont.):

Another problem is the use of forma instead of "f.". I haven't found in the code if abbreviation is required so I guess both is correct?

Some problems are about author names:

  • again C.I Peng and similar P.I Mao
  • and other authors not recognized: ´t Hart and la Croix

There is also one wrong name:

  • Psoroma papuana () Aptroot & Diederich

And some typical fungi authors, e.g.

  • Phoma aliena (Fr.: Fr.) v.d. Aa & Boerema
  • Setulipes splachnoides (Horn.: Fr.) Bon
  • Wegelina barbirostris (Dufour: Fr.) M.E. Barr

Psoroma papuana () Aptroot & Diederich
is data error for
Psoroma papuana Aptroot & Diederich
http://www.indexfungorum.org/names/NamesRecord.asp?RecordID=442505

Actions #17

Updated by Andreas Müller over 5 years ago

Author issues (C.I, la Croix and ´t Hart) fixed with 78c828989914a .
Maybe we should automatically map ´t Hart to 't Hart as ´ is a unicode character \u00B4 (but don't do this now)

Actions #18

Updated by Andreas Müller over 5 years ago

Actions #19

Updated by Andreas Müller over 5 years ago

remaining issue from basionyms copied to new ticket: #6428

Actions #20

Updated by Andreas Müller over 5 years ago

  • Related to task #3967: Rethink original spelling strategy added
Actions #21

Updated by Andreas Müller over 5 years ago

Issues from Name-parsing-problems.txt:

  • most issues also about subgen., forma, C.I Peng, ´t Hart etc.

New issues:

  • Man in 't Veld in: Phytophthora multivesiculata Ilieva, Man in 't Veld, Veenbaas-Rijks & Pieters => this is critical as it looks like the start of a reference, needs to be handled explicitly => fixed with 454d0d3dec6c21
  • Polygala petræa Chodat : is æ a valid charater in a name?
  • Rhyncho-Hypnum warmingii Hampe => is Rhyncho-Hypnum a valid genus name?
  • Arthrowallemia R.F. Castańeda, D. Garc¡a & Guarro => Garc¡a is probably incorrect and should be D. García
  • Sorokina caeruleogrisea Spooner, L‘ssøe & Lodge => is L‘ssøe correct? At IPNI I only found a mycologist Læssøe
  • Thymus x herberoi De la Torre, Vicedo, Alonso & Payá => problem in parser, capital D should be recognized => fixed with 454d0d3dec6c21
  • Chaetoceros schüttii var. circinalis Meunier => is ü a valid name character? I know from i,e,and o that diaeresis is allowed (or at least in use), so probably it is
  • Brunneiapiospora K.D. Hyde, j. Fröhl. & J.E. Taylor => I guess this is a mistake and should be J. Fröhl.
  • Coelosphaerium evidenter-marginatum M.T.P.Azevedo & SantAnna => standard form is Sant'Anna, do we want to allow SantAnna, anyway also Sant'Anna is currently not recognized. => fixed (at least for standard form) by e838ffccfc7e5a83b
  • Hibiscus tiliaceus 'hastatus' => original spelling not yet fully implemented, often discussed #3967, #3966 and related tickets

obviously incorrect:

  • Claviceps citrina Pažoutova, Fučíkovský;, Leyva-Mir & Flieger (;)
  • Helianthemum polifolium sensu auct. (sensu auct. should not be part of name)
Actions #22

Updated by Andreas Müller over 5 years ago

  • Status changed from New to In Progress
Actions #23

Updated by Andreas Müller over 5 years ago

´t Hart still does not seem to work if compiled with maven. Should be avoided as it seems to be 2 characters (1 hidden)

Actions #24

Updated by Andreas Müller over 1 year ago

  • Target version changed from Release 4.6 to Release 5.18
Actions #25

Updated by Andreas Müller over 1 year ago

  • Related to task #9014: Unparsable name strings added
Actions #26

Updated by Andreas Müller over 1 year ago

  • Target version changed from Release 5.18 to Release 5.19
Actions #27

Updated by Andreas Müller over 1 year ago

  • Target version changed from Release 5.19 to Release 5.21
Actions #28

Updated by Andreas Müller over 1 year ago

  • Target version changed from Release 5.21 to Release 5.22
Actions #29

Updated by Andreas Müller about 1 year ago

  • Target version changed from Release 5.22 to Release 5.25
Actions #30

Updated by Andreas Müller about 1 year ago

  • Tags set to parser
Actions #32

Updated by Andreas Müller 12 months ago

  • Target version changed from Release 5.25 to Release 5.34
Actions

Also available in: Atom PDF