NAME PARSER DOCUMENTATION¶
This page documents the CDM name parser
- Table of contents
- NAME PARSER DOCUMENTATION
The taxonomic name parser analyzes a free text taxonomic reference for the following four components:
the Name Part,
the Authorship Part,
the Reference Part and
the Nomenclatural Status.
Not all of them are required.
The four parts are separated by the following separators:
|part|separator|example|
|authorship|any whitespace|Abies alba*L._|
|reference|commata with following whitespace OR whitespace+'in'+whitespace|_Abies alba L.,*Sp. Pl... or Pinus alba*_inBull. Soc...._
|nom. status|commata with following whitespace|_in Bull. Bot. 3: 99. 1987., *nom illeg._|
Some valid name texts fully recognized by the parser are:
Abies alba (L.) Mill., Sp. Pl.: 105. 1846., nom illeg.
Abies alba (L.) Mill. in Bull. Bot. 3: 99. 1987., nom illeg.
The name part is required. The authorship part is required only if followed by the reference part. The reference part as well as the status part are not required.
In the following sections, the four parts are described in detail:
Name Part¶
The name part recognizes uninomials, binomials and trinomials. The first epithet must start with a capital letter; all other words (except for infrageneric epitheta) can only contain lower-case letters. Only latin letters are allowed in names (except for ï).
The name part parser differentiates 6 different syntaxes.
Uninomials¶
One word starting with a capital letter. As the rank is usually ambiguous for uninomials, the rank represents the parser's best guess and a warning is returned to check the rank.
Example: Cichorieae
Infrageneric Names¶
Capital word followed by the infrageneric marker followed by the infrageneric epitheton.
Valid markers are: subgen., subg., sect., subsect., ser., subser., t.infgen.
Example: Desmometopa subg. LitoXXX
Species Aggregates¶
Species aggregates are recognized similarly to species except they are followed by a group marker.
Valid markers are: aggr., agg., group
Example: XXX
Species¶
Species names have a genus part (capital letter) and a species part (lower case letter).
Examples are: Abies alba
Infraspecific names¶
Infraspecific names have four parts: the genus part, the species part, the infraspecific marker and the infraspecific part. All but the first may not start with a capital letter.
Recognized markers are: subsp., convar., var., subvar., f., subf., f.spec., tax.infrasp., tax. infrasp.
Example:
Infraspecific names (old markers)¶
Some older names (not valid according to the nomenclatural code) use other infraspecific markers.
The recognition of these older names is not yet implemented.
Authorship Part¶
The authorship part is divided into the original combination authorship and the combination authorship.
The earlier is put in brackets.
Example (bot.): (L.) Mill.
Example (zoo.): (XXX, 1830) XXX, 1845
You can use either no authorship (only if not followed by any other part), the original combination authorship, the combination authorship or both.
The parser differentiates botanical and zoological authorship. The later has a year following the author, separated by a comma. Botanical names only have authors.
Authorship may include single persons and teams. Team members are separated by _ & _.
A placeholder 'al.' may be used for further team members.
Both authorships may include ex-authors separated by _ ex _ or _ ex. _
Some valid author strings are:
Example (bot.): (Greuther & L'Hiver & al. ex Müller & Schmidt) Clark ex Ciardelli
Example (zoo.): _ _
The number of allowed special characters like _ ' _ or - at the moment is beyond the scope of this documentation and will change in the future.
Reference Part¶
The reference part follows the syntax:
{separator}{authorship{,}}{titleEditionVolume}{:}{detail}{.}{year}
Zoological new combinations should not have a reference part, since in zoology, it is not common to mention the new combination reference.
Separator¶
The separator between the reference part and the preceding authorship may be a comma , or an +in+ surrounded by whitespaces. The comma indicates a book whereas the in stands either for a journal article or a book section. If the in is not followed by a comma, the parser interprets the reference as an article; otherwise, as a book section. Reference type parsing should be improved in future.
Reference Authorship¶
An author is only available for book sections. Articles and book sections are differentiated from each other by comparing the first four words that follow the separator. If these words include a comma and the words before the comma are likely to represent an author, the reference is recognized as a book section. Otherwise, it will be treated as an article. In both cases, a warning is thrown that differentiation is not possible.
TitleEditionVolume¶
The TitleEditionVolume part includes the title itself as well as optional edition part and volume parts.
The title itself allows most character combinations but care must be taken if a : is included as this is the separator for the subsequent detail part. Special characters like & and - are only allowed if preceded +and+ followed immediately by ordinary characters. Ordinary brackets are allowed.
Edition and volume are separated by whitespace if only one of them exists. If both exist the later is separated by a comma. Both are optional, so all four of the following formats are valid:
Sp. Pl.
Sp. Pl. ed. 3
Sp. Pl. ed. 3, 4
Sp. Pl. 4
As can be seen, the edition is recognized by a preceding ed., whereas the volume is just a number (or a number followed by another number in brackets - e.g. 4(5) ).
The detail part is separated by a column : from the preceding titleEditonVolume part and is separated from the year by . (botanical names only).
A number of typical detail information is recognized as either pure page numbers (345) or ranges (345-348). Page numbers may be preceded by p.(p. 345) or pp.(pp. 345-348). Abbreviations indicating special parts of a reference such as fig. or tab. are recognized as well. Roman numbers are not detected at the moment.
Nomenclatural Status¶
The nomeclatural status is separated from the preceding text by a comma. Current valid values for a status:
nom. superfl., nom. nud., nom. illeg., nom. inval., nom. cons., nom. alternativ., nom. subnud., nom. rej.,nom. rej., nom. prop., nom. provis., orth. var.
Multiple values separated by comma are possible.
Updated by Andreas Müller about 1 year ago · 23 revisions