Palmweb data download¶
- Table of contents
- Palmweb data download
The data for the Palmweb dataportal is to be compiled from two sources; the core checklist data, and the wider descriptive data. The core checklist data is currently stored in a Sybase system housed at RBG Kew. This data is to be exported into a TCS/RDF file for upload into the EDIT CDM. The wider set of data required by Palmweb – taxon descriptions, common names and distribution data for example – will be compiled from a range of other sources, including scientific papers and journals.
Current progress¶
The RDF output from the monocost checklist has been updated to include all publication data, and this file has been validated. All the comments made by the Kew workshop and the team in Berlin have been incorporated into the file structure. The full file is available for download from this wiki page as arecaceae.rdf
Core fields data¶
The checklist data is stored in Kew’s monocots checklist database, a bespoke Sybase system that serves as a definitive checklist for RBG Kew. The database is served online at the World Checklist of Monocotyledons website: http://www.kew.org/wcsp/monocots.
For the download, the database is accessed using a java/hibernate application which maps to the database fields and outputs the data in the appropriate TCS/RDF tags. Converting the information into a TCS/RDF format is not entirely straightforward however. The fields in the monocots database do not map directly onto the available TCS tags, so some degree of data manipulation is required before outputting the data – concatenating and splitting fields for example, or removing certain characters that have specific purposes in the monocots system, but are not needed in the TCS format. The steps taken in manipulating the data have been reviewed at workshop held at Kew with the checklist editors, and have been circulated within WP5 as the work has progressed. These steps are detailed further below.
The data is packaged up according to the TCS/RDF schema found at the tdwg website: http://rs.tdwg.org/ontology/voc/TaxonName#
and the W3C RDF vocabulary description found at: http://www.w3.org/TR/rdf-schema/
The resulting files are being checked using the W3C RDF Validation service provided at: http://www.w3.org/RDF/Validator/
Guidelines for compiling the core fields data¶
A full mapping of the monocots fields and their corresponding tags in the tdwg ontology can be found below or by clicking this link:
General¶
guideline | description |
---|---|
Keys | For the download, the important point is that the relations withing the file are kept consistent. The CDM will generate it's own internal keys. The format for these keys is detailed below |
Empty fields | Omit the tag altogether, don't leave tags without an enclosed value. |
Links between records within the file | Referential integrity within the CDM will be addressed separately. Just keep the records links within the file consistent. Format for keys: Taxon name - 'palm_tn_'. Taxon concept - 'palm_tc_' |
Hybrid x | Include always. |
& symbols | All instances of '&' need to be replaced with '&'. |
Diacritics | The following initialisation of the BufferedWriter class will output the file in a suitable encoding standard: BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file),Charset.forName("UTF8"))); |
Publication data | Publication data should be tagged outside of the TaxonName and TaxonConcept tags. |
Publication data | Each instance of publication data shoudl draw it's key from the Publication_Edition table rather than Plant_Citation, which is more an associative table. |
Null references | In the monocots, certain records are provided to take the place of null values. An example of this is in Plant_Name, record -9999 "no Basionym" which is used as a reference for the Accepted_plant_name field where that name is in fact accepted. In the tdwg ontology, such a record will equate to null, and therefore the tags themselves will be omitted. |
Specific tags¶
tag | table | fields | notes |
---|---|---|---|
Taxon Name | * ** | ** * | * * |
Plant Name | Plant_name_id | Primary key of Plant_Name table. Output as 'palm_tn_' | |
tn:authorship | Authors | Author | Basionym authors and primary authors are concatenated together. The basionym author is placed within parentheses. |
tn:basionymAuthorship | Authors | Author | Where Author_type_id = 'PAR' |
tn:combinationAuthorship | Authors | Author | Where Author_type_id = 'PRM' |
tn:genusPart | Plant_Name | Genus | found in main table |
tn:hasBasionym | Plant_Name | Plant_name_id | Primary key of Plant_Name table. Output as 'palm_tn_' |
tn:infragenericEpithet | Plant_Name | Supraspecific_epithet | found in main table |
tn:infraspecificEpithet | Plant_Name | Infraspecific_ epithet | found in main table |
tn:nameComplete | Plant_Name | Full_epithet | found in main table |
n/a | n/a | Hard-coded with the link to the relevant tdwg ontology: http://rs.tdwg.org/ontology/voc/TaxonName#ICBN | |
tn:rank | n/a | n/a | Contains a link to relevant tdwg ontology. The link provided depends on the rank of the particular record |
tn:rankString | Contains a string representing the rank of the record. This field is conditional on the data present. 1) Record lacking data in the field Infraspecific_rank and Species = Genus. 2) Records lacking data only in Infraspecific_rank = Species. 3) If data present in Infraspecific_rank then this field is output. | ||
tn:specificEpithet | Plant_Name | Species | found in main table |
tn:year | Plant_Name | First_published | Data in this field is variable, and is manipulated to output only the 4 digits of the year value. |
tcom:publishedIn | Place_of_publication, Plant Name | Place_of_publication, Volume_and_ page, First_published | Concatenated in the order shown here. Parentheses are included around the year field for clarity. |
tn:hasAnnotation tn:NomenclaturalNote | Hard-coded with the link to the relevant tdwg ontology - http://rs.tdwg.org/ontology/voc/TaxonName#PublicationStatus | ||
tn:note | Plant Name | Nomenclatural_remarks | This field is conditional. 1) If data is present in Nomenclatural_remarks field, then this field is output. 2) If this field is empty, output 'valid' |
Taxon Concept | * ** | ** * | * * |
---|---|---|---|
tc:TaxonConcept | Plant_Name | Plant_name_id | Primary key of Plant_Name table. Output as 'palm_tc_' |
tc:hasName | Plant_Name | Plant_name_id | Primary key of Plant_Name table. Output as 'palm_tn_' |
tc:accordingTo | In the case of the monocots checklist, this will always correspond to the WCP publication: 'Govaerts, R. & Dransfield, J. (2005). World Checklist of Palms: 1-223. The Board of Trustees of the Royal Botanic Gardens, Kew.' | ||
tc:primary | Always 'true' | ||
tc:hasRelationship tc:relationship | This series of tags records the relationships of this concept with other names. Accepted names list their synonyms. Synonyms list their accepted names. | ||
Contains the link to the part of the tdwg ontology representing the relationship, eg - http://rs.tdwg.org/ontology/voc/TaxonConcept#IsSynonymFor | |||
Contains the link to the related record. This Id is output as 'palm_tc_' | |||
Publication Citation | * ** | ** * | * * |
tn:PublicationCitationrdf:about | Publication Edition | Publication_edition_id | Primary key of Publication Edition table. Output as 'palm_pub_ed_' |
tn:authorship | Publication Edition | Article_author | |
tn:datePublished | Publication Edition | Published_date | This data is variable and occurs in different formats - ie, yyyy, mm yyyy, etc |
tpub:pages | Publication Edition | Page_number_from/Page_number_to | Concentate the two fields together with a hyphen - |
Publication | Publication_id | Key of Publication table. Formatted: "palm_pub_" | |
tpub:parentPublicationString | Publication | Full_title | Title of the parent publication |
Contains the link to the part of the tdwg ontology representing the publication type, eg - http://rs.tdwg.org/ontology/voc/PublicationCitation#Book | |||
tpub:publisher | Publication | Published | String representing the publisher |
tpub:shortTitle | Publication | Abreviated_title | |
tpub:title | Publication Edition | Full_title | |
tpub:volume | Publiaction Edition | Volume | |
tpub:year | Publication Edition | Published_date |
The monocots database contains a different set of publication types to the TDWG ontology (http://rs.tdwg.org/ontology/voc/PublicationCitation#PublicationTypeTerm). These need to be mapped over. I've circulated this question around WP5 and consulted the checklist editors here at Kew to produce the following mapping:
monocots | tdwg | notes |
---|---|---|
Book | Book | |
Journal | Journal | |
Personal Communication | Communication | |
Serial Flora | Book Series | |
Specimen | Book Series? | |
Electronic | Webpage? | "Electronic" is a fuzzy term. This is seen as the closest approximation. I'll soon produce a list describing the details of the electronic types so we can make a more accurate assessment of what these types actually refer to |
Further questions?¶
LSIDs | Will the CDM generate it's own LSIDs? What about LSIDs generated here at Kew? |
---|---|
References/Publications | Is there a further set of data about references and publications we can use? |
The extended dataset¶
The process for compiling and importing the extended dataset is now being addressed under the milestone "Data Gathering - extended Palms data".
Next stage¶
The export process will be amended to take into account the findings of the Kew meeting and any other comments. A second file will be produced which will be circulated throughout the team for comment. After a review of this file we hope to be in a position to describe a definitive process for the RDF download and produce a file suitable for export into the EDIT CDM.
Updated by Andreas Müller almost 2 years ago · 11 revisions