Project

General

Profile

Actions

Palmweb data download


The data for the Palmweb dataportal is to be compiled from two sources; the core checklist data, and the wider descriptive data. The core checklist data is currently stored in a Sybase system housed at RBG Kew. This data is to be exported into a TCS/RDF file for upload into the EDIT CDM. The wider set of data required by Palmweb – taxon descriptions, common names and distribution data for example – will be compiled from a range of other sources, including scientific papers and journals.

Current progress

The RDF output from the monocost checklist has been updated to include all publication data, and this file has been validated. All the comments made by the Kew workshop and the team in Berlin have been incorporated into the file structure. The full file is available for download from this wiki page as arecaceae.rdf

Core fields data

The checklist data is stored in Kew’s monocots checklist database, a bespoke Sybase system that serves as a definitive checklist for RBG Kew. The database is served online at the World Checklist of Monocotyledons website: http://www.kew.org/wcsp/monocots.

For the download, the database is accessed using a java/hibernate application which maps to the database fields and outputs the data in the appropriate TCS/RDF tags. Converting the information into a TCS/RDF format is not entirely straightforward however. The fields in the monocots database do not map directly onto the available TCS tags, so some degree of data manipulation is required before outputting the data – concatenating and splitting fields for example, or removing certain characters that have specific purposes in the monocots system, but are not needed in the TCS format. The steps taken in manipulating the data have been reviewed at workshop held at Kew with the checklist editors, and have been circulated within WP5 as the work has progressed. These steps are detailed further below.

The data is packaged up according to the TCS/RDF schema found at the tdwg website: http://rs.tdwg.org/ontology/voc/TaxonName#

and the W3C RDF vocabulary description found at: http://www.w3.org/TR/rdf-schema/

The resulting files are being checked using the W3C RDF Validation service provided at: http://www.w3.org/RDF/Validator/

Guidelines for compiling the core fields data

A full mapping of the monocots fields and their corresponding tags in the tdwg ontology can be found below or by clicking this link:

http://dev.e-taxonomy.eu/trac/attachment/wiki/SampleDataConversion/Monocotyledoneae/moncotots-tdwg%20mapping.doc

General

guideline description
Keys For the download, the important point is that the relations withing the file are kept consistent. The CDM will generate it's own internal keys. The format for these keys is detailed below
Empty fields Omit the tag altogether, don't leave tags without an enclosed value.
Links between records within the file Referential integrity within the CDM will be addressed separately. Just keep the records links within the file consistent. Format for keys: Taxon name - 'palm_tn_'. Taxon concept - 'palm_tc_'
Hybrid x Include always.
& symbols All instances of '&' need to be replaced with '&'.
Diacritics The following initialisation of the BufferedWriter class will output the file in a suitable encoding standard: BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file),Charset.forName("UTF8")));
Publication data Publication data should be tagged outside of the TaxonName and TaxonConcept tags.
Publication data Each instance of publication data shoudl draw it's key from the Publication_Edition table rather than Plant_Citation, which is more an associative table.
Null references In the monocots, certain records are provided to take the place of null values. An example of this is in Plant_Name, record -9999 "no Basionym" which is used as a reference for the Accepted_plant_name field where that name is in fact accepted. In the tdwg ontology, such a record will equate to null, and therefore the tags themselves will be omitted.

Specific tags

tag table fields notes
Taxon Name * ** ** * * *
Plant Name Plant_name_id Primary key of Plant_Name table. Output as 'palm_tn_'
tn:authorship Authors Author Basionym authors and primary authors are concatenated together. The basionym author is placed within parentheses.
tn:basionymAuthorship Authors Author Where Author_type_id = 'PAR'
tn:combinationAuthorship Authors Author Where Author_type_id = 'PRM'
tn:genusPart Plant_Name Genus found in main table
tn:hasBasionym Plant_Name Plant_name_id Primary key of Plant_Name table. Output as 'palm_tn_'
tn:infragenericEpithet Plant_Name Supraspecific_epithet found in main table
tn:infraspecificEpithet Plant_Name Infraspecific_ epithet found in main table
tn:nameComplete Plant_Name Full_epithet found in main table
n/a n/a Hard-coded with the link to the relevant tdwg ontology: http://rs.tdwg.org/ontology/voc/TaxonName#ICBN
tn:rank n/a n/a Contains a link to relevant tdwg ontology. The link provided depends on the rank of the particular record
tn:rankString Contains a string representing the rank of the record. This field is conditional on the data present. 1) Record lacking data in the field Infraspecific_rank and Species = Genus. 2) Records lacking data only in Infraspecific_rank = Species. 3) If data present in Infraspecific_rank then this field is output.
tn:specificEpithet Plant_Name Species found in main table
tn:year Plant_Name First_published Data in this field is variable, and is manipulated to output only the 4 digits of the year value.
tcom:publishedIn Place_of_publication, Plant Name Place_of_publication, Volume_and_ page, First_published Concatenated in the order shown here. Parentheses are included around the year field for clarity.
tn:hasAnnotation tn:NomenclaturalNote Hard-coded with the link to the relevant tdwg ontology - http://rs.tdwg.org/ontology/voc/TaxonName#PublicationStatus
tn:note Plant Name Nomenclatural_remarks This field is conditional. 1) If data is present in Nomenclatural_remarks field, then this field is output. 2) If this field is empty, output 'valid'
Taxon Concept * ** ** * * *
tc:TaxonConcept Plant_Name Plant_name_id Primary key of Plant_Name table. Output as 'palm_tc_'
tc:hasName Plant_Name Plant_name_id Primary key of Plant_Name table. Output as 'palm_tn_'
tc:accordingTo In the case of the monocots checklist, this will always correspond to the WCP publication: 'Govaerts, R. & Dransfield, J. (2005). World Checklist of Palms: 1-223. The Board of Trustees of the Royal Botanic Gardens, Kew.'
tc:primary Always 'true'
tc:hasRelationship tc:relationship This series of tags records the relationships of this concept with other names. Accepted names list their synonyms. Synonyms list their accepted names.
Contains the link to the part of the tdwg ontology representing the relationship, eg - http://rs.tdwg.org/ontology/voc/TaxonConcept#IsSynonymFor
Contains the link to the related record. This Id is output as 'palm_tc_'
Publication Citation * ** ** * * *
tn:PublicationCitationrdf:about Publication Edition Publication_edition_id Primary key of Publication Edition table. Output as 'palm_pub_ed_'
tn:authorship Publication Edition Article_author
tn:datePublished Publication Edition Published_date This data is variable and occurs in different formats - ie, yyyy, mm yyyy, etc
tpub:pages Publication Edition Page_number_from/Page_number_to Concentate the two fields together with a hyphen -
Publication Publication_id Key of Publication table. Formatted: "palm_pub_"
tpub:parentPublicationString Publication Full_title Title of the parent publication
Contains the link to the part of the tdwg ontology representing the publication type, eg - http://rs.tdwg.org/ontology/voc/PublicationCitation#Book
tpub:publisher Publication Published String representing the publisher
tpub:shortTitle Publication Abreviated_title
tpub:title Publication Edition Full_title
tpub:volume Publiaction Edition Volume
tpub:year Publication Edition Published_date

The monocots database contains a different set of publication types to the TDWG ontology (http://rs.tdwg.org/ontology/voc/PublicationCitation#PublicationTypeTerm). These need to be mapped over. I've circulated this question around WP5 and consulted the checklist editors here at Kew to produce the following mapping:

monocots tdwg notes
Book Book
Journal Journal
Personal Communication Communication
Serial Flora Book Series
Specimen Book Series?
Electronic Webpage? "Electronic" is a fuzzy term. This is seen as the closest approximation. I'll soon produce a list describing the details of the electronic types so we can make a more accurate assessment of what these types actually refer to

Further questions?

LSIDs Will the CDM generate it's own LSIDs? What about LSIDs generated here at Kew?
References/Publications Is there a further set of data about references and publications we can use?

The extended dataset

The process for compiling and importing the extended dataset is now being addressed under the milestone "Data Gathering - extended Palms data".

Next stage

The export process will be amended to take into account the findings of the Kew meeting and any other comments. A second file will be produced which will be circulated throughout the team for comment. After a review of this file we hope to be in a position to describe a definitive process for the RDF download and produce a file suitable for export into the EDIT CDM.

Updated by Andreas Müller almost 2 years ago · 11 revisions