Palmweb data download¶

Table of contents
Palmweb data download

The data for the Palmweb dataportal is to be compiled from two sources; the core checklist data, and the wider descriptive data. The core checklist data is currently stored in a Sybase system housed at RBG Kew. This data is to be exported into a TCS/RDF file for upload into the EDIT CDM. The wider set of data required by Palmweb â€“ taxon descriptions, common names and distribution data for example â€“ will be compiled from a range of other sources, including scientific papers and journals.

Current progress¶

The RDF output from the monocost checklist has been updated to include all publication data, and this file has been validated. All the comments made by the Kew workshop and the team in Berlin have been incorporated into the file structure. The full file is available for download from this wiki page as arecaceae.rdf

Core fields data¶

The checklist data is stored in Kewâ€™s monocots checklist database, a bespoke Sybase system that serves as a definitive checklist for RBG Kew. The database is served online at the World Checklist of Monocotyledons website: http://www.kew.org/wcsp/monocots.

For the download, the database is accessed using a java/hibernate application which maps to the database fields and outputs the data in the appropriate TCS/RDF tags. Converting the information into a TCS/RDF format is not entirely straightforward however. The fields in the monocots database do not map directly onto the available TCS tags, so some degree of data manipulation is required before outputting the data â€“ concatenating and splitting fields for example, or removing certain characters that have specific purposes in the monocots system, but are not needed in the TCS format. The steps taken in manipulating the data have been reviewed at workshop held at Kew with the checklist editors, and have been circulated within WP5 as the work has progressed. These steps are detailed further below.

The data is packaged up according to the TCS/RDF schema found at the tdwg website: http://rs.tdwg.org/ontology/voc/TaxonName#

and the W3C RDF vocabulary description found at: http://www.w3.org/TR/rdf-schema/

The resulting files are being checked using the W3C RDF Validation service provided at: http://www.w3.org/RDF/Validator/

Guidelines for compiling the core fields data¶

A full mapping of the monocots fields and their corresponding tags in the tdwg ontology can be found below or by clicking this link:

http://dev.e-taxonomy.eu/trac/attachment/wiki/SampleDataConversion/Monocotyledoneae/moncotots-tdwg%20mapping.doc

General¶

guideline	description
Keys	For the download, the important point is that the relations withing the file are kept consistent. The CDM will generate it's own internal keys. The format for these keys is detailed below
Empty fields	Omit the tag altogether, don't leave tags without an enclosed value.
Links between records within the file	Referential integrity within the CDM will be addressed separately. Just keep the records links within the file consistent. Format for keys: Taxon name - 'palm_tn_'. Taxon concept - 'palm_tc_'
Hybrid x	Include always.
& symbols	All instances of '&' need to be replaced with '&'.
Diacritics	The following initialisation of the BufferedWriter class will output the file in a suitable encoding standard: BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file),Charset.forName("UTF8")));
Publication data	Publication data should be tagged outside of the TaxonName and TaxonConcept tags.
Publication data	Each instance of publication data shoudl draw it's key from the Publication_Edition table rather than Plant_Citation, which is more an associative table.
Null references	In the monocots, certain records are provided to take the place of null values. An example of this is in Plant_Name, record -9999 "no Basionym" which is used as a reference for the Accepted_plant_name field where that name is in fact accepted. In the tdwg ontology, such a record will equate to null, and therefore the tags themselves will be omitted.

Specific tags¶

tag	table	fields	notes
Taxon Name	* **	** *	* *
	Plant Name	Plant_name_id	Primary key of Plant_Name table. Output as 'palm_tn_'
tn:authorship	Authors	Author	Basionym authors and primary authors are concatenated together. The basionym author is placed within parentheses.
tn:basionymAuthorship	Authors	Author	Where Author_type_id = 'PAR'
tn:combinationAuthorship	Authors	Author	Where Author_type_id = 'PRM'
tn:genusPart	Plant_Name	Genus	found in main table
tn:hasBasionym	Plant_Name	Plant_name_id	Primary key of Plant_Name table. Output as 'palm_tn_'
tn:infragenericEpithet	Plant_Name	Supraspecific_epithet	found in main table
tn:infraspecificEpithet	Plant_Name	Infraspecific_ epithet	found in main table
tn:nameComplete	Plant_Name	Full_epithet	found in main table
	n/a	n/a	Hard-coded with the link to the relevant tdwg ontology: http://rs.tdwg.org/ontology/voc/TaxonName#ICBN
tn:rank	n/a	n/a	Contains a link to relevant tdwg ontology. The link provided depends on the rank of the particular record
tn:rankString			Contains a string representing the rank of the record. This field is conditional on the data present. 1) Record lacking data in the field Infraspecific_rank and Species = Genus. 2) Records lacking data only in Infraspecific_rank = Species. 3) If data present in Infraspecific_rank then this field is output.
tn:specificEpithet	Plant_Name	Species	found in main table
tn:year	Plant_Name	First_published	Data in this field is variable, and is manipulated to output only the 4 digits of the year value.
tcom:publishedIn	Place_of_publication, Plant Name	Place_of_publication, Volume_and_ page, First_published	Concatenated in the order shown here. Parentheses are included around the year field for clarity.
tn:hasAnnotation tn:NomenclaturalNote			Hard-coded with the link to the relevant tdwg ontology - http://rs.tdwg.org/ontology/voc/TaxonName#PublicationStatus
tn:note	Plant Name	Nomenclatural_remarks	This field is conditional. 1) If data is present in Nomenclatural_remarks field, then this field is output. 2) If this field is empty, output 'valid'

Taxon Concept	* **	** *	* *
tc:TaxonConcept	Plant_Name	Plant_name_id	Primary key of Plant_Name table. Output as 'palm_tc_'
tc:hasName	Plant_Name	Plant_name_id	Primary key of Plant_Name table. Output as 'palm_tn_'
tc:accordingTo			In the case of the monocots checklist, this will always correspond to the WCP publication: 'Govaerts, R. & Dransfield, J. (2005). World Checklist of Palms: 1-223. The Board of Trustees of the Royal Botanic Gardens, Kew.'
tc:primary			Always 'true'
tc:hasRelationship tc:relationship			This series of tags records the relationships of this concept with other names. Accepted names list their synonyms. Synonyms list their accepted names.
			Contains the link to the part of the tdwg ontology representing the relationship, eg - http://rs.tdwg.org/ontology/voc/TaxonConcept#IsSynonymFor
			Contains the link to the related record. This Id is output as 'palm_tc_'
Publication Citation	* **	** *	* *
tn:PublicationCitationrdf:about	Publication Edition	Publication_edition_id	Primary key of Publication Edition table. Output as 'palm_pub_ed_'
tn:authorship	Publication Edition	Article_author
tn:datePublished	Publication Edition	Published_date	This data is variable and occurs in different formats - ie, yyyy, mm yyyy, etc
tpub:pages	Publication Edition	Page_number_from/Page_number_to	Concentate the two fields together with a hyphen -
	Publication	Publication_id	Key of Publication table. Formatted: "palm_pub_"
tpub:parentPublicationString	Publication	Full_title	Title of the parent publication
			Contains the link to the part of the tdwg ontology representing the publication type, eg - http://rs.tdwg.org/ontology/voc/PublicationCitation#Book
tpub:publisher	Publication	Published	String representing the publisher
tpub:shortTitle	Publication	Abreviated_title
tpub:title	Publication Edition	Full_title
tpub:volume	Publiaction Edition	Volume
tpub:year	Publication Edition	Published_date

The monocots database contains a different set of publication types to the TDWG ontology (http://rs.tdwg.org/ontology/voc/PublicationCitation#PublicationTypeTerm). These need to be mapped over. I've circulated this question around WP5 and consulted the checklist editors here at Kew to produce the following mapping:

monocots	tdwg	notes
Book	Book
Journal	Journal
Personal Communication	Communication
Serial Flora	Book Series
Specimen	Book Series?
Electronic	Webpage?	"Electronic" is a fuzzy term. This is seen as the closest approximation. I'll soon produce a list describing the details of the electronic types so we can make a more accurate assessment of what these types actually refer to

Further questions?¶

LSIDs	Will the CDM generate it's own LSIDs? What about LSIDs generated here at Kew?
References/Publications	Is there a further set of data about references and publications we can use?

The extended dataset¶

The process for compiling and importing the extended dataset is now being addressed under the milestone "Data Gathering - extended Palms data".

Next stage¶

The export process will be amended to take into account the findings of the Kew meeting and any other comments. A second file will be produced which will be circulated throughout the team for comment. After a review of this file we hope to be in a position to describe a definitive process for the RDF download and produce a file suitable for export into the EDIT CDM.

Files (4)

Updated by Andreas Müller about 2 years ago · 11 revisions

arecaceae.xml (16.6 KB) arecaceae.xml	first XML download from the monocots database	Andreas Müller, 07/10/2008 01:47 PM
publication data.doc (76.5 KB) publication data.doc	monocots publication data mapping	Andreas Müller, 07/10/2008 01:48 PM
arecaceae.rdf (17.8 MB) arecaceae.rdf	latest version of the rdf output - UPDATED to include publication data	Andreas Müller, 07/10/2008 01:49 PM
moncotots-tdwg mapping.doc (257 KB) moncotots-tdwg mapping.doc		David Taylor, 07/14/2008 02:18 PM

Project

General

Profile

EDIT

Wiki