Project

General

Profile

CdmVersionTwoDiscussion » History » Version 55

Helene Fradin, 03/24/2009 11:00 AM

1 2 Andreas Müller
{{>toc}}
2
3
4
5
6
# CDM v2.0 Discussion
7
8
9
10
11
----
12
13 4 Andreas Müller
_This is a site to discuss possible changes to the [CDM v1.4](http://wp5.e-taxonomy.eu/cdm/v14/) to go into CDM v2.0_
14 2 Andreas Müller
15 40 Helene Fradin
_See also Component C5.80 - Review of CDM v.1 and model for descriptive data in CDM v.2_
16 2 Andreas Müller
17 40 Helene Fradin
18 2 Andreas Müller
----
19
20
21
22 10 Helene Fradin
23 8 Helene Fradin
## DESCRIPTIVE DATA - PROPOSED REVISIONS
24 3 Andreas Müller
25 1 Andreas Müller
26 10 Helene Fradin
27
----
28
29
30 37 Helene Fradin
31 10 Helene Fradin
## 1. MAJOR - Character/Descriptor/Feature concept
32
33
34 30 Helene Fradin
 **Impacted objects: Feature** 
35 6 Helene Fradin
36 1 Andreas Müller
37
The Feature class is described in the class comments by: "The class for individual properties (also designed as character, type or category) of observed phenomena able to be described or measured."
38 7 Helene Fradin
39
40 30 Helene Fradin
 **a. Issues** 
41 8 Helene Fradin
42 31 Helene Fradin
43 15 Helene Fradin
It is very interesting that the object Feature is not typed such as Characters in SDD (Categorical, Quantitative, etc.) or many other models. However, if the information is needed as to what kind of data is supported by a certain Feature, it is not clearly stated how to understand and use the different attributes. Moreover, there are a dozen categories of Features (Additional Publication, Image, Cultivation, Description, ...) that are rich but difficult to interpret in the case of the import.
44
45
46
As a reminder, below is the list of the Feature class attributes:
47
48 16 Helene Fradin
   - supportsTextData -> feature can be described with TextData objects
49 15 Helene Fradin
50 16 Helene Fradin
   - supportsQuantitativeData -> feature can be described with QuantitativeData objects
51 15 Helene Fradin
52 16 Helene Fradin
   - supportsDistribution -> feature can be described with Distribution objects (geographical)
53 15 Helene Fradin
54 16 Helene Fradin
   - supportsIndividualAssociation ~~> feature can be described with IndividualsAssociation objects (between the described specimen and a second one -~~ for instance a host, only for SpecimenDescription)
55 15 Helene Fradin
56 16 Helene Fradin
   - supportsTaxonInteraction ~~> feature can be described with TaxonInteraction objects (between the described Taxon and a second one -~~ for instance a parasite, a prey or a hybrid parent, only for TaxonDescription)
57 15 Helene Fradin
58 16 Helene Fradin
   - supportsCommonTaxonName -> feature can be described with CommonTaxonName objects 
59 15 Helene Fradin
60 16 Helene Fradin
   - recommendedModifierEnumeration -> set of TermVocabulary containing the Modifier objects recommended to be used for DescriptionElementBase elements
61 15 Helene Fradin
62 16 Helene Fradin
   - recommendedStatisticalMeasures -> set of StatisticalMeasure recommended to be used in case of QuantitativeData
63 15 Helene Fradin
64 16 Helene Fradin
   - supportedCategoricalEnumerations -> set of TermVocabulary containing the list of possible State to be used in CategoricalData
65 12 Helene Fradin
66 17 Helene Fradin
67
The flexibility of the Feature class is not a problem for the import of SDD descriptive data: for each character, a new DESCRIPTION Feature instance is created:
68
69
   - for SDD CategoricalCharacter, supportedCategoricalEnumerations is set with the states defined in SDD in the elements StateDefinition
70
71
   - for SDD QuantitativeCharacter, supportsQuantitativeData is set to true.
72
73
   - for SDD TextCharacter, support supportsTextData is set to true.
74
75 19 Helene Fradin
   - SDD SequenceCharacter: so far, this data are not imported and I don't have an SDD example of this element being used. I guess it should be imported in a Sequence object?
76 18 Helene Fradin
77
78
However, exporting SDD data raises questions about the object Feature. I can see 3 different problems:
79
80
1. There is no safeguard to ensure that DescriptionElementBase objects used for a description tally with the way the corresponding Feature has been described (for example, a DescriptionElementBase associated with a Feature that has only information on supportedCategoricalEnumerations, could be of the type QuantitativeData).
81
82
1. The SDD standard and most descriptive models require the definition of a descriptive system (list of characters, potential states, potential measures) before expressing the strutured descriptions through this descriptive system. It is difficult to export properly this descriptive system to SDD: I can either export all the Feature (but most of them will be non relevant to the exported descriptions), or I can create the descriptive system by scanning all descriptions to extract only characters that are effectively used in the concerned descriptions (loss of efficiency).
83
84 22 Helene Fradin
1. In SDD, categorical states do not have to be defined at the Character level, they can be defined at a more general level and shared. Therefore, the supportedCategoricalEnumerations could well be empty: how do we know then that it supports StateData?
85 17 Helene Fradin
86 12 Helene Fradin
87 30 Helene Fradin
 **b. Example** 
88 31 Helene Fradin
89 19 Helene Fradin
90
If we consider the feature (character/descriptor in other models) "Leaf length". Below are examples corresponding to each problem described above:
91
92 1 Andreas Müller
1. A new Feature Instance names "Leaf length" is created with the attribute supportsQuantitativeData set to true and supportedCategoricalEnumerations set to null. It is still possible to create a DescriptionElementBase of type CategoricalData with the attribute feature set to "Leaf length" feature, and for example, the attribute states set to a list of StateData containing one item {"small"}. -> A feature described as a quantitative feature is used as a categorical feature.
93 23 Helene Fradin
94 19 Helene Fradin
95 21 Helene Fradin
1. Exporting 2 descriptions from the CDM, which contain only 1 DescriptionElementBase, such as:
96 19 Helene Fradin
97 1 Andreas Müller
Viola hederacea -> Leaf Length (mm) -> {Min = 2.3, Mean = 5.1, Max = 7.9, SD = 1.3, N = 20}
98 19 Helene Fradin
99 21 Helene Fradin
100 1 Andreas Müller
Viola betonicifolia  -> Leaf Length (mm) -> {Min = 2.9, Mean = 5.3, Max = 7.4, SD = 1.3, N = 21}
101 19 Helene Fradin
102 21 Helene Fradin
103 19 Helene Fradin
There might be other Feature instances stored in the CDM ("Leaf complexity", "Body shape", "Flattening of body", ...) related or not to the descriptions of such plants.
104 1 Andreas Müller
105 22 Helene Fradin
Therefore, when exporting the descriptive system, either there will be a majority of non-used features exported, if all feature are exported, or descriptions will have to be scanned one by one to detect only effectively used ones. For the last solution, it is ok with this simple example, but if with potentially hundreds of descriptions and hundreds of characters, the complexity increases quickly.
106 1 Andreas Müller
107 22 Helene Fradin
1. The states "small", "medium", "large" could be defined as DescriptiveConcept elements in SDD and the CategoricalCharacter "Leaf length" could contain no StateDefinition elements, using the stated defined more generally in CodedDescriptions. In this case, when the character "LeafLength" is imported, a Feature with no supportedCategoricalEnumerations is created. This Feature type is undefined while it supports CategoricalData.
108 1 Andreas Müller
109 22 Helene Fradin
110 30 Helene Fradin
 **c. Current solution** 
111 1 Andreas Müller
112 22 Helene Fradin
113 26 Helene Fradin
For now, all Feature instances are exported.
114 1 Andreas Müller
115
116 30 Helene Fradin
 **d. Proposed change (NOT IMPLEMENTED)** 
117 21 Helene Fradin
118
119 26 Helene Fradin
I think there should be a distinction within Feature attributes, between the type of data supported by the Feature (supportsTextData, supportsQuantitativeData, etc.) and the domain of possible values or frame of reference (recommendedStatisticalMeasures, supportedCategoricalEnumerations).
120 1 Andreas Müller
121 26 Helene Fradin
In practical terms:
122
123 41 Helene Fradin
   - I would add a boolean to the attribute: 'supportsCategoricalData' *(IMPLEMENTED)*,
124 26 Helene Fradin
125
   - I would remove the domain of possible values (recommendedModifierEnumeration, recommendedStatisticalMeasures, supportedCategoricalEnumerations) and create a new class that we could call for example PossibleValues or RecommendedValues from which would inherit RecommendedModifiers, RecommendedStates, and RecommendedStatisticalMeasures.
126
127
   - I would add an attribute (e.g. PossibleValuesDomains) that would be a Set<RecommendedValues>).
128
129
130
It doesn't prevent problem 1 from happening but at least it clarifies the typing of Feature objects: it is set only through the boolean attributes 'supports...'.
131
132 27 Helene Fradin
It doesn't resolve problem 2. I would suggest to attach an DescriptiveSystem object to a DescriptionBase object (see item 6).
133 1 Andreas Müller
134 42 Helene Fradin
It resolves problem 3. The typing of the Feature will only depend on the boolean attributes.
135 1 Andreas Müller
136 42 Helene Fradin
137 48 Helene Fradin
[[Gregor|Hagedorn - 27/02/2009]] One comment on PossibleStatisticalMeasures: at this point both SDD and CDM take the position that all statistical measures known to the system are in principle valid data and thus allowed. At the same time, the designer of a matrix has a valid interest to make a choice of preferred measures. This is the reason why we speak of "recommendedStatisticalMeasures". Example: Leaf Length, Kurtosis = 2.3 is just as valid a statement (although highly unlikely) as Leaf Length, mean = 12.3. However: Flower color = Long is simply wrong. Thus the strict enforcement of possible states.
138
139
The base class seems reasonable, I would, however, recommend renaming it from PossibleStates to AvailableStates.
140
141
142
[[Andreas|Müller - 27/02/2009]] The PossibleValues class seems reasonable to me but instead of having subclasses all having the same structure we could use Java generics instead 
143
144
145 50 Helene Fradin
Class PossibleValues<T implements IPossibleValue>{
146
147 51 Helene Fradin
 Set<T> supportedValues;
148 50 Helene Fradin
149
}
150 48 Helene Fradin
151
152 53 Helene Fradin
and/or something similar for the Vocabulary based supported values and IPossibleValue implemented by all relevant classes like MeasurementUnit and StatisticalMeasure
153
154
155
[[Hélène|Fradin - 23/03/2009]]
156
157
158
The updated proposed change (*NOT IMPLEMENTED*) is summarized in the diagram below.
159
160
Two new classes are suggested:
161
162
- PossibleValues: makes it possible to express a frame of reference of values (e.g. an extensive range of colors). The 4 types of values that can be listed all implement a new interface: IPossibleValue. It inherits from TermBase, so that it can be described through the attribute representations.
163
164
- RecommendedValues: makes it possible to list the values that can be used in a certain context (e.g. for a group of taxa and a specific feature, only a limited range of colors can be used). It can reference a PossibleValues instance that corresponds to the larger frame of reference (e.g. a standard range of colors). It inherits from IdentifiableIdentity: a title could be enough to describe this specific set.
165 48 Helene Fradin
166
167 52 Helene Fradin
![](PossibleValues.png)
168 12 Helene Fradin
169
170
----
171
172
173
174 32 Helene Fradin
## 2. HIGHLY CRITICAL - Mixed properties associated with mixed objects
175 12 Helene Fradin
176 1 Andreas Müller
177 33 Helene Fradin
 **Impacted objects: all objects inheriting from VersionableEntity** 
178 1 Andreas Müller
179 32 Helene Fradin
180 33 Helene Fradin
 **a. Issue** 
181 32 Helene Fradin
182 1 Andreas Müller
183 40 Helene Fradin
[[Helene] Some very useful properties are available only for a restricted number of objects I found that extremely hard when importing SDD data into the CDM because I sometimes needed a property that I knew existed for other objects but was not available for the considered object|[Gregor]] I find your observation about the limitation that "essential general properties (title, description, media and original sources) are available only for a restricted number of objects" very interesting. I had some discussions with Markus, trying to get him on erring on the side of allowing sometimes a property which is only necessary under very special use cases, rather than custom tailoring properties to the currently perceived needs. I can understand that Markus wanted to have a clean model, but since in SDD we started doing this, and in the end found that more and more things are shared, we at some point decided to move quite a bit (I am not claiming the fully correct bit) into the abstract  base classes.
184 1 Andreas Müller
185 40 Helene Fradin
The "precision" aimed at, is also in my view responsible to the deep class hierarchy, which hinders a ready understanding of the model. From the UML it is difficult  to derive which properties some derived classes have, because all inheritance layers contribute.
186 1 Andreas Müller
187 40 Helene Fradin
188 34 Helene Fradin
 **d. Proposed change (NOT IMPLEMENTED)** 
189
190
191
I think these properties should be made generic, therefore available at a higher level.
192
193
The specific attributes I am thinking of are: **representations** (Set<Representation>), **media** (Set<Media>), **sources** (Set<OriginalSource>).
194
195
To implement this, I can see 2 solutions: a drastic one and a less drastic one.
196
197
198
   - drastic (directly inspired from the use of the SDD Representation element) : the problem is that it would impact the CDM at a high level so I am probably overlooking important issues raised by this.
199
200
It consists in having these attributes at the level of the VersionableEntity object. However, as the Representation, Media and OriginalSource classes all inherit from VersionableEntity, they should be removed from this hierarchy of objects and defined independantly.
201
202
The new VersionableEntity attribute would be: 
203
204
   representations: Set<Representation>
205
206
and the Representation object, defined independantly, would contain media and sources as attributes.
207
208
In parallel, redundant attributes in lower classes could be removed.
209
210
Therefore, any CDM object inheriting from VersionabeEntity could be represented in the same way: a title and a description (possibly available in several languages), one or several images attached to the object, and one or several sources.
211
212
213
   - less drastic: to make available these properties largely, they could be put back up in the hierarchy.
214
215
I would suggest:
216
217
     > adding to TermBase: sources + media
218
219
     > adding to Media: representations
220
221
     > adding to ReferencedEntityBase: media
222
223
     > adding to IdentifiableEntity: representations + media
224
225
     > adding to FeatureNode: representations + media + sources
226
227
     > removing media from DefinedTermBase
228
229
     > removing media from DescriptionElementBase
230
231
     > removing media from IdentifiableMediaEntity
232
233 32 Helene Fradin
234 54 Helene Fradin
[[Andreas|Müller - 27/02/2009]] I clearly see your point that you needed some attributes but they were not available due to the class hierarchy. Anyway I think to have too many attributes at the very high levels makes thinks more confusing for the user sometimes and opens possibilities which should exist. This sometimes makes e.g. exporting data more difficult if you do not want to loose information. E.g. if you have representations for each versionable entity you will have to check if someone for some reason added some representation to a TaxonNameBase which is a versionable entity. This doesn’t make sense because a TaxonNameBase is always meant to be a scientific name (otherwise you should use CommonName) and therefore only Latin should be available. Also a TaxonNameBase does not really need a media, but if this possibility exists people may start to save protologues as media directly with the name instead of using a TaxonNameDescription. So you will start having the same type of information at different places and you have to check them all, if you don’t want to loose information. So I don’t think that e.g. representations should be available to each class because there are many classes that do not need them really.
235
236
Therefore I think we should keep the number of attributes as limited as possible to each but at the same time of course we need to be able to express things that have to be expressed by adding necessary attributes.
237
238
Maybe you could set up a table with all classes where you think representations, media or originalSources are really needed from your experience with SDD/CDM.
239
240
I also feel a bit uncomfortable with having media and textual representation within one class because I think many representations are more abstract so we will never need a media for it. But I can see that this way of thinking is maybe influenced by the way we use representations now and that is only for defined terms. Many of the defined terms do not seem to have a need for a media representation. If you use representation in a more general way this may change.
241
242
I know that my arguments may go against the open world assumption that is followed by the TDWG ontology for example. But from my perspective the CDM should be a DataModel that is complex but still made to be used in concrete application. Therefore it tries to be strict were ever possible. At the same time I am not sure if this is always the right way to go so I am looking forward for the discussion about the above issues.
243
244
245
[[Ben|Clark - 03/03/2009]] suggested that we could make a TermBase an IdentifiableEntity - IdentifiableEntities do have a collection of OriginalSources, and space for the IdInSource.
246 45 Helene Fradin
247
248 55 Helene Fradin
[[Hélène|Fradin - 24/03/2009]] The diagram below represents the solution proposed by Ben, i.e. to make a TermBase an IdentifiableEntity so that they can have a collection of OriginalSources. I **TESTED** it in my environment and it works fine. However, I had to add the method getTitleCache() to a certain number of classes. I think it still needs to be discussed because the representations attribute becomes redundant with titleCache and then we are exactly with the problem of too many attributes mentioned by Andreas.
249
250
251 47 Helene Fradin
![](TermBase.PNG)
252 45 Helene Fradin
253 32 Helene Fradin
254 12 Helene Fradin
----
255
256
257
258 13 Helene Fradin
## 3. MAJOR - Creation of a defined set of descriptions
259 12 Helene Fradin
260
261 35 Helene Fradin
 **Impacted objects: new object** 
262
263
264
 **a. Issue** 
265
266
267
Cf. mail exchanges between Gregor Hagedorn, Ben Clark and Helene Fradin in December 2009 "Keys and descriptions in the CDM".
268
269
There is no equivalent way of representing a SDD Dataset into the CDM and multi-access keys.
270
271
272
 **d. Proposed change (NOT IMPLEMENTED)** 
273
274
275 36 Helene Fradin
The solution proposed by Ben was a delimited set of taxa and their description. It would certainly be helpful for the import/export between SDD and CDM.
276 35 Helene Fradin
277 36 Helene Fradin
[Gregor] Perhaps to generalize this, a working set of taxa and a default character tree (to optionally create a subset of all taxa) could be provided? Such a working set could then carry a flag that it is suitably revised to serve as a multi-access key.
278 35 Helene Fradin
279
280
public class WorkingSet {
281 1 Andreas Müller
282 36 Helene Fradin
   private Map<Taxon,DescriptionBase> matrix;
283 35 Helene Fradin
284 37 Helene Fradin
   private DescriptiveSystem descriptiveSystem;
285
286 36 Helene Fradin
   private boolean multiAccessKey;
287
288 46 Helene Fradin
   private Language defaultLanguage;
289
290 36 Helene Fradin
   ...
291 35 Helene Fradin
292
 }
293
294
295 12 Helene Fradin
296
----
297
298
299
300 13 Helene Fradin
## 4. MAJOR - Mapping use and rederential objects
301 12 Helene Fradin
302
303
304
----
305
306 1 Andreas Müller
307
308 40 Helene Fradin
## 5. MAJOR - Problem how CDM handles the link between description and scientific taxonomic name
309
310
311
 **a. Issue** 
312
313
314
The fact that structured descriptions (DescriptionBase objects) cannot always be linked with a scientific taxonomic name raises problems for regrouping related descriptions. If the only possibility to regroup descriptions is by using the association with an existing taxonomic hierarchy, it limits the possibility of extracting sets of descriptions from the CDM. In addition, when importing data into the CDM, the information on potential connections between descriptions other than taxonomic is lost if not structured identically (e.g. use of the Scope class). A model such as SDD uses a Dataset object which contains a set of descriptions that can be tagged with a name, a description and media objects.
315 12 Helene Fradin
316
317
318
----
319
320
321
322 13 Helene Fradin
## 6. MINOR - Descriptive system
323 37 Helene Fradin
324
325
 **Impacted objects: DescriptionBase** 
326
327
328
 **a. Issue** 
329
330
331
There is no possibility of associating a set of features/characters/descriptors to a description, or a set of descriptions.
332
333
334 38 Helene Fradin
 **d. Proposed change (IMPLEMENTED as an attribute of DescriptionBase)** 
335 37 Helene Fradin
336
337
To create a new object called DescriptiveSystem which contains at least a set of Feature objects possibly associated with domain of values.
338
339
340
public class DescriptiveSystem {
341
342
   private Set<Feature> features;
343
344
   // OR private Set<Feature, Set<PossibleValues>>;
345
346 1 Andreas Müller
 }
347 38 Helene Fradin
348
349 39 Helene Fradin
CURRENT INTERMEDIARY IMPLEMENTATION: http://dev.e-taxonomy.eu/trac/attachment/wiki/CdmVersionTwoDiscussion/DescriptionBase.gatcl.PNG
350 12 Helene Fradin
351
352
353
----
354
355
356 1 Andreas Müller
357 13 Helene Fradin
## 7. MINOR - How to express uncertainty or inapplicability ?
358 12 Helene Fradin
359
360
361
----
362
363 1 Andreas Müller
364 12 Helene Fradin
365 13 Helene Fradin
## 8. MINOR - Handling of multiple languages
366 12 Helene Fradin
367 1 Andreas Müller
368
369 12 Helene Fradin
----
370
371
372
373 13 Helene Fradin
## 9. MINOR - Media properties and associations
374 12 Helene Fradin
375
376 13 Helene Fradin
IMPLEMENTED
377 12 Helene Fradin
378 13 Helene Fradin
379
380 12 Helene Fradin
----
381
382
383
384 13 Helene Fradin
## 10. MINOR - A default measurement unit for Feature
385 12 Helene Fradin
386
387
388
----
389
390
391
392
## 11. MINOR - Ordering of TermVocabulary for supportedCategoricalEnumerations in Feature
393 24 Helene Fradin
394
395
396
----
397
398
399
400
## 12. MINOR - Why is the setParent function not public in FeatureNode ?
401 25 Helene Fradin
402
403
404
----
405
406
407
408 1 Andreas Müller
## 13. MINOR - How to distinguish between characters and groups as they are both Feature objects ?
409 32 Helene Fradin
410
411
Should the 'partOf' attribute be used?
412 27 Helene Fradin
413
414
415
----
416
417
418
419
## 14. MINOR - How to export and reimport multi-types characters/features/descriptors between CDM and SDD ?