Project

General

Profile

CdmLibraryFreetextSearch » History » Version 7

Andreas Kohlbecker, 01/24/2012 02:09 PM

1 1 Andreas Kohlbecker
2
# EDIT - Freetext Search
3
4
5
 [Hibernate Search](http://www.hibernate.org/subprojects/search.html)  brings the power of full text search engines to the persistence domain model by combining Hibernate Core with the capabilities of the [Apache Lucene](http://lucene.apache.org/) search engine. Therefore it looks like a good idea to use Hibernate Search in the CDM Library to perform free text searches. Due to the architecture of some parts of the EDIT platform there are some caveats and problems which have to be considered carefully before deciding on making full use of Hibernate Search / Lucene in the CDM Library:
6
7
8
9
## Benefits
10
11
12 2 Andreas Kohlbecker
With plain HQL it is possible to search for example for text snippets contained in TextData.multilanguageText: LIKE '%ext snip%'. But the situation gets a bit more complex when taking a look at some specific use cases, like for example the following tickets: [[#842|implement advanced search]], [[#424|ALL. Image search]], [[#2432|Implement search for multiple areas]]. For example the image search would require to perform a LIKE search over the following fields simultaneously. The performance of this query would not be the best.
13 1 Andreas Kohlbecker
14
* Media.title
15
16
* Media.description
17
18
* Media.representations.parts.uri 
19
20
* Description.title
21
22
* Description.taxon
23
24
* Description.elements.multilanguageText
25
26
* Description.elements.name
27
28
29
Hibernate search / Lucene allows to build documents which combine multiple fields which are distributed in the object graph. These docuemts are indexed and thus are serchable in a very quickly without the need to join multiple tables. Further benefits over plain hql are:
30
31
32
* normalization 
33
34
  * lowercase/uppercase - 'lactuca' finds 'Lactuca'
35
36
  * unicode (diacritics) - 'Angstrom' finds 'Ångström' 
37
38
  * removing special characters from words - 'donalds' finds 'donald's'
39
40
* real term based free text search over a phrase based search with wildcards as '*term_1 te*rm_2 ter*' in HQL
41
42
* can speed up existing find*() methods in the CDM Library
43 2 Andreas Kohlbecker
44
* Lucene can handle [spacial searches](http://wiki.apache.org/lucene-java/SpatialSearch)
45
46
* retrieve information from the Lucene index (titelCache, UUID, etc,) without the need to initialize any CDM entity
47
48 6 Andreas Kohlbecker
* retrieve lucene documents together with associated cdm entities, this is very nice if we for example search for taxa based on scientific and on common names, 
49
50
we want to return not only the matching taxa associated with the common name but also the common name itself.    
51
52 2 Andreas Kohlbecker
* ...
53 6 Andreas Kohlbecker
54
55
56
## Existing implementations
57
58
59
* Hibernate search base configuration is done, individual index location for each instance in the CDM Server
60
61
* Indexing works
62
63
* findDescriptionElements() is based on hibernate search
64
65
* findTaxaByDescriptionElements() is based on hibernate search over 10 fold performance increase over plain HQL
66
67
* retrieve lucene documents together with associated cdm entities (see above) 
68 1 Andreas Kohlbecker
69
70
71
## Open questions
72
73
74
1. is the index for the type A always updated when an associated object D like in A.B.C.D has been changed?
75
76
77
78
79 7 Andreas Kohlbecker
## Projects and tasks which require Hibernate search
80 1 Andreas Kohlbecker
81
82
* Vibrant: Task 2 - CDM Datastore as a ViBRANT Index ( ... allows humans to perform full text searches ...) **Due 30 July 2012** 
83 7 Andreas Kohlbecker
84
* some ticket can only be solved in a satisfyingly if we use hibernate search:
85
86
  * [[#842|#842 implement advanced search]]
87
88
  * [[#424|#424 ALL. Image search]]
89
90
  * [[#2432|#2432 Implement search for multiple areas]] 
91
92
  * ...
93 1 Andreas Kohlbecker
94
95
96 5 Andreas Kohlbecker
----
97 3 Andreas Kohlbecker
98 1 Andreas Kohlbecker
99
100 5 Andreas Kohlbecker
## Problems
101 1 Andreas Kohlbecker
102 5 Andreas Kohlbecker
103
104 4 Andreas Kohlbecker
### Lucene index in the local filesystem
105 1 Andreas Kohlbecker
106 3 Andreas Kohlbecker
Lucene is storing the index in the file system, whereas the data is stored in the database. Usually multiple Taxonomic Editors are connected directly to a central database. When a cdm entity is updated by an editor the new data is stored in the central database, but the lucene index exists in the local file system where the editor is installed, thus the central index is not updated. 
107 4 Andreas Kohlbecker
108
Hibernate Search registers an update listener which triggers the indexing of inserted and updated entities.
109 1 Andreas Kohlbecker
110
111
112 5 Andreas Kohlbecker
## Solutions
113
114
115
All solutions are only appicable if one database server + cdmlib-remote instance is defined as the central repository.
116
117
118
Solutions which target at the Lucene indexing:
119
120
 
121
1. Implement a custom HibernateUpdateListener which sends a notification to a web service, which then in turn triggers the indexing process in the same way as the local update listener
122
123
1. Could Lucene clustering be a solution?
124
125
1. Provide a central index into which all Lucene instances are writing. e.g. a webdav share ?? - Conflicts?
126
127
1. any further ideas?
128
129
130
Solutions on the editor side:
131
132
133
1. Fully implement the HTTP Invoker remoting solution for the editor. There are some major problems to be solved for this solution: **@Niels: please could you name the existing problems here?** 
134
135
1. Fully implement the RAP solution for the editor. There are some major problems to be solved for this solution: **@Niels: please could you name the existing problems here?**