Morphology preprocessors can be applied to the words being indexed to replace different forms of the same word with the base, normalized form or improve segmentation. For instance, English stemmer will normalize both “dogs” and “dog” to “dog”, making search results for the both keywords the same.
There are 4 different morphology preprocessors that Manticore implements.
morphology = morphology1[, morphology2, ...]A list of morphology preprocessors to apply. Optional, default is empty (do not apply any preprocessor).
The morphology processors that come with our own built-in Manticore implementations are:
You can also link with libstemmer library for even more stemmers (see details below). With libstemmer, Manticore also supports morphological processing for more than 15 other languages. Binary packages should come prebuilt with libstemmer support, too.
Lemmatizers require a dictionary that needs to be additionally downloaded from the Manticore website. That dictionary needs to be installed in a directory specified by lemmatizer_base directive. Also, there is a lemmatizer_cache directive that lets you speed up lemmatizing (and therefore indexing) by spending more RAM for, basically, an uncompressed cache of a dictionary.
Chinese segmentation using ICU is also available. It is a much more precise, but a little bit slower way (compared to n-grams) to segment Chinese documents. charset_table must contain all Chinese characters (you can use alias “cjk”). In case of “morphology=icu_chinese” documents are first pre-processed by ICU, then the result is processed by the tokenizer (according to your charset_table) and then other morphology processors specified in the “morphology” option are applied. When the documents are processed by ICU, only those parts of texts that contain Chinese are passed to ICU for segmentation, others can be modified by other means (different morphologies, charset_table etc.)
Built-in English and Russian stemmers should be faster than their libstemmer counterparts, but can produce slightly different results, because they are based on an older version.
Soundex implementation matches that of MySQL. Metaphone implementation is based on Double Metaphone algorithm and indexes the primary code.
Built-in values that are available for use in the
morphology option are as follows: * none - do not perform
any morphology processing * lemmatize_ru - apply Russian lemmatizer and
pick a single root form * lemmatize_uk - apply Ukrainian lemmatizer and
pick a single root form (install it first in Centos
or Ubuntu/Debian).
For correct work of the lemmatizer make sure specific Ukrainian
characters are preserved in your charset_table since by
default they are not. For that override them, like this:
charset_table='non_cjk,U+0406->U+0456,U+0456,U+0407->U+0457,U+0457,U+0490->U+0491,U+0491'.
Here is an
interactive course about how to install and use the urkainian
lemmatizer. * lemmatize_en - apply English lemmatizer and pick a single
root form * lemmatize_de - apply German lemmatizer and pick a single
root form * lemmatize_ru_all - apply Russian lemmatizer and index all
possible root forms * lemmatize_uk_all - apply Ukrainian lemmatizer and
index all possible root forms. Find the installation links above and
take care of the charset_table. * lemmatize_en_all - apply
English lemmatizer and index all possible root forms * lemmatize_de_all
- apply German lemmatizer and index all possible root forms * stem_en -
apply Porter’s English stemmer * stem_ru - apply Porter’s Russian
stemmer * stem_enru - apply Porter’s English and Russian stemmers *
stem_cz - apply Czech stemmer * stem_ar - apply Arabic stemmer * soundex
- replace keywords with their SOUNDEX code * metaphone - replace
keywords with their METAPHONE code * icu_chinese - apply Chinese text
segmentation using ICU * libstemmer_* . Refer to the list
of supported languages for details
Several stemmers can be specified at once comma-separated. They will be applied to incoming words in the order they are listed, and the processing will stop once one of the stemmers actually modifies the word. Also when wordforms feature is enabled the word will be looked up in word forms dictionary first, and if there is a matching entry in the dictionary, stemmers will not be applied at all. Or in other words, wordforms can be used to implement stemming exceptions.
CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'POST /cli -d "CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'"$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'morphology' => 'stem_en, libstemmer_sv'
]);utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) morphology = \'stem_en, libstemmer_sv\'')res = await utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) morphology = \'stem_en, libstemmer_sv\'');utilsApi.sql("mode=raw&query=CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'");index products {
morphology = stem_en, libstemmer_sv
type = rt
path = idx
rt_field = title
rt_attr_uint = price
}morphology_skip_fields = field1[, field2, ...]A list of fields to skip morphology preprocessing. Optional, default is empty (apply preprocessors to all fields).
CREATE TABLE products(title text, name text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'POST /cli -d "
CREATE TABLE products(title text, name text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'"$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'morphology_skip_fields' => 'name',
'morphology' => 'stem_en'
]);utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) morphology_skip_fields = \'name\' morphology = \'stem_en\'')res = await utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) morphology_skip_fields = \'name\' morphology = \'stem_en\'');utilsApi.sql("mode=raw&query=CREATE TABLE products(title text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'");index products {
morphology_skip_fields = name
morphology = stem_en
type = rt
path = idx
rt_field = title
rt_field = name
rt_attr_uint = price
}min_stemming_len = lengthMinimum word length at which to enable stemming. Optional, default is 1 (stem everything).
Stemmers are not perfect, and might sometimes produce undesired
results. For instance, running “gps” keyword through Porter stemmer for
English results in “gp”, which is not really the intent.
min_stemming_len feature lets you suppress stemming based
on the source word length, ie. to avoid stemming too short words.
Keywords that are shorter than the given threshold will not be stemmed.
Note that keywords that are exactly as long as specified
will be stemmed. So in order to avoid stemming
3-character keywords, you should specify 4 for the value. For more
finely grained control, refer to wordforms
feature.
CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'POST /cli -d "
CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'min_stemming_len' => '4',
'morphology' => 'stem_en'
]);utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) min_stemming_len = \'4\' morphology = \'stem_en\'')res = await utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) min_stemming_len = \'4\' morphology = \'stem_en\'');utilsApi.sql("mode=raw&query=CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'");index products {
min_stemming_len = 4
morphology = stem_en
type = rt
path = idx
rt_field = title
rt_attr_uint = price
}index_exact_words = {0|1}Whether to index the original keywords along with the stemmed/remapped versions. Optional, default is 0 (do not index).
When enabled, index_exact_words forces indexation to put
the raw keywords in the index along with the stemmed versions. That, in
turn, enables exact
form operator in the query language to work. This impacts the index
size and the indexing time. However, searching performance is not
impacted at all.
CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'POST /cli -d "
CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'"
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'index_exact_words' => '1',
'morphology' => 'stem_en'
]);utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) index_exact_words = \'1\' morphology = \'stem_en\'')res = await utilsApi.sql('mode=raw&query=CREATE TABLE products(title text, price float) index_exact_words = \'1\' morphology = \'stem_en\'');utilsApi.sql("mode=raw&query=CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'");index products {
index_exact_words = 1
morphology = stem_en
type = rt
path = idx
rt_field = title
rt_attr_uint = price
}