[[stemming-in-situ]]
=== Stemming in situ
For the sake of completeness, we will (((“stemming words”, “stemming in situ”)))finish this chapter by explaining how to
index stemmed words into the same field as unstemmed words. As an example,
analyzing the sentence The quick foxes jumped would produce the following
terms:
[source,text]
Pos 1: (the)
Pos 2: (quick)
Pos 3: (foxes,fox) <1>
Pos 4: (jumped,jump) <1>
<1> The stemmed and unstemmed forms occupy the same position.
WARNING: Read <
To achieve stemming in situ, we will use the
http://bit.ly/1ynIBCe[`keyword_repeat`]
token filter,(((“keyword_repeat token filter”))) which, like the keyword_marker
token filter (see
<
stemmer from touching it. However, it also repeats the term in the same
position, and this repeated term is stemmed.
Using the keyword_repeat
token filter alone would result in the following:
[source,text]
Pos 1: (the,the) <1>
Pos 2: (quick,quick) <1>
Pos 3: (foxes,fox)
Pos 4: (jumped,jump)
<1> The stemmed and unstemmed forms are the same, and so are repeated
needlessly.
To prevent the useless repetition of terms that are the same in their stemmed
and unstemmed forms, we add the
http://bit.ly/1B6xHUY[`unique`] token filter(((“unique token filter”))) into the mix:
[source,json]
PUT /my_index
{
“settings”: {
“analysis”: {
“filter”: {
“unique_stem”: {
“type”: “unique”,
“only_on_same_position”: true <1>
}
},
“analyzer”: {
“in_situ”: {
“tokenizer”: “standard”,
“filter”: [
“lowercase”,
“keyword_repeat”, <2>
“porter_stem”,
“unique_stem” <3>
]
}
}
}
}
}
<1> The unique
token filter is set to remove duplicate tokens
only when they occur in the same position.
<2> The keyword_repeat
token filter must appear before the
stemmer.
<3> The unique_stem
filter removes duplicate terms after the
stemmer has done its work.
[[stemming-in-situ-good-idea]]
==== Is Stemming in situ a Good Idea
People like the (((“stemming words”, “stemming in situ”, “good idea, or not”)))idea of stemming in situ: ``Why use an unstemmed field
and a stemmed field if I can just use one combined field?’’ But is it a
good idea? The answer is almost always no. There are two problems.
The first is the inability to separate exact matches from inexact matches. In
this chapter, we have seen that words with different meanings are often
conflated to the same stem word: organs
and organization
both stem toorgan
.
In <
stemmed field (to increase recall) with a query on an unstemmed field (to
improve relevance).(((“language analyzers”, “combining query on stemmed and unstemmed field”))) When the stemmed and unstemmed fields are separate, the
contribution of each field can be tuned by boosting one field over another
(see <
appear in the same field, there is no way to tune your search results.
The second issue has to do with how the (((“relevance scores”, “stemming in situ and”)))relevance score is calculated. In
<
inverse document frequency — how often a word appears in all the documents
in our index.(((“inverse document frequency”, “stemming in situ and”))) Using in situ stemming for a document that contains the textjump jumped jumps
would result in these terms:
[source,text]
Pos 1: (jump)
Pos 2: (jumped,jump)
Pos 3: (jumps,jump)
While jumped
and jumps
appear once each and so would have the correct IDF,jump
appears three times, greatly reducing its value as a search term in
comparison with the unstemmed forms.
For these reasons, we recommend against using stemming in situ.