[[stemming]]
    == Reducing Words to Their Root Form

    Most languages of the world are inflected, meaning (((“languages”, “inflection in”)))(((“words”, “stemming”, see=”stemming words”)))(((“stemming words”)))that words can change
    their form to express differences in the following:

    • Number: fox, foxes
    • Tense: pay, paid, paying
    • Gender: waiter, waitress
    • Person: hear, hears
    • Case: I, me, my
    • Aspect: ate, eaten
    • Mood: so be it, were it so

    While inflection aids expressivity, it interferes(((“inflection”))) with retrievability, as a
    single root word sense (or meaning) may be represented by many different
    sequences of letters.(((“English”, “inflection in”))) English is a weakly inflected language (you could
    ignore inflections and still get reasonable search results), but some other
    languages are highly inflected and need extra work in order to achieve
    high-quality search results.

    Stemming attempts to remove the differences between inflected forms of a
    word, in order to reduce each word to its root form. For instance foxes may
    be reduced to the root fox, to remove the difference between singular and
    plural in the same way that we removed the difference between lowercase and
    uppercase.

    The root form of a word may not even be a real word. The words jumping and
    jumpiness may both be stemmed to jumpi. It doesn’t matter—as long as
    the same terms are produced at index time and at search time, search will just
    work.

    If stemming were easy, there would be only one implementation. Unfortunately,
    stemming is an inexact science that (((“stemming words”, “understemming and overstemming”)))suffers from two issues: understemming
    and overstemming.

    Understemming is the failure to reduce words with the same meaning to the same
    root. For example, jumped and jumps may be reduced to jump, while
    jumping may be reduced to jumpi. Understemming reduces retrieval
    relevant documents are not returned.

    Overstemming is the failure to keep two words with distinct meanings separate.
    For instance, general and generate may both be stemmed to gener.
    Overstemming reduces precision: irrelevant documents are returned when they
    shouldn’t be.

    .Lemmatization


    A lemma is the canonical, or dictionary, form (((“lemma”)))of a set of related words—the
    lemma of paying, paid, and pays is pay. Usually the lemma resembles
    the words it is related to but sometimes it doesn’t — the lemma of is,
    was, am, and being is be.

    Lemmatization, like stemming, tries to group related words,(((“lemmatisation”))) but it goes one
    step further than stemming in that it tries to group words by their word
    sense
    , or meaning. The same word may represent two meanings—for example,wake can mean to wake up or a funeral. While lemmatization would
    try to distinguish these two word senses, stemming would incorrectly conflate
    them.

    Lemmatization is a much more complicated and expensive process that needs to
    understand the context in which words appear in order to make decisions
    about what they mean. In practice, stemming appears to be just as effective
    as lemmatization, but with a much lower cost.


    First we will discuss the two classes of stemmers available in Elasticsearch—<> and <>—and then look at how to
    choose the right stemmer for your needs in <>. Finally,
    we will discuss options for tailoring stemming in <> and
    <>.