[[icu-tokenizer]]
=== icu_tokenizer
The icu_tokenizer
uses the same Unicode Text Segmentation algorithm as thestandard
tokenizer,(((“words”, “identifying”, “using icu_tokenizer”)))(((“Unicode Text Segmentation algorithm”)))(((“icu_tokenizer”))) but adds better support for some Asian languages by
using a dictionary-based approach to identify words in Thai, Lao, Chinese,
Japanese, and Korean, and using custom rules to break Myanmar and Khmer text
into syllables.
For instance, compare the tokens (((“standard tokenizer”, “icu_tokenizer versus”)))produced by the standard
andicu_tokenizers
, respectively, when tokenizing ``Hello. I am from Bangkok.’’ in
Thai:
[source,js]
GET /_analyze?tokenizer=standard
สวัสดี ผมมาจากกรุงเทพฯ
The standard
tokenizer produces two tokens, one for each sentence: สวัสดี
,ผมมาจากกรุงเทพฯ
. That is useful only if you want to search for the whole
sentence I am from Bangkok.'', but not if you want to search for just
Bangkok.’’
[source,js]
GET /_analyze?tokenizer=icu_tokenizer
สวัสดี ผมมาจากกรุงเทพฯ
The icu_tokenizer
, on the other hand, is able to break up the text into the
individual words (สวัสดี
, ผม
, มา
, จาก
, กรุงเทพฯ)
, making them
easier to search.
In contrast, the standard
tokenizer ``over-tokenizes’’ Chinese and Japanese
text, often breaking up whole words into single characters. Because there
are no spaces between words, it can be difficult to tell whether consecutive
characters are separate words or form a single word. For instance:
向 means facing, 日 means sun, and 葵 means hollyhock. When
written together, 向日葵 means sunflower.五 means five or fifth, 月 means month, and 雨 means rain.
The first two characters written together as 五月 mean the month
of May, and adding the third character, 五月雨 means
continuous rain. When combined with a fourth character, 式,
meaning style, the word 五月雨式 becomes an adjective for anything
consecutive or unrelenting.
Although each character may be a word in its own right, tokens are more
meaningful when they retain the bigger original concept instead of just the
component parts:
[source,js]
GET /_analyze?tokenizer=standard
向日葵
GET /_analyze?tokenizer=icu_tokenizer
向日葵
The standard
tokenizer in the preceding example would emit each character
as a separate token: 向
, 日
, 葵
. The icu_tokenizer
would
emit the single token 向日葵
(sunflower).
Another difference between the standard
tokenizer and the icu_tokenizer
is
that the latter will break a word containing characters written in different
scripts (for example, βeta
) into separate tokens—β
, eta
—while the
former will emit the word as a single token: βeta
.