Tokenizers
Single Token
The Single Token Tokenizer will return the entire input bytes as a single token.
Letter
The letter tokenizer is a tokenizer that simply identifies tokens as sequences of Unicode runes that are part of the Letter category.
Regular Expression
The Regular Expression Tokenizer will tokenize input using a configurable regular expression. The regular expression should match token text. See the Whitespace Tokenizer for an example of its usage.
Whitespace
The Whitespace Tokenizer is tokenizer which simply identifies tokens as sequences of Unicode runes that are NOT part of the Space category.
Unicode
The Unicode Tokenizer uses the segment library to perform Unicode Text Segmentation on word boundaries. It is recommended over the ICU tokenizer for all languages not requiring dictionary-based tokenization that is supported by ICU.
ICU
The ICU tokenizer uses the ICU library to tokenize the input using Unicode Text Segmentation on word boundaries.
NOTE: This tokenizer requires bleve be built with the optional ICU package. See the page on Building.
Exception
The exception tokenizer allows you to define exceptions. Exceptions are sections of the input stream which match regular expressions. These sections are left intact as single tokens. Any input not matching these regular expressions is passed to the child tokenizer.