Character group tokenizer

Character group tokenizer

The char_group tokenizer breaks text into terms whenever it encounters a character which is in a defined set. It is mostly useful for cases where a simple custom tokenization is desired, and the overhead of use of the pattern tokenizer is not acceptable.

Configuration

The char_group tokenizer accepts one parameter:

tokenize_on_chars

A list containing a list of characters to tokenize the string on. Whenever a character from this list is encountered, a new token is started. This accepts either single characters like e.g. -, or character groups: whitespace, letter, digit, punctuation, symbol.

max_token_length

The maximum token length. If a token is seen that exceeds this length then it is split at max_token_length intervals. Defaults to 255.

Example output

  1. resp = client.indices.analyze(
  2. tokenizer={
  3. "type": "char_group",
  4. "tokenize_on_chars": [
  5. "whitespace",
  6. "-",
  7. "\n"
  8. ]
  9. },
  10. text="The QUICK brown-fox",
  11. )
  12. print(resp)
  1. response = client.indices.analyze(
  2. body: {
  3. tokenizer: {
  4. type: 'char_group',
  5. tokenize_on_chars: [
  6. 'whitespace',
  7. '-',
  8. "\n"
  9. ]
  10. },
  11. text: 'The QUICK brown-fox'
  12. }
  13. )
  14. puts response
  1. const response = await client.indices.analyze({
  2. tokenizer: {
  3. type: "char_group",
  4. tokenize_on_chars: ["whitespace", "-", "\n"],
  5. },
  6. text: "The QUICK brown-fox",
  7. });
  8. console.log(response);
  1. POST _analyze
  2. {
  3. "tokenizer": {
  4. "type": "char_group",
  5. "tokenize_on_chars": [
  6. "whitespace",
  7. "-",
  8. "\n"
  9. ]
  10. },
  11. "text": "The QUICK brown-fox"
  12. }

returns

  1. {
  2. "tokens": [
  3. {
  4. "token": "The",
  5. "start_offset": 0,
  6. "end_offset": 3,
  7. "type": "word",
  8. "position": 0
  9. },
  10. {
  11. "token": "QUICK",
  12. "start_offset": 4,
  13. "end_offset": 9,
  14. "type": "word",
  15. "position": 1
  16. },
  17. {
  18. "token": "brown",
  19. "start_offset": 10,
  20. "end_offset": 15,
  21. "type": "word",
  22. "position": 2
  23. },
  24. {
  25. "token": "fox",
  26. "start_offset": 16,
  27. "end_offset": 19,
  28. "type": "word",
  29. "position": 3
  30. }
  31. ]
  32. }