Shingle token filter

The shingle token filter is used to generate word n-grams, or shingles, from input text. For example, for the string slow green turtle, the shingle filter creates the following one- and two-word shingles: slow, slow green, green, green turtle, and turtle.

This token filter is often used in conjunction with other filters to enhance search accuracy by indexing phrases rather than individual tokens. For more information, see Phrase suggester.

Parameters

The shingle token filter can be configured with the following parameters.

ParameterRequired/OptionalData typeDescription
minshingle_sizeOptionalIntegerThe minimum number of tokens to concatenate. Default is 2.
max_shingle_sizeOptionalIntegerThe maximum number of tokens to concatenate. Default is 2.
output_unigramsOptionalBooleanWhether to include unigrams (individual tokens) as output. Default is true.
output_unigrams_if_no_shinglesOptionalBooleanWhether to output unigrams if no shingles are generated. Default is false.
token_separatorOptionalStringA separator used to concatenate tokens into a shingle. Default is a space (“ “).
filler_tokenOptionalStringA token inserted into empty positions or gaps between tokens. Default is an underscore ().

If output_unigrams and output_unigrams_if_no_shingles are both set to true, output_unigrams_if_no_shingles is ignored.

Example

The following example request creates a new index named my-shingle-index and configures an analyzer with a shingle filter:

  1. PUT /my-shingle-index
  2. {
  3. "settings": {
  4. "analysis": {
  5. "filter": {
  6. "my_shingle_filter": {
  7. "type": "shingle",
  8. "min_shingle_size": 2,
  9. "max_shingle_size": 2,
  10. "output_unigrams": true
  11. }
  12. },
  13. "analyzer": {
  14. "my_shingle_analyzer": {
  15. "type": "custom",
  16. "tokenizer": "standard",
  17. "filter": [
  18. "lowercase",
  19. "my_shingle_filter"
  20. ]
  21. }
  22. }
  23. }
  24. }
  25. }

copy

Generated tokens

Use the following request to examine the tokens generated using the analyzer:

  1. GET /my-shingle-index/_analyze
  2. {
  3. "analyzer": "my_shingle_analyzer",
  4. "text": "slow green turtle"
  5. }

copy

The response contains the generated tokens:

  1. {
  2. "tokens": [
  3. {
  4. "token": "slow",
  5. "start_offset": 0,
  6. "end_offset": 4,
  7. "type": "<ALPHANUM>",
  8. "position": 0
  9. },
  10. {
  11. "token": "slow green",
  12. "start_offset": 0,
  13. "end_offset": 10,
  14. "type": "shingle",
  15. "position": 0,
  16. "positionLength": 2
  17. },
  18. {
  19. "token": "green",
  20. "start_offset": 5,
  21. "end_offset": 10,
  22. "type": "<ALPHANUM>",
  23. "position": 1
  24. },
  25. {
  26. "token": "green turtle",
  27. "start_offset": 5,
  28. "end_offset": 17,
  29. "type": "shingle",
  30. "position": 1,
  31. "positionLength": 2
  32. },
  33. {
  34. "token": "turtle",
  35. "start_offset": 11,
  36. "end_offset": 17,
  37. "type": "<ALPHANUM>",
  38. "position": 2
  39. }
  40. ]
  41. }