Perform text analysis

The perform text analysis API analyzes a text string and returns the resulting tokens.

If you use the Security plugin, you must have the manage index privilege. If you simply want to analyze text, you must have the manager cluster privilege.

Path and HTTP methods

  1. GET /_analyze
  2. GET /{index}/_analyze
  3. POST /_analyze
  4. POST /{index}/_analyze

Although you can issue an analyzer request via both GET and POST requests, the two have important distinctions. A GET request causes data to be cached in the index so that the next time the data is requested, it is retrieved faster. A POST request sends a string that does not already exist to the analyzer to be compared to data that is already in the index. POST requests are not cached.

Path parameter

You can include the following optional path parameter in your request.

ParameterData typeDescription
indexStringIndex that is used to derive the analyzer.

Query parameters

You can include the following optional query parameters in your request.

FieldData typeDescription
analyzerStringThe name of the analyzer to apply to the text field. The analyzer can be built in or configured in the index.

If analyzer is not specified, the analyze API uses the analyzer defined in the mapping of the field field.

If the field field is not specified, the analyze API uses the default analyzer for the index.

If no index is specified or the index does not have a default analyzer, the analyze API uses the standard analyzer.
attributesArray of StringsArray of token attributes for filtering the output of the explain field.
char_filterArray of StringsArray of character filters for preprocessing characters before the tokenizer field.
explainBooleanIf true, causes the response to include token attributes and additional details. Defaults to false.
fieldStringField for deriving the analyzer.

If you specify field, you must also specify the index path parameter.

If you specify the analyzer field, it overrides the value of field.

If you do not specify field, the analyze API uses the default analyzer for the index.

If you do not specify the index field, or the index does not have a default analyzer, the analyze API uses the standard analyzer.
filterArray of StringsArray of token filters to apply after the tokenizer field.
normalizerStringNormalizer for converting text into a single token.
tokenizerStringTokenizer for converting the text field into tokens.

The following query parameter is required.

FieldData typeDescription
textString or Array of StringsText to analyze. If you provide an array of strings, the text is analyzed as a multi-value field.

Example requests

Analyze array of text strings

Apply a built-in analyzer

Apply a custom analyzer

Apply a custom transient analyzer

Specify an index

Derive the analyzer from an index field

Specify a normalizer

Get token details

Set a token limit

Analyze array of text strings

When you pass an array of strings to the text field, it is analyzed as a multi-value field.

  1. GET /_analyze
  2. {
  3. "analyzer" : "standard",
  4. "text" : ["first array element", "second array element"]
  5. }

copy

The previous request returns the following fields:

  1. {
  2. "tokens" : [
  3. {
  4. "token" : "first",
  5. "start_offset" : 0,
  6. "end_offset" : 5,
  7. "type" : "<ALPHANUM>",
  8. "position" : 0
  9. },
  10. {
  11. "token" : "array",
  12. "start_offset" : 6,
  13. "end_offset" : 11,
  14. "type" : "<ALPHANUM>",
  15. "position" : 1
  16. },
  17. {
  18. "token" : "element",
  19. "start_offset" : 12,
  20. "end_offset" : 19,
  21. "type" : "<ALPHANUM>",
  22. "position" : 2
  23. },
  24. {
  25. "token" : "second",
  26. "start_offset" : 20,
  27. "end_offset" : 26,
  28. "type" : "<ALPHANUM>",
  29. "position" : 3
  30. },
  31. {
  32. "token" : "array",
  33. "start_offset" : 27,
  34. "end_offset" : 32,
  35. "type" : "<ALPHANUM>",
  36. "position" : 4
  37. },
  38. {
  39. "token" : "element",
  40. "start_offset" : 33,
  41. "end_offset" : 40,
  42. "type" : "<ALPHANUM>",
  43. "position" : 5
  44. }
  45. ]
  46. }

Apply a built-in analyzer

If you omit the index path parameter, you can apply any of the built-in analyzers to the text string.

The following request analyzes text using the standard built-in analyzer:

  1. GET /_analyze
  2. {
  3. "analyzer" : "standard",
  4. "text" : "OpenSearch text analysis"
  5. }

copy

The previous request returns the following fields:

  1. {
  2. "tokens" : [
  3. {
  4. "token" : "opensearch",
  5. "start_offset" : 0,
  6. "end_offset" : 10,
  7. "type" : "<ALPHANUM>",
  8. "position" : 0
  9. },
  10. {
  11. "token" : "text",
  12. "start_offset" : 11,
  13. "end_offset" : 15,
  14. "type" : "<ALPHANUM>",
  15. "position" : 1
  16. },
  17. {
  18. "token" : "analysis",
  19. "start_offset" : 16,
  20. "end_offset" : 24,
  21. "type" : "<ALPHANUM>",
  22. "position" : 2
  23. }
  24. ]
  25. }

Apply a custom analyzer

You can create your own analyzer and specify it in an analyze request.

In this scenario, a custom analyzer lowercase_ascii_folding has been created and associated with the books2 index. The analyzer converts text to lowercase and converts non-ASCII characters to ASCII.

The following request applies the custom analyzer to the provided text:

  1. GET /books2/_analyze
  2. {
  3. "analyzer": "lowercase_ascii_folding",
  4. "text" : "Le garçon m'a SUIVI."
  5. }

copy

The previous request returns the following fields:

  1. {
  2. "tokens" : [
  3. {
  4. "token" : "le",
  5. "start_offset" : 0,
  6. "end_offset" : 2,
  7. "type" : "<ALPHANUM>",
  8. "position" : 0
  9. },
  10. {
  11. "token" : "garcon",
  12. "start_offset" : 3,
  13. "end_offset" : 9,
  14. "type" : "<ALPHANUM>",
  15. "position" : 1
  16. },
  17. {
  18. "token" : "m'a",
  19. "start_offset" : 10,
  20. "end_offset" : 13,
  21. "type" : "<ALPHANUM>",
  22. "position" : 2
  23. },
  24. {
  25. "token" : "suivi",
  26. "start_offset" : 14,
  27. "end_offset" : 19,
  28. "type" : "<ALPHANUM>",
  29. "position" : 3
  30. }
  31. ]
  32. }

Apply a custom transient analyzer

You can build a custom transient analyzer from tokenizers, token filters, or character filters. Use the filter parameter to specify token filters.

The following request uses the uppercase character filter to convert the text to uppercase:

  1. GET /_analyze
  2. {
  3. "tokenizer" : "keyword",
  4. "filter" : ["uppercase"],
  5. "text" : "OpenSearch filter"
  6. }

copy

The previous request returns the following fields:

  1. {
  2. "tokens" : [
  3. {
  4. "token" : "OPENSEARCH FILTER",
  5. "start_offset" : 0,
  6. "end_offset" : 17,
  7. "type" : "word",
  8. "position" : 0
  9. }
  10. ]
  11. }

The following request uses the html_strip filter to remove HTML characters from the text:

  1. GET /_analyze
  2. {
  3. "tokenizer" : "keyword",
  4. "filter" : ["lowercase"],
  5. "char_filter" : ["html_strip"],
  6. "text" : "<b>Leave</b> right now!"
  7. }

copy

The previous request returns the following fields:

  1. {
  2. "tokens" : [
  3. {
  4. "token" : "leave right now!",
  5. "start_offset" : 3,
  6. "end_offset" : 23,
  7. "type" : "word",
  8. "position" : 0
  9. }
  10. ]
  11. }

You can combine filters using an array.

The following request combines a lowercase translation with a stop filter that removes the words in the stopwords array:

  1. GET /_analyze
  2. {
  3. "tokenizer" : "whitespace",
  4. "filter" : ["lowercase", {"type": "stop", "stopwords": [ "to", "in"]}],
  5. "text" : "how to train your dog in five steps"
  6. }

copy

The previous request returns the following fields:

  1. {
  2. "tokens" : [
  3. {
  4. "token" : "how",
  5. "start_offset" : 0,
  6. "end_offset" : 3,
  7. "type" : "word",
  8. "position" : 0
  9. },
  10. {
  11. "token" : "train",
  12. "start_offset" : 7,
  13. "end_offset" : 12,
  14. "type" : "word",
  15. "position" : 2
  16. },
  17. {
  18. "token" : "your",
  19. "start_offset" : 13,
  20. "end_offset" : 17,
  21. "type" : "word",
  22. "position" : 3
  23. },
  24. {
  25. "token" : "dog",
  26. "start_offset" : 18,
  27. "end_offset" : 21,
  28. "type" : "word",
  29. "position" : 4
  30. },
  31. {
  32. "token" : "five",
  33. "start_offset" : 25,
  34. "end_offset" : 29,
  35. "type" : "word",
  36. "position" : 6
  37. },
  38. {
  39. "token" : "steps",
  40. "start_offset" : 30,
  41. "end_offset" : 35,
  42. "type" : "word",
  43. "position" : 7
  44. }
  45. ]
  46. }

Specify an index

You can analyze text using an index’s default analyzer, or you can specify a different analyzer.

The following request analyzes the provided text using the default analyzer associated with the books index:

  1. GET /books/_analyze
  2. {
  3. "text" : "OpenSearch analyze test"
  4. }

copy

The previous request returns the following fields:

  1. "tokens" : [
  2. {
  3. "token" : "opensearch",
  4. "start_offset" : 0,
  5. "end_offset" : 10,
  6. "type" : "<ALPHANUM>",
  7. "position" : 0
  8. },
  9. {
  10. "token" : "analyze",
  11. "start_offset" : 11,
  12. "end_offset" : 18,
  13. "type" : "<ALPHANUM>",
  14. "position" : 1
  15. },
  16. {
  17. "token" : "test",
  18. "start_offset" : 19,
  19. "end_offset" : 23,
  20. "type" : "<ALPHANUM>",
  21. "position" : 2
  22. }
  23. ]
  24. }

The following request analyzes the provided text using the keyword analyzer, which returns the entire text value as a single token:

  1. GET /books/_analyze
  2. {
  3. "analyzer" : "keyword",
  4. "text" : "OpenSearch analyze test"
  5. }

copy

The previous request returns the following fields:

  1. {
  2. "tokens" : [
  3. {
  4. "token" : "OpenSearch analyze test",
  5. "start_offset" : 0,
  6. "end_offset" : 23,
  7. "type" : "word",
  8. "position" : 0
  9. }
  10. ]
  11. }

Derive the analyzer from an index field

You can pass text and a field in the index. The API looks up the field’s analyzer and uses it to analyze the text.

If the mapping does not exist, the API uses the standard analyzer, which converts all text to lowercase and tokenizes based on white space.

The following request causes the analysis to be based on the mapping for name:

  1. GET /books2/_analyze
  2. {
  3. "field" : "name",
  4. "text" : "OpenSearch analyze test"
  5. }

copy

The previous request returns the following fields:

  1. {
  2. "tokens" : [
  3. {
  4. "token" : "opensearch",
  5. "start_offset" : 0,
  6. "end_offset" : 10,
  7. "type" : "<ALPHANUM>",
  8. "position" : 0
  9. },
  10. {
  11. "token" : "analyze",
  12. "start_offset" : 11,
  13. "end_offset" : 18,
  14. "type" : "<ALPHANUM>",
  15. "position" : 1
  16. },
  17. {
  18. "token" : "test",
  19. "start_offset" : 19,
  20. "end_offset" : 23,
  21. "type" : "<ALPHANUM>",
  22. "position" : 2
  23. }
  24. ]
  25. }

Specify a normalizer

Instead of using a keyword field, you can use the normalizer associated with the index. A normalizer causes the analysis change to produce a single token.

In this example, the books2 index includes a normalizer called to_lower_fold_ascii that converts text to lowercase and translates non-ASCII text to ASCII.

The following request applies to_lower_fold_ascii to the text:

  1. GET /books2/_analyze
  2. {
  3. "normalizer" : "to_lower_fold_ascii",
  4. "text" : "C'est le garçon qui m'a suivi."
  5. }

copy

The previous request returns the following fields:

  1. {
  2. "tokens" : [
  3. {
  4. "token" : "c'est le garcon qui m'a suivi.",
  5. "start_offset" : 0,
  6. "end_offset" : 30,
  7. "type" : "word",
  8. "position" : 0
  9. }
  10. ]
  11. }

You can create a custom transient normalizer with token and character filters.

The following request uses the uppercase character filter to convert the given text to all uppercase:

  1. GET /_analyze
  2. {
  3. "filter" : ["uppercase"],
  4. "text" : "That is the boy who followed me."
  5. }

copy

The previous request returns the following fields:

  1. {
  2. "tokens" : [
  3. {
  4. "token" : "THAT IS THE BOY WHO FOLLOWED ME.",
  5. "start_offset" : 0,
  6. "end_offset" : 32,
  7. "type" : "word",
  8. "position" : 0
  9. }
  10. ]
  11. }

Get token details

You can obtain additional details for all tokens by setting the explain attribute to true.

The following request provides detailed token information for the reverse filter used with the standard tokenizer:

  1. GET /_analyze
  2. {
  3. "tokenizer" : "standard",
  4. "filter" : ["reverse"],
  5. "text" : "OpenSearch analyze test",
  6. "explain" : true,
  7. "attributes" : ["keyword"]
  8. }

copy

The previous request returns the following fields:

  1. {
  2. "detail" : {
  3. "custom_analyzer" : true,
  4. "charfilters" : [ ],
  5. "tokenizer" : {
  6. "name" : "standard",
  7. "tokens" : [
  8. {
  9. "token" : "OpenSearch",
  10. "start_offset" : 0,
  11. "end_offset" : 10,
  12. "type" : "<ALPHANUM>",
  13. "position" : 0
  14. },
  15. {
  16. "token" : "analyze",
  17. "start_offset" : 11,
  18. "end_offset" : 18,
  19. "type" : "<ALPHANUM>",
  20. "position" : 1
  21. },
  22. {
  23. "token" : "test",
  24. "start_offset" : 19,
  25. "end_offset" : 23,
  26. "type" : "<ALPHANUM>",
  27. "position" : 2
  28. }
  29. ]
  30. },
  31. "tokenfilters" : [
  32. {
  33. "name" : "reverse",
  34. "tokens" : [
  35. {
  36. "token" : "hcraeSnepO",
  37. "start_offset" : 0,
  38. "end_offset" : 10,
  39. "type" : "<ALPHANUM>",
  40. "position" : 0
  41. },
  42. {
  43. "token" : "ezylana",
  44. "start_offset" : 11,
  45. "end_offset" : 18,
  46. "type" : "<ALPHANUM>",
  47. "position" : 1
  48. },
  49. {
  50. "token" : "tset",
  51. "start_offset" : 19,
  52. "end_offset" : 23,
  53. "type" : "<ALPHANUM>",
  54. "position" : 2
  55. }
  56. ]
  57. }
  58. ]
  59. }
  60. }

Set a token limit

You can set a limit to the number of tokens generated. Setting a lower value reduces a node’s memory usage. The default value is 10000.

The following request limits the tokens to four:

  1. PUT /books2
  2. {
  3. "settings" : {
  4. "index.analyze.max_token_count" : 4
  5. }
  6. }

copy

The preceding request is an index API rather than an analyze API. See DYNAMIC INDEX SETTINGS for additional details.

Response fields

The text analysis endpoints return the following response fields.

FieldData typeDescription
tokensArrayArray of tokens derived from the text. See token object.
detailObjectDetails about the analysis and each token. Included only when you request token details. See detail object.

Token object

FieldData typeDescription
tokenStringThe token’s text.
start_offsetIntegerThe token’s starting position within the original text string. Offsets are zero-based.
end_offsetIntegerThe token’s ending position within the original text string.
typeStringClassification of the token: <ALPHANUM>, <NUM>, and so on. The tokenizer usually sets the type, but some filters define their own types. For example, the synonym filter defines the <SYNONYM> type.
positionIntegerThe token’s position within the tokens array.

Detail object

FieldData typeDescription
custom_analyzerBooleanWhether the analyzer applied to the text is custom or built in.
charfiltersArrayList of character filters applied to the text.
tokenizerObjectName of the tokenizer applied to the text and a list of tokens with content before the token filters were applied.
tokenfiltersArrayList of token filters applied to the text. Each token filter includes the filter’s name and a list of tokens with content after the filters were applied. Token filters are listed in the order they are specified in the request.

See token object for token field descriptions.