Standard analyzer

Standard analyzer

The standard analyzer is the default analyzer which is used if none is specified. It provides grammar based tokenization (based on the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29) and works well for most languages.

Example output

  1. resp = client.indices.analyze(
  2. analyzer="standard",
  3. text="The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
  4. )
  5. print(resp)
  1. response = client.indices.analyze(
  2. body: {
  3. analyzer: 'standard',
  4. text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
  5. }
  6. )
  7. puts response
  1. const response = await client.indices.analyze({
  2. analyzer: "standard",
  3. text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
  4. });
  5. console.log(response);
  1. POST _analyze
  2. {
  3. "analyzer": "standard",
  4. "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
  5. }

The above sentence would produce the following terms:

  1. [ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]

Configuration

The standard analyzer accepts the following parameters:

maxtoken_length

The maximum token length. If a token is seen that exceeds this length then it is split at max_token_length intervals. Defaults to 255.

stopwords

A pre-defined stop words list like _english or an array containing a list of stop words. Defaults to none.

stopwords_path

The path to a file containing stop words.

See the Stop Token Filter for more information about stop word configuration.

Example configuration

In this example, we configure the standard analyzer to have a max_token_length of 5 (for demonstration purposes), and to use the pre-defined list of English stop words:

  1. resp = client.indices.create(
  2. index="my-index-000001",
  3. settings={
  4. "analysis": {
  5. "analyzer": {
  6. "my_english_analyzer": {
  7. "type": "standard",
  8. "max_token_length": 5,
  9. "stopwords": "_english_"
  10. }
  11. }
  12. }
  13. },
  14. )
  15. print(resp)
  16. resp1 = client.indices.analyze(
  17. index="my-index-000001",
  18. analyzer="my_english_analyzer",
  19. text="The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
  20. )
  21. print(resp1)
  1. response = client.indices.create(
  2. index: 'my-index-000001',
  3. body: {
  4. settings: {
  5. analysis: {
  6. analyzer: {
  7. my_english_analyzer: {
  8. type: 'standard',
  9. max_token_length: 5,
  10. stopwords: '_english_'
  11. }
  12. }
  13. }
  14. }
  15. }
  16. )
  17. puts response
  18. response = client.indices.analyze(
  19. index: 'my-index-000001',
  20. body: {
  21. analyzer: 'my_english_analyzer',
  22. text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
  23. }
  24. )
  25. puts response
  1. const response = await client.indices.create({
  2. index: "my-index-000001",
  3. settings: {
  4. analysis: {
  5. analyzer: {
  6. my_english_analyzer: {
  7. type: "standard",
  8. max_token_length: 5,
  9. stopwords: "_english_",
  10. },
  11. },
  12. },
  13. },
  14. });
  15. console.log(response);
  16. const response1 = await client.indices.analyze({
  17. index: "my-index-000001",
  18. analyzer: "my_english_analyzer",
  19. text: "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
  20. });
  21. console.log(response1);
  1. PUT my-index-000001
  2. {
  3. "settings": {
  4. "analysis": {
  5. "analyzer": {
  6. "my_english_analyzer": {
  7. "type": "standard",
  8. "max_token_length": 5,
  9. "stopwords": "_english_"
  10. }
  11. }
  12. }
  13. }
  14. }
  15. POST my-index-000001/_analyze
  16. {
  17. "analyzer": "my_english_analyzer",
  18. "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
  19. }

The above example produces the following terms:

  1. [ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]

Definition

The standard analyzer consists of:

Tokenizer

Token Filters

If you need to customize the standard analyzer beyond the configuration parameters then you need to recreate it as a custom analyzer and modify it, usually by adding token filters. This would recreate the built-in standard analyzer and you can use it as a starting point:

  1. resp = client.indices.create(
  2. index="standard_example",
  3. settings={
  4. "analysis": {
  5. "analyzer": {
  6. "rebuilt_standard": {
  7. "tokenizer": "standard",
  8. "filter": [
  9. "lowercase"
  10. ]
  11. }
  12. }
  13. }
  14. },
  15. )
  16. print(resp)
  1. response = client.indices.create(
  2. index: 'standard_example',
  3. body: {
  4. settings: {
  5. analysis: {
  6. analyzer: {
  7. rebuilt_standard: {
  8. tokenizer: 'standard',
  9. filter: [
  10. 'lowercase'
  11. ]
  12. }
  13. }
  14. }
  15. }
  16. }
  17. )
  18. puts response
  1. const response = await client.indices.create({
  2. index: "standard_example",
  3. settings: {
  4. analysis: {
  5. analyzer: {
  6. rebuilt_standard: {
  7. tokenizer: "standard",
  8. filter: ["lowercase"],
  9. },
  10. },
  11. },
  12. },
  13. });
  14. console.log(response);
  1. PUT /standard_example
  2. {
  3. "settings": {
  4. "analysis": {
  5. "analyzer": {
  6. "rebuilt_standard": {
  7. "tokenizer": "standard",
  8. "filter": [
  9. "lowercase"
  10. ]
  11. }
  12. }
  13. }
  14. }
  15. }

You’d add any token filters after lowercase.