Thai analyzer

The built-in thai analyzer can be applied to a text field using the following command:

  1. PUT /thai-index
  2. {
  3. "mappings": {
  4. "properties": {
  5. "content": {
  6. "type": "text",
  7. "analyzer": "thai"
  8. }
  9. }
  10. }
  11. }

copy

Stem exclusion

You can use stem_exclusion with this language analyzer using the following command:

  1. PUT index_with_stem_exclusion_thai_analyzer
  2. {
  3. "settings": {
  4. "analysis": {
  5. "analyzer": {
  6. "stem_exclusion_thai_analyzer": {
  7. "type": "thai",
  8. "stem_exclusion": ["อำนาจ", "การอนุมัติ"]
  9. }
  10. }
  11. }
  12. }
  13. }

copy

Thai analyzer internals

The thai analyzer is built using the following components:

  • Tokenizer: thai

  • Token filters:

    • lowercase
    • decimal_digit
    • stop (Thai)
    • keyword

Custom Thai analyzer

You can create a custom Thai analyzer using the following command:

  1. PUT /thai-index
  2. {
  3. "settings": {
  4. "analysis": {
  5. "filter": {
  6. "thai_stop": {
  7. "type": "stop",
  8. "stopwords": "_thai_"
  9. },
  10. "thai_keywords": {
  11. "type": "keyword_marker",
  12. "keywords": []
  13. }
  14. },
  15. "analyzer": {
  16. "thai_analyzer": {
  17. "tokenizer": "thai",
  18. "filter": [
  19. "lowercase",
  20. "decimal_digit",
  21. "thai_stop",
  22. "thai_keywords"
  23. ]
  24. }
  25. }
  26. }
  27. },
  28. "mappings": {
  29. "properties": {
  30. "content": {
  31. "type": "text",
  32. "analyzer": "thai_analyzer"
  33. }
  34. }
  35. }
  36. }

copy

Generated tokens

Use the following request to examine the tokens generated using the analyzer:

  1. POST /thai-index/_analyze
  2. {
  3. "field": "content",
  4. "text": "นักเรียนกำลังศึกษาอยู่ที่มหาวิทยาลัยไทย หมายเลข 123456."
  5. }

copy

The response contains the generated tokens:

  1. {
  2. "tokens": [
  3. {"token": "นักเรียน","start_offset": 0,"end_offset": 8,"type": "word","position": 0},
  4. {"token": "กำลัง","start_offset": 8,"end_offset": 13,"type": "word","position": 1},
  5. {"token": "ศึกษา","start_offset": 13,"end_offset": 18,"type": "word","position": 2},
  6. {"token": "มหาวิทยาลัย","start_offset": 25,"end_offset": 36,"type": "word","position": 5},
  7. {"token": "ไทย","start_offset": 36,"end_offset": 39,"type": "word","position": 6},
  8. {"token": "หมายเลข","start_offset": 40,"end_offset": 47,"type": "word","position": 7},
  9. {"token": "123456","start_offset": 48,"end_offset": 54,"type": "word","position": 8}
  10. ]
  11. }