CJK width token filter

The cjk_width token filter normalizes Chinese, Japanese, and Korean (CJK) tokens by converting full-width ASCII characters to their standard (half-width) ASCII equivalents and half-width katakana characters to their full-width equivalents.

Converting full-width ASCII characters

In CJK texts, ASCII characters (such as letters and numbers) can appear in full-width form, occupying the space of two half-width characters. Full-width ASCII characters are typically used in East Asian typography for alignment with the width of CJK characters. However, for the purposes of indexing and searching, these full-width characters need to be normalized to their standard (half-width) ASCII equivalents.

The following example illustrates ASCII character normalization:

  1. Full-Width: ABCDE 12345
  2. Normalized (half-width): ABCDE 12345

Converting half-width katakana characters

The cjk_width token filter converts half-width katakana characters to their full-width counterparts, which are the standard form used in Japanese text. This normalization, illustrated in the following example, is important for consistency in text processing and searching:

  1. Half-Width katakana: カタカナ
  2. Normalized (full-width) katakana: カタカナ

Example

The following example request creates a new index named cjk_width_example_index and defines an analyzer with the cjk_width filter:

  1. PUT /cjk_width_example_index
  2. {
  3. "settings": {
  4. "analysis": {
  5. "analyzer": {
  6. "cjk_width_analyzer": {
  7. "type": "custom",
  8. "tokenizer": "standard",
  9. "filter": ["cjk_width"]
  10. }
  11. }
  12. }
  13. }
  14. }

copy

Generated tokens

Use the following request to examine the tokens generated using the analyzer:

  1. POST /cjk_width_example_index/_analyze
  2. {
  3. "analyzer": "cjk_width_analyzer",
  4. "text": "Tokyo 2024 カタカナ"
  5. }

copy

The response contains the generated tokens:

  1. {
  2. "tokens": [
  3. {
  4. "token": "Tokyo",
  5. "start_offset": 0,
  6. "end_offset": 5,
  7. "type": "<ALPHANUM>",
  8. "position": 0
  9. },
  10. {
  11. "token": "2024",
  12. "start_offset": 6,
  13. "end_offset": 10,
  14. "type": "<NUM>",
  15. "position": 1
  16. },
  17. {
  18. "token": "カタカナ",
  19. "start_offset": 11,
  20. "end_offset": 15,
  21. "type": "<KATAKANA>",
  22. "position": 2
  23. }
  24. ]
  25. }