Normalizers

Normalizers

Normalizers are similar to analyzers except that they may only emit a single token. As a consequence, they do not have a tokenizer and only accept a subset of the available char filters and token filters. Only the filters that work on a per-character basis are allowed. For instance a lowercasing filter would be allowed, but not a stemming filter, which needs to look at the keyword as a whole. The current list of filters that can be used in a normalizer definition are: arabic_normalization, asciifolding, bengali_normalization, cjk_width, decimal_digit, elision, german_normalization, hindi_normalization, indic_normalization, lowercase, persian_normalization, scandinavian_folding, serbian_normalization, sorani_normalization, uppercase.

Elasticsearch ships with a lowercase built-in normalizer. For other forms of normalization, a custom configuration is required.

Custom normalizers

Custom normalizers take a list of character filters and a list of token filters.

  1. PUT index
  2. {
  3. "settings": {
  4. "analysis": {
  5. "char_filter": {
  6. "quote": {
  7. "type": "mapping",
  8. "mappings": [
  9. "« => \"",
  10. "» => \""
  11. ]
  12. }
  13. },
  14. "normalizer": {
  15. "my_normalizer": {
  16. "type": "custom",
  17. "char_filter": ["quote"],
  18. "filter": ["lowercase", "asciifolding"]
  19. }
  20. }
  21. }
  22. },
  23. "mappings": {
  24. "properties": {
  25. "foo": {
  26. "type": "keyword",
  27. "normalizer": "my_normalizer"
  28. }
  29. }
  30. }
  31. }