Elision token filter

The elision token filter is used to remove elided characters from words in certain languages. Elision typically occurs in languages such as French, in which words are often contracted and combined with the following word, typically by omitting a vowel and replacing it with an apostrophe.

The elision token filter is already preconfigured in the following language analyzers: catalan, french, irish, and italian.

Parameters

The custom elision token filter can be configured with the following parameters.

ParameterRequired/OptionalData typeDescription
articlesRequired if articles_path is not configuredArray of stringsDefines which articles or short words should be removed when they appear as part of an elision.
articles_pathRequired if articles is not configuredStringSpecifies the path to a custom list of articles that should be removed during the analysis process.
articles_caseOptionalBooleanSpecifies whether the filter is case sensitive when matching elisions. Default is false.

Example

The default set of French elisions is l', m', t', qu', n', s', j', d', c', jusqu', quoiqu', lorsqu', and puisqu'. You can update this by configuring the french_elision token filter. The following example request creates a new index named french_texts and configures an analyzer with a french_elision filter:

  1. PUT /french_texts
  2. {
  3. "settings": {
  4. "analysis": {
  5. "filter": {
  6. "french_elision": {
  7. "type": "elision",
  8. "articles": [ "l", "t", "m", "d", "n", "s", "j" ]
  9. }
  10. },
  11. "analyzer": {
  12. "french_analyzer": {
  13. "type": "custom",
  14. "tokenizer": "standard",
  15. "filter": ["lowercase", "french_elision"]
  16. }
  17. }
  18. }
  19. },
  20. "mappings": {
  21. "properties": {
  22. "text": {
  23. "type": "text",
  24. "analyzer": "french_analyzer"
  25. }
  26. }
  27. }
  28. }

copy

Generated tokens

Use the following request to examine the tokens generated using the analyzer:

  1. POST /french_texts/_analyze
  2. {
  3. "analyzer": "french_analyzer",
  4. "text": "L'étudiant aime l'école et le travail."
  5. }

copy

The response contains the generated tokens:

  1. {
  2. "tokens": [
  3. {
  4. "token": "étudiant",
  5. "start_offset": 0,
  6. "end_offset": 10,
  7. "type": "<ALPHANUM>",
  8. "position": 0
  9. },
  10. {
  11. "token": "aime",
  12. "start_offset": 11,
  13. "end_offset": 15,
  14. "type": "<ALPHANUM>",
  15. "position": 1
  16. },
  17. {
  18. "token": "école",
  19. "start_offset": 16,
  20. "end_offset": 23,
  21. "type": "<ALPHANUM>",
  22. "position": 2
  23. },
  24. {
  25. "token": "et",
  26. "start_offset": 24,
  27. "end_offset": 26,
  28. "type": "<ALPHANUM>",
  29. "position": 3
  30. },
  31. {
  32. "token": "le",
  33. "start_offset": 27,
  34. "end_offset": 29,
  35. "type": "<ALPHANUM>",
  36. "position": 4
  37. },
  38. {
  39. "token": "travail",
  40. "start_offset": 30,
  41. "end_offset": 37,
  42. "type": "<ALPHANUM>",
  43. "position": 5
  44. }
  45. ]
  46. }