Elision token filter
Elision token filter
Removes specified elisions from the beginning of tokens. For example, you can use this filter to change l'avion
to avion
.
When not customized, the filter removes the following French elisions by default:
l'
, m'
, t'
, qu'
, n'
, s'
, j'
, d'
, c'
, jusqu'
, quoiqu'
, lorsqu'
, puisqu'
Customized versions of this filter are included in several of Elasticsearch’s built-in language analyzers:
This filter uses Lucene’s ElisionFilter.
Example
The following analyze API request uses the elision
filter to remove j'
from j’examine près du wharf
:
resp = client.indices.analyze(
tokenizer="standard",
filter=[
"elision"
],
text="j’examine près du wharf",
)
print(resp)
response = client.indices.analyze(
body: {
tokenizer: 'standard',
filter: [
'elision'
],
text: 'j’examine près du wharf'
}
)
puts response
const response = await client.indices.analyze({
tokenizer: "standard",
filter: ["elision"],
text: "j’examine près du wharf",
});
console.log(response);
GET _analyze
{
"tokenizer" : "standard",
"filter" : ["elision"],
"text" : "j’examine près du wharf"
}
The filter produces the following tokens:
[ examine, près, du, wharf ]
Add to an analyzer
The following create index API request uses the elision
filter to configure a new custom analyzer.
resp = client.indices.create(
index="elision_example",
settings={
"analysis": {
"analyzer": {
"whitespace_elision": {
"tokenizer": "whitespace",
"filter": [
"elision"
]
}
}
}
},
)
print(resp)
response = client.indices.create(
index: 'elision_example',
body: {
settings: {
analysis: {
analyzer: {
whitespace_elision: {
tokenizer: 'whitespace',
filter: [
'elision'
]
}
}
}
}
}
)
puts response
const response = await client.indices.create({
index: "elision_example",
settings: {
analysis: {
analyzer: {
whitespace_elision: {
tokenizer: "whitespace",
filter: ["elision"],
},
},
},
},
});
console.log(response);
PUT /elision_example
{
"settings": {
"analysis": {
"analyzer": {
"whitespace_elision": {
"tokenizer": "whitespace",
"filter": [ "elision" ]
}
}
}
}
}
Configurable parameters
articles
(Required*, array of string) List of elisions to remove.
To be removed, the elision must be at the beginning of a token and be immediately followed by an apostrophe. Both the elision and apostrophe are removed.
For custom elision
filters, either this parameter or articles_path
must be specified.
articles_path
(Required*, string) Path to a file that contains a list of elisions to remove.
This path must be absolute or relative to the config
location, and the file must be UTF-8 encoded. Each elision in the file must be separated by a line break.
To be removed, the elision must be at the beginning of a token and be immediately followed by an apostrophe. Both the elision and apostrophe are removed.
For custom elision
filters, either this parameter or articles
must be specified.
articles_case
(Optional, Boolean) If true
, elision matching is case insensitive. If false
, elision matching is case sensitive. Defaults to false
.
Customize
To customize the elision
filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.
For example, the following request creates a custom case-insensitive elision
filter that removes the l'
, m'
, t'
, qu'
, n'
, s'
, and j'
elisions:
resp = client.indices.create(
index="elision_case_insensitive_example",
settings={
"analysis": {
"analyzer": {
"default": {
"tokenizer": "whitespace",
"filter": [
"elision_case_insensitive"
]
}
},
"filter": {
"elision_case_insensitive": {
"type": "elision",
"articles": [
"l",
"m",
"t",
"qu",
"n",
"s",
"j"
],
"articles_case": True
}
}
}
},
)
print(resp)
response = client.indices.create(
index: 'elision_case_insensitive_example',
body: {
settings: {
analysis: {
analyzer: {
default: {
tokenizer: 'whitespace',
filter: [
'elision_case_insensitive'
]
}
},
filter: {
elision_case_insensitive: {
type: 'elision',
articles: [
'l',
'm',
't',
'qu',
'n',
's',
'j'
],
articles_case: true
}
}
}
}
}
)
puts response
const response = await client.indices.create({
index: "elision_case_insensitive_example",
settings: {
analysis: {
analyzer: {
default: {
tokenizer: "whitespace",
filter: ["elision_case_insensitive"],
},
},
filter: {
elision_case_insensitive: {
type: "elision",
articles: ["l", "m", "t", "qu", "n", "s", "j"],
articles_case: true,
},
},
},
},
});
console.log(response);
PUT /elision_case_insensitive_example
{
"settings": {
"analysis": {
"analyzer": {
"default": {
"tokenizer": "whitespace",
"filter": [ "elision_case_insensitive" ]
}
},
"filter": {
"elision_case_insensitive": {
"type": "elision",
"articles": [ "l", "m", "t", "qu", "n", "s", "j" ],
"articles_case": true
}
}
}
}
}