Trim token filter
Trim token filter
Removes leading and trailing whitespace from each token in a stream. While this can change the length of a token, the trim
filter does not change a token’s offsets.
The trim
filter uses Lucene’s TrimFilter.
Many commonly used tokenizers, such as the standard or whitespace tokenizer, remove whitespace by default. When using these tokenizers, you don’t need to add a separate trim
filter.
Example
To see how the trim
filter works, you first need to produce a token containing whitespace.
The following analyze API request uses the keyword tokenizer to produce a token for " fox "
.
resp = client.indices.analyze(
tokenizer="keyword",
text=" fox ",
)
print(resp)
response = client.indices.analyze(
body: {
tokenizer: 'keyword',
text: ' fox '
}
)
puts response
const response = await client.indices.analyze({
tokenizer: "keyword",
text: " fox ",
});
console.log(response);
GET _analyze
{
"tokenizer" : "keyword",
"text" : " fox "
}
The API returns the following response. Note the " fox "
token contains the original text’s whitespace. Note that despite changing the token’s length, the start_offset
and end_offset
remain the same.
{
"tokens": [
{
"token": " fox ",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
}
]
}
To remove the whitespace, add the trim
filter to the previous analyze API request.
resp = client.indices.analyze(
tokenizer="keyword",
filter=[
"trim"
],
text=" fox ",
)
print(resp)
response = client.indices.analyze(
body: {
tokenizer: 'keyword',
filter: [
'trim'
],
text: ' fox '
}
)
puts response
const response = await client.indices.analyze({
tokenizer: "keyword",
filter: ["trim"],
text: " fox ",
});
console.log(response);
GET _analyze
{
"tokenizer" : "keyword",
"filter" : ["trim"],
"text" : " fox "
}
The API returns the following response. The returned fox
token does not include any leading or trailing whitespace.
{
"tokens": [
{
"token": "fox",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
}
]
}
Add to an analyzer
The following create index API request uses the trim
filter to configure a new custom analyzer.
resp = client.indices.create(
index="trim_example",
settings={
"analysis": {
"analyzer": {
"keyword_trim": {
"tokenizer": "keyword",
"filter": [
"trim"
]
}
}
}
},
)
print(resp)
response = client.indices.create(
index: 'trim_example',
body: {
settings: {
analysis: {
analyzer: {
keyword_trim: {
tokenizer: 'keyword',
filter: [
'trim'
]
}
}
}
}
}
)
puts response
const response = await client.indices.create({
index: "trim_example",
settings: {
analysis: {
analyzer: {
keyword_trim: {
tokenizer: "keyword",
filter: ["trim"],
},
},
},
},
});
console.log(response);
PUT trim_example
{
"settings": {
"analysis": {
"analyzer": {
"keyword_trim": {
"tokenizer": "keyword",
"filter": [ "trim" ]
}
}
}
}
}