Pattern replace token filter
Pattern replace token filter
Uses a regular expression to match and replace token substrings.
The pattern_replace
filter uses Java’s regular expression syntax. By default, the filter replaces matching substrings with an empty substring (""
). Replacement substrings can use Java’s $g syntax to reference capture groups from the original token text.
A poorly-written regular expression may run slowly or return a StackOverflowError, causing the node running the expression to exit suddenly.
Read more about pathological regular expressions and how to avoid them.
This filter uses Lucene’s PatternReplaceFilter.
Example
The following analyze API request uses the pattern_replace
filter to prepend watch
to the substring dog
in foxes jump lazy dogs
.
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type": "pattern_replace",
"pattern": "(dog)",
"replacement": "watch$1"
}
],
"text": "foxes jump lazy dogs"
}
The filter produces the following tokens.
[ foxes, jump, lazy, watchdogs ]
Configurable parameters
all
(Optional, Boolean) If true
, all substrings matching the pattern
parameter’s regular expression are replaced. If false
, the filter replaces only the first matching substring in each token. Defaults to true
.
pattern
(Required, string) Regular expression, written in Java’s regular expression syntax. The filter replaces token substrings matching this pattern with the substring in the replacement
parameter.
replacement
(Optional, string) Replacement substring. Defaults to an empty substring (""
).
Customize and add to an analyzer
To customize the pattern_replace
filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.
The following create index API request configures a new custom analyzer using a custom pattern_replace
filter, my_pattern_replace_filter
.
The my_pattern_replace_filter
filter uses the regular expression [£|€]
to match and remove the currency symbols £
and €
. The filter’s all
parameter is false
, meaning only the first matching symbol in each token is removed.
PUT /my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"filter": [
"my_pattern_replace_filter"
]
}
},
"filter": {
"my_pattern_replace_filter": {
"type": "pattern_replace",
"pattern": "[£|€]",
"replacement": "",
"all": false
}
}
}
}
}