Word delimiter graph token filter
Word delimiter graph token filter
Splits tokens at non-alphanumeric characters. The word_delimiter_graph
filter also performs optional token normalization based on a set of rules. By default, the filter uses the following rules:
- Split tokens at non-alphanumeric characters. The filter uses these characters as delimiters. For example:
Super-Duper
→Super
,Duper
- Remove leading or trailing delimiters from each token. For example:
XL---42+'Autocoder'
→XL
,42
,Autocoder
- Split tokens at letter case transitions. For example:
PowerShot
→Power
,Shot
- Split tokens at letter-number transitions. For example:
XL500
→XL
,500
- Remove the English possessive (
's
) from the end of each token. For example:Neil's
→Neil
The word_delimiter_graph
filter uses Lucene’s WordDelimiterGraphFilter.
The word_delimiter_graph
filter was designed to remove punctuation from complex identifiers, such as product IDs or part numbers. For these use cases, we recommend using the word_delimiter_graph
filter with the keyword tokenizer.
Avoid using the word_delimiter_graph
filter to split hyphenated words, such as wi-fi
. Because users often search for these words both with and without hyphens, we recommend using the synonym_graph filter instead.
Example
The following analyze API request uses the word_delimiter_graph
filter to split Neil's-Super-Duper-XL500--42+AutoCoder
into normalized tokens using the filter’s default rules:
resp = client.indices.analyze(
tokenizer="keyword",
filter=[
"word_delimiter_graph"
],
text="Neil's-Super-Duper-XL500--42+AutoCoder",
)
print(resp)
response = client.indices.analyze(
body: {
tokenizer: 'keyword',
filter: [
'word_delimiter_graph'
],
text: "Neil's-Super-Duper-XL500--42+AutoCoder"
}
)
puts response
const response = await client.indices.analyze({
tokenizer: "keyword",
filter: ["word_delimiter_graph"],
text: "Neil's-Super-Duper-XL500--42+AutoCoder",
});
console.log(response);
GET /_analyze
{
"tokenizer": "keyword",
"filter": [ "word_delimiter_graph" ],
"text": "Neil's-Super-Duper-XL500--42+AutoCoder"
}
The filter produces the following tokens:
[ Neil, Super, Duper, XL, 500, 42, Auto, Coder ]
Add to an analyzer
The following create index API request uses the word_delimiter_graph
filter to configure a new custom analyzer.
resp = client.indices.create(
index="my-index-000001",
settings={
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"filter": [
"word_delimiter_graph"
]
}
}
}
},
)
print(resp)
response = client.indices.create(
index: 'my-index-000001',
body: {
settings: {
analysis: {
analyzer: {
my_analyzer: {
tokenizer: 'keyword',
filter: [
'word_delimiter_graph'
]
}
}
}
}
}
)
puts response
const response = await client.indices.create({
index: "my-index-000001",
settings: {
analysis: {
analyzer: {
my_analyzer: {
tokenizer: "keyword",
filter: ["word_delimiter_graph"],
},
},
},
},
});
console.log(response);
PUT /my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"filter": [ "word_delimiter_graph" ]
}
}
}
}
}
Avoid using the word_delimiter_graph
filter with tokenizers that remove punctuation, such as the standard tokenizer. This could prevent the word_delimiter_graph
filter from splitting tokens correctly. It can also interfere with the filter’s configurable parameters, such as catenate_all or preserve_original. We recommend using the keyword or whitespace tokenizer instead.
Configurable parameters
adjust_offsets
(Optional, Boolean) If true
, the filter adjusts the offsets of split or catenated tokens to better reflect their actual position in the token stream. Defaults to true
.
Set adjust_offsets
to false
if your analyzer uses filters, such as the trim filter, that change the length of tokens without changing their offsets. Otherwise, the word_delimiter_graph
filter could produce tokens with illegal offsets.
catenate_all
(Optional, Boolean) If true
, the filter produces catenated tokens for chains of alphanumeric characters separated by non-alphabetic delimiters. For example: super-duper-xl-500
→ [ superduperxl500
, super
, duper
, xl
, 500
]. Defaults to false
.
Setting this parameter to true
produces multi-position tokens, which are not supported by indexing.
If this parameter is true
, avoid using this filter in an index analyzer or use the flatten_graph filter after this filter to make the token stream suitable for indexing.
When used for search analysis, catenated tokens can cause problems for the match_phrase query and other queries that rely on token position for matching. Avoid setting this parameter to true
if you plan to use these queries.
catenate_numbers
(Optional, Boolean) If true
, the filter produces catenated tokens for chains of numeric characters separated by non-alphabetic delimiters. For example: 01-02-03
→ [ 010203
, 01
, 02
, 03
]. Defaults to false
.
Setting this parameter to true
produces multi-position tokens, which are not supported by indexing.
If this parameter is true
, avoid using this filter in an index analyzer or use the flatten_graph filter after this filter to make the token stream suitable for indexing.
When used for search analysis, catenated tokens can cause problems for the match_phrase query and other queries that rely on token position for matching. Avoid setting this parameter to true
if you plan to use these queries.
catenate_words
(Optional, Boolean) If true
, the filter produces catenated tokens for chains of alphabetical characters separated by non-alphabetic delimiters. For example: super-duper-xl
→ [ superduperxl
, super
, duper
, xl
]. Defaults to false
.
Setting this parameter to true
produces multi-position tokens, which are not supported by indexing.
If this parameter is true
, avoid using this filter in an index analyzer or use the flatten_graph filter after this filter to make the token stream suitable for indexing.
When used for search analysis, catenated tokens can cause problems for the match_phrase query and other queries that rely on token position for matching. Avoid setting this parameter to true
if you plan to use these queries.
generate_number_parts
(Optional, Boolean) If true
, the filter includes tokens consisting of only numeric characters in the output. If false
, the filter excludes these tokens from the output. Defaults to true
.
generate_word_parts
(Optional, Boolean) If true
, the filter includes tokens consisting of only alphabetical characters in the output. If false
, the filter excludes these tokens from the output. Defaults to true
.
ignore_keywords
(Optional, Boolean) If true
, the filter skips tokens with a keyword
attribute of true
. Defaults to false
.
preserve_original
(Optional, Boolean) If true
, the filter includes the original version of any split tokens in the output. This original version includes non-alphanumeric delimiters. For example: super-duper-xl-500
→ [ super-duper-xl-500
, super
, duper
, xl
, 500
]. Defaults to false
.
Setting this parameter to true
produces multi-position tokens, which are not supported by indexing.
If this parameter is true
, avoid using this filter in an index analyzer or use the flatten_graph filter after this filter to make the token stream suitable for indexing.
protected_words
(Optional, array of strings) Array of tokens the filter won’t split.
protected_words_path
(Optional, string) Path to a file that contains a list of tokens the filter won’t split.
This path must be absolute or relative to the config
location, and the file must be UTF-8 encoded. Each token in the file must be separated by a line break.
split_on_case_change
(Optional, Boolean) If true
, the filter splits tokens at letter case transitions. For example: camelCase
→ [ camel
, Case
]. Defaults to true
.
split_on_numerics
(Optional, Boolean) If true
, the filter splits tokens at letter-number transitions. For example: j2se
→ [ j
, 2
, se
]. Defaults to true
.
stem_english_possessive
(Optional, Boolean) If true
, the filter removes the English possessive ('s
) from the end of each token. For example: O'Neil's
→ [ O
, Neil
]. Defaults to true
.
type_table
(Optional, array of strings) Array of custom type mappings for characters. This allows you to map non-alphanumeric characters as numeric or alphanumeric to avoid splitting on those characters.
For example, the following array maps the plus (+
) and hyphen (-
) characters as alphanumeric, which means they won’t be treated as delimiters:
[ "+ => ALPHA", "- => ALPHA" ]
Supported types include:
ALPHA
(Alphabetical)ALPHANUM
(Alphanumeric)DIGIT
(Numeric)LOWER
(Lowercase alphabetical)SUBWORD_DELIM
(Non-alphanumeric delimiter)UPPER
(Uppercase alphabetical)
type_table_path
(Optional, string) Path to a file that contains custom type mappings for characters. This allows you to map non-alphanumeric characters as numeric or alphanumeric to avoid splitting on those characters.
For example, the contents of this file may contain the following:
# Map the $, %, '.', and ',' characters to DIGIT
# This might be useful for financial data.
$ => DIGIT
% => DIGIT
. => DIGIT
\\u002C => DIGIT
# in some cases you might not want to split on ZWJ
# this also tests the case where we need a bigger byte[]
# see https://en.wikipedia.org/wiki/Zero-width_joiner
\\u200D => ALPHANUM
Supported types include:
ALPHA
(Alphabetical)ALPHANUM
(Alphanumeric)DIGIT
(Numeric)LOWER
(Lowercase alphabetical)SUBWORD_DELIM
(Non-alphanumeric delimiter)UPPER
(Uppercase alphabetical)
This file path must be absolute or relative to the config
location, and the file must be UTF-8 encoded. Each mapping in the file must be separated by a line break.
Customize
To customize the word_delimiter_graph
filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.
For example, the following request creates a word_delimiter_graph
filter that uses the following rules:
- Split tokens at non-alphanumeric characters, except the hyphen (
-
) character. - Remove leading or trailing delimiters from each token.
- Do not split tokens at letter case transitions.
- Do not split tokens at letter-number transitions.
- Remove the English possessive (
's
) from the end of each token.
resp = client.indices.create(
index="my-index-000001",
settings={
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"filter": [
"my_custom_word_delimiter_graph_filter"
]
}
},
"filter": {
"my_custom_word_delimiter_graph_filter": {
"type": "word_delimiter_graph",
"type_table": [
"- => ALPHA"
],
"split_on_case_change": False,
"split_on_numerics": False,
"stem_english_possessive": True
}
}
}
},
)
print(resp)
response = client.indices.create(
index: 'my-index-000001',
body: {
settings: {
analysis: {
analyzer: {
my_analyzer: {
tokenizer: 'keyword',
filter: [
'my_custom_word_delimiter_graph_filter'
]
}
},
filter: {
my_custom_word_delimiter_graph_filter: {
type: 'word_delimiter_graph',
type_table: [
'- => ALPHA'
],
split_on_case_change: false,
split_on_numerics: false,
stem_english_possessive: true
}
}
}
}
}
)
puts response
const response = await client.indices.create({
index: "my-index-000001",
settings: {
analysis: {
analyzer: {
my_analyzer: {
tokenizer: "keyword",
filter: ["my_custom_word_delimiter_graph_filter"],
},
},
filter: {
my_custom_word_delimiter_graph_filter: {
type: "word_delimiter_graph",
type_table: ["- => ALPHA"],
split_on_case_change: false,
split_on_numerics: false,
stem_english_possessive: true,
},
},
},
},
});
console.log(response);
PUT /my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"filter": [ "my_custom_word_delimiter_graph_filter" ]
}
},
"filter": {
"my_custom_word_delimiter_graph_filter": {
"type": "word_delimiter_graph",
"type_table": [ "- => ALPHA" ],
"split_on_case_change": false,
"split_on_numerics": false,
"stem_english_possessive": true
}
}
}
}
}
Differences between word_delimiter_graph
and word_delimiter
Both the word_delimiter_graph
and word_delimiter filters produce tokens that span multiple positions when any of the following parameters are true
:
However, only the word_delimiter_graph
filter assigns multi-position tokens a positionLength
attribute, which indicates the number of positions a token spans. This ensures the word_delimiter_graph
filter always produces valid token graphs.
The word_delimiter
filter does not assign multi-position tokens a positionLength
attribute. This means it produces invalid graphs for streams including these tokens.
While indexing does not support token graphs containing multi-position tokens, queries, such as the match_phrase query, can use these graphs to generate multiple sub-queries from a single query string.
To see how token graphs produced by the word_delimiter
and word_delimiter_graph
filters differ, check out the following example.
Example
Basic token graph
Both the word_delimiter
and word_delimiter_graph
produce the following token graph for PowerShot2000
when the following parameters are false
:
This graph does not contain multi-position tokens. All tokens span only one position.
word_delimiter_graph
graph with a multi-position token
The word_delimiter_graph
filter produces the following token graph for PowerShot2000
when catenate_words
is true
.
This graph correctly indicates the catenated PowerShot
token spans two positions.
word_delimiter
graph with a multi-position token
When catenate_words
is true
, the word_delimiter
filter produces the following token graph for PowerShot2000
.
Note that the catenated PowerShot
token should span two positions but only spans one in the token graph, making it invalid.