Delimited payload token filter
Delimited payload token filter
The older name delimited_payload_filter
is deprecated and should not be used with new indices. Use delimited_payload
instead.
Separates a token stream into tokens and payloads based on a specified delimiter.
For example, you can use the delimited_payload
filter with a |
delimiter to split the|1 quick|2 fox|3
into the tokens the
, quick
, and fox
with respective payloads of 1
, 2
, and 3
.
This filter uses Lucene’s DelimitedPayloadTokenFilter.
Payloads
A payload is user-defined binary data associated with a token position and stored as base64-encoded bytes.
Elasticsearch does not store token payloads by default. To store payloads, you must:
- Set the term_vector mapping parameter to
with_positions_payloads
orwith_positions_offsets_payloads
for any field storing payloads. - Use an index analyzer that includes the
delimited_payload
filter
You can view stored payloads using the term vectors API.
Example
The following analyze API request uses the delimited_payload
filter with the default |
delimiter to split the|0 brown|10 fox|5 is|0 quick|10
into tokens and payloads.
GET _analyze
{
"tokenizer": "whitespace",
"filter": ["delimited_payload"],
"text": "the|0 brown|10 fox|5 is|0 quick|10"
}
The filter produces the following tokens:
[ the, brown, fox, is, quick ]
Note that the analyze API does not return stored payloads. For an example that includes returned payloads, see Return stored payloads.
Add to an analyzer
The following create index API request uses the delimited-payload
filter to configure a new custom analyzer.
PUT delimited_payload
{
"settings": {
"analysis": {
"analyzer": {
"whitespace_delimited_payload": {
"tokenizer": "whitespace",
"filter": [ "delimited_payload" ]
}
}
}
}
}
Configurable parameters
delimiter
(Optional, string) Character used to separate tokens from payloads. Defaults to |
.
encoding
(Optional, string) Data type for the stored payload. Valid values are:
float
(Default) Float
identity
Characters
int
Integer
Customize and add to an analyzer
To customize the delimited_payload
filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.
For example, the following create index API request uses a custom delimited_payload
filter to configure a new custom analyzer. The custom delimited_payload
filter uses the +
delimiter to separate tokens from payloads. Payloads are encoded as integers.
PUT delimited_payload_example
{
"settings": {
"analysis": {
"analyzer": {
"whitespace_plus_delimited": {
"tokenizer": "whitespace",
"filter": [ "plus_delimited" ]
}
},
"filter": {
"plus_delimited": {
"type": "delimited_payload",
"delimiter": "+",
"encoding": "int"
}
}
}
}
}
Return stored payloads
Use the create index API to create an index that:
- Includes a field that stores term vectors with payloads.
- Uses a custom index analyzer with the
delimited_payload
filter.
PUT text_payloads
{
"mappings": {
"properties": {
"text": {
"type": "text",
"term_vector": "with_positions_payloads",
"analyzer": "payload_delimiter"
}
}
},
"settings": {
"analysis": {
"analyzer": {
"payload_delimiter": {
"tokenizer": "whitespace",
"filter": [ "delimited_payload" ]
}
}
}
}
}
Add a document containing payloads to the index.
POST text_payloads/_doc/1
{
"text": "the|0 brown|3 fox|4 is|0 quick|10"
}
Use the term vectors API to return the document’s tokens and base64-encoded payloads.
GET text_payloads/_termvectors/1
{
"fields": [ "text" ],
"payloads": true
}
The API returns the following response:
{
"_index": "text_payloads",
"_type": "_doc",
"_id": "1",
"_version": 1,
"found": true,
"took": 8,
"term_vectors": {
"text": {
"field_statistics": {
"sum_doc_freq": 5,
"doc_count": 1,
"sum_ttf": 5
},
"terms": {
"brown": {
"term_freq": 1,
"tokens": [
{
"position": 1,
"payload": "QEAAAA=="
}
]
},
"fox": {
"term_freq": 1,
"tokens": [
{
"position": 2,
"payload": "QIAAAA=="
}
]
},
"is": {
"term_freq": 1,
"tokens": [
{
"position": 3,
"payload": "AAAAAA=="
}
]
},
"quick": {
"term_freq": 1,
"tokens": [
{
"position": 4,
"payload": "QSAAAA=="
}
]
},
"the": {
"term_freq": 1,
"tokens": [
{
"position": 0,
"payload": "AAAAAA=="
}
]
}
}
}
}
}