Fingerprint token filter
Sorts and removes duplicate tokens from a token stream, then concatenates the stream into a single output token.
For example, this filter changes the [ the, fox, was, very, very, quick ]
token stream as follows:
- Sorts the tokens alphabetically to
[ fox, quick, the, very, very, was ]
- Removes a duplicate instance of the
very
token. - Concatenates the token stream to a output single token:
[fox quick the very was ]
Output tokens produced by this filter are useful for fingerprinting and clustering a body of text as described in the OpenRefine project.
This filter uses Lucene’s FingerprintFilter.
Example
The following analyze API request uses the fingerprint
filter to create a single output token for the text zebra jumps over resting resting dog
:
GET _analyze
{
"tokenizer" : "whitespace",
"filter" : ["fingerprint"],
"text" : "zebra jumps over resting resting dog"
}
The filter produces the following token:
[ dog jumps over resting zebra ]
Add to an analyzer
The following create index API request uses the fingerprint
filter to configure a new custom analyzer.
PUT fingerprint_example
{
"settings": {
"analysis": {
"analyzer": {
"whitespace_fingerprint": {
"tokenizer": "whitespace",
"filter": [ "fingerprint" ]
}
}
}
}
}
Configurable parameters
max_output_size
(Optional, integer) Maximum character length, including whitespace, of the output token. Defaults to 255
. Concatenated tokens longer than this will result in no token output.
separator
(Optional, string) Character to use to concatenate the token stream input. Defaults to a space.
Customize
To customize the fingerprint
filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.
For example, the following request creates a custom fingerprint
filter with that use +
to concatenate token streams. The filter also limits output tokens to 100
characters or fewer.
PUT custom_fingerprint_example
{
"settings": {
"analysis": {
"analyzer": {
"whitespace_": {
"tokenizer": "whitespace",
"filter": [ "fingerprint_plus_concat" ]
}
},
"filter": {
"fingerprint_plus_concat": {
"type": "fingerprint",
"max_output_size": 100,
"separator": "+"
}
}
}
}
}