Fingerprint token filter

Fingerprint token filter

Sorts and removes duplicate tokens from a token stream, then concatenates the stream into a single output token.

For example, this filter changes the [ the, fox, was, very, very, quick ] token stream as follows:

  1. Sorts the tokens alphabetically to [ fox, quick, the, very, very, was ]
  2. Removes a duplicate instance of the very token.
  3. Concatenates the token stream to a output single token: [fox quick the very was ]

Output tokens produced by this filter are useful for fingerprinting and clustering a body of text as described in the OpenRefine project.

This filter uses Lucene’s FingerprintFilter.

Example

The following analyze API request uses the fingerprint filter to create a single output token for the text zebra jumps over resting resting dog:

  1. GET _analyze
  2. {
  3. "tokenizer" : "whitespace",
  4. "filter" : ["fingerprint"],
  5. "text" : "zebra jumps over resting resting dog"
  6. }

The filter produces the following token:

  1. [ dog jumps over resting zebra ]

Add to an analyzer

The following create index API request uses the fingerprint filter to configure a new custom analyzer.

  1. PUT fingerprint_example
  2. {
  3. "settings": {
  4. "analysis": {
  5. "analyzer": {
  6. "whitespace_fingerprint": {
  7. "tokenizer": "whitespace",
  8. "filter": [ "fingerprint" ]
  9. }
  10. }
  11. }
  12. }
  13. }

Configurable parameters

max_output_size

(Optional, integer) Maximum character length, including whitespace, of the output token. Defaults to 255. Concatenated tokens longer than this will result in no token output.

separator

(Optional, string) Character to use to concatenate the token stream input. Defaults to a space.

Customize

To customize the fingerprint filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.

For example, the following request creates a custom fingerprint filter with that use + to concatenate token streams. The filter also limits output tokens to 100 characters or fewer.

  1. PUT custom_fingerprint_example
  2. {
  3. "settings": {
  4. "analysis": {
  5. "analyzer": {
  6. "whitespace_": {
  7. "tokenizer": "whitespace",
  8. "filter": [ "fingerprint_plus_concat" ]
  9. }
  10. },
  11. "filter": {
  12. "fingerprint_plus_concat": {
  13. "type": "fingerprint",
  14. "max_output_size": 100,
  15. "separator": "+"
  16. }
  17. }
  18. }
  19. }
  20. }