Fingerprint analyzer

Fingerprint analyzer

The fingerprint analyzer implements a fingerprinting algorithm which is used by the OpenRefine project to assist in clustering.

Input text is lowercased, normalized to remove extended characters, sorted, deduplicated and concatenated into a single token. If a stopword list is configured, stop words will also be removed.

Example output

  1. resp = client.indices.analyze(
  2. analyzer="fingerprint",
  3. text="Yes yes, Gödel said this sentence is consistent and.",
  4. )
  5. print(resp)
  1. response = client.indices.analyze(
  2. body: {
  3. analyzer: 'fingerprint',
  4. text: 'Yes yes, Gödel said this sentence is consistent and.'
  5. }
  6. )
  7. puts response
  1. const response = await client.indices.analyze({
  2. analyzer: "fingerprint",
  3. text: "Yes yes, Gödel said this sentence is consistent and.",
  4. });
  5. console.log(response);
  1. POST _analyze
  2. {
  3. "analyzer": "fingerprint",
  4. "text": "Yes yes, Gödel said this sentence is consistent and."
  5. }

The above sentence would produce the following single term:

  1. [ and consistent godel is said sentence this yes ]

Configuration

The fingerprint analyzer accepts the following parameters:

separator

The character to use to concatenate the terms. Defaults to a space.

maxoutput_size

The maximum token size to emit. Defaults to 255. Tokens larger than this size will be discarded.

stopwords

A pre-defined stop words list like _english or an array containing a list of stop words. Defaults to none.

stopwords_path

The path to a file containing stop words.

See the Stop Token Filter for more information about stop word configuration.

Example configuration

In this example, we configure the fingerprint analyzer to use the pre-defined list of English stop words:

  1. resp = client.indices.create(
  2. index="my-index-000001",
  3. settings={
  4. "analysis": {
  5. "analyzer": {
  6. "my_fingerprint_analyzer": {
  7. "type": "fingerprint",
  8. "stopwords": "_english_"
  9. }
  10. }
  11. }
  12. },
  13. )
  14. print(resp)
  15. resp1 = client.indices.analyze(
  16. index="my-index-000001",
  17. analyzer="my_fingerprint_analyzer",
  18. text="Yes yes, Gödel said this sentence is consistent and.",
  19. )
  20. print(resp1)
  1. response = client.indices.create(
  2. index: 'my-index-000001',
  3. body: {
  4. settings: {
  5. analysis: {
  6. analyzer: {
  7. my_fingerprint_analyzer: {
  8. type: 'fingerprint',
  9. stopwords: '_english_'
  10. }
  11. }
  12. }
  13. }
  14. }
  15. )
  16. puts response
  17. response = client.indices.analyze(
  18. index: 'my-index-000001',
  19. body: {
  20. analyzer: 'my_fingerprint_analyzer',
  21. text: 'Yes yes, Gödel said this sentence is consistent and.'
  22. }
  23. )
  24. puts response
  1. const response = await client.indices.create({
  2. index: "my-index-000001",
  3. settings: {
  4. analysis: {
  5. analyzer: {
  6. my_fingerprint_analyzer: {
  7. type: "fingerprint",
  8. stopwords: "_english_",
  9. },
  10. },
  11. },
  12. },
  13. });
  14. console.log(response);
  15. const response1 = await client.indices.analyze({
  16. index: "my-index-000001",
  17. analyzer: "my_fingerprint_analyzer",
  18. text: "Yes yes, Gödel said this sentence is consistent and.",
  19. });
  20. console.log(response1);
  1. PUT my-index-000001
  2. {
  3. "settings": {
  4. "analysis": {
  5. "analyzer": {
  6. "my_fingerprint_analyzer": {
  7. "type": "fingerprint",
  8. "stopwords": "_english_"
  9. }
  10. }
  11. }
  12. }
  13. }
  14. POST my-index-000001/_analyze
  15. {
  16. "analyzer": "my_fingerprint_analyzer",
  17. "text": "Yes yes, Gödel said this sentence is consistent and."
  18. }

The above example produces the following term:

  1. [ consistent godel said sentence yes ]

Definition

The fingerprint tokenizer consists of:

Tokenizer

Token Filters (in order)

If you need to customize the fingerprint analyzer beyond the configuration parameters then you need to recreate it as a custom analyzer and modify it, usually by adding token filters. This would recreate the built-in fingerprint analyzer and you can use it as a starting point for further customization:

  1. resp = client.indices.create(
  2. index="fingerprint_example",
  3. settings={
  4. "analysis": {
  5. "analyzer": {
  6. "rebuilt_fingerprint": {
  7. "tokenizer": "standard",
  8. "filter": [
  9. "lowercase",
  10. "asciifolding",
  11. "fingerprint"
  12. ]
  13. }
  14. }
  15. }
  16. },
  17. )
  18. print(resp)
  1. response = client.indices.create(
  2. index: 'fingerprint_example',
  3. body: {
  4. settings: {
  5. analysis: {
  6. analyzer: {
  7. rebuilt_fingerprint: {
  8. tokenizer: 'standard',
  9. filter: [
  10. 'lowercase',
  11. 'asciifolding',
  12. 'fingerprint'
  13. ]
  14. }
  15. }
  16. }
  17. }
  18. }
  19. )
  20. puts response
  1. const response = await client.indices.create({
  2. index: "fingerprint_example",
  3. settings: {
  4. analysis: {
  5. analyzer: {
  6. rebuilt_fingerprint: {
  7. tokenizer: "standard",
  8. filter: ["lowercase", "asciifolding", "fingerprint"],
  9. },
  10. },
  11. },
  12. },
  13. });
  14. console.log(response);
  1. PUT /fingerprint_example
  2. {
  3. "settings": {
  4. "analysis": {
  5. "analyzer": {
  6. "rebuilt_fingerprint": {
  7. "tokenizer": "standard",
  8. "filter": [
  9. "lowercase",
  10. "asciifolding",
  11. "fingerprint"
  12. ]
  13. }
  14. }
  15. }
  16. }
  17. }