Min hash token filter

The min_hash token filter is used to generate hashes for tokens based on a MinHash approximation algorithm, which is useful for detecting similarity between documents. The min_hash token filter generates hashes for a set of tokens (typically from an analyzed field).

Parameters

The min_hash token filter can be configured with the following parameters.

ParameterRequired/OptionalData typeDescription
hash_countOptionalIntegerThe number of hash values to generate for each token. Increasing this value generally improves the accuracy of similarity estimation but increases the computational cost. Default is 1.
bucket_countOptionalIntegerThe number of hash buckets to use. This affects the granularity of the hashing. A larger number of buckets provides finer granularity and reduces hash collisions but requires more memory. Default is 512.
hash_set_sizeOptionalIntegerThe number of hashes to retain in each bucket. This can influence the hashing quality. Larger set sizes may lead to better similarity detection but consume more memory. Default is 1.
with_rotationOptionalBooleanWhen set to true, the filter populates empty buckets with the value from the first non-empty bucket found to its circular right, provided that the hash_set_size is 1. If the bucket_count argument exceeds 1, this setting automatically defaults to true; otherwise, it defaults to false.

Example

The following example request creates a new index named minhash_index and configures an analyzer with a min_hash filter:

  1. PUT /minhash_index
  2. {
  3. "settings": {
  4. "analysis": {
  5. "filter": {
  6. "minhash_filter": {
  7. "type": "min_hash",
  8. "hash_count": 3,
  9. "bucket_count": 512,
  10. "hash_set_size": 1,
  11. "with_rotation": false
  12. }
  13. },
  14. "analyzer": {
  15. "minhash_analyzer": {
  16. "type": "custom",
  17. "tokenizer": "standard",
  18. "filter": [
  19. "minhash_filter"
  20. ]
  21. }
  22. }
  23. }
  24. }
  25. }

copy

Generated tokens

Use the following request to examine the tokens generated using the analyzer:

  1. POST /minhash_index/_analyze
  2. {
  3. "analyzer": "minhash_analyzer",
  4. "text": "OpenSearch is very powerful."
  5. }

copy

The response contains the generated tokens (the tokens are not human readable because they represent hashes):

  1. {
  2. "tokens" : [
  3. {
  4. "token" : "\u0000\u0000㳠锯ੲ걌䐩䉵",
  5. "start_offset" : 0,
  6. "end_offset" : 27,
  7. "type" : "MIN_HASH",
  8. "position" : 0
  9. },
  10. {
  11. "token" : "\u0000\u0000㳠锯ੲ걌䐩䉵",
  12. "start_offset" : 0,
  13. "end_offset" : 27,
  14. "type" : "MIN_HASH",
  15. "position" : 0
  16. },
  17. ...

In order to demonstrate the usefulness of the min_hash token filter, you can use the following Python script to compare the two strings using the previously created analyzer:

  1. from opensearchpy import OpenSearch
  2. from requests.auth import HTTPBasicAuth
  3. # Initialize the OpenSearch client with authentication
  4. host = 'https://localhost:9200' # Update if using a different host/port
  5. auth = ('admin', 'admin') # Username and password
  6. # Create the OpenSearch client with SSL verification turned off
  7. client = OpenSearch(
  8. hosts=[host],
  9. http_auth=auth,
  10. use_ssl=True,
  11. verify_certs=False, # Disable SSL certificate validation
  12. ssl_show_warn=False # Suppress SSL warnings in the output
  13. )
  14. # Analyzes text and returns the minhash tokens
  15. def analyze_text(index, text):
  16. response = client.indices.analyze(
  17. index=index,
  18. body={
  19. "analyzer": "minhash_analyzer",
  20. "text": text
  21. }
  22. )
  23. return [token['token'] for token in response['tokens']]
  24. # Analyze two similar texts
  25. tokens_1 = analyze_text('minhash_index', 'OpenSearch is a powerful search engine.')
  26. tokens_2 = analyze_text('minhash_index', 'OpenSearch is a very powerful search engine.')
  27. # Calculate Jaccard similarity
  28. set_1 = set(tokens_1)
  29. set_2 = set(tokens_2)
  30. shared_tokens = set_1.intersection(set_2)
  31. jaccard_similarity = len(shared_tokens) / len(set_1.union(set_2))
  32. print(f"Jaccard Similarity: {jaccard_similarity}")

The response should contain the Jaccard similarity score:

  1. Jaccard Similarity: 0.8571428571428571