Ingest-attachment plugin

The ingest-attachment plugin enables OpenSearch to extract content and other information from files using the Apache text extraction library Tika. Supported document formats include PPT, PDF, RTF, ODF, and many more Tika (Supported Document Formats).

The input field must be a base64-encoded binary.

Installing the plugin

Install the ingest-attachment plugin using the following command:

  1. ./bin/opensearch-plugin install ingest-attachment

Attachment processor options

NameRequiredDefaultDescription
fieldYesN/AThe field from which to get the base64-encoded binary.
target_fieldNoAttachmentThe field that stores the attachment information.
propertiesNoAll propertiesAn array of properties that should be stored. Can be content, language, date, title, author, keywords, content_type, or content_length.
indexed_charsNo100_000The number of characters used for extraction to prevent fields from becoming too large. Use -1 for no limit.
indexed_chars_fieldNonullThe field name used to overwrite the number of chars being used for extraction, for example, indexed_chars.
ignore_missingNofalseWhen true, the processor exits without modifying the document when the specified field doesn’t exist.

Example

The following steps show you how to get started with the ingest-attachment plugin.

Step 1: Create an index for storing your attachments

The following command creates an index for storing your attachments:

  1. PUT /example-attachment-index
  2. {
  3. "mappings": {
  4. "properties": {}
  5. }
  6. }

Step 2: Create a pipeline

The following command creates a pipeline containing the attachment processor:

  1. PUT _ingest/pipeline/attachment
  2. {
  3. "description" : "Extract attachment information",
  4. "processors" : [
  5. {
  6. "attachment" : {
  7. "field" : "data"
  8. }
  9. }
  10. ]
  11. }

Step 3: Store an attachment

Convert the attachment to a base64 string to pass it as data. In this example the base64 command converts the file lorem.rtf:

  1. base64 lorem.rtf

Alternatively, you can use Node.js to read the file to base64, as shown in the following commands:

  1. import * as fs from "node:fs/promises";
  2. import path from "node:path";
  3. const filePath = path.join(import.meta.dirname, "lorem.rtf");
  4. const base64File = await fs.readFile(filePath, { encoding: "base64" });
  5. console.log(base64File);

The.rtf file contains the following base64 text:

Lorem ipsum dolor sit amet: e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=.

  1. PUT example-attachment-index/_doc/lorem_rtf?pipeline=attachment
  2. {
  3. "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
  4. }

Query results

With the attachment processed, you can now search through the data using search queries, as shown in the following example:

  1. POST example-attachment-index/_search
  2. {
  3. "query": {
  4. "match": {
  5. "attachment.content": "ipsum"
  6. }
  7. }
  8. }

OpenSearch responds with the following:

  1. {
  2. "took": 5,
  3. "timed_out": false,
  4. "_shards": {
  5. "total": 1,
  6. "successful": 1,
  7. "skipped": 0,
  8. "failed": 0
  9. },
  10. "hits": {
  11. "total": {
  12. "value": 1,
  13. "relation": "eq"
  14. },
  15. "max_score": 1.1724279,
  16. "hits": [
  17. {
  18. "_index": "example-attachment-index",
  19. "_id": "lorem_rtf",
  20. "_score": 1.1724279,
  21. "_source": {
  22. "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
  23. "attachment": {
  24. "content_type": "application/rtf",
  25. "language": "pt",
  26. "content": "Lorem ipsum dolor sit amet",
  27. "content_length": 28
  28. }
  29. }
  30. }
  31. ]
  32. }
  33. }

Extracted information

The following fields can be extracted using the plugin:

  • content
  • language
  • date
  • title
  • author
  • keywords
  • content_type
  • content_length

To extract only a subset of these fields, define them in the properties of the pipeline processor, as shown in the following example:

  1. PUT _ingest/pipeline/attachment
  2. {
  3. "description" : "Extract attachment information",
  4. "processors" : [
  5. {
  6. "attachment" : {
  7. "field" : "data",
  8. "properties": ["content", "title", "author"]
  9. }
  10. }
  11. ]
  12. }

Limit the extracted content

To prevent extracting too many characters and overloading the node memory, the default limit is 100_000. You can change this value using the setting indexed_chars. For example, you can use -1 for unlimited characters, but you need to make sure you have enough HEAP space on your OpenSearch node to extract the content of large documents.

You can also define this limit per document using the indexed_chars_field request field. If a document contains indexed_chars_field, it will overwrite the indexed_chars setting, as shown in the following example:

  1. PUT _ingest/pipeline/attachment
  2. {
  3. "description" : "Extract attachment information",
  4. "processors" : [
  5. {
  6. "attachment" : {
  7. "field" : "data",
  8. "indexed_chars" : 10,
  9. "indexed_chars_field" : "max_chars",
  10. }
  11. }
  12. ]
  13. }

With the attachment pipeline configured, you can extract the default 10 characters without specifying max_chars in the request, as shown in the following example:

  1. PUT example-attachment-index/_doc/lorem_rtf?pipeline=attachment
  2. {
  3. "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
  4. }

Alternatively, you can change the max_char per document in order to extract up to 15 characters, as shown in the following example:

  1. PUT example-attachment-index/_doc/lorem_rtf?pipeline=attachment
  2. {
  3. "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
  4. "max_chars": 15
  5. }