corpora

The corpora element contains all the document corpora used by the workload. You can use document corpora across workloads by copying and pasting any corpora definitions.

Example

The following example defines a single corpus called movies with 11658903 documents and 1544799789 uncompressed bytes:

  1. "corpora": [
  2. {
  3. "name": "movies",
  4. "documents": [
  5. {
  6. "source-file": "movies-documents.json",
  7. "document-count": 11658903, # Fetch document count from command line
  8. "uncompressed-bytes": 1544799789 # Fetch uncompressed bytes from command line
  9. }
  10. ]
  11. }
  12. ]

Configuration options

Use the following options with corpora.

ParameterRequiredTypeDescription
nameYesStringThe name of the document corpus. Because OpenSearch Benchmark uses this name in its directories, use only lowercase names without white spaces.
documentsYesJSON arrayAn array of document files.
metaNoStringA mapping of key-value pairs with additional metadata for a corpus.

Each entry in the documents array consists of the following options.

ParameterRequiredTypeDescription
source-fileYesStringThe file name containing the corresponding documents for the workload. When using OpenSearch Benchmark locally, documents are contained in a JSON file. When providing a base_url, use a compressed file format: .zip, .bz2, .gz, .tar, .tar.gz, .tgz, or .tar.bz2. The compressed file must have one JSON file containing the name.
document-countYesIntegerThe number of documents in the source-file, which determines which client indexes correlate to which parts of the document corpus. Each N client receives an Nth of the document corpus. When using a source that contains a document with a parent-child relationship, specify the number of parent documents.
base-urlNoStringAn http(s), Amazon Simple Storage Service (Amazon S3), or Google Cloud Storage URL that points to the root path where OpenSearch Benchmark can obtain the corresponding source file.
source-formatNoStringDefines the format OpenSearch Benchmark uses to interpret the data file specified in source-file. Only bulk is supported.
compressed-bytesNoIntegerThe size, in bytes, of the compressed source file, indicating how much data OpenSearch Benchmark downloads.
uncompressed-bytesNoIntegerThe size, in bytes, of the source file after decompression, indicating how much disk space the decompressed source file needs.
target-indexNoStringDefines the name of the index that the bulk operation should target. OpenSearch Benchmark automatically derives this value when only one index is defined in the indices element. The value of target-index is ignored when the includes-action-and-meta-data setting is true.
target-typeNoStringDefines the document type of the target index targeted in bulk operations. OpenSearch Benchmark automatically derives this value when only one index is defined in the indices element and the index has only one type. The value of target-type is ignored when the includes-action-and-meta-data setting is true.
includes-action-and-meta-dataNoBooleanWhen set to true, indicates that the document’s file already contains an action line and a meta-data line. When false, indicates that the document’s file contains only documents. Default is false.
metaNoStringA mapping of key-value pairs with additional metadata for a corpus.