Real-time table

Real-time table is the main type of tables in Manticore. It allows adding, updating and deleting documents with immediate availability of the changes. Real-time table settings can be defined in a configuration file or online via CREATE/UPDATE/DELETE/ALTER commands.

Real-time table internally consists of one or multiple plain tables called chunks. There can be:

  • multiple disk chunks. They are stored on disk with the same structure as any plain table
  • single ram chunk. Stored in memory and used as an accumulator of changes

RAM chunk size is controlled by rt_mem_limit. Once the limit is exceeded the RAM chunk is flushed to disk in a form of a disk chunk. When there are too many disk chunks they can be merged into one for better performance using command OPTIMIZE.

  • SQL
  • JSON
  • PHP
  • Python
  • Javascript
  • Java
  • CONFIG

SQL JSON PHP Python Javascript Java CONFIG

  1. CREATE TABLE products(title text, price float) morphology='stem_en';
  1. POST /cli -d "CREATE TABLE products(title text, price float) morphology='stem_en'"
  1. $index = new \Manticoresearch\Index($client);
  2. $index->setName('products');
  3. $index->create([
  4. 'title'=>['type'=>'text'],
  5. 'price'=>['type'=>'float'],
  6. ]);
  1. utilsApi.sql('CREATE TABLE forum(title text, price float)')
  1. res = await utilsApi.sql('CREATE TABLE forum(title text, price float)');
  1. utilsApi.sql("CREATE TABLE forum(title text, price float)");
  1. table products {
  2. type = rt
  3. path = tbl
  4. rt_field = title
  5. rt_attr_uint = price
  6. stored_fields = title
  7. }

Response

  1. Query OK, 0 rows affected (0.00 sec)
  1. {
  2. "total":0,
  3. "error":"",
  4. "warning":""
  5. }
Creating a real-time table via JSON over HTTP:

👍 What you can do with a real-time table:

⛔ What you cannot do with a real-time table:

  • Index data with help of indexer
  • Link it with sources for easy indexing from external storages
  • Update it’s killlist_target, it’s just not needed as the real-time table takes controls of it automatically

Real-time table files structure

ExtensionDescription
.locklock file
.ramRAM chunk
.metaRT table headers
..spdisk chunks (see plain table format)

Plain table

Plain table is a basic element for non-percolate searching. It can be specified only in a configuration file in the Plain mode). It’s not supported in the RT mode). It’s normally used together with a source to process data from an external storage and afterwards can be attached to a real-time table.

👍 What you can do with a plain table:

⛔ What you cannot do with a plain table:

  • insert more data into a table after it’s built
  • delete data from it
  • create/delete/alter a plain table online (you need to define it in a configuration file)
  • use UUID for automatic ID generation. When you fetch data from an external storage it must include a unique identifier for each document

Except numeric attributes (including MVA)), the rest of the data in a plain table is immutable. If you need to update/add new records you need to rebuild the table. While table is being rebuilt, existing table is still available for serving requests. When a new version of the table is ready, a process called rotation is performed which puts the new version online and discards the old one.

  • Plain table example

Plain table example

A plain table can be only defined in a configuration file. It’s not supported by command CREATE TABLE

  1. source source {
  2. type = mysql
  3. sql_host = localhost
  4. sql_user = myuser
  5. sql_pass = mypass
  6. sql_db = mydb
  7. sql_query = SELECT id, title, description, category_id from mytable
  8. sql_attr_uint = category_id
  9. sql_field_string = title
  10. }
  11. table tbl {
  12. type = plain
  13. source = source
  14. path = /path/to/table
  15. }

Plain table building performance

Speed of plain indexing depends on several factors:

  • how fast the source can be providing the data
  • tokenization settings
  • your hardware (CPU, amount of RAM, disk performance)

Plain table building scenarios

Rebuild fully when needed

In the simplest usage scenario, we would use a single plain table which we just fully rebuild from time to time. It works fine for smaller data sets and if you are ready that:

  • the table will be not as fresh as data in the source
  • indexing duration grows with the data, the more data you have in the source the longer it will take to build the table
Main+delta

If you have a bigger data set and still want to use a plain table rather than Real-Time what you can do is:

  • make another smaller table for incremental indexing
  • combine the both using a distributed table

What it can give is you can rebuild the bigger table seldom (say once per week), save the position of the freshest indexed document and after that use the smaller table to process anything new or updated from your source. Since you will only need to fetch the updates from your storage you can do it much more frequently (say once per minute or even each few seconds).

But after a while the smaller indexing duration will become too high and that will be the moment when you need to rebuild the bigger table and empty the smaller one.

This is called main+delta schema and you can learn more about it in this interactive course.

When you build a smaller “delta” table it can get documents that are already in the “main” table. To let Manticore know that documents from the current table should take precedence there’s a mechanism called kill list and corresponding directive killlist_target.

More information on this topic can be found here.

Plain table files structure

ExtensionDescription
.spastores document attributes in row-wise mode
.spbstores blob attributes in row-wise mode: strings, MVA, json
.spcstores document attributes in columnar mode
.spdstores matching document ID lists for each word ID
.sphstores table header information
.sphistores histograms of attribute values
.spistores word lists (word IDs and pointers to .spd file)
.spidxstores secondary indexes data
.spkstores kill-lists
.spllock file
.spmstores a bitmap of killed documents
.sppstores hit (aka posting, aka word occurrence) lists for each word ID
.sptstores additional data structures to speed up lookups by document ids
.spestores skip-lists to speed up doc-list filtering
.spdsstores document texts
.tmptemporary files during index_settings_and_status
.new.spnew version of a plain table before rotation
.old.sp*old version of a plain table after rotation

Plain and real-time table settings

Defining table schema in a configuration file

  1. table <index_name>[:<parent table name>] {
  2. ...
  3. }
  • Plain
  • Real-time

Plain Real-time

  1. table <table name> {
  2. type = plain
  3. path = /path/to/table
  4. source = <source_name>
  5. source = <another source_name>
  6. [stored_fields = <comma separated list of full-text fields that should be stored, all are stored by default, can be empty>]
  7. }
  1. table <table name> {
  2. type = rt
  3. path = /path/to/table
  4. rt_field = <full-text field name>
  5. rt_field = <another full-text field name>
  6. [rt_attr_uint = <integer field name>]
  7. [rt_attr_uint = <another integer field name, limit by N bits>:N]
  8. [rt_attr_bigint = <bigint field name>]
  9. [rt_attr_bigint = <another bigint field name>]
  10. [rt_attr_multi = <multi-integer (MVA) field name>]
  11. [rt_attr_multi = <another multi-integer (MVA) field name>]
  12. [rt_attr_multi_64 = <multi-bigint (MVA) field name>]
  13. [rt_attr_multi_64 = <another multi-bigint (MVA) field name>]
  14. [rt_attr_float = <float field name>]
  15. [rt_attr_float = <another float field name>]
  16. [rt_attr_bool = <boolean field name>]
  17. [rt_attr_bool = <another boolean field name>]
  18. [rt_attr_string = <string field name>]
  19. [rt_attr_string = <another string field name>]
  20. [rt_attr_json = <json field name>]
  21. [rt_attr_json = <another json field name>]
  22. [rt_attr_timestamp = <timestamp field name>]
  23. [rt_attr_timestamp = <another timestamp field name>]
  24. [stored_fields = <comma separated list of full-text fields that should be stored, all are stored by default, can be empty>]
  25. [rt_mem_limit = <RAM chunk max size, default 128M>]
  26. [optimize_cutoff = <max number of RT table disk chunks>]
  27. }

Common plain and real-time tables settings

type

  1. type = plain
  2. type = rt

Table type: “plain” or “rt” (real-time)

Value: plain (default), rt

path

  1. path = path/to/table

Absolute or relative path without extension where to store the table or where to look for it

Value: path to the table, mandatory

stored_fields

  1. stored_fields = title, content

By default when a table is defined in a configuration file, full-text fields’ original content is both indexed and stored. This setting lets you specify the fields that should have their original values stored.

Value: comma separated list of full-text fields that should be stored. Empty value (i.e. stored_fields =) disables storing original values for all the fields.

Note, in case of a real-time table the fields listed in stored_fields should be also declared as rt_field.

Note also, that you don’t need to list attributes in stored_fields, since their original values are stored anyway. stored_fields can be only used for full-text fields.

See also docstore_block_size, docstore_compression for document storage compression options.

  • SQL
  • JSON
  • PHP
  • Python
  • Javascript
  • Java
  • CONFIG

SQL JSON PHP Python Javascript Java CONFIG

  1. CREATE TABLE products(title text, content text stored indexed, name text indexed, price float)
  1. POST /cli -d "
  2. CREATE TABLE products(title text, content text stored indexed, name text indexed, price float)"
  1. $params = [
  2. 'body' => [
  3. 'columns' => [
  4. 'title'=>['type'=>'text'],
  5. 'content'=>['type'=>'text', 'options' => ['indexed', 'stored']],
  6. 'name'=>['type'=>'text', 'options' => ['indexed']],
  7. 'price'=>['type'=>'float']
  8. ]
  9. ],
  10. 'index' => 'products'
  11. ];
  12. $index = new \Manticoresearch\Index($client);
  13. $index->create($params);
  1. utilsApi.sql('CREATE TABLE products(title text, content text stored indexed, name text indexed, price float)')
  1. res = await utilsApi.sql('CREATE TABLE products(title text, content text stored indexed, name text indexed, price float)');
  1. utilsApi.sql("CREATE TABLE products(title text, content text stored indexed, name text indexed, price float)");
  1. table products {
  2. stored_fields = title, content # we want to store only "title" and "content", "name" shouldn't be stored
  3. type = rt
  4. path = tbl
  5. rt_field = title
  6. rt_field = content
  7. rt_field = name
  8. rt_attr_uint = price
  9. }

stored_only_fields

  1. stored_only_fields = title,content

List of fields that will be stored in the table, but will not be indexed. Similar to stored_fields except when a field is specified in stored_only_fields it is only stored, not indexed and can’t be searched with full-text queries. It can only be returned with search results.

Value: comma separated list of fields that should be stored only, not indexed. Default is empty. Note, in case of a real-time table the fields listed in stored_only_fields should be also declared as rt_field.

Note also, that you don’t need to list attributes in stored_only_fields, since their original values are stored anyway. If to compare stored_only_fields to string attributes the former (stored field):

  • is stored on disk and doesn’t require memory
  • is stored compressed
  • can be only fetched, you can’t sort/filter/group by the value

The latter (string attribute) is:

  • stored on disk and in memory
  • stored uncompressed
  • can be used for sorting, grouping, filtering and anything else you want to do with attributes.

Real-time table settings:

optimize_cutoff

Max number of RT table disk chunks. Read more here.

rt_field

  1. rt_field = subject

Full-text fields to be indexed. The names must be unique. The order is preserved; and so field values in INSERT statements without an explicit list of inserted columns will have to be in the same order as configured.

Full-text field declaration. Multi-value, optional.

rt_attr_uint

  1. rt_attr_uint = gid

Unsigned integer attribute declaration

Value: field_name or field_name:N, can be multiple records. N is the max number of bits to keep.

rt_attr_bigint

  1. rt_attr_bigint = gid

BIGINT attribute declaration

Value: field name, multiple records allowed

rt_attr_multi

  1. rt_attr_multi = tags

Multi-valued attribute (MVA) declaration. Declares the UNSIGNED INTEGER (unsigned 32-bit) MVA attribute. Multi-value (ie. there may be more than one such attribute declared), optional.

Value: field name, multiple records allowed.

rt_attr_multi_64

  1. rt_attr_multi_64 = wide_tags

Multi-valued attribute (MVA) declaration. Declares the BIGINT (signed 64-bit) MVA attribute. Multi-value (ie. there may be more than one such attribute declared), optional.

Value: field name, multiple records allowed.

rt_attr_float

  1. rt_attr_float = lat
  2. rt_attr_float = lon

Floating point attribute declaration. Multi-value (an arbitrary number of attributes is allowed), optional. Declares a single precision, 32-bit IEEE 754 format float attribute.

Value: field name, multiple records allowed.

rt_attr_bool

  1. rt_attr_bool = available

Boolean attribute declaration. Multi-value (there might be multiple attributes declared), optional. Declares a 1-bit unsigned integer attribute.

Value: field name, multiple records allowed.

rt_attr_string

  1. rt_attr_string = title

String attribute declaration. Multi-value (an arbitrary number of attributes is allowed), optional.

Value: field name, multiple records allowed.

rt_attr_json

  1. rt_attr_json = properties

JSON attribute declaration. Multi-value (ie. there may be more than one such attribute declared), optional.

Value: field name, multiple records allowed.

rt_attr_timestamp

  1. rt_attr_timestamp = date_added

Timestamp attribute declaration. Multi-value (an arbitrary number of attributes is allowed), optional.

Value: field name, multiple records allowed.

rt_mem_limit

  1. rt_mem_limit = 512M

RAM chunk size limit. Optional, default is 128M.

RT table keeps some data in memory (“RAM chunk”) and also maintains a number of on-disk tables (“disk chunks”). This directive lets you control the RAM chunk size. Once there’s too much data to keep in RAM, RT table will flush it to disk, activate a newly created disk chunk, and reset the RAM chunk.

The limit is pretty strict: RT table never allocates more memory than it’s limited to. The memory is not preallocated either, hence, specifying 512 MB limit and only inserting 3 MB of data should result in allocating 3 MB, not 512 MB.

The rt_mem_limit is never exceeded, but the actual RAM chunk can be significantly lower than the limit. Real-time table learns by your data insertion pace and adapts the actual limit to decrease RAM consumption and increase data write speed. How it works:

  • By default RAM chunk size is 50% of rt_mem_limit. It’s called “rt_mem_limit rate”.
  • As soon as RAM chunk accumulates rt_mem_limit * rate data (50% of rt_mem_limit by default) Manticore starts saving the RAM chunk as a new disk chunk.
  • While a new disk chunk is being saved, Manticore checks how many new/replaced documents have appeared.
  • Upon saving a new disk chunk we update the rt_mem_limit rate.
  • The rate is reset to 50% as soon as you restart the searchd.

For example, if we saved 90M docs to a disk chunk and 10M more docs arrived while saving, the rate is 90%, so next time we collect up to 90% of rt_mem_limit before starting flushing. The higher is the speed of insertion, the lower is the rt_mem_limit rate. The rate varies in the range of 33.3% to 95%. You can see table’s current rate in SHOW TABLE STATUS.

How to change rt_mem_limit and optimize_cutoff

In the RT mode RAM chunk size limit and max number of disk chunks can be changed using ALTER TABLE . To set rt_mem_limit to 1 gigabyte for table ‘t’ run query ALTER TABLE t rt_mem_limit='1G'. To change max number of chunks - ALTER TABLE t optimize_cutoff='5'.

In the plain mode rt_mem_limit and optimize_cutoff can be changed so:

  • change the value in the table configuration
  • run ALTER TABLE <index_name> RECONFIGURE
Important notes about RAM chunks
  • RT table is quite similar to a distributed table consisting of multiple local tables. The local tables are called “disk chunks”.
  • RAM chunk internally consists of multiple “segments”.
  • While disk chunks are stored on disk, the segments of RAM chunk are special RAM-only “tables”.
  • Any transaction you make to a real-time table generates a new segment. RAM chunk segments are merged after each transaction commit. Therefore it is beneficial to do bulk INSERTs of hundreds/thousands documents rather than hundreds/thousands different inserts with 1 document to avoid the overhead from merging RAM chunk segments.
  • When the number of segments gets greater than 32, the segments get merged, so the count is not greater than 32.
  • RT table always has a single RAM-chunk (may be empty) and one or multiple disk chunks.
  • Merging larger segments take longer, that’s why it may be suboptimal to have very large RAM chunk (and therefore rt_mem_limit).
  • Number of disk chunks depends on the amount of data in the table and rt_mem_limit setting.
  • Searchd flushes RAM chunk to disk (not as a disk chunk, just persists) on shutdown and periodically according to rt_flush_period. Flushing several gigabytes to disk may take some time.
  • Large RAM chunk will put more pressure on the storage:
    • when flushing the RAM chunk to disk into the .ram file
    • when the RAM chunk is full and is dumped to disk as a disk chunk.
  • Until flushed RAM chunk is not persisted on disk there’s a binary log as its backup for the case of a sudden daemon shutdown. In this case the larger you have rt_mem_limit, the longer will it take to replay the binlog on start to recover the RAM chunk.
  • RAM chunk may be performing slightly slower than a disk chunk.
  • Even though a RAM chunk doesn’t take more memory than rt_mem_limit Manticore itself can take more in some cases, e.g. if you begin a transaction to insert data and don’t commit it for some time, then the data you have already transmitted within the transaction to Manticore is kept in memory.

Plain table settings:

source

  1. source = srcpart1
  2. source = srcpart2
  3. source = srcpart3

Specifies document source to get documents from when the current table is indexed. There must be at least one source. The sources can be of different types (e.g. one - mysql, another - postgresql). Read more about indexing from external storages here

Value: name of the source to build the table from, mandatory. Can be multiple records.

killlist_target

  1. killlist_target = main:kl

Sets the table(s) that the kill-list will be applied to. Suppresses matches in the targeted table that are updated or deleted in the current table. In :kl mode the documents to suppress are taken from the kill-list. In :id mode all document ids from the current table are suppressed in the targeted one. If neither is specified the both modes take effect. Read more about kill-lists here

Value: not specified (default), target_index_name:kl, target_index_name:id, target_index_name. Multiple values are allowed

columnar_attrs

  1. columnar_attrs = *
  2. columnar_attrs = id, attr1, attr2, attr3

Specifies what attributes should be stored in the columnar storage instead of the default row-wise storage.

You can do columnar_attrs = * to store fields of all supported data types in the columnar storage.

id is also supported.

Creating a real-time table online via CREATE TABLE

General syntax of CREATE TABLE
  1. CREATE TABLE [IF NOT EXISTS] name ( <field name> <field data type> [data type options] [, ...]) [table_options]
Data types:

Read more about data types here.

TypeEquivalent in a configuration fileNotesAliases
textrt_fieldOptions: indexed, stored. Default - both. To keep text stored, but indexed specify “stored” only. To keep text indexed only specify only “indexed”.string
integerrt_attr_uintintegerint, uint
bigintrt_attr_bigintbig integer
floatrt_attr_floatfloat
multirt_attr_multimulti-integer
multi64rt_attr_multi_64multi-bigint
boolrt_attr_boolboolean
jsonrt_attr_jsonJSON
stringrt_attr_stringstring. Option indexed, attribute will make the value full-text indexed and filterable, sortable and groupable at the same time
timestamprt_attr_timestamptimestamp
bit(n)rt_attr_uint field_name:NN is the max number of bits to keep
  • SQL

SQL

  1. CREATE TABLE products (title text, price float) morphology='stem_en'

creates table “products” with two fields: “title” (full-text) and “price” (float) and setting “morphology” with value “stem_en”

  1. CREATE TABLE products (title text indexed, description text stored, author text, price float)

creates table “products” with three fields:

  • field “title” - indexed, but not stored
  • field “description” - stored, but not indexed
  • field “author” - both stored and indexed

Engine

  1. create table ... engine='columnar';
  2. create table ... engine='rowwise';

Changes default attribute storage for all attributes in the table. Can be overridden by specifying engine separately for each attribute.

See columnar_attrs on how to enable columnar storage for a plain table.

Values:

  • columnar - enables columnar storage for all table attributes except for json
  • rowwise (default) - doesn’t change anything, i.e. makes Manticore use the traditional row-wise storage for the table

Other settings

The following settings are similar for both real-time and plain table in either mode: whether specified in a configuration file or online via CREATE or ALTER command.

Accessing table files

Manticore uses two access modes to read table data - seek+read and mmap.

In seek+read mode the server performs system call pread to read document lists and keyword positions, i.e. *.spd and *.spp files. Internal read buffers are used to optimize reading. The size of these buffers can be tuned with options read_buffer_docs and read_buffer_hits. There is also option preopen that allows to control how Manticore opens files at start.

In the mmap access mode the search server just maps table’s file into memory with mmap system call and OS caches file contents by itself. Options read_buffer_docs and read_buffer_hits have no effect for corresponding files in this mode. The mmap reader can also lock table’s data in memory via mlock privileged call which prevents swapping out the cached data to disk by OS.

To control what access mode will be used access_plain_attrs, access_blob_attrs, access_doclists and access_hitlists options are available with the following values:

ValueDescription
fileserver reads the table files from disk with seek+read using internal buffers on file access
mmapserver maps the table files into memory and OS caches up its contents on file access
mmap_prereadserver maps the table files into memory and a background thread reads it once to warm up the cache
mlockserver maps the table files into memory and then executes the mlock() system call to cache up the file contents and lock it into memory to prevent it being swapped out
SettingValuesDescription
access_plain_attrsmmap, mmap_preread (default), mlockcontrols how .spa (plain attributes) .spe (skip lists) .spi (word lists) .spt (lookups) .spm (killed docs) will be read
access_blob_attrsmmap, mmap_preread (default), mlockcontrols how .spb (blob attributes) (string, mva and json attributes) will be read
access_doclistsfile (default), mmap, mlockcontrols how .spd (doc lists) data will be read
access_hitlistsfile (default), mmap, mlockcontrols how .spp (hit lists) data will be read

Here is a table which can help you select your desired mode:

table partkeep it on diskkeep it in memorycached in memory on server startlock it in memory
plain attributes in row-wise (non-columnar) storage, skip lists, word lists, lookups, killed docsmmapmmapmmap_preread (default)mlock
row-wise string, multi-value attributes (MVA) and json attributesmmapmmapmmap_preread (default)mlock
columnar numeric, string and multi-value attributesalwaysonly by means of OSnonot supported
doc listsfile (default)mmapnomlock
hit listsfile (default)mmapnomlock
The recommendations are:
  • If you want the best search response time and have enough memory - use row-wise attributes and mlock for attributes and for doclists/hitlists
  • If you can’t afford lower performance on start and are ready to wait longer on start until it’s warmed up - use --force-preread. If you want searchd to be able to restart faster - stay with mmap_preread
  • If you want to save RAM, but still have enough RAM for all the attributes - do not use mlock, then your OS will decide what should be in memory at any given moment of time depending on what is read from disk more frequently
  • If row-wise attributes don’t fit into RAM - use columnar attributes
  • If full-text search performance is not a priority and you want to save RAM - use access_doclists/access_hitlists=file

The default mode is to:

  • mmap
  • preread non-columnar attributes
  • seek+read columnar attributes with no preread
  • seek+read doclists/hitlists with no preread

which provides decent search performance, optimal memory usage and faster searchd restart in most cases.

attr_update_reserve

  1. attr_update_reserve = 256k

Sets the space to be reserved for blob attribute updates. Optional, default value is 128k. When blob attributes (multi-value attributes (MVA), strings, JSON) are updated, their length may change. If the updated string (or MVA or JSON) is shorter than the old one, it overwrites the old one in the *.spb file. But if the updated string is longer, the updates are written to the end of the *.spb file. This file is memory mapped, that’s why resizing it may be a rather slow process, depending on the OS implementation of memory mapped files. To avoid frequent resizes, you can specify the extra space to be reserved at the end of the .spb file by using this setting.

Value: size, default 128k.

docstore_block_size

  1. docstore_block_size = 32k

Size of the block of documents used by document storage. Optional, default is 16kb. When stored_fields or stored_only_fields are specified, original document text is stored inside the table. To use less disk space, documents are compressed. To get more efficient disk access and better compression ratios on small documents, documents are concatenated into blocks. When indexing, documents are collected until their total size reaches the threshold. After that, this block of documents is compressed. This option can be used to get better compression ratio (by increasing block size) or to get faster access to document text (by decreasing block size).

Value: size, default 16k.

docstore_compression

  1. docstore_compression = lz4hc

Type of compression used to compress blocks of documents used by document storage. When stored_fields or stored_only_fields are specified, document storage stores compressed document blocks. ‘lz4’ has fast compression and decompression speeds, ‘lz4hc’ (high compression) has the same fast decompression but compression speed is traded for better compression ratio. ‘none’ disables compression.

Value: lz4 (default), lz4hc, none.

docstore_compression_level

  1. docstore_compression_level = 12

Compression level in document storage when ‘lz4hc’ compression is used. When ‘lz4hc’ compression is used, compression level can be fine-tuned to get better performance or better compression ratio. Does not work with ‘lz4’ compression.

Value: 1-12 (default 9).

preopen

  1. preopen = 1

This option tells searchd that it should pre-open all table files on startup (or rotation) and keep them open while it runs. Currently, the default mode is not to pre-open the files. Pre-opened tables take a few (currently 2) file descriptors per table. However, they save on per-query open() calls; and also they are invulnerable to subtle race conditions that may happen during table rotation under high load. On the other hand, when serving many tables (100s to 1000s), it still might be desired to open them on per-query basis in order to save file descriptors

Value: 0 (default), 1.

read_buffer_docs

  1. read_buffer_docs = 1M

Per-keyword read buffer size for document lists. The higher the value the higher per-query RAM use is, but possibly lower IO time

Value: size, default 256k, min 8k.

read_buffer_hits

  1. read_buffer_hits = 1M

Per-keyword read buffer size for hit lists. The higher the value the higher per-query RAM use is, but possibly lower IO time

Value: size, default 256k, min 8k.

Plain table disk footprint settings

inplace_enable

  1. inplace_enable = {0|1}

Whether to enable in-place table inversion. Optional, default is 0 (use separate temporary files).

inplace_enable greatly reduces indexing disk footprint for a plain table, at a cost of slightly slower indexing (it uses around 2x less disk, but yields around 90-95% the original performance).

Indexing involves two major phases. The first phase collects, processes, and partially sorts documents by keyword, and writes the intermediate result to temporary files (.tmp*). The second phase fully sorts the documents, and creates the final table files. Thus, rebuilding a production table on the fly involves around 3x peak disk footprint: 1st copy for the intermediate temporary files, 2nd copy for newly constructed copy, and 3rd copy for the old table that will be serving production queries in the meantime. (Intermediate data is comparable in size to the final table.) That might be too much disk footprint for big data collections, and inplace_enable allows to reduce it. When enabled, it reuses the temporary files, outputs the final data back to them, and renames them on completion. However, this might require additional temporary data chunk relocation, which is where the performance impact comes from.

This directive does not affect searchd in any way, it only affects indexer.

  • CONFIG

CONFIG

  1. table products {
  2. inplace_enable = 1
  3. path = products
  4. source = src_base
  5. }

inplace_hit_gap

  1. inplace_hit_gap = size

In-place inversion fine-tuning option. Controls preallocated hitlist gap size. Optional, default is 0.

This directive does not affect searchd in any way, it only affects indexer.

  • CONFIG

CONFIG

  1. table products {
  2. inplace_hit_gap = 1M
  3. inplace_enable = 1
  4. path = products
  5. source = src_base
  6. }

inplace_reloc_factor

  1. inplace_reloc_factor = 0.1

Controls relocation buffer size within indexing memory arena. Optional, default is 0.1.

This directive does not affect searchd in any way, it only affects indexer.

  • CONFIG

CONFIG

  1. table products {
  2. inplace_reloc_factor = 0.1
  3. inplace_enable = 1
  4. path = products
  5. source = src_base
  6. }

inplace_write_factor

  1. inplace_write_factor = 0.1

Controls in-place write buffer size within indexing memory arena. Optional, default is 0.1.

This directive does not affect searchd in any way, it only affects indexer.

  • CONFIG

CONFIG

  1. table products {
  2. inplace_write_factor = 0.1
  3. inplace_enable = 1
  4. path = products
  5. source = src_base
  6. }

Natural language processing specific settings

The following settings are supported. They are all described in section NLP and tokenization.