Processing fetched data

By default the first column from the result set of sql_query is indexed as the document id.

Document ID MUST be the very first field, and it MUST BE UNIQUE SIGNED (NON-ZERO) INTEGER NUMBER from -9223372036854775808 to 9223372036854775807.

You can specify up to 32 full-text fields (formally, up to SPH_MAX_FIELDS from sphinx.h), and an arbitrary amount of attributes. All of the columns that are neither document ID (the first one) nor attributes will be indexed as full-text fields.

Declaration of attributes:

sql_attr_bigint

Declares a 64-bit signed integer.

sql_attr_bool

Declares a boolean attribute. It’s equivalent to an integer attribute with bit count of 1.

sql_attr_float

Declares a floating point attribute.

The values will be stored in single precision, 32-bit IEEE 754 format. Represented range is approximately from 1e-38 to 1e+38. The amount of decimal digits that can be stored precisely is approximately 7.

One important usage of the float attributes is storing latitude and longitude values (in radians), for further usage in query-time geosphere distance calculations.

sql_attr_json

Declares a JSON attribute.

When indexing JSON attributes, Manticore expects a text field with JSON formatted data. JSON attributes supports arbitrary JSON data with no limitation in nested levels or types.

sql_attr_multi

Declares a multi-value attribute.

Plain attributes only allow to attach 1 value per each document. However, there are cases (such as tags or categories) when it is desired to attach multiple values of the same attribute and be able to apply filtering or grouping to value lists.

The MVA can take the values from a column (like the rest of the data types) - in this case the column in the result set must provide a string with multiple integer values separated by comma - or by running a separate query to get the values.

In case of query the engine runs the query, group by result by ids and will attach the values to their corresponding documents in the table. Values with an id not found in the table are discarded.

The declaration format is as follows (backslashes are for clarity only; everything can be declared in a single line as well):

  1. sql_attr_multi = ATTR-TYPE ATTR-NAME 'from' SOURCE-TYPE \
  2. [;QUERY] \
  3. [;RANGED-QUERY]

where

  • ATTR-TYPE is uint, bigint or timestamp
  • SOURCE-TYPE is field, query, ranged-query, or ranged-main-query
  • QUERY is SQL query used to fetch all ( docid, attrvalue ) pairs
  • RANGED-QUERY is SQL query used to fetch min and max ID values, similar to sql_query_range (used with ranged-query SOURCE-TYPE) If using ranged-main-query SOURCE-TYPE then omit the RANGED-QUERY and it will automatically use the same query from sql_query_range (useful option in complex inheritance setups to save having to manually duplicate the same query many times)
  1. sql_attr_multi = uint tag from field
  2. sql_attr_multi = uint tag from query; SELECT id, tag FROM tags
  3. sql_attr_multi = bigint tag from ranged-query; \
  4. SELECT id, tag FROM tags WHERE id>=$start AND id<=$end; \
  5. SELECT MIN(id), MAX(id) FROM tags

sql_attr_string

Declares a string attribute. There is a fixed size limit of 4GB per value.

sql_attr_timestamp

Declares an UNIX timestamp.

Timestamps can store date and time in the range of Jan 01, 1970 to Jan 19, 2038 with a precision of one second. The expected column value should be a timestamp in UNIX format, ie. 32-bit unsigned integer number of seconds elapsed since midnight, January 01, 1970, GMT. Timestamps are internally stored and handled as integers everywhere. But in addition to working with timestamps as integers, it’s also legal to use them along with different date-based functions, such as time segments sorting mode, or day/week/month/year extraction for GROUP BY.

Note that DATE or DATETIME column types in MySQL can not be directly used as timestamp attributes in Manticore; you need to explicitly convert such columns using UNIX_TIMESTAMP function (if data is in range).

Note timestamps can not represent dates before January 01, 1970, and UNIX_TIMESTAMP() in MySQL will not return anything expected. If you only needs to work with dates, not times, consider TO_DAYS() function in MySQL instead.

sql_attr_uint

Declares an unsigned integer attribute.

You can specify bit count for integer attributes by appending ‘:BITCOUNT’ to attribute name (see example below). Attributes with less than default 32-bit size, or bitfields, perform slower.

  1. sql_attr_uint = group_id
  2. sql_attr_uint = forum_id:9 # 9 bits for forum_id

sql_field_string

Declares a combo string attribute/text field. The values will be indexed as a full-text field, but also stored in a string attribute with the same name. Note, it should be only used when you are sure you want the field to be searchable both in a full-text manner and as an attribute (with the ability to sort and group by it). If you just want to be able to fetch the original value of the field you don’t need to do anything for it unless you implicitly removed the field from the stored fields list via stored_fields.

  1. sql_field_string = name

sql_file_field

Declares a file based field.

This directive makes indexer interpret field contents as a file name, and load and process the referred file. Files larger than max_file_field_buffer in size are skipped. Any errors during the file loading (IO errors, missed limits, etc) will be reported as indexing warnings and will not early terminate the indexing. No content will be indexed for such files.

  1. sql_file_field = field_name

sql_joined_field

Joined/payload field fetch query. Multi-value, optional, default is empty list of queries.

sql_joined_field lets you use two different features: joined fields, and payloads (payload fields). It’s syntax is as follows:

  1. sql_joined_field = FIELD-NAME 'from' ( 'query' | 'payload-query' | 'ranged-query' | 'ranged-main-query' ); \
  2. QUERY [ ; RANGE-QUERY ]

where

  • FIELD-NAME is a joined/payload field name
  • QUERY is an SQL query that must fetch values for further processing
  • RANGE-QUERY is an optional SQL query that fetches a range of values to process

Joined fields let you avoid JOIN and/or GROUP_CONCAT statements in the main document fetch query (sql_query). This can be useful when SQL-side JOIN is slow, or needs to be offloaded on Manticore side, or simply to emulate MySQL-specific GROUP_CONCAT functionality in case your database server does not support it.

The query must return exactly 2 columns: document ID, and text to append to a joined field. Document IDs can be duplicate, but they must be in ascending order. All the text rows fetched for a given ID will be concatenated together, and the concatenation result will be indexed as the entire contents of a joined field. Rows will be concatenated in the order returned from the query, and separating whitespace will be inserted between them. For instance, if joined field query returns the following rows:

  1. ( 1, 'red' )
  2. ( 1, 'right' )
  3. ( 1, 'hand' )
  4. ( 2, 'mysql' )
  5. ( 2, 'manticore' )

then the indexing results would be equivalent to that of adding a new text field with a value of ‘red right hand’ to document 1 and ‘mysql sphinx’’to document 2, including the keyword positions inside the field in the order they come from the query. If the rows needs to be in a specific order, that needs to be explicitly defined in the query.

Joined fields are only indexed differently. There are no other differences between joined fields and regular text fields.

When a single query is not efficient enough or does not work because of the database driver limitations, ranged queries can be used. It works similar to the ranged queries in the main indexing loop. The range will be queried for and fetched upfront once, then multiple queries with different $start and $end substitutions will be run to fetch the actual data.

When using ranged-main-query query then omit the ranged-query and it will automatically use the same query from :ref:sql_query_range (useful option in complex inheritance setups to save having to manually duplicate the same query many times).

Payloads let you create a special field in which, instead of keyword positions, so-called user payloads are stored. Payloads are custom integer values attached to every keyword. They can then be used in search time to affect the ranking.

The payload query must return exactly 3 columns: document ID; keyword; and integer payload value. Document IDs can be duplicate, but they must be in ascending order. Payloads must be unsigned integers within 24-bit range, ie. from 0 to 16777215. For reference, payloads are currently internally stored as in-field keyword positions, but that is not guaranteed and might change in the future.

Currently, the only method to account for payloads is to use SPH_RANK_PROXIMITY_BM25 ranker. On tables with payload fields, it will automatically switch to a variant that matches keywords in those fields, computes a sum of matched payloads multiplied by field weights, and adds that sum to the final rank.

Example:

  1. sql_joined_field = \
  2. tagstext from query; \
  3. SELECT docid, CONCAT('tag',tagid) FROM tags ORDER BY docid ASC
  4. sql_joined_field = tag from ranged-query; \
  5. SELECT id, tag FROM tags WHERE id>=$start AND id<=$end ORDER BY id ASC; \
  6. SELECT MIN(id), MAX(id) FROM tags

sql_column_buffers

  1. sql_column_buffers = <colname>=<size>[K|M] [, ...]

Per-column buffer sizes. Optional, default is empty (deduce the sizes automatically). Applies to odbc, mssql source types only.

ODBC and MS SQL drivers sometimes can not return the maximum actual column size to be expected. For instance, NVARCHAR(MAX) columns always report their length as 2147483647 bytes to indexer even though the actually used length is likely considerably less. However, the receiving buffers still need to be allocated upfront, and their sizes have to be determined. When the driver does not report the column length at all, Manticore allocates default 1 KB buffers for each non-char column, and 1 MB buffers for each char column. Driver-reported column length also gets clamped by an upper limit of 8 MB, so in case the driver reports (almost) a 2 GB column length, it will be clamped and a 8 MB buffer will be allocated instead for that column. These hard-coded limits can be overridden using the sql_column_buffers directive, either in order to save memory on actually shorter columns, or overcome the 8 MB limit on actually longer columns. The directive values must be a comma-separated lists of selected column names and sizes:

Example:

  1. sql_query = SELECT id, mytitle, mycontent FROM documents
  2. sql_column_buffers = mytitle=64K, mycontent=10M

Ranged queries

Main query, which needs to fetch all the documents, can impose a read lock on the whole table and stall the concurrent queries (eg. INSERTs to MyISAM table), waste a lot of memory for result set, etc. To avoid this, Manticore supports so-called ranged queries. With ranged queries, Manticore first fetches min and max document IDs from the table, and then substitutes different ID intervals into main query text and runs the modified query to fetch another chunk of documents. Here’s an example.

Ranged query usage example:

  1. sql_query_range = SELECT MIN(id),MAX(id) FROM documents
  2. sql_range_step = 1000
  3. sql_query = SELECT * FROM documents WHERE id>=$start AND id<=$end

If the table contains document IDs from 1 to, say, 2345, then sql_query would be run three times:

  1. with $start replaced with 1 and $end replaced with 1000;
  2. with $start replaced with 1001 and $end replaced with 2000;
  3. with $start replaced with 2001 and $end replaced with 2345.

Obviously, that’s not much of a difference for 2000-row table, but when it comes to indexing 10-million-row table, ranged queries might be of some help.

sql_query_range

Defines the range query. The query specified in this option must fetch min and max document IDs that will be used as range boundaries. It must return exactly two integer fields, min ID first and max ID second; the field names are ignored. When enabled, sql_query will be required to contain $start and $end macros. Note that the intervals specified by $start..$end will not overlap, so you should not remove document IDs that are exactly equal to $start or $end from your query.

sql_range_step

Defined the range query step. Default is 1024.

sql_ranged_throttle

Can be used to throttle ranged query. By default there is no throttling. Values are expected in sql_ranged_throttle.

Throttling can be useful when indexer imposes too much load on the database server. It causes the indexer to sleep for given amount of milliseconds once per each ranged query step. This sleep is unconditional, and is performed before the fetch query.

  1. sql_ranged_throttle = 1000 # sleep for 1 sec before each query step

Fetching from XML streams

xmlpipe2 source type lets you pass arbitrary full-text and attribute data to Manticore in yet another custom XML format. It also allows to specify the schema (ie. the set of fields and attributes) either in the XML stream itself, or in the source settings.

Declaration of XML stream

xmlpipe_command directive is mandatory and contains the shell command invoked to produce the XML stream which gets indexed. The command can just read a XML file but it can also be a program that generates on-the-fly the XML content.

XML file format

When indexing xmlpipe2 source, indexer runs the given command, opens a pipe to its stdout, and expects well-formed XML stream. Here’s sample stream data:

Example of XML document stream with schema built-in:

  1. <?xml version="1.0" encoding="utf-8"?>
  2. <sphinx:docset>
  3. <sphinx:schema>
  4. <sphinx:field name="subject"/>
  5. <sphinx:field name="content"/>
  6. <sphinx:attr name="published" type="timestamp"/>
  7. <sphinx:attr name="author_id" type="int" bits="16" default="1"/>
  8. </sphinx:schema>
  9. <sphinx:document id="1234">
  10. <content>this is the main content <![CDATA[and this <cdata> entry
  11. must be handled properly by xml parser lib]]></content>
  12. <published>1012325463</published>
  13. <subject>note how field/attr tags can be
  14. in <strong> class="red">randomized</strong> order</subject>
  15. <misc>some undeclared element</misc>
  16. </sphinx:document>
  17. <sphinx:document id="1235">
  18. <subject>another subject</subject>
  19. <content>here comes another document, and i am given to understand,
  20. that in-document field order must not matter, sir</content>
  21. <published>1012325467</published>
  22. </sphinx:document>
  23. <!-- ... even more sphinx:document entries here ... -->
  24. <sphinx:killlist>
  25. <id>1234</id>
  26. <id>4567</id>
  27. </sphinx:killlist>
  28. </sphinx:docset>

Arbitrary fields and attributes are allowed. They also can occur in the stream in arbitrary order within each document; the order is ignored. There is a restriction on maximum field length; fields longer than 2 MB will be truncated to 2 MB (this limit can be changed in the source).

The schema, ie. complete fields and attributes list, must be declared before any document could be parsed. This can be done either in the configuration file using xmlpipe_field and xmlpipe_attr_XXX settings, or right in the stream using <sphinx:schema> element. <sphinx:schema> is optional. It is only allowed to occur as the very first sub-element in <sphinx:docset>. If there is no in-stream schema definition, settings from the configuration file will be used. Otherwise, stream settings take precedence. Note that the document id should be specified as a property id of tag <sphinx:document> (e.g. <sphinx:document id="1235">) and is supposed to be a unique signed positive non-zero 64-bit integer.

Unknown tags (which were not declared neither as fields nor as attributes) will be ignored with a warning. In the example above, \<misc> will be ignored. All embedded tags and their attributes (such as ** in \<subject> in the example above) will be silently ignored.

Support for incoming stream encodings depends on whether iconv is installed on the system. xmlpipe2 is parsed using libexpat parser that understands US-ASCII, ISO-8859-1, UTF-8 and a few UTF-16 variants natively. Manticore configure script will also check for libiconv presence, and utilize it to handle other encodings. libexpat also enforces the requirement to use UTF-8 charset on Manticore side, because the parsed data it returns is always in UTF-8.

XML elements (tags) recognized by xmlpipe2 (and their attributes where applicable) are:

  • sphinx:docset - Mandatory top-level element, denotes and contains xmlpipe2 document set.
  • sphinx:schema - Optional element, must either occur as the very first child of sphinx:docset, or never occur at all. Declares the document schema. Contains field and attribute declarations. If present, overrides per-source settings from the configuration file.
  • sphinx:field - `Optional element, child of sphinx:schema. Declares a full-text field. Known attributes are:
    • “name”, specifies the XML element name that will be treated as a full-text field in the subsequent documents.
    • “attr”, specifies whether to also table this field as a string. Possible value is “string”.
  • sphinx:attr - Optional element, child of sphinx:schema. Declares an attribute. Known attributes are:
    • “name”, specifies the element name that should be treated as an attribute in the subsequent documents.
    • “type”, specifies the attribute type. Possible values are “int”, “bigint”, “timestamp”, “bool”, “float”, “multi” and “json”.
    • “bits”, specifies the bit size for “int” attribute type. Valid values are 1 to 32.
    • “default”, specifies the default value for this attribute that should be used if the attribute’s element is not present in the document.
  • sphinx:document - Mandatory element, must be a child of sphinx:docset. Contains arbitrary other elements with field and attribute values to be indexed, as declared either using sphinx:field and sphinx:attr elements or in the configuration file. The only known attribute is “id” that must contain the unique integer document ID.
  • sphinx:killlist - Optional element, child of sphinx:docset. Contains a number of “id” elements whose contents are document IDs to be put into a kill-list of the table. The kill-list is used in multi-table searches to suppress documents found in other tables of the search

Data definition in source configuration

If the XML doesn’t define a schema, the data types of tables elements must be defined in the source configuration.

  • xmlpipe_field - declares a text field.
  • xmlpipe_field_string - declares a text field/string attribute. The column will be both indexed as a text field but also stored as a string attribute.
  • xmlpipe_attr_uint - declares an integer attribute
  • xmlpipe_attr_timestamp - declares a timestamp attribute
  • xmlpipe_attr_bool - declares a boolean attribute
  • xmlpipe_attr_float - declares a float attribute
  • xmlpipe_attr_bigint - declares a big integer attribute
  • xmlpipe_attr_multi - declares a multi-value attribute with integers
  • xmlpipe_attr_multi_64 - declares a multi-value attribute with 64bit integers
  • xmlpipe_attr_string - declares a string attribute
  • xmlpipe_attr_json - declares a JSON attribute

Specific XML source settings

If xmlpipe_fixup_utf8 is set it will enable Manticore-side UTF-8 validation and filtering to prevent XML parser from choking on non-UTF-8 documents. By default this option is disabled.

Under certain occasions it might be hard or even impossible to guarantee that the incoming XMLpipe2 document bodies are in perfectly valid and conforming UTF-8 encoding. For instance, documents with national single-byte encodings could sneak into the stream. libexpat XML parser is fragile, meaning that it will stop processing in such cases. UTF8 fixup feature lets you avoid that. When fixup is enabled, Manticore will preprocess the incoming stream before passing it to the XML parser and replace invalid UTF-8 sequences with spaces.

  1. xmlpipe_fixup_utf8 = 1

Example of XML source without schema in configuration:

  1. source xml_test_1
  2. {
  3. type = xmlpipe2
  4. xmlpipe_command = cat /tmp/products_today.xml
  5. }

Example of XML source with schema in configuration:

  1. source xml_test_2
  2. {
  3. type = xmlpipe2
  4. xmlpipe_command = cat /tmp/products_today.xml
  5. xmlpipe_field = subject
  6. xmlpipe_field = content
  7. xmlpipe_attr_timestamp = published
  8. xmlpipe_attr_uint = author_id:16
  9. }