Collations
Collations essentially affect the string attribute comparisons. They specify both the character set encoding and the strategy that Manticore uses to compare strings when doing ORDER BY
or GROUP BY
with a string attribute involved.
String attributes are stored as is when indexing, and no character set or language information is attached to them. That’s okay as long as Manticore only needs to store and return the strings to the calling application verbatim. But when you ask Manticore to sort by a string value, that request immediately becomes quite ambiguous.
First, single-byte (ASCII, or ISO-8859-1, or Windows-1251) strings need to be processed differently that the UTF-8 ones that may encode every character with a variable number of bytes. So we need to know what is the character set type to interpret the raw bytes as meaningful characters properly.
Second, we additionally need to know the language-specific string sorting rules. For instance, when sorting according to US rules in en_US locale, the accented character ï
(small letter i
with diaeresis) should be placed somewhere after z
. However, when sorting with French rules and fr_FR locale in mind, it should be placed between i
and j
. And some other set of rules might choose to ignore accents at all, allowing ï
and i
to be mixed arbitrarily.
Third, but not least, we might need case-sensitive sorting in some scenarios and case-insensitive sorting in some others.
Collations combine all of the above: the character set, the language rules, and the case sensitivity. Manticore currently provides the following four collations.
libc_ci
libc_cs
utf8_general_ci
binary
The first two collations rely on several standard C library (libc) calls and can thus support any locale that is installed on your system. They provide case-insensitive (_ci
) and case-sensitive (_cs
) comparisons respectively. By default they will use C locale, effectively resorting to bytewise comparisons. To change that, you need to specify a different available locale using collation_libc_locale directive. The list of locales available on your system can usually be obtained with the locale
command:
$ locale -a
C
en_AG
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IN
en_NG
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZW.utf8
es_ES
fr_FR
POSIX
ru_RU.utf8
ru_UA.utf8
The specific list of the system locales may vary. Consult your OS documentation to install additional needed locales.
utf8_general_ci
and binary
locales are built-in into Manticore. The first one is a generic collation for UTF-8 data (without any so-called language tailoring); it should behave similar to utf8_general_ci
collation in MySQL. The second one is a simple bytewise comparison.
Collation can be overridden via SQL on a per-session basis using SET collation_connection
statement. All subsequent SQL queries will use this collation. Otherwise all queries will use the server default collation or as specified in collation_server configuration directive. Manticore currently defaults to libc_ci
collation.
Collations affect all string attribute comparisons, including those within ORDER BY
and GROUP BY
, so differently ordered or grouped results can be returned depending on the collation chosen. Note that collations don’t affect full-text searching, for that use charset_table
Cost-based optimizer
When Manticore executes a fullscan query, it can either use plain scan to check every document against the filters, or it can use additional data and/or algorithms to speed up the query execution. To decide which approach to take, Manticore uses a query cost-based optimizer (“CBO” also known as “query optimizer”).
The CBO may decide to replace one or more query filters with one of the following entities if it determines that it will improve performance:
- A docid index, which uses a special docid-only secondary index stored in files with the
.spt
extension. In addition to improving filters on document ids, the docid index is also used to speed up document id to row id lookups, and to speed up the application of large killlists on daemon startup. - A columnar scan, which uses columnar storage and can only be used on a columnar attribute. It still scans every value and tests it against the filter, but it is heavily optimized and is usually faster than the default approach.
- Secondary indexes, which are generated for all attributes by default. They use the PGM index together with Manticore’s built-in inverted index to retrieve the list of row ids corresponding to a value or range of values. The secondary indexes are stored in files with the
.spidx
extension.
The optimizer estimates the cost of each execution path using different attribute statistics, including:
- Information on the data distribution within an attribute (histograms, stored in
.sphi
files). Histograms are generated automatically when data is indexed and are the main source of information for the CBO. - Information from PGM (secondary indexes), which is used to estimate the number of document lists to read. This helps to estimate doclist merge performance and to choose the correct merge algorithm (priority queue merge or bitmap merge).
- Columnar encoding statistics, which are used to estimate columnar data decompression performance.
- A columnar min-max tree. The CBO uses histograms to estimate the number of documents left after the filter was applied, but it also needs to estimate how many documents the filter had to process. For columnar attributes, partial evaluation of the min-max tree is used for that purpose.
The optimizer calculates the execution cost for every filter used in a query. Because certain filters can be replaced with several different entities (e.g., for a document id, Manticore can use a plain scan, a docid index lookup, a columnar scan (if the document id is columnar), and a secondary index), the optimizer evaluates every available combination. Note that there is a maximum limit of 1024 combinations.
To estimate query execution costs, the optimizer calculates the estimated costs of the most significant operations that are performed when the query is executed. It uses preset constants to represent the cost of each operation.
The optimizer compares the costs of each execution path and chooses the path with the lowest cost to execute the query.
Another thing to consider is multithreaded query execution (when pseudo_sharding
is enabled). The CBO knows that some queries can be executed in multiple threads and takes that into account. The CBO favors smaller query execution times (i.e., latency) over throughput. For example, if a query using a columnar scan can be executed in multiple threads (and occupy multiple CPU cores) and is faster than a query executed in a single thread using secondary indexes, multithreaded execution will be preferred.
Queries using secondary indexes and docid indexes always run in a single thread, as benchmarks show that there is little to no benefit in making them multithreaded.
Currently, the optimizer only uses CPU costs and does not consider memory or disk usage.
Updating table schema
Updating table schema in RT mode
ALTER TABLE table ADD COLUMN column_name [{INTEGER|INT|BIGINT|FLOAT|BOOL|MULTI|MULTI64|JSON|STRING|TIMESTAMP|TEXT [INDEXED [ATTRIBUTE]]}] [engine='columnar']
ALTER TABLE table DROP COLUMN column_name
It supports adding one field at a time for RT tables. Supported data types are:
int
- integer attributetimestamp
- timestamp attributebigint
- big integer attributefloat
- float attributebool
- boolean attributemulti
- multi-valued integer attributemulti64
- multi-valued bigint attributejson
- json attributestring
/text attribute
/string attribute
- string attributetext
/text indexed stored
/string indexed stored
- full-text indexed field with original value stored in docstoretext indexed
/string indexed
- full-text indexed field, indexed only (the original value is not stored in docstore)text indexed attribute
/string indexed attribute
- full text indexed field + string attribute (not storing the original value in docstore)text stored
/string stored
- the value will be only stored in docstore, not full-text indexed, not a string attribute- adding
engine='columnar'
to any attribute (except for json) will make it stored in the columnar storage
Important notes:
- ❗It’s recommended to backup table files before
ALTER
ing it to avoid data corruption in case of a sudden power interruption or other similar issues. - Querying a table is impossible while a column is being added.
- Newly created attribute’s values are set to 0.
ALTER
will not work for distributed tables and tables without any attributes.DROP COLUMN
will fail if a table has only one field.- When dropping a field which is both a full-text field and a string attribute the first
ALTER DROP
drops the attribute, the second one drops the full-text field. - Adding/dropping full-text field is only supported in the RT mode.
- Example
Example
mysql> desc rt;
+------------+-----------+
| Field | Type |
+------------+-----------+
| id | bigint |
| text | field |
| group_id | uint |
| date_added | timestamp |
+------------+-----------+
mysql> alter table rt add column test integer;
mysql> desc rt;
+------------+-----------+
| Field | Type |
+------------+-----------+
| id | bigint |
| text | field |
| group_id | uint |
| date_added | timestamp |
| test | uint |
+------------+-----------+
mysql> alter table rt drop column group_id;
mysql> desc rt;
+------------+-----------+
| Field | Type |
+------------+-----------+
| id | bigint |
| text | field |
| date_added | timestamp |
| test | uint |
+------------+-----------+
mysql> alter table rt add column title text indexed;
mysql> desc rt;
+------------+-----------+------------+
| Field | Type | Properties |
+------------+-----------+------------+
| id | bigint | |
| text | text | indexed |
| title | text | indexed |
| date_added | timestamp | |
| test | uint | |
+------------+-----------+------------+
mysql> alter table rt add column title text attribute;
mysql> desc rt;
+------------+-----------+------------+
| Field | Type | Properties |
+------------+-----------+------------+
| id | bigint | |
| text | text | indexed |
| title | text | indexed |
| date_added | timestamp | |
| test | uint | |
| title | string | |
+------------+-----------+------------+
mysql> alter table rt drop column title;
mysql> desc rt;
+------------+-----------+------------+
| Field | Type | Properties |
+------------+-----------+------------+
| id | bigint | |
| text | text | indexed |
| title | text | indexed |
| date_added | timestamp | |
| test | uint | |
+------------+-----------+------------+
mysql> alter table rt drop column title;
mysql> desc rt;
+------------+-----------+------------+
| Field | Type | Properties |
+------------+-----------+------------+
| id | bigint | |
| text | text | indexed |
| date_added | timestamp | |
| test | uint | |
+------------+-----------+------------+
Updating table FT settings in RT mode
ALTER TABLE table ft_setting='value'[, ft_setting2='value']
You can also use ALTER
to modify full-text settings of your table in the RT mode. Just remember that it doesn’t affect existing documents, it only affects new ones. Take a look at the example where we:
- create a table with a full-text field and
charset_table
that allows only 3 searchable characters:a
,b
andc
. - then we insert document ‘abcd’ and find it by query
abcd
, thed
just gets ignored since it’s not in thecharset_table
array - then we understand, that we want
d
to be searchable too, so we add it with help ofALTER
- but the same query
where match('abcd')
still says it searched byabc
, because the existing document remembers previous contents ofcharset_table
- then we add another document
abcd
and search byabcd
again - now it finds the both documents and
show meta
says it used two keywords:abc
(to find the old document) andabcd
(for the new one).
- Example
Example
mysql> create table rt(title text) charset_table='a,b,c';
mysql> insert into rt(title) values('abcd');
mysql> select * from rt where match('abcd');
+---------------------+-------+
| id | title |
+---------------------+-------+
| 1514630637682688054 | abcd |
+---------------------+-------+
mysql> show meta;
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total | 1 |
| total_found | 1 |
| time | 0.000 |
| keyword[0] | abc |
| docs[0] | 1 |
| hits[0] | 1 |
+---------------+-------+
mysql> alter table rt charset_table='a,b,c,d';
mysql> select * from rt where match('abcd');
+---------------------+-------+
| id | title |
+---------------------+-------+
| 1514630637682688054 | abcd |
+---------------------+-------+
mysql> show meta
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total | 1 |
| total_found | 1 |
| time | 0.000 |
| keyword[0] | abc |
| docs[0] | 1 |
| hits[0] | 1 |
+---------------+-------+
mysql> insert into rt(title) values('abcd');
mysql> select * from rt where match('abcd');
+---------------------+-------+
| id | title |
+---------------------+-------+
| 1514630637682688055 | abcd |
| 1514630637682688054 | abcd |
+---------------------+-------+
mysql> show meta;
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| total | 2 |
| total_found | 2 |
| time | 0.000 |
| keyword[0] | abc |
| docs[0] | 1 |
| hits[0] | 1 |
| keyword[1] | abcd |
| docs[1] | 1 |
| hits[1] | 1 |
+---------------+-------+
Updating table FT settings in plain mode
ALTER TABLE table RECONFIGURE
ALTER
can also reconfigure an RT table in the plain mode), so that new tokenization, morphology and other text processing settings from the configuration file take effect for new documents. Note, that the existing document will be left intact. Internally, it forcibly saves the current RAM chunk as a new disk chunk and adjusts the table header, so that new documents are tokenized using the updated full-text settings.
- Example
Example
mysql> show table rt settings;
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| settings | |
+---------------+-------+
1 row in set (0.00 sec)
mysql> alter table rt reconfigure;
Query OK, 0 rows affected (0.00 sec)
mysql> show table rt settings;
+---------------+----------------------+
| Variable_name | Value |
+---------------+----------------------+
| settings | morphology = stem_en |
+---------------+----------------------+
1 row in set (0.00 sec)
Rebuild secondary index
ALTER TABLE table REBUILD SECONDARY
ALTER
can also be used to rebuild secondary indexes in a given table. Sometimes a secondary index can be disabled for the whole table or for one/multiple attributes in it:
- On
UPDATE
of an attribute: in this case its secondary index gets disabled. - In case Manticore loads a table with old formatted secondary indexes: in this case secondary indexes will be disabled for the whole table.
ALTER TABLE table REBUILD SECONDARY
rebuilds secondary indexes from attribute data and enables them again.
- Example
Example
ALTER TABLE rt REBUILD SECONDARY;
Response
Query OK, 0 rows affected (0.00 sec)