Merging tables
Merging two existing plain tables can be more efficient than indexing the data from scratch and desired in some cases (such as merging ‘main’ and ‘delta’ tables instead of simply rebuilding ‘main’ in the ‘main+delta’ partitioning scheme). So indexer
has an option to do that. Merging tables is normally faster than rebuilding, but still not instant on huge tables. Basically, it will need to read the contents of the both tables once and write the result once. Merging 100 GB and 1 GB table, for example, will result in 202 GB of I/O (but that’s still likely less than the indexing from scratch requires).
The basic command syntax is as follows:
sudo -u manticore indexer --merge DSTINDEX SRCINDEX [--rotate] [--drop-src]
Unless --drop-src
is specified only the DSTINDEX table will be affected: the contents of SRCINDEX will be merged into it.
--rotate
switch will be required if DSTINDEX is already being served by searchd
.
The typical usage pattern is to merge a smaller update from SRCINDEX into DSTINDEX. Thus, when merging attributes the values from SRCINDEX will win if duplicate document IDs are encountered. Note, however, that the “old” keywords will not be automatically removed in such cases. For example, if there’s a keyword “old” associated with document 123 in DSTINDEX, and a keyword “new” associated with it in SRCINDEX, document 123 will be found by both keywords after the merge. You can supply an explicit condition to remove documents from DSTINDEX to mitigate that; the relevant switch is --merge-dst-range
:
sudo -u manticore indexer --merge main delta --merge-dst-range deleted 0 0
This switch lets you apply filters to the destination table along with merging. There can be several filters; all of their conditions must be met in order to include the document in the resulting merged table. In the example above, the filter passes only those records where ‘deleted’ is 0, eliminating all records that were flagged as deleted.
--drop-src
allows dropping SRCINDEX after the merge and before rotating the tables, which is important in case you specify DSTINDEX in killlist_target
of DSTINDEX, otherwise when rotating the tables the documents that have been merged into DSTINDEX may be suppressed by SRCINDEX.
Killlist in plain tables
When using plain tables there is a problem generated by the need of having the data in the table as fresh as possible.
In this case one or more secondary (also know as delta) tables are used to capture the modified data between the time the main table was created and and current time. The modified data can mean new, updated or deleted documents. The search becomes a search over the main table and the delta table. This works with no obstacle when you just add new documents to the delta table, but when it comes to updated or deleted documents there remains the following issue.
If a document is present in both main and delta tables it can cause issues at searching as the engine will see two versions of a document and won’t know how to pick the right one. So the delta needs to tell somehow to the search that there are deleted documents in the main table that should be forgotten. Here comes kill lists.
Table kill-list
Table can maintain a list of document ids that can be used to suppress records in other tables. This feature is available for plain tables using database sources or plain tables using XML sources. In case of database sources, the source needs to provide an additional query defined by sql_query_killlist
. It will store in the table a list of documents that can be used by the server to remove documents from other plain tables.
This query is expected to return a number of 1-column rows, each containing just the document ID.
In many cases the query is a union between a query that gets a list of updated documents and a list of deleted documents, e.g.:
sql_query_killlist = \
SELECT id FROM documents WHERE updated_ts>=@last_reindex UNION \
SELECT id FROM documents_deleted WHERE deleted_ts>=@last_reindex
Removing documents in a plain table
A plain table can contain a directive called killlist_target that will tell the server it can provide a list of document ids that should be removed from certain existing tables. The table can use either it’s document ids as the source for this list or provide a separate list.
killlist_target
Sets the table(s) that the kill-list will be applied to. Optional, default value is empty.
When you use plain tables you often need to maintain not a single table, but a set of them to be able to add/update/delete new documents sooner (read about delta table updates). In order to suppress matches in the previous (main) table that were updated or deleted in the next (delta) table you need to:
- Create a kill-list in the delta table using sql_query_killlist
- Specify main table as
killlist_target
in delta table settings:
- CONFIG
CONFIG
table products {
killlist_target = main:kl
path = products
source = src_base
}
When killlist_target
is specified, kill-list is applied to all the tables listed in it on searchd
startup. If any of the tables from killlist_target
are rotated, kill-list is reapplied to these tables. When kill-list is applied, tables that were affected save these changes to disk.
killlist_target
has 3 modes of operation:
killlist_target = main:kl
. Document ids from the kill-list of the delta table are suppressed in the main table (seesql_query_killlist
).killlist_target = main:id
. All document ids from delta table are suppressed in the main table. Kill-list is ignored.killlist_target = main
. Both document ids from delta table and its kill-list are suppressed in the main table.
Multiple targets can be specified separated by comma like
killlist_target = table_one:kl,table_two:kl
You can change killlist_target
settings for a table without rebuilding it by using ALTER
.
But since the ‘old’ main table has already written the changes to disk, the documents that were deleted in it will remain deleted even if it is no longer in the killlist_target
of the delta table.
- SQL
- HTTP
SQL HTTP
ALTER TABLE delta KILLLIST_TARGET='new_main_table:kl'
POST /cli -d "
ALTER TABLE delta KILLLIST_TARGET='new_main_table:kl'"
Attaching a plain table to a real-time table
A plain table can be converted into a real-time table or added to an existing real-time table.
The first case is useful when you need to regenerated a real-time table completely which may be needed for example if tokenization settings need an update. Then preparing a plain table and converting it into a real-time table may be easier than preparing a batch job to perform INSERTs for adding all the data into a real-time table.
In the second you normally want to add a large bulk of new data to a real-time table and again creating a plain table with that data is easier than populating the existing real-time table.
Attaching table - general syntax
The ATTACH
statement allows to convert a plain table to be attached to an existing real-time table.
ATTACH TABLE plain_table TO TABLE rt_table [WITH TRUNCATE]
ATTACH TABLE
statement lets you move data from a plain table to an RT table.
After a successful ATTACH
the data originally stored in the source plain table becomes a part of the target RT table, and the source plain table becomes unavailable (until the next rebuild). ATTACH
does not result in any table data changes. Basically, it just renames the files (making the source table a new disk chunk of the target RT table) and updates the metadata. So it is a generally quick operation which might (frequently) complete as fast as under a second.
Note that when a table is attached to an empty RT table the fields, attributes and text processing settings (tokenizer, wordforms, etc) from the source table are copied over and take effect. The respective parts of the RT table definition from the configuration file will be ignored.
When TRUNCATE
option is used RT table gets truncated prior to attaching source plain table. This allows to make operation atomic or make sure that the attached source plain table will be the only data in the target RT table.
ATTACH TABLE
comes with a number of restrictions. Most notably, the target RT table is currently required to be either empty or have the same settings as the source plain table. In case the source plain table gets attached to a non-empty RT table the RT table data collected so far gets stored as a regular disk chunk and table being attached becomes the newest disk chunk and documents with same IDs get killed. The complete list is as follows:
- Target RT table needs to be either empty or have same settings
- Source plain table needs to have phrase_boundary_step\=0, stopword_step\=1.
- Example
Example
Before ATTACH the RT table is empty and has 3 fields:
mysql> DESC rt;
Empty set (0.00 sec)
mysql> SELECT * FROM rt;
+-----------+---------+
| Field | Type |
+-----------+---------+
| id | integer |
| testfield | field |
| testattr | uint |
+-----------+---------+
3 rows in set (0.00 sec)
The plain table is not empty:
mysql> SELECT * FROM plain WHERE MATCH('test');
+------+--------+----------+------------+
| id | weight | group_id | date_added |
+------+--------+----------+------------+
| 1 | 1304 | 1 | 1313643256 |
| 2 | 1304 | 1 | 1313643256 |
| 3 | 1304 | 1 | 1313643256 |
| 4 | 1304 | 1 | 1313643256 |
+------+--------+----------+------------+
4 rows in set (0.00 sec)
Attaching:
mysql> ATTACH TABLE plain TO TABLE rt;
Query OK, 0 rows affected (0.00 sec)
The RT table now has 5 fields:
mysql> DESC rt;
+------------+-----------+
| Field | Type |
+------------+-----------+
| id | integer |
| title | field |
| content | field |
| group_id | uint |
| date_added | timestamp |
+------------+-----------+
5 rows in set (0.00 sec)
And it’s not empty:
mysql> SELECT * FROM rt WHERE MATCH('test');
+------+--------+----------+------------+
| id | weight | group_id | date_added |
+------+--------+----------+------------+
| 1 | 1304 | 1 | 1313643256 |
| 2 | 1304 | 1 | 1313643256 |
| 3 | 1304 | 1 | 1313643256 |
| 4 | 1304 | 1 | 1313643256 |
+------+--------+----------+------------+
4 rows in set (0.00 sec)
The plain table was removed:
mysql> SELECT * FROM plain WHERE MATCH('test');
ERROR 1064 (42000): no enabled local indexes to search