Indexers on Jina Hub
Indexers are a subtype of Hub Executors that store or retrieve data. They are designed to replace DocumentArray
and DocumentArrayMemmap
in large-scale applications.
They are split by usage and interface. These types are:
Storage
This category is for storing data, in a CRUD-like interface. These Executors are reliable and performant in write/read/update/delete operations. They can only search by a Document’s id
.
Example Hub Executors:
Vector Searcher
These usually implement a form of similarity search, based on the embeddings created by the encoders you have chosen in your Flow.
Example Hub Executors:
Compound Indexer
Compound indexers are composed of a Searcher, for storing vectors and computing matches, and a Storage, for retrieving the document’s original metadata.
Example Hub Executors:
If you want to develop a composite like these, check the guide here.
Tip
Besides, there are other special indexers,
SimpleIndexer. This uses the Jina built-in
DocumentArrayMemmap
class to store data on disk and read it into RAM as needed.DocCache. It is not used for storing and retrieving data directly, but for caching and avoiding duplicating of data during the indexing process.
MatchMerger. It is used for merging the results retrieved from sharding. It merges the results of shards by aggregating all matches by the corresponding Document in the original search request.
CRUD operations and the Executor endpoints
The Executors implemented and provided by Jina implement a CRUD interface as follows:
Operation | Endpoint | Implemented in |
---|---|---|
Create | /index | Storage |
Read | /search | Searcher |
Update | /update | Storage |
Delete | /delete | Storage |
The Create, Update, Delete operations are implemented by the Storage Indexers, while the Read operation is implemented in the /search
endpoints in the Search Indexers. The /search
endpoints do not correspond perfectly with the Read operation, as it searches for similar results, and does not return a specific Document by id. Some Indexers do implement a /fill_embedding
endpoint, which functions as a Read by id. Please refer to the specific documentation or implementation of the Executor for details.
Recommended All-in-one Indexers
We recommend you use one of our pre-built full solution Indexers. These support both CRUD and Search operations without the need for any manual configuration or further operations. They are:
SimpleIndexer. This works well with small amounts of data, <200k. Recommended usage is for experimentation and early prototyping. It uses exhaustive search.
HNSWPostgresIndexer. This is a robust, scalable Indexer that support replication and sharding, and uses the powerful HNSWlib approximate nearest neighbor algorithm.
Check their respective documentation pages for more information.
Your Custom Solution
If, on the other hand, you want to use another combination of Storage Indexer and Search Indexer, you will need to do some manual patching. The recommended usage of these Executors is to split them into Indexing vs Search Flows. In the Indexing Flow, you perform write, update, and delete. In order to search them, you need to start a Search Flow, dump the data from the Index Flow, and load it into the Search Flow.
See below figure for how this would look like:
In the above case, the Storage could be the MongoDBStorage-based Storage, while the Search Flow could be based on HnswlibSearcher .
Tip
For a showcase code, check our integration tests.
The split between indexing and search Flows allows you to continuously serve requests in your application (in the search Flow), while still being able to write or modify the underlying data. Then when you want to update the state of the searchable data for your users, you perform a dump and rolling update.
Synchronizing State via Rolling Update
The communication between index and search Flows is done via this pair of actions. The dump action tells the indexer to export its internal data (from whatever format it stores it in) to a disk location, optimized to be read by the shards in your search Flow. At the other end, the rolling update tells the search Flow to recreate its internal state with the new version of the data.
Looking at the test, we can see how this is called:
flow_storage.post(
on='/dump',
target_executor='indexer_storage',
parameters={
'dump_path': dump_path,
'shards': shards,
'timeout': -1,
},
)
where
flow_storage
is the Flow with the storage Indexertarget_executor
is the name of the Executor, defined in yourflow.yml
dump_path
is the path (on local disk) where you want the data to be stored. NOTE The folder needs to be empty. Otherwise, the dump will be cancelled.shards
is the number of shards you have in your search Flow. NOTE This doesn’t change the value in the Flow. You need to keep track of how you configured your search Flow
For performing the rolling update, we can see the usage in the same test:
flow_query.rolling_update(deployment_name='indexer_query', uses_with={'dump_path': dump_path})
where
flow_query
is the Flow with the searcher Indexerdeployment_name
is the name of the Executor, defined in yourflow.yml
dump_path
is the folder where you exported the data, from the above dump call
Note
dump_path
needs to be accessible by local reference. It can however be a network location / internal Docker location that you have mapped
Indexer Cheat Sheet
Index Size | RPS | Latency p95 | Best Indexer + configuration |
---|---|---|---|
fit into memory | < 20 | any | SimpleIndexer + use default |
any | > 20 | any | HNSWPostgresIndexer + use k8s & replicas |
not fit into memory | any | any | HNSWPostgresIndexer + use shards |
not fit into memory | > 20 | any | HNSWPostgresIndexer + use k8s & shards & replicas |
any | any | small | HNSWPostgresIndexer + use k8s & shards & replicas |
The Jina Hub offers multiple Indexers for different use-cases. In a lot of production use-cases Indexers heavily use shards and replicas. There are four major questions that should be answered, when deciding for an Indexer and its configuration.
Does my data fit into memory?
Estimated the total number N
of Documents that you want to index. Understand the average size x
of a Document. Does N * x
fit into memory?
How many requests per second (RPS) does the system need to handle?
RPS is typically used for knowing how big to scale distributed systems. Depending on your use-case you might have completely different RPS expectations.
What latency do your users expect?
Latency is typically measured via the p95 or p99 percentile. Meaning, how fast are 95% or 99% percent of the requests answered.
Tip
A webshop might want a really low latency in order to increase user experience. A high-quality Q&A chatbot might be OK with having answers only after one or even several seconds.
Do you need instant failure recovery?
When running any service in the cloud, an underlying machine could die at any time. Usually, a new machine will spawn and take over. Anyhow, this might take some minutes. If you need instant failure recovery, you need to use replicas. Jina provides this via the HNSWPostgresIndexer in combination with replicas inside kubernetes (k8s).