Segment replication
With segment replication, segment files are copied across shards instead of documents being indexed on each shard copy. This improves indexing throughput and lowers resource utilization at the expense of increased network utilization.
When the primary shard sends a checkpoint to replica shards on a refresh, a new segment replication event is triggered on replica shards. This happens:
- When a new replica shard is added to a cluster.
- When there are segment file changes on a primary shard refresh.
- During peer recovery, such as replica shard recovery and shard relocation (explicit allocation using the
move
allocation command or automatic shard rebalancing).
Segment replication is the first feature in a series of features designed to decouple reads and writes in order to lower compute costs.
Use cases
- Users who have high write loads but do not have high search requirements and are comfortable with longer refresh times.
- Users with very high loads who want to add new nodes, as you do not need to index all nodes when adding a new node to the cluster.
- OpenSearch cluster deployments with low replica counts, such as those used for log analytics.
Segment replication configuration
Setting the default replication type for a cluster affects all newly created indexes. However, you can specify a different replication type when creating an index. Index-level settings always override cluster-level settings.
Creating an index with the segment replication type
To set segment replication as the replication strategy for an index, create the index with replication.type
set to SEGMENT
:
PUT /my-index1
{
"settings": {
"index": {
"replication.type": "SEGMENT"
}
}
}
copy
In segment replication, the primary shard is usually generating more network traffic than the replicas because it copies segment files to the replicas. Thus, it’s beneficial to distribute primary shards equally between the nodes. To ensure balanced primary shard distribution, set the dynamic cluster.routing.allocation.balance.prefer_primary
setting to true
. For more information, see Cluster settings.
Segment replication currently does not support the wait_for
value in the refresh
query parameter.
For the best performance, we recommend enabling both of the following settings:
- Segment replication backpressure.
- Balanced primary shard allocation:
curl -X PUT "$host/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
"persistent": {
"cluster.routing.allocation.balance.prefer_primary": true,
"segrep.pressure.enabled": true
}
}
copy
Setting the replication type on a cluster
You can set the default replication type for newly created cluster indexes in the opensearch.yml
file as follows:
cluster.indices.replication.strategy: 'SEGMENT'
copy
This cluster-level setting cannot be enabled through the REST API.
This setting is not applied to system indexes and hidden indexes. By default, all system and hidden indexes in OpenSearch will still use document replication even if this setting is enabled.
Creating an index with the document replication type
Even when the default replication type is set to segment replication, you can create an index that uses document replication by setting replication.type
to DOCUMENT
:
PUT /my-index1
{
"settings": {
"index": {
"replication.type": "DOCUMENT"
}
}
}
copy
Considerations
When using segment replication, consider the following:
- Enabling segment replication for an existing index requires reindexing.
- Cross-cluster replication does not currently use segment replication to copy between clusters.
- Segment replication leads to increased network congestion on primary shards. See Issue - Optimize network bandwidth on primary shards.
- Integration with remote-backed storage as the source of replication is currently not supported.
- Read-after-write guarantees: The
wait_until
refresh policy is not compatible with segment replication. If you use thewait_until
refresh policy while ingesting documents, you’ll get a response only after the primary node has refreshed and made those documents searchable. Replica shards will respond only after having written to their local translog. We are exploring other mechanisms for providing read-after-write guarantees. For more information, see the corresponding GitHub issue. - System indexes will continue to use document replication internally until read-after-write guarantees are available. In this case, document replication does not hinder the overall performance because there are few system indexes.
Benchmarks
During initial benchmarks, segment replication users reported 40% higher throughput than when using document replication with the same cluster setup.
The following benchmarks were collected with OpenSearch-benchmark using the stackoverflow and nyc_taxi datasets.
The benchmarks demonstrate the effect of the following configurations on segment replication:
Your results may vary based on the cluster topology, hardware used, shard count, and merge settings.
Increasing the workload size
The following table lists benchmarking results for the nyc_taxi
dataset with the following configuration:
10 m5.xlarge data nodes
40 primary shards, 1 replica each (80 shards total)
4 primary shards and 4 replica shards per node
40 GB primary shard, 80 GB total | 240 GB primary shard, 480 GB total | ||||||
---|---|---|---|---|---|---|---|
Document Replication | Segment Replication | Percent difference | Document Replication | Segment Replication | Percent difference | ||
Store size | 85.2781 | 91.2268 | N/A | 515.726 | 558.039 | N/A | |
Index throughput (number of requests per second) | Minimum | 148,134 | 185,092 | 24.95% | 100,140 | 168,335 | 68.10% |
Median | 160,110 | 189,799 | 18.54% | 106,642 | 170,573 | 59.95% | |
Maximum | 175,196 | 190,757 | 8.88% | 108,583 | 172,507 | 58.87% | |
Error rate | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
As the size of the workload increases, the benefits of segment replication are amplified because the replicas are not required to index the larger dataset. In general, segment replication leads to higher throughput at lower resource costs than document replication in all cluster configurations, not accounting for replication lag.
Increasing the number of primary shards
The following table lists benchmarking results for the nyc_taxi
dataset for 40 and 100 primary shards.
40 primary shards, 1 replica | 100 primary shards, 1 replica | ||||||
---|---|---|---|---|---|---|---|
Document Replication | Segment Replication | Percent difference | Document Replication | Segment Replication | Percent difference | ||
Index throughput (number of requests per second) | Minimum | 148,134 | 185,092 | 24.95% | 151,404 | 167,391 | 9.55% |
Median | 160,110 | 189,799 | 18.54% | 154,796 | 172,995 | 10.52% | |
Maximum | 175,196 | 190,757 | 8.88% | 166,173 | 174,655 | 4.86% | |
Error rate | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% |
As the number of primary shards increases, the benefits of segment replication over document replication decrease. While segment replication is still beneficial with a larger number of primary shards, the difference in performance becomes less pronounced because there are more primary shards per node that must copy segment files across the cluster.
Increasing the number of replicas
The following table lists benchmarking results for the stackoverflow
dataset for 1 and 9 replicas.
10 primary shards, 1 replica | 10 primary shards, 9 replicas | ||||||
---|---|---|---|---|---|---|---|
Document Replication | Segment Replication | Percent difference | Document Replication | Segment Replication | Percent difference | ||
Index throughput (number of requests per second) | Median | 72,598.10 | 90,776.10 | 25.04% | 16,537.00 | 14,429.80 | −12.74% |
Maximum | 86,130.80 | 96,471.00 | 12.01% | 21,472.40 | 38,235.00 | 78.07% | |
CPU usage (%) | p50 | 17 | 18.857 | 10.92% | 69.857 | 8.833 | −87.36% |
p90 | 76 | 82.133 | 8.07% | 99 | 86.4 | −12.73% | |
p99 | 100 | 100 | 0% | 100 | 100 | 0% | |
p100 | 100 | 100 | 0% | 100 | 100 | 0% | |
Memory usage (%) | p50 | 35 | 23 | −34.29% | 42 | 40 | −4.76% |
p90 | 59 | 57 | −3.39% | 59 | 63 | 6.78% | |
p99 | 69 | 61 | −11.59% | 66 | 70 | 6.06% | |
p100 | 72 | 62 | −13.89% | 69 | 72 | 4.35% | |
Error rate | 0.00% | 0.00% | 0.00% | 0.00% | 2.30% | 2.30% |
As the number of replicas increases, the amount of time required for primary shards to keep replicas up to date (known as the replication lag) also increases. This is because segment replication copies the segment files directly from primary shards to replicas.
The benchmarking results show a non-zero error rate as the number of replicas increases. The error rate indicates that the segment replication backpressure mechanism is initiated when replicas cannot keep up with the primary shard. However, the error rate is offset by the significant CPU and memory gains that segment replication provides.