Uploading data to YDB
This section provides recommendations on how to efficiently upload data to YDB.
There are anti-patterns and non-optimal settings for uploading data. They don’t guarantee acceptable data uploading performance.
To accelerate data uploads, consider the following recommendations:
- Shard a table when creating it. This lets you effectively use the system bandwidth as soon as you start uploading data.
By default, a new table consists of a single shard. YDB supports automatic table sharding by data volume. This means that a table shard is divided into two shards when it reaches a certain size.
The acceptable size for splitting a table shard is 2 GB. As the number of shards grows, the data upload bandwidth increases, but it remains low for some time at first.
Therefore, when uploading a large amount of data for the first time, we recommend initially creating a table with the desired number of shards. You can calculate the number of shards based on 1 GB of data per shard in a resulting set. - Insert multiple rows in each transaction to reduce the overhead of the transactions themselves.
Each transaction in YDB has some overhead. To reduce the total overhead, you should make transactions that insert multiple rows. Good performance indicators terminate a transaction when it reaches 1 MB of data or 100,000 rows.
When uploading data, avoid transactions that insert a single row. - Within each transaction, insert rows from the primary key-sorted set to minimize the number of shards that the transaction affects.
In YDB, transactions that span multiple shards have higher overhead compared to transactions that involve exactly one shard. Moreover, this overhead increases with the growing number of table shards involved in the transaction.
We recommend selecting rows to be inserted in a particular transaction so that they’re located in a small number of shards, ideally, in one. - If you need to push data to multiple tables, we recommend pushing data to a single table within a single query.
- If you need to push data to a table with a synchronous secondary index, we recommend that you first push data to a table and, when done, build a secondary index.
- You should avoid writing data sequentially in ascending or descending order of the primary key.
Writing data to a table with a monotonically increasing key causes all new data to be written to the end of the table, since all tables in YDB are sorted by ascending primary key. As YDB splits table data into shards based on key ranges, inserts are always processed by the same server that is responsible for the “last” shard. Concentrating the load on a single server results in slow data uploading and inefficient use of a distributed system. - Some use cases require writing the initial data (often large amounts) to a table before enabling OLTP workloads. In this case, transactionality at the level of individual queries is not required and you can use
BulkUpsert
calls in the API and SDK. Since no transactionality is used, this approach has much lower overhead as compared to YQL queries. In case of a successful response to the query, theBulkUpsert
method guarantees that all data added within this query is committed.
Warning
The BulkUpsert
method isn’t supported on tables with secondary indexes.
We recommend the following algorithm for efficiently uploading data to YDB:
- Create a table with the desired number of shards based on 1 GB of data per shard.
- Sort the source data set by the expected primary key.
- Partition the resulting data set by the number of shards in the table. Each part will contain a set of consecutive rows.
- Upload the resulting parts to the table shards concurrently.
- Make a
COMMIT
after every 100,000 rows or 1 MB of data.