Bulk inserts
There are a couple of ways you can load lots of data quickly. The naive approach is to simply call Model.create()
in a loop:
data_source = [
{'field1': 'val1-1', 'field2': 'val1-2'},
{'field1': 'val2-1', 'field2': 'val2-2'},
# ...
]
for data_dict in data_source:
MyModel.create(**data_dict)
The above approach is slow for a couple of reasons:
- If you are not wrapping the loop in a transaction then each call to
create()
happens in its own transaction. That is going to be really slow! - There is a decent amount of Python logic getting in your way, and each
InsertQuery
must be generated and parsed into SQL. - That’s a lot of data (in terms of raw bytes of SQL) you are sending to your database to parse.
- We are retrieving the last insert id, which causes an additional query to be executed in some cases.
You can get a significant speedup by simply wrapping this in a transaction with atomic()
.
# This is much faster.
with db.atomic():
for data_dict in data_source:
MyModel.create(**data_dict)
The above code still suffers from points 2, 3 and 4. We can get another big boost by using insert_many()
. This method accepts a list of tuples or dictionaries, and inserts multiple rows in a single query:
data_source = [
{'field1': 'val1-1', 'field2': 'val1-2'},
{'field1': 'val2-1', 'field2': 'val2-2'},
# ...
]
# Fastest way to INSERT multiple rows.
MyModel.insert_many(data_source).execute()
The insert_many()
method also accepts a list of row-tuples, provided you also specify the corresponding fields:
# We can INSERT tuples as well...
data = [('val1-1', 'val1-2'),
('val2-1', 'val2-2'),
('val3-1', 'val3-2')]
# But we need to indicate which fields the values correspond to.
MyModel.insert_many(data, fields=[MyModel.field1, MyModel.field2]).execute()
It is also a good practice to wrap the bulk insert in a transaction:
# You can, of course, wrap this in a transaction as well:
with db.atomic():
MyModel.insert_many(data, fields=fields).execute()
Note
SQLite users should be aware of some caveats when using bulk inserts. Specifically, your SQLite3 version must be 3.7.11.0 or newer to take advantage of the bulk insert API. Additionally, by default SQLite limits the number of bound variables in a SQL query to 999
for SQLite versions prior to 3.32.0 (2020-05-22) and 32766 for SQLite versions after 3.32.0.
Inserting rows in batches
Depending on the number of rows in your data source, you may need to break it up into chunks. SQLite in particular typically has a limit of 999 or 32766 variables-per-query (batch size would then be 999 // row length or 32766 // row length).
You can write a loop to batch your data into chunks (in which case it is strongly recommended you use a transaction):
# Insert rows 100 at a time.
with db.atomic():
for idx in range(0, len(data_source), 100):
MyModel.insert_many(data_source[idx:idx+100]).execute()
Peewee comes with a chunked()
helper function which you can use for efficiently chunking a generic iterable into a series of batch-sized iterables:
from peewee import chunked
# Insert rows 100 at a time.
with db.atomic():
for batch in chunked(data_source, 100):
MyModel.insert_many(batch).execute()
Alternatives
The Model.bulk_create()
method behaves much like Model.insert_many()
, but instead it accepts a list of unsaved model instances to insert, and it optionally accepts a batch-size parameter. To use the bulk_create()
API:
# Read list of usernames from a file, for example.
with open('user_list.txt') as fh:
# Create a list of unsaved User instances.
users = [User(username=line.strip()) for line in fh.readlines()]
# Wrap the operation in a transaction and batch INSERT the users
# 100 at a time.
with db.atomic():
User.bulk_create(users, batch_size=100)
Note
If you are using Postgresql (which supports the RETURNING
clause), then the previously-unsaved model instances will have their new primary key values automatically populated.
In addition, Peewee also offers Model.bulk_update()
, which can efficiently update one or more columns on a list of models. For example:
# First, create 3 users with usernames u1, u2, u3.
u1, u2, u3 = [User.create(username='u%s' % i) for i in (1, 2, 3)]
# Now we'll modify the user instances.
u1.username = 'u1-x'
u2.username = 'u2-y'
u3.username = 'u3-z'
# Update all three users with a single UPDATE query.
User.bulk_update([u1, u2, u3], fields=[User.username])
Note
For large lists of objects, you should specify a reasonable batch_size and wrap the call to bulk_update()
with Database.atomic()
:
with database.atomic():
User.bulk_update(list_of_users, fields=['username'], batch_size=50)
Alternatively, you can use the Database.batch_commit()
helper to process chunks of rows inside batch-sized transactions. This method also provides a workaround for databases besides Postgresql, when the primary-key of the newly-created rows must be obtained.
# List of row data to insert.
row_data = [{'username': 'u1'}, {'username': 'u2'}, ...]
# Assume there are 789 items in row_data. The following code will result in
# 8 total transactions (7x100 rows + 1x89 rows).
for row in db.batch_commit(row_data, 100):
User.create(**row)
Bulk-loading from another table
If the data you would like to bulk load is stored in another table, you can also create INSERT queries whose source is a SELECT query. Use the Model.insert_from()
method:
res = (TweetArchive
.insert_from(
Tweet.select(Tweet.user, Tweet.message),
fields=[TweetArchive.user, TweetArchive.message])
.execute())
The above query is equivalent to the following SQL:
INSERT INTO "tweet_archive" ("user_id", "message")
SELECT "user_id", "message" FROM "tweet";