Bulk inserts
There are a couple of ways you can load lots of data quickly. The naiveapproach is to simply call Model.create()
in a loop:
- data_source = [
- {'field1': 'val1-1', 'field2': 'val1-2'},
- {'field1': 'val2-1', 'field2': 'val2-2'},
- # ...
- ]
- for data_dict in data_source:
- MyModel.create(**data_dict)
The above approach is slow for a couple of reasons:
- If you are not wrapping the loop in a transaction then each call to
create()
happens in its own transaction. That is going to bereally slow! - There is a decent amount of Python logic getting in your way, and each
InsertQuery
must be generated and parsed into SQL. - That’s a lot of data (in terms of raw bytes of SQL) you are sending to yourdatabase to parse.
- We are retrieving the last insert id, which causes an additional query tobe executed in some cases.You can get a significant speedup by simply wrapping this in a transaction with
atomic()
.
- # This is much faster.
- with db.atomic():
- for data_dict in data_source:
- MyModel.create(**data_dict)
The above code still suffers from points 2, 3 and 4. We can get another bigboost by using insert_many()
. This method accepts a list oftuples or dictionaries, and inserts multiple rows in a single query:
- data_source = [
- {'field1': 'val1-1', 'field2': 'val1-2'},
- {'field1': 'val2-1', 'field2': 'val2-2'},
- # ...
- ]
- # Fastest way to INSERT multiple rows.
- MyModel.insert_many(data_source).execute()
The insert_many()
method also accepts a list of row-tuples,provided you also specify the corresponding fields:
- # We can INSERT tuples as well...
- data = [('val1-1', 'val1-2'),
- ('val2-1', 'val2-2'),
- ('val3-1', 'val3-2')]
- # But we need to indicate which fields the values correspond to.
- MyModel.insert_many(data, fields=[MyModel.field1, MyModel.field2]).execute()
It is also a good practice to wrap the bulk insert in a transaction:
- # You can, of course, wrap this in a transaction as well:
- with db.atomic():
- MyModel.insert_many(data, fields=fields).execute()
Note
SQLite users should be aware of some caveats when using bulk inserts.Specifically, your SQLite3 version must be 3.7.11.0 or newer to takeadvantage of the bulk insert API. Additionally, by default SQLite limitsthe number of bound variables in a SQL query to 999
.
Inserting rows in batches
Depending on the number of rows in your data source, you may need to break itup into chunks. SQLite in particular typically has a limit of 999variables-per-query (batch size would then be roughly 1000 / row length).
You can write a loop to batch your data into chunks (in which case it isstrongly recommended you use a transaction):
- # Insert rows 100 at a time.
- with db.atomic():
- for idx in range(0, len(data_source), 100):
- MyModel.insert_many(data_source[idx:idx+100]).execute()
Peewee comes with a chunked()
helper function which you can use forefficiently chunking a generic iterable into a series of batch-sizediterables:
- from peewee import chunked
- # Insert rows 100 at a time.
- with db.atomic():
- for batch in chunked(data_source, 100):
- MyModel.insert_many(batch).execute()
Alternatives
The Model.bulk_create()
method behaves much likeModel.insert_many()
, but instead it accepts a list of unsaved modelinstances to insert, and it optionally accepts a batch-size parameter. To usethe bulk_create()
API:
- # Read list of usernames from a file, for example.
- with open('user_list.txt') as fh:
- # Create a list of unsaved User instances.
- users = [User(username=line.strip()) for line in fh.readlines()]
- # Wrap the operation in a transaction and batch INSERT the users
- # 100 at a time.
- with db.atomic():
- User.bulk_create(users, batch_size=100)
Note
If you are using Postgresql (which supports the RETURNING
clause), thenthe previously-unsaved model instances will have their new primary keyvalues automatically populated.
In addition, Peewee also offers Model.bulk_update()
, which canefficiently update one or more columns on a list of models. For example:
- # First, create 3 users with usernames u1, u2, u3.
- u1, u2, u3 = [User.create(username='u%s' % i) for i in (1, 2, 3)]
- # Now we'll modify the user instances.
- u1.username = 'u1-x'
- u2.username = 'u2-y'
- u3.username = 'u3-z'
- # Update all three users with a single UPDATE query.
- User.bulk_update([u1, u2, u3], fields=[User.username])
Note
For large lists of objects, you should specify a reasonable batch_size andwrap the call to bulk_update()
withDatabase.atomic()
:
- with database.atomic():
- User.bulk_update(list_of_users, fields=['username'], batch_size=50)
Alternatively, you can use the Database.batch_commit()
helper toprocess chunks of rows inside batch-sized transactions. This method alsoprovides a workaround for databases besides Postgresql, when the primary-key ofthe newly-created rows must be obtained.
- # List of row data to insert.
- row_data = [{'username': 'u1'}, {'username': 'u2'}, ...]
- # Assume there are 789 items in row_data. The following code will result in
- # 8 total transactions (7x100 rows + 1x89 rows).
- for row in db.batch_commit(row_data, 100):
- User.create(**row)
Bulk-loading from another table
If the data you would like to bulk load is stored in another table, you canalso create INSERT queries whose source is a SELECT query. Use theModel.insert_from()
method:
- res = (TweetArchive
- .insert_from(
- Tweet.select(Tweet.user, Tweet.message),
- fields=[TweetArchive.user, TweetArchive.message])
- .execute())
The above query is equivalent to the following SQL:
- INSERT INTO "tweet_archive" ("user_id", "message")
- SELECT "user_id", "message" FROM "tweet";