Jobs: pausing and resuming crawls
Sometimes, for big sites, it’s desirable to pause crawls and be able to resumethem later.
Scrapy supports this functionality out of the box by providing the followingfacilities:
- a scheduler that persists scheduled requests on disk
- a duplicates filter that persists visited requests on disk
- an extension that keeps some spider state (key/value pairs) persistentbetween batches
Job directory
To enable persistence support you just need to define a job directory throughthe JOBDIR
setting. This directory will be for storing all required data tokeep the state of a single job (i.e. a spider run). It’s important to note thatthis directory must not be shared by different spiders, or even differentjobs/runs of the same spider, as it’s meant to be used for storing the state ofa single job.
How to use it
To start a spider with persistence support enabled, run it like this:
- scrapy crawl somespider -s JOBDIR=crawls/somespider-1
Then, you can stop the spider safely at any time (by pressing Ctrl-C or sendinga signal), and resume it later by issuing the same command:
- scrapy crawl somespider -s JOBDIR=crawls/somespider-1
Keeping persistent state between batches
Sometimes you’ll want to keep some persistent spider state between pause/resumebatches. You can use the spider.state
attribute for that, which should be adict. There’s a built-in extension that takes care of serializing, storing andloading that attribute from the job directory, when the spider starts andstops.
Here’s an example of a callback that uses the spider state (other spider codeis omitted for brevity):
- def parse_item(self, response):
- # parse item here
- self.state['items_count'] = self.state.get('items_count', 0) + 1
Persistence gotchas
There are a few things to keep in mind if you want to be able to use the Scrapypersistence support:
Cookies expiration
Cookies may expire. So, if you don’t resume your spider quickly the requestsscheduled may no longer work. This won’t be an issue if you spider doesn’t relyon cookies.
Request serialization
For persistence to work, Request
objects must beserializable with pickle
, except for the callback
and errback
values passed to their init
method, which must be methods of therunning Spider
class.
If you wish to log the requests that couldn’t be serialized, you can set theSCHEDULER_DEBUG
setting to True
in the project’s settings page.It is False
by default.