AutoThrottle extension
This is an extension for automatically throttling crawling speed based on loadof both the Scrapy server and the website you are crawling.
Design goals
- be nicer to sites instead of using default download delay of zero
- automatically adjust Scrapy to the optimum crawling speed, so the userdoesn’t have to tune the download delays to find the optimum one.The user only needs to specify the maximum concurrent requestsit allows, and the extension does the rest.
How it works
AutoThrottle extension adjusts download delays dynamically to make spider sendAUTOTHROTTLE_TARGET_CONCURRENCY
concurrent requests on averageto each remote website.
It uses download latency to compute the delays. The main idea is thefollowing: if a server needs latency
seconds to respond, a clientshould send a request each latency/N
seconds to have N
requestsprocessed in parallel.
Instead of adjusting the delays one can just set a small fixeddownload delay and impose hard limits on concurrency usingCONCURRENT_REQUESTS_PER_DOMAIN
orCONCURRENT_REQUESTS_PER_IP
options. It will provide a similareffect, but there are some important differences:
- because the download delay is small there will be occasional burstsof requests;
- often non-200 (error) responses can be returned faster than regularresponses, so with a small download delay and a hard concurrency limitcrawler will be sending requests to server faster when server starts toreturn errors. But this is an opposite of what crawler should do - in caseof errors it makes more sense to slow down: these errors may be caused bythe high request rate.
AutoThrottle doesn’t have these issues.
Throttling algorithm
AutoThrottle algorithm adjusts download delays based on the following rules:
- spiders always start with a download delay of
AUTOTHROTTLE_START_DELAY
; - when a response is received, the target download delay is calculated as
latency / N
wherelatency
is a latency of the response,andN
isAUTOTHROTTLE_TARGET_CONCURRENCY
. - download delay for next requests is set to the average of previousdownload delay and the target download delay;
- latencies of non-200 responses are not allowed to decrease the delay;
- download delay can’t become less than
DOWNLOAD_DELAY
or greaterthanAUTOTHROTTLE_MAX_DELAY
Note
The AutoThrottle extension honours the standard Scrapy settings forconcurrency and delay. This means that it will respectCONCURRENT_REQUESTS_PER_DOMAIN
andCONCURRENT_REQUESTS_PER_IP
options andnever set a download delay lower than DOWNLOAD_DELAY
.
In Scrapy, the download latency is measured as the time elapsed betweenestablishing the TCP connection and receiving the HTTP headers.
Note that these latencies are very hard to measure accurately in a cooperativemultitasking environment because Scrapy may be busy processing a spidercallback, for example, and unable to attend downloads. However, these latenciesshould still give a reasonable estimate of how busy Scrapy (and ultimately, theserver) is, and this extension builds on that premise.
Settings
The settings used to control the AutoThrottle extension are:
AUTOTHROTTLE_ENABLED
AUTOTHROTTLE_START_DELAY
AUTOTHROTTLE_MAX_DELAY
AUTOTHROTTLE_TARGET_CONCURRENCY
AUTOTHROTTLE_DEBUG
CONCURRENT_REQUESTS_PER_DOMAIN
CONCURRENT_REQUESTS_PER_IP
DOWNLOAD_DELAY
For more information see How it works.
AUTOTHROTTLE_ENABLED
Default: False
Enables the AutoThrottle extension.
AUTOTHROTTLE_START_DELAY
Default: 5.0
The initial download delay (in seconds).
AUTOTHROTTLE_MAX_DELAY
Default: 60.0
The maximum download delay (in seconds) to be set in case of high latencies.
AUTOTHROTTLE_TARGET_CONCURRENCY
New in version 1.1.
Default: 1.0
Average number of requests Scrapy should be sending in parallel to remotewebsites.
By default, AutoThrottle adjusts the delay to send a singleconcurrent request to each of the remote websites. Set this option toa higher value (e.g. 2.0
) to increase the throughput and the load on remoteservers. A lower AUTOTHROTTLE_TARGET_CONCURRENCY
value(e.g. 0.5
) makes the crawler more conservative and polite.
Note that CONCURRENT_REQUESTS_PER_DOMAIN
and CONCURRENT_REQUESTS_PER_IP
options are still respectedwhen AutoThrottle extension is enabled. This means that ifAUTOTHROTTLE_TARGET_CONCURRENCY
is set to a value higher thanCONCURRENT_REQUESTS_PER_DOMAIN
orCONCURRENT_REQUESTS_PER_IP
, the crawler won’t reach this numberof concurrent requests.
At every given time point Scrapy can be sending more or less concurrentrequests than AUTOTHROTTLE_TARGET_CONCURRENCY
; it is a suggestedvalue the crawler tries to approach, not a hard limit.
AUTOTHROTTLE_DEBUG
Default: False
Enable AutoThrottle debug mode which will display stats on every responsereceived, so you can see how the throttling parameters are being adjusted inreal time.