Scheduler
The scheduler component receives requests from the engine and stores them into persistent and/or non-persistent data structures. It also gets those requests and feeds them back to the engine when it asks for a next request to be downloaded.
Overriding the default scheduler
You can use your own custom scheduler class by supplying its full Python path in the SCHEDULER setting.
Minimal scheduler interface
class scrapy.core.scheduler.BaseScheduler[source]
The scheduler component is responsible for storing requests received from the engine, and feeding them back upon request (also to the engine).
The original sources of said requests are:
Spider:
start_requests
method, requests created for URLs in thestart_urls
attribute, request callbacksSpider middleware:
process_spider_output
andprocess_spider_exception
methodsDownloader middleware:
process_request
,process_response
andprocess_exception
methods
The order in which the scheduler returns its stored requests (via the next_request
method) plays a great part in determining the order in which those requests are downloaded.
The methods defined in this class constitute the minimal interface that the Scrapy engine will interact with.
close(reason: str) → Optional[Deferred][source]
Called when the spider is closed by the engine. It receives the reason why the crawl finished as argument and it’s useful to execute cleaning code.
Parameters
reason (str) – a string which describes the reason why the spider was closed
abstract enqueue_request(request: Request) → bool[source]
Process a request received by the engine.
Return
True
if the request is stored correctly,False
otherwise.If
False
, the engine will fire arequest_dropped
signal, and will not make further attempts to schedule the request at a later time. For reference, the default Scrapy scheduler returnsFalse
when the request is rejected by the dupefilter.classmethod from_crawler(crawler: Crawler)[source]
Factory method which receives the current Crawler object as argument.
abstract has_pending_requests() → bool[source]
True
if the scheduler has enqueued requests,False
otherwiseabstract next_request() → Optional[Request][source]
Return the next Request to be processed, or
None
to indicate that there are no requests to be considered ready at the moment.Returning
None
implies that no request from the scheduler will be sent to the downloader in the current reactor cycle. The engine will continue callingnext_request
untilhas_pending_requests
isFalse
.open(spider: Spider) → Optional[Deferred][source]
Called when the spider is opened by the engine. It receives the spider instance as argument and it’s useful to execute initialization code.
Parameters
spider (Spider) – the spider object for the current crawl
Default Scrapy scheduler
class scrapy.core.scheduler.Scheduler(dupefilter, jobdir: Optional[str] = None, dqclass=None, mqclass=None, logunser: bool = False, stats=None, pqclass=None, crawler: Optional[Crawler] = None)[source]
Default Scrapy scheduler. This implementation also handles duplication filtering via the dupefilter.
This scheduler stores requests into several priority queues (defined by the SCHEDULER_PRIORITY_QUEUE setting). In turn, said priority queues are backed by either memory or disk based queues (respectively defined by the SCHEDULER_MEMORY_QUEUE and SCHEDULER_DISK_QUEUE settings).
Request prioritization is almost entirely delegated to the priority queue. The only prioritization performed by this scheduler is using the disk-based queue if present (i.e. if the JOBDIR setting is defined) and falling back to the memory-based queue if a serialization error occurs. If the disk queue is not present, the memory one is used directly.
Parameters
dupefilter (
scrapy.dupefilters.BaseDupeFilter
instance or similar: any class that implements the BaseDupeFilter interface) – An object responsible for checking and filtering duplicate requests. The value for the DUPEFILTER_CLASS setting is used by default.jobdir (str or
None
) – The path of a directory to be used for persisting the crawl’s state. The value for the JOBDIR setting is used by default. See Jobs: pausing and resuming crawls.dqclass (class) – A class to be used as persistent request queue. The value for the SCHEDULER_DISK_QUEUE setting is used by default.
mqclass (class) – A class to be used as non-persistent request queue. The value for the SCHEDULER_MEMORY_QUEUE setting is used by default.
logunser (bool) – A boolean that indicates whether or not unserializable requests should be logged. The value for the SCHEDULER_DEBUG setting is used by default.
stats (scrapy.statscollectors.StatsCollector instance or similar: any class that implements the StatsCollector interface) – A stats collector object to record stats about the request scheduling process. The value for the STATS_CLASS setting is used by default.
pqclass (class) – A class to be used as priority queue for requests. The value for the SCHEDULER_PRIORITY_QUEUE setting is used by default.
crawler (scrapy.crawler.Crawler) – The crawler object corresponding to the current crawl.
-
Return the total amount of enqueued requests
close(reason: str) → Optional[Deferred][source]
dump pending requests to disk if there is a disk queue
return the result of the dupefilter’s
close
method
enqueue_request(request: Request) → bool[source]
Unless the received request is filtered out by the Dupefilter, attempt to push it into the disk queue, falling back to pushing it into the memory queue.
Increment the appropriate stats, such as:
scheduler/enqueued
,scheduler/enqueued/disk
,scheduler/enqueued/memory
.Return
True
if the request was stored successfully,False
otherwise.classmethod from_crawler(crawler) → SchedulerTV[source]
Factory method, initializes the scheduler with arguments taken from the crawl settings
has_pending_requests() → bool[source]
True
if the scheduler has enqueued requests,False
otherwisenext_request() → Optional[Request][source]
Return a Request object from the memory queue, falling back to the disk queue if the memory queue is empty. Return
None
if there are no more enqueued requests.Increment the appropriate stats, such as:
scheduler/dequeued
,scheduler/dequeued/disk
,scheduler/dequeued/memory
.open(spider: Spider) → Optional[Deferred][source]
initialize the memory queue
initialize the disk queue if the
jobdir
attribute is a valid directoryreturn the result of the dupefilter’s
open
method