Core API
New in version 0.15.
This section documents the Scrapy core API, and it’s intended for developers ofextensions and middlewares.
Crawler API
The main entry point to Scrapy API is the Crawler
object, passed to extensions through the from_crawler
class method. Thisobject provides access to all Scrapy core components, and it’s the only way forextensions to access them and hook their functionality into Scrapy.
The Extension Manager is responsible for loading and keeping track of installedextensions and it’s configured through the EXTENSIONS
setting whichcontains a dictionary of all available extensions and their order similar tohow you configure the downloader middlewares.
- class
scrapy.crawler.
Crawler
(spidercls, settings)[source] The Crawler object must be instantiated with a
scrapy.spiders.Spider
subclass and ascrapy.settings.Settings
object.
This is used by extensions & middlewares to access the Scrapy settingsof this crawler.
For an introduction on Scrapy settings see Settings.
For the API see Settings
class.
This is used by extensions & middlewares to hook themselves into Scrapyfunctionality.
For an introduction on signals see Signals.
For the API see SignalManager
class.
This is used from extensions & middlewares to record stats of theirbehaviour, or access stats collected by other extensions.
For an introduction on stats collection see Stats Collection.
For the API see StatsCollector
class.
Most extensions won’t need to access this attribute.
For an introduction on extensions and a list of available extensions onScrapy see Extensions.
engine
- The execution engine, which coordinates the core crawling logicbetween the scheduler, downloader and spiders.
Some extension may want to access the Scrapy engine, to inspect ormodify the downloader and scheduler behaviour, although this is anadvanced use and this API is not yet stable.
spider
Spider currently being crawled. This is an instance of the spider classprovided while constructing the crawler, and it is created after thearguments given in the
crawl()
method.crawl
(*args, **kwargs)[source]- Starts the crawler by instantiating its spider class with the given
args
andkwargs
arguments, while setting the execution engine inmotion.
Returns a deferred that is fired when the crawl is finished.
stop
()[source]- Starts a graceful stop of the crawler and returns a deferred that isfired when the crawler is stopped.
- class
scrapy.crawler.
CrawlerRunner
(settings=None)[source] - This is a convenient helper class that keeps track of, manages and runscrawlers inside an already setup
reactor
.
The CrawlerRunner object must be instantiated with aSettings
object.
This class shouldn’t be needed (since Scrapy is responsible of using itaccordingly) unless writing scripts that manually handle the crawlingprocess. See Run Scrapy from a script for an example.
crawl
(crawler_or_spidercls, *args, **kwargs)[source]- Run a crawler with the provided arguments.
It will call the given Crawler’s crawl()
method, whilekeeping track of it so it can be stopped later.
If crawler_or_spidercls
isn’t a Crawler
instance, this method will try to create one using this parameter asthe spider class given to it.
Returns a deferred that is fired when the crawling is finished.
Parameters:
- **crawler_or_spidercls** ([<code>Crawler</code>](#scrapy.crawler.Crawler) instance,[<code>Spider</code>]($fb832e20e85f228c.md#scrapy.spiders.Spider) subclass or string) – already created crawler, or a spider classor spider’s name inside the project to create it
- **args** ([_list_](https://docs.python.org/3/library/stdtypes.html#list)) – arguments to initialize the spider
- **kwargs** ([_dict_](https://docs.python.org/3/library/stdtypes.html#dict)) – keyword arguments to initialize the spider
- property
crawlers
Set of
crawlers
started bycrawl()
and managed by this class.createcrawler
(_crawler_or_spidercls)[source]Return a
Crawler
object.- If
crawler_or_spidercls
is a Crawler, it is returned as-is. - If
crawler_or_spidercls
is a Spider subclass, a new Crawleris constructed for it. - If
crawler_or_spidercls
is a string, this function findsa spider with this name in a Scrapy project (using spider loader),then creates a Crawler instance for it.
- If
join
()[source]Returns a deferred that is fired when all managed
crawlers
havecompleted their executions.stop
()[source]- Stops simultaneously all the crawling jobs taking place.
Returns a deferred that is fired when they all have ended.
- class
scrapy.crawler.
CrawlerProcess
(settings=None, install_root_handler=True)[source] - Bases:
scrapy.crawler.CrawlerRunner
A class to run multiple scrapy crawlers in a process simultaneously.
This class extends CrawlerRunner
by adding supportfor starting a reactor
and handling shutdownsignals, like the keyboard interrupt command Ctrl-C. It also configurestop-level logging.
This utility should be a better fit thanCrawlerRunner
if you aren’t running anotherreactor
within your application.
The CrawlerProcess object must be instantiated with aSettings
object.
Parameters:install_root_handler – whether to install root logging handler(default: True)
This class shouldn’t be needed (since Scrapy is responsible of using itaccordingly) unless writing scripts that manually handle the crawlingprocess. See Run Scrapy from a script for an example.
It will call the given Crawler’s crawl()
method, whilekeeping track of it so it can be stopped later.
If crawler_or_spidercls
isn’t a Crawler
instance, this method will try to create one using this parameter asthe spider class given to it.
Returns a deferred that is fired when the crawling is finished.
Parameters:
- **crawler_or_spidercls** ([<code>Crawler</code>](#scrapy.crawler.Crawler) instance,[<code>Spider</code>]($fb832e20e85f228c.md#scrapy.spiders.Spider) subclass or string) – already created crawler, or a spider classor spider’s name inside the project to create it
- **args** ([_list_](https://docs.python.org/3/library/stdtypes.html#list)) – arguments to initialize the spider
- **kwargs** ([_dict_](https://docs.python.org/3/library/stdtypes.html#dict)) – keyword arguments to initialize the spider
- property
crawlers
Set of
crawlers
started bycrawl()
and managed by this class.Return a
Crawler
object.- If
crawler_or_spidercls
is a Crawler, it is returned as-is. - If
crawler_or_spidercls
is a Spider subclass, a new Crawleris constructed for it. - If
crawler_or_spidercls
is a string, this function findsa spider with this name in a Scrapy project (using spider loader),then creates a Crawler instance for it.
- If
join
()Returns a deferred that is fired when all managed
crawlers
havecompleted their executions.start
(stop_after_crawl=True)[source]- This method starts a
reactor
, adjusts its poolsize toREACTOR_THREADPOOL_MAXSIZE
, and installs a DNS cachebased onDNSCACHE_ENABLED
andDNSCACHE_SIZE
.
If stop_after_crawl
is True, the reactor will be stopped after allcrawlers have finished, using join()
.
Parameters:stop_after_crawl (boolean) – stop or not the reactor when allcrawlers have finished
Returns a deferred that is fired when they all have ended.
Settings API
scrapy.settings.
SETTINGS_PRIORITIES
- Dictionary that sets the key name and priority level of the defaultsettings priorities used in Scrapy.
Each item defines a settings entry point, giving it a code name foridentification and an integer priority. Greater priorities take moreprecedence over lesser ones when setting and retrieving values in theSettings
class.
- SETTINGS_PRIORITIES = {
- 'default': 0,
- 'command': 10,
- 'project': 20,
- 'spider': 30,
- 'cmdline': 40,
- }
For a detailed explanation on each settings sources, see:Settings.
scrapy.settings.
getsettings_priority
(_priority)[source]- Small helper function that looks up a given string priority in the
SETTINGS_PRIORITIES
dictionary and returns itsnumerical value, or directly returns a given numerical priority.
- class
scrapy.settings.
Settings
(values=None, priority='project')[source] - Bases:
scrapy.settings.BaseSettings
This object stores Scrapy settings for the configuration of internalcomponents, and can be used for any further customization.
It is a direct subclass and supports all methods ofBaseSettings
. Additionally, after instantiationof this class, the new object will have the global default settingsdescribed on Built-in settings reference already populated.
- class
scrapy.settings.
BaseSettings
(values=None, priority='project')[source] - Instances of this class behave like dictionaries, but store prioritiesalong with their
(key, value)
pairs, and can be frozen (i.e. markedimmutable).
Key-value entries can be passed on initialization with the values
argument, and they would take the priority
level (unless values
isalready an instance of BaseSettings
, in whichcase the existing priority levels will be kept). If the priority
argument is a string, the priority name will be looked up inSETTINGS_PRIORITIES
. Otherwise, a specific integershould be provided.
Once the object is created, new settings can be loaded or updated with theset()
method, and can be accessed withthe square bracket notation of dictionaries, or with theget()
method of the instance and itsvalue conversion variants. When requesting a stored key, the value with thehighest priority will be retrieved.
copy
()[source]- Make a deep copy of current settings.
This method returns a new instance of the Settings
class,populated with the same values and their priorities.
Modifications to the new object won’t be reflected on the originalsettings.
copy_to_dict
()[source]- Make a copy of current settings and convert to a dict.
This method returns a new dict populated with the same valuesand their priorities as the current settings.
Modifications to the returned dict won’t be reflected on the originalsettings.
This method can be useful for example for printing settingsin Scrapy shell.
freeze
()[source]- Disable further changes to the current settings.
After calling this method, the present state of the settings will becomeimmutable. Trying to change values through the set()
method andits variants won’t be possible and will be alerted.
frozencopy
()[source]- Return an immutable copy of the current settings.
Alias for a freeze()
call in the object returned by copy()
.
get
(name, default=None)[source]- Get a setting value without affecting its original type.
Parameters:
- **name** (_string_) – the setting name
- **default** (_any_) – the value to return if no setting is found
getbool
(name, default=False)[source]- Get a setting value as a boolean.
1
, '1'
, True` and 'True'
return True
,while 0
, '0'
, False
, 'False'
and None
return False
.
For example, settings populated through environment variables set to'0'
will return False
when using this method.
Parameters:
- **name** (_string_) – the setting name
- **default** (_any_) – the value to return if no setting is found
getdict
(name, default=None)[source]- Get a setting value as a dictionary. If the setting original type is adictionary, a copy of it will be returned. If it is a string it will beevaluated as a JSON dictionary. In the case that it is a
BaseSettings
instance itself, it will beconverted to a dictionary, containing all its current settings valuesas they would be returned byget()
,and losing all information about priority and mutability.
Parameters:
- **name** (_string_) – the setting name
- **default** (_any_) – the value to return if no setting is found
getfloat
(name, default=0.0)[source]- Get a setting value as a float.
Parameters:
- **name** (_string_) – the setting name
- **default** (_any_) – the value to return if no setting is found
getint
(name, default=0)[source]- Get a setting value as an int.
Parameters:
- **name** (_string_) – the setting name
- **default** (_any_) – the value to return if no setting is found
getlist
(name, default=None)[source]- Get a setting value as a list. If the setting original type is a list, acopy of it will be returned. If it’s a string it will be split by “,”.
For example, settings populated through environment variables set to'one,two'
will return a list [‘one’, ‘two’] when using this method.
Parameters:
- **name** (_string_) – the setting name
- **default** (_any_) – the value to return if no setting is found
getpriority
(name)[source]- Return the current numerical priority value of a setting, or
None
ifthe givenname
does not exist.
Parameters:name (string) – the setting name
getwithbase
(name)[source]- Get a composition of a dictionary-like setting and its __BASE_counterpart.
Parameters:name (string) – name of the dictionary-like setting
maxpriority
()[source]Return the numerical value of the highest priority present throughoutall settings, or the numerical value for
default
fromSETTINGS_PRIORITIES
if there are no settingsstored.set
(name, value, priority='project')[source]- Store a key/value attribute with a given priority.
Settings should be populated before configuring the Crawler object(through the configure()
method),otherwise they won’t have any effect.
Parameters:
- **name** (_string_) – the setting name
- **value** (_any_) – the value to associate with the setting
- **priority** (_string__ or _[_int_](https://docs.python.org/3/library/functions.html#int)) – the priority of the setting. Should be a key of[<code>SETTINGS_PRIORITIES</code>](#scrapy.settings.SETTINGS_PRIORITIES) or an integer
setmodule
(module, priority='project')[source]- Store settings from a module with a given priority.
This is a helper function that callsset()
for every globally declareduppercase variable of module
with the provided priority
.
Parameters:
- **module** (_module object__ or __string_) – the module or the path of the module
- **priority** (_string__ or _[_int_](https://docs.python.org/3/library/functions.html#int)) – the priority of the settings. Should be a key of[<code>SETTINGS_PRIORITIES</code>](#scrapy.settings.SETTINGS_PRIORITIES) or an integer
update
(values, priority='project')[source]- Store key/value pairs with a given priority.
This is a helper function that callsset()
for every item of values
with the provided priority
.
If values
is a string, it is assumed to be JSON-encoded and parsedinto a dict with json.loads()
first. If it is aBaseSettings
instance, the per-key prioritieswill be used and the priority
parameter ignored. This allowsinserting/updating settings with different priorities with a singlecommand.
Parameters:
- **values** (dict or string or [<code>BaseSettings</code>](#scrapy.settings.BaseSettings)) – the settings names and values
- **priority** (_string__ or _[_int_](https://docs.python.org/3/library/functions.html#int)) – the priority of the settings. Should be a key of[<code>SETTINGS_PRIORITIES</code>](#scrapy.settings.SETTINGS_PRIORITIES) or an integer
SpiderLoader API
- class
scrapy.spiderloader.
SpiderLoader
[source] - This class is in charge of retrieving and handling the spider classesdefined across the project.
Custom spider loaders can be employed by specifying their path in theSPIDER_LOADER_CLASS
project setting. They must fully implementthe scrapy.interfaces.ISpiderLoader
interface to guarantee anerrorless execution.
fromsettings
(_settings)[source]- This class method is used by Scrapy to create an instance of the class.It’s called with the current project settings, and it loads the spidersfound recursively in the modules of the
SPIDER_MODULES
setting.
Parameters:settings (Settings
instance) – project settings
load
(spider_name)[source]- Get the Spider class with the given name. It’ll look into the previouslyloaded spiders for a spider class with name
spider_name
and will raisea KeyError if not found.
Parameters:spider_name (str) – spider class name
list
()[source]Get the names of the available spiders in the project.
findby_request
(_request)[source]- List the spiders’ names that can handle the given request. Will try tomatch the request’s url against the domains of the spiders.
Parameters:request (Request
instance) – queried request
Signals API
- class
scrapy.signalmanager.
SignalManager
(sender=_Anonymous)[source] connect
(receiver, signal, **kwargs)[source]- Connect a receiver function to a signal.
The signal can be any object, although Scrapy comes with somepredefined signals that are documented in the Signalssection.
Parameters:
- **receiver** (_callable_) – the function to be connected
- **signal** ([_object_](https://docs.python.org/3/library/functions.html#object)) – the signal to connect to
disconnect
(receiver, signal, **kwargs)[source]Disconnect a receiver function from a signal. This has theopposite effect of the
connect()
method, and the argumentsare the same.disconnectall
(_signal, **kwargs)[source]- Disconnect all receivers from the given signal.
Parameters:signal (object) – the signal to disconnect from
sendcatch_log
(_signal, **kwargs)[source]- Send a signal, catch exceptions and log them.
The keyword arguments are passed to the signal handlers (connectedthrough the connect()
method).
sendcatch_log_deferred
(_signal, **kwargs)[source]- Like
send_catch_log()
but supports returningDeferred
objects from signal handlers.
Returns a Deferred that gets fired once all signal handlersdeferreds were fired. Send a signal, catch exceptions and log them.
The keyword arguments are passed to the signal handlers (connectedthrough the connect()
method).
Stats Collector API
There are several Stats Collectors available under thescrapy.statscollectors
module and they all implement the StatsCollector API defined by the StatsCollector
class (which they all inherit from).
- class
scrapy.statscollectors.
StatsCollector
[source] getvalue
(_key, default=None)[source]Return the value for the given stats key or default if it doesn’t exist.
get_stats
()[source]Get all stats from the currently running spider as a dict.
setvalue
(_key, value)[source]Set the given value for the given stats key.
setstats
(_stats)[source]Override the current stats with the dict passed in
stats
argument.incvalue
(_key, count=1, start=0)[source]Increment the value of the given stats key, by the given count,assuming the start value given (when it’s not set).
maxvalue
(_key, value)[source]Set the given value for the given key only if current value for thesame key is lower than value. If there is no current value for thegiven key, the value is always set.
minvalue
(_key, value)[source]Set the given value for the given key only if current value for thesame key is greater than value. If there is no current value for thegiven key, the value is always set.
clear_stats
()[source]- Clear all stats.
The following methods are not part of the stats collection api but insteadused when implementing custom stats collectors: