核心API
0.15 新版功能.
该节文档讲述Scrapy核心API,目标用户是开发Scrapy扩展(extensions)和中间件(middlewares)的开发人员。
Crawler API
Scrapy API的主要入口是 Crawler
的实例对象,通过类方法 from_crawler
将它传递给扩展(extensions)。该对象提供对所有Scrapy核心组件的访问,也是扩展访问Scrapy核心组件和挂载功能到Scrapy的唯一途径。
Extension Manager负责加载和跟踪已经安装的扩展,它通过 EXTENSIONS
配置,包含一个所有可用扩展的字典,字典的顺序跟你在 configure the downloader middlewares 配置的顺序一致。
- class
scrapy.crawler.
Crawler
(spidercls, settings)
Crawler必须使用scrapy.spiders.Spider
子类及scrapy.settings.Settings
的对象进行实例化settings
crawler的配置管理器。
扩展(extensions)和中间件(middlewares)使用它用来访问Scrapy的配置。
关于Scrapy配置的介绍参考这里 Settings。
API参考Settings
。
signals
crawler的信号管理器。
扩展和中间件使用它将自己的功能挂载到Scrapy。
关于信号的介绍参考 信号(Signals)。
API参考SignalManager
。
stats
crawler的统计信息收集器。
扩展和中间件使用它记录操作的统计信息,或者访问由其他扩展收集的统计信息。
关于统计信息收集器的介绍参考 数据收集(Stats Collection)。
API参考类StatsCollector
class。API seeStatsCollector
class.
extensions
扩展管理器,跟踪所有开启的扩展。
大多数扩展不需要访问该属性。
关于扩展和可用扩展列表器的介绍参考 扩展(Extensions)。
设置(Settings) API
scrapy.settings.
SETTINGSPRIORITIES
Dictionary that sets the key name and priority level of the defaultsettings priorities used in Scrapy.
Each item defines a settings entry point, giving it a code name foridentification and an integer priority. Greater priorities take moreprecedence over lesser ones when setting and retrieving values in theSettings
class.- SETTINGS_PRIORITIES = {
'default': 0,
'command': 10,
'project': 20,
'spider': 30,
'cmdline': 40,
}
For a detailed explanation on each settings sources, see:Settings.- SETTINGS_PRIORITIES = {
- _class
scrapy.settings.
Settings
(values={}, priority='project')
This object stores Scrapy settings for the configuration of internalcomponents, and can be used for any further customization.
After instantiation of this class, the new object will have the globaldefault settings described on 内置设定参考手册 alreadypopulated.
Additional values can be passed on initialization with thevalues
argument, and they would take thepriority
level. If the latterargument is a string, the priority name will be looked up inSETTINGS_PRIORITIES
. Otherwise, a expecificinteger should be provided.
Once the object is created, new settings can be loaded or updated with theset()
method, and can be accessed with thesquare bracket notation of dictionaries, or with theget()
method of the instance and its valueconversion variants. When requesting a stored key, the value with thehighest priority will be retrieved.set
(name, value, priority='project')
Store a key/value attribute with a given priority.Settings should be populated before configuring the Crawler object(through theconfigure()
method),otherwise they won’t have any effect.参数:
- name (string) – the setting name
- value (any) – the value to associate with the setting
- priority (string or int) – the priority of the setting. Should be a key ofSETTINGS_PRIORITIES
or an integer
setdict
(values, priority='project')
Store key/value pairs with a given priority.
This is a helper function that callsset()
for every item ofvalues
with the providedpriority
.参数:
- values (dict) – the settings names and values
- priority (string or int) – the priority of the settings. Should be a key ofSETTINGS_PRIORITIES
or an integer
setmodule
(module, priority='project')
Store settings from a module with a given priority.
This is a helper function that callsset()
for every globally declareduppercase variable ofmodule
with the providedpriority
.参数:
- module (module object or string) – the module or the path of the module
- priority (string or int) – the priority of the settings. Should be a key ofSETTINGS_PRIORITIES
or an integer
getbool
(name, default=False)
returnFalse
将某项配置的值以布尔值形式返回。比如,1
和'1'
,True
都返回True
,而0
,'0'
,False
和None
返回False
。
比如,通过环境变量计算将某项配置设置为'0'
,通过该方法获取得到False
。参数:
- name (字符串) – 配置名
- default (任何) – 如果该配置项未设置,返回的缺省值
getlist
(name, default=None)
将某项配置的值以列表形式返回。如果配置值本来就是list则原样返回。如果是字符串,则返回被 ”,” 分割后的列表。
比如,某项值通过环境变量的计算被设置为'one,two'
,该方法返回[‘one’, ‘two’]。参数:
- name (字符串) – 配置名
- default (任何) – 如果该配置项未设置,返回的缺省值
getdict
(name, default=None)
Get a setting value as a dictionary. If the setting original type is adictionary, a copy of it will be returned. If it’s a string it willevaluated as a json dictionary.参数:
- name (string) – the setting name
- default (any) – the value to return if no setting is found
copy
()
Make a deep copy of current settings.
This method returns a new instance of theSettings
class,populated with the same values and their priorities.
Modifications to the new object won’t be reflected on the originalsettings.
freeze
()
Disable further changes to the current settings.
After calling this method, the present state of the settings will becomeimmutable. Trying to change values through theset()
method andits variants won’t be possible and will be alerted.
SpiderLoader API
- class
scrapy.loader.
SpiderLoader
This class is in charge of retrieving and handling the spider classesdefined across the project.
Custom spider loaders can be employed by specifying their path in theSPIDER_LOADER_CLASS
project setting. They must fully implementthescrapy.interfaces.ISpiderLoader
interface to guarantee anerrorless execution.fromsettings
(_settings)
This class method is used by Scrapy to create an instance of the class.It’s called with the current project settings, and it loads the spidersfound in the modules of theSPIDER_MODULES
setting.参数: settings ( Settings
instance) – project settings
load
(spider_name)
Get the Spider class with the given name. It’ll look into the previouslyloaded spiders for a spider class with name spider_name and will raisea KeyError if not found.参数: spider_name (str) – spider class name
findby_request
(_request)
List the spiders’ names that can handle the given request. Will try tomatch the request’s url against the domains of the spiders.参数: request ( Request
instance) – queried request
信号(Signals) API
- class
scrapy.signalmanager.
SignalManager
connect
(receiver, signal)
链接一个接收器函数(receiver function) 到一个信号(signal)。
signal可以是任何对象,虽然Scrapy提供了一些预先定义好的信号,参考文档 信号(Signals)。参数:
- receiver (可调用对象) – 被链接到的函数
- signal (对象) – 链接的信号
sendcatch_log
(_signal, **kwargs)
发送一个信号,捕获异常并记录日志。
关键字参数会传递给信号处理者(signal handlers)(通过方法connect()
关联)。
sendcatch_log_deferred
(_signal, **kwargs)
跟send_catch_log()
相似但支持返回 deferreds 形式的信号处理器。
返回一个 deferred ,当所有的信号处理器的延迟被触发时调用。发送一个信号,处理异常并记录日志。
关键字参数会传递给信号处理者(signal handlers)(通过方法connect()
关联)。
disconnect
(receiver, signal)
解除一个接收器函数和一个信号的关联。这跟方法connect()
有相反的作用,参数也相同。
状态收集器(Stats Collector) API
模块 scrapy.statscollectors 下有好几种状态收集器,它们都实现了状态收集器API对应的类 Statscollector
(即它们都继承至该类)。
当前内容版权归 scrapy-chs 或其关联方所有,如需对内容或内容相关联开源项目进行关注与资助,请访问 scrapy-chs .