Settings

Settings

Scrapy设置(settings)提供了定制Scrapy组件的方法。可以控制包括核心(core)，插件(extension)，pipeline及spider组件。比如设置Json Pipeliine、LOG_LEVEL

获取设置值(Populating the settings)

设置可以通过多种方式设置，每个方式具有不同的优先级。下面以优先级降序的方式给出方式列表:

命令行选项(Command line Options)(最高优先级)每个spider的设置项目设置模块(Project settings module)

命令行选项(Command line options) 命令行传入的参数具有最高的优先级。您可以使用command line 选项 -s (或 —set) 来覆盖一个(或更多)选项。

样例:

  scrapy crawl myspider -s LOG_FILE=scrapy.log

每个spider的设置 scrapy.spiders.Spider.custom_settings

  class MySpider(scrapy.Spider):
  name = 'myspider'
  custom_settings = {
      'SOME_SETTING': 'some value',
  }

项目设置模块(Project settings module)项目设置模块是您Scrapy项目的标准配置文件

myproject.settings

如何访问配置(settings)

In a spider, the settings are available through self.settings:

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']
    def parse(self, response):
        print("Existing settings: %s" % self.settings.attributes.keys())

Settings can be accessed through the scrapy.crawler.Crawler.settings attribute of the Crawler that is passed to from_crawler method in extensions, middlewares and item pipelines:

class MyExtension(object):
    def __init__(self, log_is_enabled=False):
        if log_is_enabled:
            print("log is enabled!")
    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings
        return cls(settings.getbool('LOG_ENABLED'))

案例

添加一行代码print("Existing settings: %s" % self.settings['LOG_FILE'])

# -*- coding:utf-8 -*-
import scrapy
from tutorial.items import RecruitItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import logging
class RecruitSpider(CrawlSpider):
    name = "tencent_crawl"
    allowed_domains = ["hr.tencent.com"]
    start_urls = [
        "http://hr.tencent.com/position.php?&start=0#a"
    ]
    #提取匹配 'http://hr.tencent.com/position.php?&start=\d+'的链接
    page_lx = LinkExtractor(allow=('start=\d+'))
    rules = [
        #提取匹配,并使用spider的parse方法进行分析;并跟进链接(没有callback意味着follow默认为True)
        Rule(page_lx, callback='parseContent',follow=True)
    ]
    def parseContent(self, response):
        print response.url
        print("Existing settings: %s" % self.settings['LOG_FILE'])
        self.logger.info('Parse function called on %s', response.url)
        for sel in response.xpath('//*[@class="even"]'):
            name = sel.xpath('./td[1]/a/text()').extract()[0]
            detailLink = sel.xpath('./td[1]/a/@href').extract()[0]
            catalog =None
            if sel.xpath('./td[2]/text()'):
                catalog = sel.xpath('./td[2]/text()').extract()[0]
            recruitNumber = sel.xpath('./td[3]/text()').extract()[0]
            workLocation = sel.xpath('./td[4]/text()').extract()[0]
            publishTime = sel.xpath('./td[5]/text()').extract()[0]
            #print name, detailLink, catalog,recruitNumber,workLocation,publishTime
            item = RecruitItem()
            item['name']=name.encode('utf-8')
            item['detailLink']=detailLink.encode('utf-8')
            if catalog:
                item['catalog']=catalog.encode('utf-8')
            item['recruitNumber']=recruitNumber.encode('utf-8')
            item['workLocation']=workLocation.encode('utf-8')
            item['publishTime']=publishTime.encode('utf-8')
            yield item

内置设置参考手册

BOT_NAME

默认: 'scrapybot'

当您使用 startproject 命令创建项目时其也被自动赋值。
CONCURRENT_ITEMS

默认: 100

Item Processor(即 Item Pipeline) 同时处理(每个response的)item的最大值。
CONCURRENT_REQUESTS

默认: 16

Scrapy downloader 并发请求(concurrent requests)的最大值。
DEFAULT_REQUEST_HEADERS 默认:

  {
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Language': 'en',
  }

Scrapy HTTP Request使用的默认header。

DEPTH_LIMIT

默认: 0

爬取网站最大允许的深度(depth)值。如果为0，则没有限制。
DOWNLOAD_DELAY

默认: 0

下载器在下载同一个网站下一个页面前需要等待的时间。该选项可以用来限制爬取速度，减轻服务器压力。同时也支持小数:

  DOWNLOAD_DELAY = 0.25    # 250 ms of delay

该设置影响(默认启用的) RANDOMIZE_DOWNLOAD_DELAY 设置。默认情况下，Scrapy在两个请求间不等待一个固定的值，而是使用0.5到1.5之间的一个随机值 * DOWNLOAD_DELAY 的结果作为等待间隔。

DOWNLOAD_TIMEOUT

默认: 180

下载器超时时间(单位: 秒)。
ITEM_PIPELINES

默认: {}

保存项目中启用的pipeline及其顺序的字典。该字典默认为空，值(value)任意。不过值(value)习惯设置在0-1000范围内。

样例:

  ITEM_PIPELINES = {
      'mybot.pipelines.validate.ValidateMyItem': 300,
      'mybot.pipelines.validate.StoreMyItem': 800,
  }

LOG_ENABLED

默认: True

是否启用logging。
LOG_ENCODING

默认: 'utf-8'

logging使用的编码。
LOG_LEVEL

默认: 'DEBUG'

log的最低级别。可选的级别有: CRITICAL、 ERROR、WARNING、INFO、DEBUG 。
USER_AGENT

默认: "Scrapy/VERSION (+http://scrapy.org)"

爬取的默认User-Agent，除非被覆盖。