Link Extractors
Link Extractors 是那些目的仅仅是从网页(scrapy.http.Response
对象)中抽取最终将会被follow链接的对象。
Scrapy提供了 scrapy.linkextractors import LinkExtractor
, 但你通过实现一个简单的接口创建自己定制的Link Extractor来满足需求。
每个link extractor有唯一的公共方法是 extract_links
,它接收一个 Response
对象,并返回一个 scrapy.link.Link
对象。Link Extractors,要实例化一次并且 extract_links
方法会根据不同的response调用多次提取链接。
Link Extractors在 CrawlSpider
类(在Scrapy可用)中使用,通过一套规则,但你也可以用它在你的Spider中,即使你不是从CrawlSpider
继承的子类, 因为它的目的很简单: 提取链接。
内置Link Extractor 参考
Scrapy提供的Link Extractor类在 scrapy.linkextractors
模块提供。默认的link extractor是 LinkExtractor
, 其实就是 LxmlLinkExtractor
:
- from scrapy.linkextractors import LinkExtractor
There used to be other link extractor classes in previous Scrapy versions,but they are deprecated now.
LxmlLinkExtractor
- class
scrapy.linkextractors.lxmlhtml.
LxmlLinkExtractor
(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href', ), canonicalize=True, unique=True, process_value=None)
LxmlLinkExtractor is the recommended link extractor with handy filteringoptions. It is implemented using lxml’s robust HTMLParser.参数:
- allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions)that the (absolute) urls must match in order to be extracted. If notgiven (or empty), it will match all links.
- deny (a regular expression (or list of)) – a single regular expression (or list of regular expressions)that the (absolute) urls must match in order to be excluded (ie. notextracted). It has precedence over theallow
parameter. If notgiven (or empty) it won’t exclude any links.
- allow_domains (str or list) – a single value or a list of string containingdomains which will be considered for extracting the links
- deny_domains (str or list) – a single value or a list of strings containingdomains which won’t be considered for extracting the links
- deny_extensions (list) – a single value or list of strings containingextensions that should be ignored when extracting links.If not given, it will default to theIGNOREDEXTENSIONS
list defined in the scrapy.linkextractorspackage.
- restrict_xpaths (_str or list) – is a XPath (or list of XPath’s) which definesregions inside the response where links should be extracted from.If given, only the text selected by those XPath will be scanned forlinks. See examples below.
- restrict_css (str or list) – a CSS selector (or list of selectors) which definesregions inside the response where links should be extracted from.Has the same behaviour asrestrictxpaths
.
- tags (_str or list) – a tag or a list of tags to consider when extracting links.Defaults to('a', 'area')
.
- attrs (list) – an attribute or list of attributes which should be considered when lookingfor links to extract (only for those tags specified in thetags
parameter). Defaults to('href',)
- canonicalize (boolean) – canonicalize each extracted url (usingscrapy.utils.url.canonicalizeurl). Defaults toTrue
.
- unique (_boolean) – whether duplicate filtering should be applied to extractedlinks.
- process_value (callable) –
它接收来自扫描标签和属性提取每个值, 可以修改该值, 并返回一个新的, 或返回None
完全忽略链接的功能。如果没有给出,process_value
默认是lambda x: x
。
例如,从这段代码中提取链接:- <a href="javascript:goToPage('../other/page.html'); return false">Link text</a>
你可以使用下面的这个process_value
函数:- def process_value(value):
m = re.search("javascript:goToPage('(.*?)'", value)
if m:
return m.group(1)
- <a href="javascript:goToPage('../other/page.html'); return false">Link text</a>
当前内容版权归 scrapy-chs 或其关联方所有,如需对内容或内容相关联开源项目进行关注与资助,请访问 scrapy-chs .