Link Extractors
A link extractor is an object that extracts links from responses.
The init
method ofLxmlLinkExtractor
takes settings thatdetermine which links may be extracted. LxmlLinkExtractor.extract_links
returns alist of matching scrapy.link.Link
objects from aResponse
object.
Link extractors are used in CrawlSpider
spidersthrough a set of Rule
objects. You can also use linkextractors in regular spiders.
Link extractor reference
The link extractor class isscrapy.linkextractors.lxmlhtml.LxmlLinkExtractor
. For convenience itcan also be imported as scrapy.linkextractors.LinkExtractor
:
- from scrapy.linkextractors import LinkExtractor
LxmlLinkExtractor
- class
scrapy.linkextractors.lxmlhtml.
LxmlLinkExtractor
(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href', ), canonicalize=False, unique=True, process_value=None, strip=True)[source] - LxmlLinkExtractor is the recommended link extractor with handy filteringoptions. It is implemented using lxml’s robust HTMLParser.
Parameters:
- allow (a regular expression (or list of__)) – a single regular expression (or list of regular expressions)that the (absolute) urls must match in order to be extracted. If notgiven (or empty), it will match all links.
- deny (a regular expression (or list of__)) – a single regular expression (or list of regular expressions)that the (absolute) urls must match in order to be excluded (i.e. notextracted). It has precedence over the
allow
parameter. If notgiven (or empty) it won’t exclude any links. - allow_domains (str or list) – a single value or a list of string containingdomains which will be considered for extracting the links
- deny_domains (str or list) – a single value or a list of strings containingdomains which won’t be considered for extracting the links
- deny_extensions (list) –a single value or list of strings containingextensions that should be ignored when extracting links.If not given, it will default to
scrapy.linkextractors.IGNORED_EXTENSIONS
.
Changed in version 2.0: IGNORED_EXTENSIONS
now includes7z
, 7zip
, apk
, bz2
, cdr
, dmg
, ico
,iso
, tar
, tar.gz
, webm
, and xz
.
- restrict_xpaths (str or list) – is an XPath (or list of XPath’s) which definesregions inside the response where links should be extracted from.If given, only the text selected by those XPath will be scanned forlinks. See examples below.
- restrict_css (str or list) – a CSS selector (or list of selectors) which definesregions inside the response where links should be extracted from.Has the same behaviour as
restrict_xpaths
. - restrict_text (a regular expression (or list of__)) – a single regular expression (or list of regular expressions)that the link’s text must match in order to be extracted. If notgiven (or empty), it will match all links. If a list of regular expressions isgiven, the link will be extracted if it matches at least one.
- tags (str or list) – a tag or a list of tags to consider when extracting links.Defaults to
('a', 'area')
. - attrs (list) – an attribute or list of attributes which should be considered when lookingfor links to extract (only for those tags specified in the
tags
parameter). Defaults to('href',)
- canonicalize (boolean) – canonicalize each extracted url (usingw3lib.url.canonicalize_url). Defaults to
False
.Note that canonicalize_url is meant for duplicate checking;it can change the URL visible at server side, so the response can bedifferent for requests with canonicalized and raw URLs. If you’reusing LinkExtractor to follow links it is more robust tokeep the defaultcanonicalize=False
. - unique (boolean) – whether duplicate filtering should be applied to extractedlinks.
- process_value (callable) –a function which receives each value extracted fromthe tag and attributes scanned and can modify the value and return anew one, or return
None
to ignore the link altogether. If notgiven,process_value
defaults tolambda x: x
.
For example, to extract links from this code:
- <a href="javascript:goToPage('../other/page.html'); return false">Link text</a>
You can use the following function in process_value
:
- def process_value(value):
- m = re.search("javascript:goToPage\('(.*?)'", value)
- if m:
- return m.group(1)
- strip (boolean) – whether to strip whitespaces from extracted attributes.According to HTML5 standard, leading and trailing whitespacesmust be stripped from
href
attributes of<a>
,<area>
and many other elements,src
attribute of<img>
,<iframe>
elements, etc., so LinkExtractor strips space chars by default.Setstrip=False
to turn it off (e.g. if you’re extracting urlsfrom elements or attributes which allow leading/trailing whitespaces).
Only links that match the settings passed to the init
method ofthe link extractor are returned.
Duplicate links are omitted.