urllib.robotparser —- robots.txt 语法分析程序
This module provides a single class, RobotFileParser
, which answersquestions about whether or not a particular user agent can fetch a URL on theWeb site that published the robots.txt
file. For more details on thestructure of robots.txt
files, see http://www.robotstxt.org/orig.html.
- class
urllib.robotparser.
RobotFileParser
(url='') This class provides methods to read, parse and answer questions about the
robots.txt
file at url.seturl
(_url)Sets the URL referring to a
robots.txt
file.Reads the
robots.txt
URL and feeds it to the parser.Parses the lines argument.
Returns
True
if the useragent is allowed to fetch the _url_according to the rules contained in the parsedrobots.txt
file.Returns the time the
robots.txt
file was last fetched. This isuseful for long-running web spiders that need to check for newrobots.txt
files periodically.Sets the time the
robots.txt
file was last fetched to the currenttime.- Returns the value of the
Crawl-delay
parameter fromrobots.txt
for the useragent in question. If there is no such parameter or itdoesn't apply to the useragent specified or therobots.txt
entryfor this parameter has invalid syntax, returnNone
.
3.6 新版功能.
requestrate
(_useragent)- Returns the contents of the
Request-rate
parameter fromrobots.txt
as a named tupleRequestRate(requests, seconds)
.If there is no such parameter or it doesn't apply to the _useragent_specified or therobots.txt
entry for this parameter has invalidsyntax, returnNone
.
3.6 新版功能.
site_maps
()- Returns the contents of the
Sitemap
parameter fromrobots.txt
in the form of alist()
. If there is no suchparameter or therobots.txt
entry for this parameter hasinvalid syntax, returnNone
.
3.8 新版功能.
The following example demonstrates basic use of the RobotFileParser
class:
- >>> import urllib.robotparser
- >>> rp = urllib.robotparser.RobotFileParser()
- >>> rp.set_url("http://www.musi-cal.com/robots.txt")
- >>> rp.read()
- >>> rrate = rp.request_rate("*")
- >>> rrate.requests
- 3
- >>> rrate.seconds
- 20
- >>> rp.crawl_delay("*")
- 6
- >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
- False
- >>> rp.can_fetch("*", "http://www.musi-cal.com/")
- True