Scrapy Tutorial
In this tutorial, we’ll assume that Scrapy is already installed on your system.If that’s not the case, see Installation guide.
We are going to scrape quotes.toscrape.com, a websitethat lists quotes from famous authors.
This tutorial will walk you through these tasks:
- Creating a new Scrapy project
- Writing a spider to crawl a site and extract data
- Exporting the scraped data using the command line
- Changing spider to recursively follow links
- Using spider arguments Scrapy is written in Python. If you’re new to the language you might want tostart by getting an idea of what the language is like, to get the most out ofScrapy.
If you’re already familiar with other languages, and want to learn Python quickly, the Python Tutorial is a good resource.
If you’re new to programming and want to start with Python, the following booksmay be useful to you:
- Automate the Boring Stuff With Python
- How To Think Like a Computer Scientist
- Learn Python 3 The Hard Way
You can also take a look at this list of Python resources for non-programmers,as well as the suggested resources in the learnpython-subreddit.
Creating a project
Before you start scraping, you will have to set up a new Scrapy project. Enter adirectory where you’d like to store your code and run:
- scrapy startproject tutorial
This will create a tutorial
directory with the following contents:
- tutorial/
- scrapy.cfg # deploy configuration file
- tutorial/ # project's Python module, you'll import your code from here
- __init__.py
- items.py # project items definition file
- middlewares.py # project middlewares file
- pipelines.py # project pipelines file
- settings.py # project settings file
- spiders/ # a directory where you'll later put your spiders
- __init__.py
Our first Spider
Spiders are classes that you define and that Scrapy uses to scrape informationfrom a website (or a group of websites). They must subclassSpider
and define the initial requests to make,optionally how to follow links in the pages, and how to parse the downloadedpage content to extract data.
This is the code for our first Spider. Save it in a file namedquotes_spider.py
under the tutorial/spiders
directory in your project:
- import scrapy
- class QuotesSpider(scrapy.Spider):
- name = "quotes"
- def start_requests(self):
- urls = [
- 'http://quotes.toscrape.com/page/1/',
- 'http://quotes.toscrape.com/page/2/',
- ]
- for url in urls:
- yield scrapy.Request(url=url, callback=self.parse)
- def parse(self, response):
- page = response.url.split("/")[-2]
- filename = 'quotes-%s.html' % page
- with open(filename, 'wb') as f:
- f.write(response.body)
- self.log('Saved file %s' % filename)
As you can see, our Spider subclasses scrapy.Spider
and defines some attributes and methods:
name
: identifies the Spider. It must beunique within a project, that is, you can’t set the same name for differentSpiders.start_requests()
: must return an iterable ofRequests (you can return a list of requests or write a generator function)which the Spider will begin to crawl from. Subsequent requests will begenerated successively from these initial requests.parse()
: a method that will be called to handlethe response downloaded for each of the requests made. The response parameteris an instance ofTextResponse
that holdsthe page content and has further helpful methods to handle it.
The parse()
method usually parses the response, extractingthe scraped data as dicts and also finding new URLs tofollow and creating new requests (Request
) from them.
How to run our spider
To put our spider to work, go to the project’s top level directory and run:
- scrapy crawl quotes
This command runs the spider with name quotes
that we’ve just added, thatwill send some requests for the quotes.toscrape.com
domain. You will get an outputsimilar to this:
- ... (omitted for brevity)
- 2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened
- 2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
- 2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
- 2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
- 2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
- 2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
- 2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html
- 2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html
- 2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished)
- ...
Now, check the files in the current directory. You should notice that two newfiles have been created: quotes-1.html and quotes-2.html, with the contentfor the respective URLs, as our parse
method instructs.
Note
If you are wondering why we haven’t parsed the HTML yet, holdon, we will cover that soon.
What just happened under the hood?
Scrapy schedules the scrapy.Request
objectsreturned by the start_requests
method of the Spider. Upon receiving aresponse for each one, it instantiates Response
objectsand calls the callback method associated with the request (in this case, theparse
method) passing the response as argument.
A shortcut to the start_requests method
Instead of implementing a start_requests()
methodthat generates scrapy.Request
objects from URLs,you can just define a start_urls
class attributewith a list of URLs. This list will then be used by the default implementationof start_requests()
to create the initial requestsfor your spider:
- import scrapy
- class QuotesSpider(scrapy.Spider):
- name = "quotes"
- start_urls = [
- 'http://quotes.toscrape.com/page/1/',
- 'http://quotes.toscrape.com/page/2/',
- ]
- def parse(self, response):
- page = response.url.split("/")[-2]
- filename = 'quotes-%s.html' % page
- with open(filename, 'wb') as f:
- f.write(response.body)
The parse()
method will be called to handle eachof the requests for those URLs, even though we haven’t explicitly told Scrapyto do so. This happens because parse()
is Scrapy’sdefault callback method, which is called for requests without an explicitlyassigned callback.
Extracting data
The best way to learn how to extract data with Scrapy is trying selectorsusing the Scrapy shell. Run:
- scrapy shell 'http://quotes.toscrape.com/page/1/'
Note
Remember to always enclose urls in quotes when running Scrapy shell fromcommand-line, otherwise urls containing arguments (i.e. &
character)will not work.
On Windows, use double quotes instead:
- scrapy shell "http://quotes.toscrape.com/page/1/"
You will see something like:
- [ ... Scrapy log here ... ]
- 2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
- [s] Available Scrapy objects:
- [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
- [s] crawler <scrapy.crawler.Crawler object at 0x7fa91d888c90>
- [s] item {}
- [s] request <GET http://quotes.toscrape.com/page/1/>
- [s] response <200 http://quotes.toscrape.com/page/1/>
- [s] settings <scrapy.settings.Settings object at 0x7fa91d888c10>
- [s] spider <DefaultSpider 'default' at 0x7fa91c8af990>
- [s] Useful shortcuts:
- [s] shelp() Shell help (print this help)
- [s] fetch(req_or_url) Fetch request (or URL) and update local objects
- [s] view(response) View response in a browser
Using the shell, you can try selecting elements using CSS with the responseobject:
- >>> response.css('title')
- [<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
The result of running response.css('title')
is a list-like object calledSelectorList
, which represents a list ofSelector
objects that wrap around XML/HTML elementsand allow you to run further queries to fine-grain the selection or extract thedata.
To extract the text from the title above, you can do:
- >>> response.css('title::text').getall()
- ['Quotes to Scrape']
There are two things to note here: one is that we’ve added ::text
to theCSS query, to mean we want to select only the text elements directly inside<title>
element. If we don’t specify ::text
, we’d get the full titleelement, including its tags:
- >>> response.css('title').getall()
- ['<title>Quotes to Scrape</title>']
The other thing is that the result of calling .getall()
is a list: it ispossible that a selector returns more than one result, so we extract them all.When you know you just want the first result, as in this case, you can do:
- >>> response.css('title::text').get()
- 'Quotes to Scrape'
As an alternative, you could’ve written:
- >>> response.css('title::text')[0].get()
- 'Quotes to Scrape'
However, using .get()
directly on a SelectorList
instance avoids an IndexError
and returns None
when it doesn’tfind any element matching the selection.
There’s a lesson here: for most scraping code, you want it to be resilient toerrors due to things not being found on a page, so that even if some parts failto be scraped, you can at least get some data.
Besides the getall()
andget()
methods, you can also usethe re()
method to extract using regularexpressions:
- >>> response.css('title::text').re(r'Quotes.*')
- ['Quotes to Scrape']
- >>> response.css('title::text').re(r'Q\w+')
- ['Quotes']
- >>> response.css('title::text').re(r'(\w+) to (\w+)')
- ['Quotes', 'Scrape']
In order to find the proper CSS selectors to use, you might find useful openingthe response page from the shell in your web browser using view(response)
.You can use your browser’s developer tools to inspect the HTML and come upwith a selector (see Using your browser’s Developer Tools for scraping).
Selector Gadget is also a nice tool to quickly find CSS selector forvisually selected elements, which works in many browsers.
XPath: a brief intro
Besides CSS, Scrapy selectors also support using XPath expressions:
- >>> response.xpath('//title')
- [<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
- >>> response.xpath('//title/text()').get()
- 'Quotes to Scrape'
XPath expressions are very powerful, and are the foundation of ScrapySelectors. In fact, CSS selectors are converted to XPath under-the-hood. Youcan see that if you read closely the text representation of the selectorobjects in the shell.
While perhaps not as popular as CSS selectors, XPath expressions offer morepower because besides navigating the structure, it can also look at thecontent. Using XPath, you’re able to select things like: select the linkthat contains the text “Next Page”. This makes XPath very fitting to the taskof scraping, and we encourage you to learn XPath even if you already know how toconstruct CSS selectors, it will make scraping much easier.
We won’t cover much of XPath here, but you can read more about using XPathwith Scrapy Selectors here. To learn more about XPath, werecommend this tutorial to learn XPath through examples, and this tutorial to learn “howto think in XPath”.
Extracting quotes and authors
Now that you know a bit about selection and extraction, let’s complete ourspider by writing the code to extract the quotes from the web page.
Each quote in http://quotes.toscrape.com is represented by HTML elements that looklike this:
- <div class="quote">
- <span class="text">“The world as we have created it is a process of our
- thinking. It cannot be changed without changing our thinking.”</span>
- <span>
- by <small class="author">Albert Einstein</small>
- <a href="/author/Albert-Einstein">(about)</a>
- </span>
- <div class="tags">
- Tags:
- <a class="tag" href="/tag/change/page/1/">change</a>
- <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
- <a class="tag" href="/tag/thinking/page/1/">thinking</a>
- <a class="tag" href="/tag/world/page/1/">world</a>
- </div>
- </div>
Let’s open up scrapy shell and play a bit to find out how to extract the datawe want:
- $ scrapy shell 'http://quotes.toscrape.com'
We get a list of selectors for the quote HTML elements with:
- >>> response.css("div.quote")
- [<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
- <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
- ...]
Each of the selectors returned by the query above allows us to run furtherqueries over their sub-elements. Let’s assign the first selector to avariable, so that we can run our CSS selectors directly on a particular quote:
- >>> quote = response.css("div.quote")[0]
Now, let’s extract text
, author
and the tags
from that quoteusing the quote
object we just created:
- >>> text = quote.css("span.text::text").get()
- >>> text
- '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
- >>> author = quote.css("small.author::text").get()
- >>> author
- 'Albert Einstein'
Given that the tags are a list of strings, we can use the .getall()
methodto get all of them:
- >>> tags = quote.css("div.tags a.tag::text").getall()
- >>> tags
- ['change', 'deep-thoughts', 'thinking', 'world']
Having figured out how to extract each bit, we can now iterate over all thequotes elements and put them together into a Python dictionary:
- >>> for quote in response.css("div.quote"):
- ... text = quote.css("span.text::text").get()
- ... author = quote.css("small.author::text").get()
- ... tags = quote.css("div.tags a.tag::text").getall()
- ... print(dict(text=text, author=author, tags=tags))
- {'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
- {'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
- ...
Extracting data in our spider
Let’s get back to our spider. Until now, it doesn’t extract any data inparticular, just saves the whole HTML page to a local file. Let’s integrate theextraction logic above into our spider.
A Scrapy spider typically generates many dictionaries containing the dataextracted from the page. To do that, we use the yield
Python keywordin the callback, as you can see below:
- import scrapy
- class QuotesSpider(scrapy.Spider):
- name = "quotes"
- start_urls = [
- 'http://quotes.toscrape.com/page/1/',
- 'http://quotes.toscrape.com/page/2/',
- ]
- def parse(self, response):
- for quote in response.css('div.quote'):
- yield {
- 'text': quote.css('span.text::text').get(),
- 'author': quote.css('small.author::text').get(),
- 'tags': quote.css('div.tags a.tag::text').getall(),
- }
If you run this spider, it will output the extracted data with the log:
- 2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
- {'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}
- 2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
- {'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}
Storing the scraped data
The simplest way to store the scraped data is by using Feed exports, with the following command:
- scrapy crawl quotes -o quotes.json
That will generate an quotes.json
file containing all scraped items,serialized in JSON.
For historic reasons, Scrapy appends to a given file instead of overwritingits contents. If you run this command twice without removing the filebefore the second time, you’ll end up with a broken JSON file.
You can also use other formats, like JSON Lines:
- scrapy crawl quotes -o quotes.jl
The JSON Lines format is useful because it’s stream-like, you can easilyappend new records to it. It doesn’t have the same problem of JSON when you runtwice. Also, as each record is a separate line, you can process big fileswithout having to fit everything in memory, there are tools like JQ to helpdoing that at the command-line.
In small projects (like the one in this tutorial), that should be enough.However, if you want to perform more complex things with the scraped items, youcan write an Item Pipeline. A placeholder filefor Item Pipelines has been set up for you when the project is created, intutorial/pipelines.py
. Though you don’t need to implement any itempipelines if you just want to store the scraped items.
Following links
Let’s say, instead of just scraping the stuff from the first two pagesfrom http://quotes.toscrape.com, you want quotes from all the pages in the website.
Now that you know how to extract data from pages, let’s see how to follow linksfrom them.
First thing is to extract the link to the page we want to follow. Examiningour page, we can see there is a link to the next page with the followingmarkup:
- <ul class="pager">
- <li class="next">
- <a href="/page/2/">Next <span aria-hidden="true">→</span></a>
- </li>
- </ul>
We can try extracting it in the shell:
- >>> response.css('li.next a').get()
- '<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'
This gets the anchor element, but we want the attribute href
. For that,Scrapy supports a CSS extension that lets you select the attribute contents,like this:
- >>> response.css('li.next a::attr(href)').get()
- '/page/2/'
There is also an attrib
property available(see Selecting element attributes for more):
- >>> response.css('li.next a').attrib['href']
- '/page/2/'
Let’s see now our spider modified to recursively follow the link to the nextpage, extracting data from it:
- import scrapy
- class QuotesSpider(scrapy.Spider):
- name = "quotes"
- start_urls = [
- 'http://quotes.toscrape.com/page/1/',
- ]
- def parse(self, response):
- for quote in response.css('div.quote'):
- yield {
- 'text': quote.css('span.text::text').get(),
- 'author': quote.css('small.author::text').get(),
- 'tags': quote.css('div.tags a.tag::text').getall(),
- }
- next_page = response.css('li.next a::attr(href)').get()
- if next_page is not None:
- next_page = response.urljoin(next_page)
- yield scrapy.Request(next_page, callback=self.parse)
Now, after extracting the data, the parse()
method looks for the link tothe next page, builds a full absolute URL using theurljoin()
method (since the links can berelative) and yields a new request to the next page, registering itself ascallback to handle the data extraction for the next page and to keep thecrawling going through all the pages.
What you see here is Scrapy’s mechanism of following links: when you yielda Request in a callback method, Scrapy will schedule that request to be sentand register a callback method to be executed when that request finishes.
Using this, you can build complex crawlers that follow links according to rulesyou define, and extract different kinds of data depending on the page it’svisiting.
In our example, it creates a sort of loop, following all the links to the next pageuntil it doesn’t find one – handy for crawling blogs, forums and other sites withpagination.
A shortcut for creating Requests
As a shortcut for creating Request objects you can useresponse.follow
:
- import scrapy
- class QuotesSpider(scrapy.Spider):
- name = "quotes"
- start_urls = [
- 'http://quotes.toscrape.com/page/1/',
- ]
- def parse(self, response):
- for quote in response.css('div.quote'):
- yield {
- 'text': quote.css('span.text::text').get(),
- 'author': quote.css('span small::text').get(),
- 'tags': quote.css('div.tags a.tag::text').getall(),
- }
- next_page = response.css('li.next a::attr(href)').get()
- if next_page is not None:
- yield response.follow(next_page, callback=self.parse)
Unlike scrapy.Request, response.follow
supports relative URLs directly - noneed to call urljoin. Note that response.follow
just returns a Requestinstance; you still have to yield this Request.
You can also pass a selector to response.follow
instead of a string;this selector should extract necessary attributes:
- for href in response.css('ul.pager a::attr(href)'):
- yield response.follow(href, callback=self.parse)
For <a>
elements there is a shortcut: response.follow
uses their hrefattribute automatically. So the code can be shortened further:
- for a in response.css('ul.pager a'):
- yield response.follow(a, callback=self.parse)
To create multiple requests from an iterable, you can useresponse.follow_all
instead:
- anchors = response.css('ul.pager a')
- yield from response.follow_all(anchors, callback=self.parse)
or, shortening it further:
- yield from response.follow_all(css='ul.pager a', callback=self.parse)
More examples and patterns
Here is another spider that illustrates callbacks and following links,this time for scraping author information:
- import scrapy
- class AuthorSpider(scrapy.Spider):
- name = 'author'
- start_urls = ['http://quotes.toscrape.com/']
- def parse(self, response):
- author_page_links = response.css('.author + a')
- yield from response.follow_all(author_page_links, self.parse_author)
- pagination_links = response.css('li.next a')
- yield from response.follow_all(pagination_links, self.parse)
- def parse_author(self, response):
- def extract_with_css(query):
- return response.css(query).get(default='').strip()
- yield {
- 'name': extract_with_css('h3.author-title::text'),
- 'birthdate': extract_with_css('.author-born-date::text'),
- 'bio': extract_with_css('.author-description::text'),
- }
This spider will start from the main page, it will follow all the links to theauthors pages calling the parse_author
callback for each of them, and alsothe pagination links with the parse
callback as we saw before.
Here we’re passing callbacks toresponse.follow_all
as positionalarguments to make the code shorter; it also works forRequest
.
The parse_author
callback defines a helper function to extract and cleanup thedata from a CSS query and yields the Python dict with the author data.
Another interesting thing this spider demonstrates is that, even if there aremany quotes from the same author, we don’t need to worry about visiting thesame author page multiple times. By default, Scrapy filters out duplicatedrequests to URLs already visited, avoiding the problem of hitting servers toomuch because of a programming mistake. This can be configured by the settingDUPEFILTER_CLASS
.
Hopefully by now you have a good understanding of how to use the mechanismof following links and callbacks with Scrapy.
As yet another example spider that leverages the mechanism of following links,check out the CrawlSpider
class for a genericspider that implements a small rules engine that you can use to write yourcrawlers on top of it.
Also, a common pattern is to build an item with data from more than one page,using a trick to pass additional data to the callbacks.
Using spider arguments
You can provide command line arguments to your spiders by using the -a
option when running them:
- scrapy crawl quotes -o quotes-humor.json -a tag=humor
These arguments are passed to the Spider’s init
method and becomespider attributes by default.
In this example, the value provided for the tag
argument will be availablevia self.tag
. You can use this to make your spider fetch only quoteswith a specific tag, building the URL based on the argument:
- import scrapy
- class QuotesSpider(scrapy.Spider):
- name = "quotes"
- def start_requests(self):
- url = 'http://quotes.toscrape.com/'
- tag = getattr(self, 'tag', None)
- if tag is not None:
- url = url + 'tag/' + tag
- yield scrapy.Request(url, self.parse)
- def parse(self, response):
- for quote in response.css('div.quote'):
- yield {
- 'text': quote.css('span.text::text').get(),
- 'author': quote.css('small.author::text').get(),
- }
- next_page = response.css('li.next a::attr(href)').get()
- if next_page is not None:
- yield response.follow(next_page, self.parse)
If you pass the tag=humor
argument to this spider, you’ll notice that itwill only visit URLs from the humor
tag, such ashttp://quotes.toscrape.com/tag/humor
.
You can learn more about handling spider arguments here.
Next steps
This tutorial covered only the basics of Scrapy, but there’s a lot of otherfeatures not mentioned here. Check the What else? section inScrapy at a glance chapter for a quick overview of the most important ones.
You can continue from the section Basic concepts to know more about thecommand-line tool, spiders, selectors and other things the tutorial hasn’t covered likemodeling the scraped data. If you prefer to play with an example project, checkthe Examples section.