Debugging Spiders

This document explains the most common techniques for debugging spiders.Consider the following Scrapy spider below:

  1. import scrapy
  2. from myproject.items import MyItem
  3.  
  4. class MySpider(scrapy.Spider):
  5. name = 'myspider'
  6. start_urls = (
  7. 'http://example.com/page1',
  8. 'http://example.com/page2',
  9. )
  10.  
  11. def parse(self, response):
  12. # <processing code not shown>
  13. # collect `item_urls`
  14. for item_url in item_urls:
  15. yield scrapy.Request(item_url, self.parse_item)
  16.  
  17. def parse_item(self, response):
  18. # <processing code not shown>
  19. item = MyItem()
  20. # populate `item` fields
  21. # and extract item_details_url
  22. yield scrapy.Request(item_details_url, self.parse_details, cb_kwargs={'item': item})
  23.  
  24. def parse_details(self, response, item):
  25. # populate more `item` fields
  26. return item

Basically this is a simple spider which parses two pages of items (thestart_urls). Items also have a details page with additional information, so weuse the cb_kwargs functionality of Request to pass apartially populated item.

Parse Command

The most basic way of checking the output of your spider is to use theparse command. It allows to check the behaviour of different partsof the spider at the method level. It has the advantage of being flexible andsimple to use, but does not allow debugging code inside a method.

In order to see the item scraped from a specific url:

  1. $ scrapy parse --spider=myspider -c parse_item -d 2 <item_url>
  2. [ ... scrapy log lines crawling example.com spider ... ]
  3.  
  4. >>> STATUS DEPTH LEVEL 2 <<<
  5. # Scraped Items ------------------------------------------------------------
  6. [{'url': <item_url>}]
  7.  
  8. # Requests -----------------------------------------------------------------
  9. []

Using the —verbose or -v option we can see the status at each depth level:

  1. $ scrapy parse --spider=myspider -c parse_item -d 2 -v <item_url>
  2. [ ... scrapy log lines crawling example.com spider ... ]
  3.  
  4. >>> DEPTH LEVEL: 1 <<<
  5. # Scraped Items ------------------------------------------------------------
  6. []
  7.  
  8. # Requests -----------------------------------------------------------------
  9. [<GET item_details_url>]
  10.  
  11.  
  12. >>> DEPTH LEVEL: 2 <<<
  13. # Scraped Items ------------------------------------------------------------
  14. [{'url': <item_url>}]
  15.  
  16. # Requests -----------------------------------------------------------------
  17. []

Checking items scraped from a single start_url, can also be easily achievedusing:

  1. $ scrapy parse --spider=myspider -d 3 'http://example.com/page1'

Scrapy Shell

While the parse command is very useful for checking behaviour of aspider, it is of little help to check what happens inside a callback, besidesshowing the response received and the output. How to debug the situation whenparse_details sometimes receives no item?

Fortunately, the shell is your bread and butter in this case (seeInvoking the shell from spiders to inspect responses):

  1. from scrapy.shell import inspect_response
  2.  
  3. def parse_details(self, response, item=None):
  4. if item:
  5. # populate more `item` fields
  6. return item
  7. else:
  8. inspect_response(response, self)

See also: Invoking the shell from spiders to inspect responses.

Open in browser

Sometimes you just want to see how a certain response looks in a browser, youcan use the open_in_browser function for that. Here is an example of howyou would use it:

  1. from scrapy.utils.response import open_in_browser
  2.  
  3. def parse_details(self, response):
  4. if "item name" not in response.body:
  5. open_in_browser(response)

open_in_browser will open a browser with the response received by Scrapy atthat point, adjusting the base tag so that images and styles are displayedproperly.

Logging

Logging is another useful option for getting information about your spider run.Although not as convenient, it comes with the advantage that the logs will beavailable in all future runs should they be necessary again:

  1. def parse_details(self, response, item=None):
  2. if item:
  3. # populate more `item` fields
  4. return item
  5. else:
  6. self.logger.warning('No item received for %s', response.url)

For more information, check the Logging section.