如何过滤重复的页面

Scrapy支持通过RFPDupeFilter来完成页面的去重(防止重复抓取)。

RFPDupeFilter实际是根据request_fingerprint实现过滤的,实现如下:

  1. def request_fingerprint(request, include_headers=None):
  2. if include_headers:
  3. include_headers = tuple([h.lower() for h in sorted(include_headers)])
  4. cache = _fingerprint_cache.setdefault(request, {})
  5. if include_headers not in cache:
  6. fp = hashlib.sha1()
  7. fp.update(request.method)
  8. fp.update(canonicalize_url(request.url))
  9. fp.update(request.body or '')
  10. if include_headers:
  11. for hdr in include_headers:
  12. if hdr in request.headers:
  13. fp.update(hdr)
  14. for v in request.headers.getlist(hdr):
  15. fp.update(v)
  16. cache[include_headers] = fp.hexdigest()
  17. return cache[include_headers]

我们可以看到,去重指纹是sha1(method + url + body + header)

所以,实际能够去掉重复的比例并不大。

如果我们需要自己提取去重的finger,需要自己实现Filter,并配置上它。

下面这个Filter只根据url去重:

  1. from scrapy.dupefilter import RFPDupeFilter
  2. class SeenURLFilter(RFPDupeFilter):
  3. """A dupe filter that considers the URL"""
  4. def __init__(self, path=None):
  5. self.urls_seen = set()
  6. RFPDupeFilter.__init__(self, path)
  7. def request_seen(self, request):
  8. if request.url in self.urls_seen:
  9. return True
  10. else:
  11. self.urls_seen.add(request.url)

不要忘记配置上:

  1. DUPEFILTER_CLASS ='scraper.custom_filters.SeenURLFilter'