URL Canonicalizer Spider middleware


/ Published in: Python
Save to your folder(s)



Copy this code and paste it in your HTML
  1. # A spider middleware to canonicalize the urls of all requests generated from a spider.
  2.  
  3. from scrapy.http import Request
  4. from scrapy.utils.url import canonicalize_url
  5.  
  6. class UrlCanonicalizerMiddleware(object):
  7. def process_spider_output(self, response, result, spider):
  8. for r in result:
  9. if isinstance(r, Request):
  10. curl = canonicalize_url(r.url)
  11. if curl != r.url:
  12. r = r.replace(url=curl)
  13. yield r
  14.  
  15. # Snippet imported from snippets.scrapy.org (which no longer works)
  16. # author: pablo
  17. # date : Sep 07, 2010
  18.  

Report this snippet


Comments

RSS Icon Subscribe to comments

You need to login to post a comment.