Revision: 59337
Initial Code
Initial URL
Initial Description
Initial Title
Initial Tags
Initial Language
at September 1, 2012 07:15 by scrapy
Initial Code
# A spider middleware to canonicalize the urls of all requests generated from a spider. from scrapy.http import Request from scrapy.utils.url import canonicalize_url class UrlCanonicalizerMiddleware(object): def process_spider_output(self, response, result, spider): for r in result: if isinstance(r, Request): curl = canonicalize_url(r.url) if curl != r.url: r = r.replace(url=curl) yield r # Snippet imported from snippets.scrapy.org (which no longer works) # author: pablo # date : Sep 07, 2010
Initial URL
Initial Description
Initial Title
URL Canonicalizer Spider middleware
Initial Tags
url
Initial Language
Python