Posted By


scrapy on 09/01/12

Tagged


Statistics


Viewed 147 times
Favorited by 0 user(s)

DropElderMiddleware


/ Published in: Python
Save to your folder(s)



Copy this code and paste it in your HTML
  1. # This Middleware drops items older than a given time and date (21:17:27 7/11/2010). It also makes note of the newest item it sees. If a page drops any items, then no more requests/urls are yielded, and scraping stops.
  2. #
  3. # This needs to be Middleware in order to end the scraping on a per-url basis, since a pipeline sees all items indifferently.
  4. #
  5. # It could possibly be broken in to a more general Middleware that yields urls conditionally; conditions based on the items yielded by that scraped page.
  6. #
  7. # Maybe the Middleware could work with existing pipelines, merely making note of what happened to the items belonging to a particular page as they passed through the pipeline; although this might warrant changes to infrastructure?
  8.  
  9. class DropElderMiddleware(object):
  10. def __init__(self):
  11. dispatcher.connect(self.spider_opened, signal=signals.spider_opened)
  12. dispatcher.connect(self.spider_closed, signal=signals.spider_closed)
  13.  
  14. def spider_opened(self, spider):
  15. #restore lastcheck
  16. self.lastcheck = datetime(2010, 11, 7, 21, 17, 27)
  17. self.newest = self.lastcheck
  18.  
  19. def process_spider_output(self, response, result, spider):
  20. reqs = None
  21. nold = 0
  22. nnew = 0
  23. for i in result:
  24. if isinstance(i, Request):
  25. reqs = i
  26. elif i['date'] >= self.lastcheck:
  27. if i['date'] >= self.newest:
  28. self.newest = i['date']
  29. nnew += 1
  30. yield i
  31. else:
  32. nold += 1
  33. log.msg("Scraped url: %s" % (response.url,), level=log.INFO)
  34. log.msg("%i items scraped (%i dropped)" % (nold + nnew, nold), level=log.INFO)
  35. if not nold:
  36. yield reqs
  37.  
  38. def spider_closed(self, spider):
  39. # save lastcheck
  40. print self.newest
  41.  
  42. # Snippet imported from snippets.scrapy.org (which no longer works)
  43. # author: Chris2048
  44. # date : Dec 03, 2010
  45.  

Report this snippet


Comments

RSS Icon Subscribe to comments

You need to login to post a comment.