If you do a lot of scraping you have to rotate IP addresses, come on, nobody wants to be blocked!
For that problem, I find Crawlera to be an awesome tool.
Add it to your Scrapy spider and you're good to go.
There is a catch though... you'll be using a new IP on every request.
Normally that is fine, but for complex crawlers that do more than gather data you sometimes need to convince a system that you are a real user.
Real users generally don't use a new IP each request.
Lets look at a modified example of the first spider from the Scrapy docs.
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
url = 'http://quotes.toscrape.com/page/1/'
yield scrapy.Request(
url=url,
headers={'X-Crawlera-Session': 'create'},
callback=self.parse
)
def parse(self, response):
self.crawlera_session_id = response.headers.get('X-Crawlera-Session', '')
url = 'http://quotes.toscrape.com/page/2/'
yield scrapy.Request(
url=url,
headers={'X-Crawlera-Session': self.crawlera_session_id},
callback=self.parse2,
)
def parse2(self, response):
url = 'http://quotes.toscrape.com/page/4/'
yield scrapy.Request(
url=url,
headers={'X-Crawlera-Session': self.crawlera_session_id},
callback=self.parse3,
)
What we need to do for a fixed IP address on our crawler is start a Crawlera session. This is done by
including the header, X-Crawlera-Session
with a value of create
.
yield scrapy.Request(
url=url,
headers={'X-Crawlera-Session': 'create'}, # start Crawlera session
callback=self.parse
)
That's it. We have our session and fixed IP, now its just a matter of making sure we keep it going. The id for the current Crawlera session must be in every subsequent request.
In the first callback, parse()
in this case, we can grab the value for crawlera_session_id
in the headers.
def parse(self, response):
# geting id from headers, store it in an instance variable so that
# we don't have to access the headers of each subsequent Response
self.crawlera_session_id = response.headers.get('X-Crawlera-Session', '')
url = 'http://quotes.toscrape.com/page/2/'
yield scrapy.Request(
url=url,
# include id in this and every subsequent Request
headers={'X-Crawlera-Session': self.crawlera_session_id},
callback=self.parse2,
)
def parse2(self, response):
url = 'http://quotes.toscrape.com/page/4/'
yield scrapy.Request(
url=url,
# this time conviniently accessed from this instance variable!
headers={'X-Crawlera-Session': self.crawlera_session_id},
callback=self.parse3,
)