Fixed IPs with Crawlera and Scrapy


If you do a lot of scraping you have to rotate IP addresses, come on, nobody wants to be blocked!

For that problem, I find Crawlera to be an awesome tool.

Add it to your Scrapy spider and you're good to go.

There is a catch though... you'll be using a new IP on every request.

Normally that is fine, but for complex crawlers that do more than gather data you sometimes need to convince a system that you are a real user.

Real users generally don't use a new IP each request.

Lets look at a modified example of the first spider from the Scrapy docs.

class QuotesSpider(scrapy.Spider):
name = "quotes"

def start_requests(self):
    url = 'http://quotes.toscrape.com/page/1/'

    yield scrapy.Request(
        url=url,
        headers={'X-Crawlera-Session': 'create'},
        callback=self.parse
    )

def parse(self, response):
    self.crawlera_session_id = response.headers.get('X-Crawlera-Session', '')
    url = 'http://quotes.toscrape.com/page/2/'
    yield scrapy.Request(
        url=url,
        headers={'X-Crawlera-Session': self.crawlera_session_id},
        callback=self.parse2,
    )

def parse2(self, response):
    url = 'http://quotes.toscrape.com/page/4/'
    yield scrapy.Request(
        url=url,
        headers={'X-Crawlera-Session': self.crawlera_session_id},
        callback=self.parse3,
    )

What we need to do for a fixed IP address on our crawler is start a Crawlera session. This is done by including the header, X-Crawlera-Session with a value of create.

yield scrapy.Request(
    url=url,
    headers={'X-Crawlera-Session': 'create'},  # start Crawlera session
    callback=self.parse
)

That's it. We have our session and fixed IP, now its just a matter of making sure we keep it going. The id for the current Crawlera session must be in every subsequent request.

In the first callback, parse() in this case, we can grab the value for crawlera_session_id in the headers.

def parse(self, response):
    # geting id from headers, store it in an instance variable so that
    # we don't have to access the headers of each subsequent Response
    self.crawlera_session_id = response.headers.get('X-Crawlera-Session', '')

    url = 'http://quotes.toscrape.com/page/2/'

    yield scrapy.Request(
        url=url,
        # include id in this and every subsequent Request
        headers={'X-Crawlera-Session': self.crawlera_session_id},
        callback=self.parse2,
    )

def parse2(self, response):
    url = 'http://quotes.toscrape.com/page/4/'

    yield scrapy.Request(
        url=url,
        # this time conviniently accessed from this instance variable!
        headers={'X-Crawlera-Session': self.crawlera_session_id},
        callback=self.parse3,
    )
Archive