Scraping FrameworkPython

How to Use Proxies with Scrapy

Scrapy is an enterprise-grade web crawling framework that offers a highly structured and scalable approach to data extraction. Its built-in 'HttpProxyMiddleware' provides a robust foundation for routing requests through the ProxyLabs residential gateway, handling the heavy lifting of proxy assignment and credential injection. Unlike simple HTTP clients, Scrapy is designed to manage complex crawling logic across thousands of concurrent requests, making it the industry standard for large-scale structured data extraction. By integrating with residential proxies, Scrapy can bypass most IP-based rate limits and geo-fencing, allowing you to collect data from localized versions of websites with ease. The framework's modular architecture also allows for the implementation of custom middlewares that can track bandwidth usage, detect proxy-level bans, and dynamically rotate sessions, ensuring that your scraping operations remain stable and cost-effective over long periods of time.

Focus: working config first, then the mistakes that usually cause traffic to bypass the proxy or break under concurrency.

Using Proxies with Scrapy: What to Know

Scrapy's proxy integration is powered by the 'HttpProxyMiddleware', which acts as an intermediary between your spiders and the downloader. When you yield a Request with a 'proxy' meta-key, the middleware intercepts it and injects the necessary 'Proxy-Authorization' header before passing it to the Twisted-based downloader engine. For HTTPS requests, Twisted establishes an HTTP CONNECT tunnel to the ProxyLabs gateway, ensuring that the entire conversation with the target server remains encrypted and isolated from the proxy itself.

The true power of Scrapy lies in its asynchronous architecture. Built on the Twisted event loop, Scrapy can manage hundreds of concurrent proxy connections with minimal CPU and memory overhead. This makes it significantly more efficient than synchronous libraries like 'requests' for large-scale data extraction. However, this high concurrency must be managed carefully when using residential proxies, as opening too many connections simultaneously can lead to higher failure rates and increased costs if not properly throttled.

Sticky sessions are essential for crawling flows that require persistence, such as maintaining a login state across multiple pages. In Scrapy, you can implement this by appending a session ID to the proxy username. Because the 'HttpProxyMiddleware' is stateless, it's your responsibility as the developer to ensure that the same session ID is passed in the meta-key of every related Request. This ensures that the ProxyLabs gateway routes all those requests through the same residential peer, maintaining the continuity of your crawling session.

Managing bandwidth in Scrapy is a critical operational task. Because Scrapy is so efficient at downloading data, a poorly configured spider can quickly consume your residential quota. We recommend implementing a custom middleware that tracks the 'Content-Length' of every response. This data can be used to generate live reports on your scraping costs and to automatically stop the spider if it exceeds a predefined budget. Combining this with Scrapy's built-in filtering for duplicate URLs ensures that you never pay for the same data twice.

Ban detection in Scrapy often requires going beyond simple HTTP status codes. Many modern anti-bot systems will return a 200 OK status code but serve a CAPTCHA or a 'Challenge' page instead of the requested data. To handle this, you should extend your proxy middleware to inspect the response body for common 'soft ban' signals. When a ban is detected, the middleware can drop the current proxy session and return the original Request, forcing Scrapy to retry the task with a fresh residential IP from the gateway.

Scrapy-Playwright is an increasingly popular combination for scraping modern web applications. When using these tools together, proxy configuration is typically handled at the Playwright launch level in your 'settings.py'. This ensures that the browser instances used for rendering are already configured to route traffic through ProxyLabs. This hybrid approach gives you the power of a full browser for rendering and the efficiency of Scrapy for scheduling and data processing, all while maintaining the anonymity of residential proxies.

Finally, always consider the geographic distribution of your scraping tasks. ProxyLabs allows you to target specific countries and cities, which is invaluable for tasks like SEO monitoring or price comparison across regions. In Scrapy, you can automate this by defining a 'proxy_country' attribute on your spider and using a middleware to dynamically construct the proxy URL for every request. This allows a single spider codebase to operate globally, providing a truly scalable solution for international data collection.

Installation

pip install scrapycopy to clipboard

Working Examples

Rotating Proxy via Middlewarepython
# settings.py
DOWNLOADER_MIDDLEWARES = {
    "scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 1,
}

# In your spider
import scrapy

class ProxySpider(scrapy.Spider):
    name = "proxy_spider"
    start_urls = ["https://httpbin.org/ip"]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url,
                callback=self.parse,
                meta={
                    "proxy": "http://your-username:[email protected]:8080",
                },
                errback=self.handle_error,
            )

    def parse(self, response):
        self.logger.info(f"IP: {response.text}")
        yield {"ip": response.json().get("origin")}

    def handle_error(self, failure):
        self.logger.error(f"Request failed: {failure.value}")
Sticky Session Spiderpython
import scrapy

class StickySessionSpider(scrapy.Spider):
    name = "sticky_spider"

    def start_requests(self):
        # Same session ID = same exit IP for all requests
        proxy = "http://your-username-session-abc123:[email protected]:8080"

        for page in range(1, 21):
            yield scrapy.Request(
                f"https://example.com/products?page={page}",
                callback=self.parse_page,
                meta={"proxy": proxy},
                errback=self.handle_error,
                dont_filter=True,
            )

    def parse_page(self, response):
        for product in response.css("div.product"):
            yield {
                "name": product.css("h2::text").get(),
                "price": product.css(".price::text").get(),
            }

    def handle_error(self, failure):
        self.logger.error(f"Failed: {failure.request.url} - {failure.value}")
Geo-Targeted Scrapingpython
import scrapy

class GeoSpider(scrapy.Spider):
    name = "geo_spider"

    custom_settings = {
        "CONCURRENT_REQUESTS": 8,
        "DOWNLOAD_DELAY": 1,
        "RETRY_TIMES": 3,
    }

    countries = ["US", "GB", "DE", "FR", "JP"]

    def start_requests(self):
        for country in self.countries:
            proxy = f"http://your-username-country-{country}:[email protected]:8080"
            yield scrapy.Request(
                "https://example.com/pricing",
                callback=self.parse_pricing,
                meta={"proxy": proxy, "country": country},
                errback=self.handle_error,
                dont_filter=True,
            )

    def parse_pricing(self, response):
        country = response.meta["country"]
        yield {
            "country": country,
            "price": response.css(".price::text").get(),
            "currency": response.css(".currency::text").get(),
        }

    def handle_error(self, failure):
        self.logger.error(f"Failed: {failure.value}")

What matters in practice

  • Sophisticated built-in retry middleware that can be configured to automatically handle proxy dropouts and temporary gateway errors.
  • Per-request proxy assignment via the Request meta dictionary, providing granular control over the network identity of every task.
  • Highly configurable concurrency limits that allow you to balance scraping throughput with target sensitivity and proxy stability.
  • Support for an interactive Telnet console that enables live debugging of proxy performance and spider state during long-running crawls.
  • Custom pipeline support for collecting and saving proxy-specific metrics, such as bandwidth consumption per geo-location or session.
  • Native integration with signals, allowing you to programmatically adjust proxy configurations based on the spider's real-time success rate.

Operational Notes

01

Use the 'dont_filter=True' flag in your Requests if you are scraping the same URL multiple times with different proxy countries to check for localized content or pricing.

02

Adjust the CONCURRENT_REQUESTS_PER_DOMAIN setting based on your target's sensitivity. High concurrency through a single residential IP can lead to fast detection and blacklisting.

03

Enable the HTTPCACHE_ENABLED setting during development. This avoids consuming expensive residential bandwidth on repeated test runs of your spider logic.

04

Add 403 Forbidden and 429 Too Many Requests status codes to your RETRY_HTTP_CODES. Scrapy will then automatically retry these requests with a fresh residential IP from the ProxyLabs pool.

05

Use a custom Downloader Middleware to inject a random User-Agent for every request alongside the proxy IP. This prevents anti-bot systems from flagging your Scrapy-based fingerprint.

06

Monitor the bandwidth usage in your custom middleware to avoid unexpected costs. Scrapy can download significant amounts of data very quickly when running at high concurrency.

Frequently Asked Questions

Do I need to implement a custom middleware for rotating proxies in Scrapy?

For basic IP rotation, you do not need a custom middleware because the ProxyLabs gateway automatically rotates the exit IP for every new connection. Simply passing the authenticated proxy URL in the 'proxy' key of the Request meta dictionary is sufficient for most use cases. However, if you need more advanced features like geo-targeting based on spider attributes, automatic session management for multi-page flows, or sophisticated ban detection (such as identifying CAPTCHA pages that return a 200 status code), then a custom Downloader Middleware is highly recommended.

How can I avoid getting banned when using Scrapy with residential proxies?

Avoiding bans requires a multi-layered approach beyond just using residential proxies. You must also randomize your User-Agent strings, manage cookies carefully to avoid cross-session contamination, and implement realistic delays between requests. Scrapy's 'AUTOTHROTTLE' extension is particularly useful here, as it automatically adjusts the crawling speed based on the target server's response time. Additionally, ensure that your 'CONCURRENT_REQUESTS' setting is not so high that it overwhelms the residential peer, which could lead to increased latency and potential detection.

Can I use Scrapy with JavaScript rendering services like Splash through a proxy?

Yes, you can use Scrapy with Splash while routing the traffic through residential proxies. You must pass the proxy configuration to Splash through the 'splash:on_request()' hook in your Lua script or by providing the 'proxy' argument in the Splash request parameters. This allows Splash to render the page using a residential IP, which is essential for scraping JavaScript-heavy sites that have strict geographic or anti-bot restrictions. Be aware that rendering pages in Splash consumes more bandwidth, so prioritize blocking unnecessary assets like images and ads.

Nearby Guides

Need residential IPs for Scrapy?

Get access to 30M+ residential IPs in 195+ countries. Pay-as-you-go from £2.50/GB. No subscriptions, no commitments.

GET STARTED

Related guides