Setting up a proxy in Scrapy: a complete guide with examples

Scrapy is one of the most powerful Python frameworks for web scraping, but without proper proxy setup, your scrapers will get blocked within minutes of operation. In this guide, I will show all the ways to integrate proxies into Scrapy: from the simplest setup to advanced IP rotation methods with automatic error handling.

The material is based on real experience scraping large e-commerce platforms and protected sites. You will receive ready-to-use code examples that can be immediately implemented in your projects.

Why Scrapy Gets Blocked Without Proxies

Modern websites use multi-layered protection against scraping. Even if you have set up a User-Agent and delays between requests, your IP address reveals automation through several indicators:

Request Frequency: one IP making 100+ requests per minute is a clear sign of a bot
Behavior Patterns: sequentially browsing pages without random transitions
Lack of JavaScript: Scrapy does not execute JS, which is easily detectable
Geolocation: access from a data center instead of a home network

The result is an IP ban for several hours or days. Especially aggressive protection is used by marketplaces (Amazon, Wildberries, Ozon), social networks, and sites with Cloudflare. Proxies solve this problem by distributing requests among multiple IP addresses.

Important: Even with proxies, you need to adhere to rate limits. Recommended speed: 1-3 requests per second per IP. For high-speed scraping, use a pool of 50+ proxies with rotation.

Basic Proxy Setup in Scrapy

The simplest way is to specify the proxy directly in the spider settings. This method is suitable for testing or scraping small amounts of data with a single proxy server.

Method 1: Through Meta in Request

import scrapy

class MySpider(scrapy.Spider):
    name = 'example'
    start_urls = ['https://example.com']
    
    def start_requests(self):
        proxy = 'http://username:[email protected]:8080'
        
        for url in self.start_urls:
            yield scrapy.Request(
                url=url,
                callback=self.parse,
                meta={'proxy': proxy}
            )
    
    def parse(self, response):
        # Your parsing logic
        self.log(f'Scraped {response.url} via {response.meta["proxy"]}')

The proxy format depends on the protocol and authentication method:

http://proxy.example.com:8080 — without authentication
http://user:[email protected]:8080 — with username/password
socks5://user:[email protected]:1080 — SOCKS5 proxy

Method 2: Global Settings in settings.py

# settings.py

# HTTP proxy for all requests
HTTPPROXY_ENABLED = True
HTTPPROXY_AUTH_ENCODING = 'utf-8'

# Setup via environment variables
HTTP_PROXY = 'http://username:[email protected]:8080'
HTTPS_PROXY = 'http://username:[email protected]:8080'

This method is convenient for quick testing but is not suitable for production: there is no IP rotation, if the proxy fails, the entire scraper stops, and it is impossible to use different proxies for different sites.

Creating Custom Proxy Middleware

For production scraping, you need your own middleware that will manage the proxy pool, handle errors, and automatically rotate IPs. Here is a basic implementation:

# middlewares.py

import random
from scrapy import signals
from scrapy.exceptions import NotConfigured

class RandomProxyMiddleware:
    def __init__(self, proxy_list):
        self.proxy_list = proxy_list
    
    @classmethod
    def from_crawler(cls, crawler):
        # Load the proxy list from settings
        proxy_list = crawler.settings.getlist('PROXY_LIST')
        
        if not proxy_list:
            raise NotConfigured('PROXY_LIST not configured')
        
        return cls(proxy_list)
    
    def process_request(self, request, spider):
        # Choose a random proxy from the pool
        proxy = random.choice(self.proxy_list)
        request.meta['proxy'] = proxy
        
        spider.logger.info(f'Using proxy: {proxy}')
    
    def process_exception(self, request, exception, spider):
        # On error, try another proxy
        proxy = random.choice(self.proxy_list)
        request.meta['proxy'] = proxy
        
        spider.logger.warning(
            f'Proxy error, switching to: {proxy}'
        )
        
        return request

Now configure the use of middleware in settings.py:

# settings.py

# List of proxies (can be loaded from a file or API)
PROXY_LIST = [
    'http://user1:[email protected]:8080',
    'http://user2:[email protected]:8080',
    'http://user3:[email protected]:8080',
    # ... add 50+ proxies for effective rotation
]

# Connect middleware
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.RandomProxyMiddleware': 350,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400,
}

# Retry attempts on errors
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]

Proxy Rotation: Three Working Methods

Randomly selecting a proxy (as in the example above) is the simplest but not the most effective method. Let's consider three rotation strategies for different scenarios.

Method 1: Round-robin (Sequential Rotation)

Proxies are selected in a round-robin manner. Suitable for evenly distributing the load:

class RoundRobinProxyMiddleware:
    def __init__(self, proxy_list):
        self.proxy_list = proxy_list
        self.current_index = 0
    
    @classmethod
    def from_crawler(cls, crawler):
        proxy_list = crawler.settings.getlist('PROXY_LIST')
        return cls(proxy_list)
    
    def process_request(self, request, spider):
        # Take the next proxy in a round-robin manner
        proxy = self.proxy_list[self.current_index]
        self.current_index = (self.current_index + 1) % len(self.proxy_list)
        
        request.meta['proxy'] = proxy

Method 2: Smart Rotation with Blacklist

Track problematic proxies and temporarily exclude them from rotation:

import time
from collections import defaultdict

class SmartProxyMiddleware:
    def __init__(self, proxy_list):
        self.proxy_list = proxy_list
        self.proxy_errors = defaultdict(int)
        self.blacklist = set()
        self.blacklist_timeout = 300  # 5 minutes
        self.blacklist_time = {}
    
    @classmethod
    def from_crawler(cls, crawler):
        proxy_list = crawler.settings.getlist('PROXY_LIST')
        return cls(proxy_list)
    
    def get_working_proxies(self):
        # Remove from blacklist proxies that have expired timeout
        current_time = time.time()
        expired = [
            proxy for proxy, ban_time in self.blacklist_time.items()
            if current_time - ban_time > self.blacklist_timeout
        ]
        
        for proxy in expired:
            self.blacklist.discard(proxy)
            self.proxy_errors[proxy] = 0
        
        # Return working proxies
        return [p for p in self.proxy_list if p not in self.blacklist]
    
    def process_request(self, request, spider):
        working_proxies = self.get_working_proxies()
        
        if not working_proxies:
            spider.logger.error('All proxies are blacklisted!')
            return
        
        proxy = random.choice(working_proxies)
        request.meta['proxy'] = proxy
    
    def process_response(self, request, response, spider):
        # If we get a block — add to blacklist
        if response.status in [403, 429, 503]:
            proxy = request.meta.get('proxy')
            self.proxy_errors[proxy] += 1
            
            if self.proxy_errors[proxy] >= 3:
                self.blacklist.add(proxy)
                self.blacklist_time[proxy] = time.time()
                spider.logger.warning(
                    f'Proxy {proxy} blacklisted for {self.blacklist_timeout}s'
                )
        
        return response

Method 3: Rotation via Provider API

Many proxy providers (including residential proxies) offer a rotating endpoint — a single URL that automatically changes IP with each request:

# settings.py

# Single endpoint with automatic rotation
ROTATING_PROXY = 'http://username:[email protected]:8080'

# Simple middleware
class RotatingProxyMiddleware:
    def __init__(self, proxy):
        self.proxy = proxy
    
    @classmethod
    def from_crawler(cls, crawler):
        proxy = crawler.settings.get('ROTATING_PROXY')
        return cls(proxy)
    
    def process_request(self, request, spider):
        # One URL, but each request goes with a new IP
        request.meta['proxy'] = self.proxy

This is the most convenient method for production: no need to manage a proxy pool, the provider takes care of the quality of IPs and replaces problematic ones. It works especially effectively with residential proxies, where the pool of IPs can reach millions of addresses.

Authentication: Username/Password vs IP Whitelist

Proxy providers offer two authentication methods. The choice affects connection speed and ease of setup.

Username:Password Authentication

The username and password are passed in the proxy URL. Scrapy automatically converts them into the HTTP header Proxy-Authorization:

proxy = 'http://username:[email protected]:8080'
request.meta['proxy'] = proxy

# Scrapy will automatically add the header:
# Proxy-Authorization: Basic dXNlcm5hbWU6cGFzc3dvcmQ=

Pros: works from any IP, easy to change proxies in code.
Cons: slight overhead on each request (~50-100ms), credentials in plain text in the code.

IP Whitelist Authentication

You add your server's IP to the provider's whitelist, no authentication is required:

proxy = 'http://proxy.example.com:8080'  # without username/password
request.meta['proxy'] = proxy

Pros: faster by 50-100ms, safer (no credentials in code).
Cons: works only from specific IPs, need to update the whitelist when changing servers.

Recommendation for Production:

Use IP whitelisting for scraping from dedicated servers (AWS, Google Cloud, Hetzner). For development and testing from a local machine — use username:password authentication.

Error Handling and Automatic IP Switching

Even with quality proxies, there will be errors: timeouts, connection refusals, blocks. Proper error handling is critical for the stable operation of the scraper.

Handling HTTP Statuses

class ProxyMiddleware:
    def process_response(self, request, response, spider):
        # Codes for which we need to switch proxies and retry
        ban_codes = [403, 407, 429, 503]
        
        if response.status in ban_codes:
            proxy = request.meta.get('proxy')
            spider.logger.warning(
                f'Got {response.status} from {proxy}, retrying...'
            )
            
            # Mark for retry with a new proxy
            request.meta['dont_retry'] = False
            request.meta['proxy'] = self.get_new_proxy()
            
            return request
        
        return response

Handling Network Exceptions

from twisted.internet.error import TimeoutError, ConnectionRefusedError
from scrapy.exceptions import IgnoreRequest

class ProxyMiddleware:
    def process_exception(self, request, exception, spider):
        # Proxy connection errors
        proxy_errors = (
            TimeoutError,
            ConnectionRefusedError,
            ConnectionLost,
        )
        
        if isinstance(exception, proxy_errors):
            proxy = request.meta.get('proxy')
            spider.logger.error(
                f'Proxy {proxy} connection failed: {exception}'
            )
            
            # Change proxy and try again
            request.meta['proxy'] = self.get_new_proxy()
            return request
        
        # For other errors, use standard handling
        return None

Detecting Blocks by Content

Some sites return HTTP 200 but show a captcha or block page:

class ProxyMiddleware:
    def process_response(self, request, response, spider):
        # Indicators of a block in the content
        ban_indicators = [
            'captcha',
            'access denied',
            'blocked',
            'unusual traffic',
            'robot check',
        ]
        
        body_text = response.text.lower()
        
        if any(indicator in body_text for indicator in ban_indicators):
            spider.logger.warning(
                f'Ban page detected from {request.meta.get("proxy")}'
            )
            
            # Change proxy and retry
            request.meta['proxy'] = self.get_new_proxy()
            return request
        
        return response

Which Type of Proxy to Choose for Scrapy

The choice of proxy type depends on the target site, budget, and required scraping speed. Here is a comparison of the main options:

Proxy Type	Speed	Cost	When to Use
Data Center Proxies	High (50-200ms)	Low ($1-3/IP)	Simple sites without protection, APIs, internal tools
Residential Proxies	Medium (300-800ms)	Medium ($5-15/GB)	E-commerce, social networks, Cloudflare sites, geo-targeting
Mobile Proxies	Low (500-1500ms)	High ($50-150/IP)	Mobile applications, Instagram, TikTok, maximum protection

Selection Recommendations

For scraping marketplaces (Amazon, Wildberries, Ozon, AliExpress) — only residential proxies. These sites aggressively ban data centers. Rotation and geo-targeting are needed (e.g., Russian IPs for Wildberries).

For scraping news sites, blogs, forums — data center proxies will suffice. Protection is minimal; speed and low traffic cost are important.

For scraping sites with Cloudflare — residential proxies are mandatory. Cloudflare detects data centers almost instantly. Add the cloudscraper library to Scrapy to bypass JS challenges.

For scraping Google Search, SEO tools — residential proxies with geo-targeting. Google shows different results for different countries and cities.

Tip: Start with a pool of 10 residential proxies for testing. If you receive blocks — increase the pool to 50-100 IPs. For high-speed scraping (1000+ requests/minute), use a rotating endpoint with a pool of 10,000+ IPs.

Advanced Techniques: Sessions and Sticky IP

When scraping some sites, you need to maintain a single IP throughout the session (authorization, shopping cart, multi-step forms). Here’s how to implement sticky sessions in Scrapy.

Sticky IP for One Domain

from urllib.parse import urlparse

class StickyProxyMiddleware:
    def __init__(self, proxy_list):
        self.proxy_list = proxy_list
        # Dictionary: domain -> proxy
        self.domain_proxy_map = {}
    
    @classmethod
    def from_crawler(cls, crawler):
        proxy_list = crawler.settings.getlist('PROXY_LIST')
        return cls(proxy_list)
    
    def process_request(self, request, spider):
        # Extract domain from URL
        domain = urlparse(request.url).netloc
        
        # If there is already a proxy for this domain — use it
        if domain in self.domain_proxy_map:
            proxy = self.domain_proxy_map[domain]
        else:
            # Otherwise, choose a new one and remember it
            proxy = random.choice(self.proxy_list)
            self.domain_proxy_map[domain] = proxy
            spider.logger.info(f'Assigned {proxy} to {domain}')
        
        request.meta['proxy'] = proxy

Sticky IP with Session Timeout

A more advanced option: the proxy is tied to the domain for a certain period (e.g., 10 minutes), then it changes:

import time
from urllib.parse import urlparse

class SessionProxyMiddleware:
    def __init__(self, proxy_list, session_timeout=600):
        self.proxy_list = proxy_list
        self.session_timeout = session_timeout  # 10 minutes
        # Dictionary: domain -> (proxy, creation time)
        self.sessions = {}
    
    @classmethod
    def from_crawler(cls, crawler):
        proxy_list = crawler.settings.getlist('PROXY_LIST')
        timeout = crawler.settings.getint('PROXY_SESSION_TIMEOUT', 600)
        return cls(proxy_list, timeout)
    
    def get_proxy_for_domain(self, domain):
        current_time = time.time()
        
        # Check if there is an active session
        if domain in self.sessions:
            proxy, created_at = self.sessions[domain]
            
            # If the session has not expired — use the same proxy
            if current_time - created_at < self.session_timeout:
                return proxy
        
        # Create a new session with a new proxy
        new_proxy = random.choice(self.proxy_list)
        self.sessions[domain] = (new_proxy, current_time)
        
        return new_proxy
    
    def process_request(self, request, spider):
        domain = urlparse(request.url).netloc
        proxy = self.get_proxy_for_domain(domain)
        request.meta['proxy'] = proxy

Integration with Cookie Middleware

For full sessions, you need to synchronize proxies and cookies. Scrapy stores cookies separately for each domain, but when changing proxies, you need to clear cookies:

# settings.py

# Enable cookie middleware
COOKIES_ENABLED = True
COOKIES_DEBUG = False

# Middleware for synchronizing proxies and cookies
class ProxyCookieMiddleware:
    def process_request(self, request, spider):
        # Get the current proxy
        current_proxy = request.meta.get('proxy')
        
        # If the proxy has changed — clear cookies
        previous_proxy = request.meta.get('previous_proxy')
        
        if previous_proxy and previous_proxy != current_proxy:
            # Clear cookies for this domain
            jar = spider.crawler.engine.downloader.middleware.middlewares[0].jars
            domain = urlparse(request.url).netloc
            
            if domain in jar:
                jar[domain].clear()
                spider.logger.info(f'Cleared cookies for {domain}')
        
        request.meta['previous_proxy'] = current_proxy

Conclusion

Proper proxy setup in Scrapy is the foundation for stable scraping without blocks. We have covered all key aspects: from basic integration to advanced rotation and session management techniques.

Key takeaways:

For production, use custom middleware with smart rotation and blacklist problematic IPs
Handle all types of errors: HTTP statuses, network exceptions, content blocks
Choose the type of proxy for the task: data centers for simple sites, residential for protected ones
For sites with authorization, use sticky sessions with proxy binding to the domain
Start with a pool of 10-50 proxies, scale up as load increases

If you plan to scrape protected sites (marketplaces, social networks, sites with Cloudflare), I recommend using residential proxies — they provide maximum anonymity and minimal risk of blocks. For high-speed scraping, choose providers with a rotating endpoint and a pool of 10,000 IP addresses.

All code examples from this article have been tested on Scrapy 2.x and are ready for production use. Adapt them to your tasks and scale as your project grows.