Scrapy is one of the most powerful Python frameworks for web scraping, but without proper proxy setup, your scrapers will get blocked within minutes of operation. In this guide, I will show all the ways to integrate proxies into Scrapy: from the simplest setup to advanced IP rotation methods with automatic error handling.
The material is based on real experience scraping large e-commerce platforms and protected sites. You will receive ready-to-use code examples that can be immediately implemented in your projects.
Why Scrapy Gets Blocked Without Proxies
Modern websites use multi-layered protection against scraping. Even if you have set up a User-Agent and delays between requests, your IP address reveals automation through several indicators:
- Request Frequency: one IP making 100+ requests per minute is a clear sign of a bot
- Behavior Patterns: sequentially browsing pages without random transitions
- Lack of JavaScript: Scrapy does not execute JS, which is easily detectable
- Geolocation: access from a data center instead of a home network
The result is an IP ban for several hours or days. Especially aggressive protection is used by marketplaces (Amazon, Wildberries, Ozon), social networks, and sites with Cloudflare. Proxies solve this problem by distributing requests among multiple IP addresses.
Important: Even with proxies, you need to adhere to rate limits. Recommended speed: 1-3 requests per second per IP. For high-speed scraping, use a pool of 50+ proxies with rotation.
Basic Proxy Setup in Scrapy
The simplest way is to specify the proxy directly in the spider settings. This method is suitable for testing or scraping small amounts of data with a single proxy server.
Method 1: Through Meta in Request
import scrapy
class MySpider(scrapy.Spider):
name = 'example'
start_urls = ['https://example.com']
def start_requests(self):
proxy = 'http://username:password@proxy.example.com:8080'
for url in self.start_urls:
yield scrapy.Request(
url=url,
callback=self.parse,
meta={'proxy': proxy}
)
def parse(self, response):
# Your parsing logic
self.log(f'Scraped {response.url} via {response.meta["proxy"]}')
The proxy format depends on the protocol and authentication method:
http://proxy.example.com:8080β without authenticationhttp://user:pass@proxy.example.com:8080β with username/passwordsocks5://user:pass@proxy.example.com:1080β SOCKS5 proxy
Method 2: Global Settings in settings.py
# settings.py
# HTTP proxy for all requests
HTTPPROXY_ENABLED = True
HTTPPROXY_AUTH_ENCODING = 'utf-8'
# Setup via environment variables
HTTP_PROXY = 'http://username:password@proxy.example.com:8080'
HTTPS_PROXY = 'http://username:password@proxy.example.com:8080'
This method is convenient for quick testing but is not suitable for production: there is no IP rotation, if the proxy fails, the entire scraper stops, and it is impossible to use different proxies for different sites.
Creating Custom Proxy Middleware
For production scraping, you need your own middleware that will manage the proxy pool, handle errors, and automatically rotate IPs. Here is a basic implementation:
# middlewares.py
import random
from scrapy import signals
from scrapy.exceptions import NotConfigured
class RandomProxyMiddleware:
def __init__(self, proxy_list):
self.proxy_list = proxy_list
@classmethod
def from_crawler(cls, crawler):
# Load the proxy list from settings
proxy_list = crawler.settings.getlist('PROXY_LIST')
if not proxy_list:
raise NotConfigured('PROXY_LIST not configured')
return cls(proxy_list)
def process_request(self, request, spider):
# Choose a random proxy from the pool
proxy = random.choice(self.proxy_list)
request.meta['proxy'] = proxy
spider.logger.info(f'Using proxy: {proxy}')
def process_exception(self, request, exception, spider):
# On error, try another proxy
proxy = random.choice(self.proxy_list)
request.meta['proxy'] = proxy
spider.logger.warning(
f'Proxy error, switching to: {proxy}'
)
return request
Now configure the use of middleware in settings.py:
# settings.py
# List of proxies (can be loaded from a file or API)
PROXY_LIST = [
'http://user1:pass1@proxy1.example.com:8080',
'http://user2:pass2@proxy2.example.com:8080',
'http://user3:pass3@proxy3.example.com:8080',
# ... add 50+ proxies for effective rotation
]
# Connect middleware
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomProxyMiddleware': 350,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400,
}
# Retry attempts on errors
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]
Proxy Rotation: Three Working Methods
Randomly selecting a proxy (as in the example above) is the simplest but not the most effective method. Let's consider three rotation strategies for different scenarios.
Method 1: Round-robin (Sequential Rotation)
Proxies are selected in a round-robin manner. Suitable for evenly distributing the load:
class RoundRobinProxyMiddleware:
def __init__(self, proxy_list):
self.proxy_list = proxy_list
self.current_index = 0
@classmethod
def from_crawler(cls, crawler):
proxy_list = crawler.settings.getlist('PROXY_LIST')
return cls(proxy_list)
def process_request(self, request, spider):
# Take the next proxy in a round-robin manner
proxy = self.proxy_list[self.current_index]
self.current_index = (self.current_index + 1) % len(self.proxy_list)
request.meta['proxy'] = proxy
Method 2: Smart Rotation with Blacklist
Track problematic proxies and temporarily exclude them from rotation:
import time
from collections import defaultdict
class SmartProxyMiddleware:
def __init__(self, proxy_list):
self.proxy_list = proxy_list
self.proxy_errors = defaultdict(int)
self.blacklist = set()
self.blacklist_timeout = 300 # 5 minutes
self.blacklist_time = {}
@classmethod
def from_crawler(cls, crawler):
proxy_list = crawler.settings.getlist('PROXY_LIST')
return cls(proxy_list)
def get_working_proxies(self):
# Remove from blacklist proxies that have expired timeout
current_time = time.time()
expired = [
proxy for proxy, ban_time in self.blacklist_time.items()
if current_time - ban_time > self.blacklist_timeout
]
for proxy in expired:
self.blacklist.discard(proxy)
self.proxy_errors[proxy] = 0
# Return working proxies
return [p for p in self.proxy_list if p not in self.blacklist]
def process_request(self, request, spider):
working_proxies = self.get_working_proxies()
if not working_proxies:
spider.logger.error('All proxies are blacklisted!')
return
proxy = random.choice(working_proxies)
request.meta['proxy'] = proxy
def process_response(self, request, response, spider):
# If we get a block β add to blacklist
if response.status in [403, 429, 503]:
proxy = request.meta.get('proxy')
self.proxy_errors[proxy] += 1
if self.proxy_errors[proxy] >= 3:
self.blacklist.add(proxy)
self.blacklist_time[proxy] = time.time()
spider.logger.warning(
f'Proxy {proxy} blacklisted for {self.blacklist_timeout}s'
)
return response
Method 3: Rotation via Provider API
Many proxy providers (including residential proxies) offer a rotating endpoint β a single URL that automatically changes IP with each request:
# settings.py
# Single endpoint with automatic rotation
ROTATING_PROXY = 'http://username:password@rotating.proxy.com:8080'
# Simple middleware
class RotatingProxyMiddleware:
def __init__(self, proxy):
self.proxy = proxy
@classmethod
def from_crawler(cls, crawler):
proxy = crawler.settings.get('ROTATING_PROXY')
return cls(proxy)
def process_request(self, request, spider):
# One URL, but each request goes with a new IP
request.meta['proxy'] = self.proxy
This is the most convenient method for production: no need to manage a proxy pool, the provider takes care of the quality of IPs and replaces problematic ones. It works especially effectively with residential proxies, where the pool of IPs can reach millions of addresses.
Authentication: Username/Password vs IP Whitelist
Proxy providers offer two authentication methods. The choice affects connection speed and ease of setup.
Username:Password Authentication
The username and password are passed in the proxy URL. Scrapy automatically converts them into the HTTP header Proxy-Authorization:
proxy = 'http://username:password@proxy.example.com:8080'
request.meta['proxy'] = proxy
# Scrapy will automatically add the header:
# Proxy-Authorization: Basic dXNlcm5hbWU6cGFzc3dvcmQ=
Pros: works from any IP, easy to change proxies in code.
Cons: slight overhead on each request (~50-100ms), credentials in plain text in the code.
IP Whitelist Authentication
You add your server's IP to the provider's whitelist, no authentication is required:
proxy = 'http://proxy.example.com:8080' # without username/password
request.meta['proxy'] = proxy
Pros: faster by 50-100ms, safer (no credentials in code).
Cons: works only from specific IPs, need to update the whitelist when changing servers.
Recommendation for Production:
Use IP whitelisting for scraping from dedicated servers (AWS, Google Cloud, Hetzner). For development and testing from a local machine β use username:password authentication.
Error Handling and Automatic IP Switching
Even with quality proxies, there will be errors: timeouts, connection refusals, blocks. Proper error handling is critical for the stable operation of the scraper.
Handling HTTP Statuses
class ProxyMiddleware:
def process_response(self, request, response, spider):
# Codes for which we need to switch proxies and retry
ban_codes = [403, 407, 429, 503]
if response.status in ban_codes:
proxy = request.meta.get('proxy')
spider.logger.warning(
f'Got {response.status} from {proxy}, retrying...'
)
# Mark for retry with a new proxy
request.meta['dont_retry'] = False
request.meta['proxy'] = self.get_new_proxy()
return request
return response
Handling Network Exceptions
from twisted.internet.error import TimeoutError, ConnectionRefusedError
from scrapy.exceptions import IgnoreRequest
class ProxyMiddleware:
def process_exception(self, request, exception, spider):
# Proxy connection errors
proxy_errors = (
TimeoutError,
ConnectionRefusedError,
ConnectionLost,
)
if isinstance(exception, proxy_errors):
proxy = request.meta.get('proxy')
spider.logger.error(
f'Proxy {proxy} connection failed: {exception}'
)
# Change proxy and try again
request.meta['proxy'] = self.get_new_proxy()
return request
# For other errors, use standard handling
return None
Detecting Blocks by Content
Some sites return HTTP 200 but show a captcha or block page:
class ProxyMiddleware:
def process_response(self, request, response, spider):
# Indicators of a block in the content
ban_indicators = [
'captcha',
'access denied',
'blocked',
'unusual traffic',
'robot check',
]
body_text = response.text.lower()
if any(indicator in body_text for indicator in ban_indicators):
spider.logger.warning(
f'Ban page detected from {request.meta.get("proxy")}'
)
# Change proxy and retry
request.meta['proxy'] = self.get_new_proxy()
return request
return response
Which Type of Proxy to Choose for Scrapy
The choice of proxy type depends on the target site, budget, and required scraping speed. Here is a comparison of the main options:
| Proxy Type | Speed | Cost | When to Use |
|---|---|---|---|
| Data Center Proxies | High (50-200ms) | Low ($1-3/IP) | Simple sites without protection, APIs, internal tools |
| Residential Proxies | Medium (300-800ms) | Medium ($5-15/GB) | E-commerce, social networks, Cloudflare sites, geo-targeting |
| Mobile Proxies | Low (500-1500ms) | High ($50-150/IP) | Mobile applications, Instagram, TikTok, maximum protection |
Selection Recommendations
For scraping marketplaces (Amazon, Wildberries, Ozon, AliExpress) β only residential proxies. These sites aggressively ban data centers. Rotation and geo-targeting are needed (e.g., Russian IPs for Wildberries).
For scraping news sites, blogs, forums β data center proxies will suffice. Protection is minimal; speed and low traffic cost are important.
For scraping sites with Cloudflare β residential proxies are mandatory. Cloudflare detects data centers almost instantly. Add the cloudscraper library to Scrapy to bypass JS challenges.
For scraping Google Search, SEO tools β residential proxies with geo-targeting. Google shows different results for different countries and cities.
Tip: Start with a pool of 10 residential proxies for testing. If you receive blocks β increase the pool to 50-100 IPs. For high-speed scraping (1000+ requests/minute), use a rotating endpoint with a pool of 10,000+ IPs.
Advanced Techniques: Sessions and Sticky IP
When scraping some sites, you need to maintain a single IP throughout the session (authorization, shopping cart, multi-step forms). Hereβs how to implement sticky sessions in Scrapy.
Sticky IP for One Domain
from urllib.parse import urlparse
class StickyProxyMiddleware:
def __init__(self, proxy_list):
self.proxy_list = proxy_list
# Dictionary: domain -> proxy
self.domain_proxy_map = {}
@classmethod
def from_crawler(cls, crawler):
proxy_list = crawler.settings.getlist('PROXY_LIST')
return cls(proxy_list)
def process_request(self, request, spider):
# Extract domain from URL
domain = urlparse(request.url).netloc
# If there is already a proxy for this domain β use it
if domain in self.domain_proxy_map:
proxy = self.domain_proxy_map[domain]
else:
# Otherwise, choose a new one and remember it
proxy = random.choice(self.proxy_list)
self.domain_proxy_map[domain] = proxy
spider.logger.info(f'Assigned {proxy} to {domain}')
request.meta['proxy'] = proxy
Sticky IP with Session Timeout
A more advanced option: the proxy is tied to the domain for a certain period (e.g., 10 minutes), then it changes:
import time
from urllib.parse import urlparse
class SessionProxyMiddleware:
def __init__(self, proxy_list, session_timeout=600):
self.proxy_list = proxy_list
self.session_timeout = session_timeout # 10 minutes
# Dictionary: domain -> (proxy, creation time)
self.sessions = {}
@classmethod
def from_crawler(cls, crawler):
proxy_list = crawler.settings.getlist('PROXY_LIST')
timeout = crawler.settings.getint('PROXY_SESSION_TIMEOUT', 600)
return cls(proxy_list, timeout)
def get_proxy_for_domain(self, domain):
current_time = time.time()
# Check if there is an active session
if domain in self.sessions:
proxy, created_at = self.sessions[domain]
# If the session has not expired β use the same proxy
if current_time - created_at < self.session_timeout:
return proxy
# Create a new session with a new proxy
new_proxy = random.choice(self.proxy_list)
self.sessions[domain] = (new_proxy, current_time)
return new_proxy
def process_request(self, request, spider):
domain = urlparse(request.url).netloc
proxy = self.get_proxy_for_domain(domain)
request.meta['proxy'] = proxy
Integration with Cookie Middleware
For full sessions, you need to synchronize proxies and cookies. Scrapy stores cookies separately for each domain, but when changing proxies, you need to clear cookies:
# settings.py
# Enable cookie middleware
COOKIES_ENABLED = True
COOKIES_DEBUG = False
# Middleware for synchronizing proxies and cookies
class ProxyCookieMiddleware:
def process_request(self, request, spider):
# Get the current proxy
current_proxy = request.meta.get('proxy')
# If the proxy has changed β clear cookies
previous_proxy = request.meta.get('previous_proxy')
if previous_proxy and previous_proxy != current_proxy:
# Clear cookies for this domain
jar = spider.crawler.engine.downloader.middleware.middlewares[0].jars
domain = urlparse(request.url).netloc
if domain in jar:
jar[domain].clear()
spider.logger.info(f'Cleared cookies for {domain}')
request.meta['previous_proxy'] = current_proxy
Conclusion
Proper proxy setup in Scrapy is the foundation for stable scraping without blocks. We have covered all key aspects: from basic integration to advanced rotation and session management techniques.
Key takeaways:
- For production, use custom middleware with smart rotation and blacklist problematic IPs
- Handle all types of errors: HTTP statuses, network exceptions, content blocks
- Choose the type of proxy for the task: data centers for simple sites, residential for protected ones
- For sites with authorization, use sticky sessions with proxy binding to the domain
- Start with a pool of 10-50 proxies, scale up as load increases
If you plan to scrape protected sites (marketplaces, social networks, sites with Cloudflare), I recommend using residential proxies β they provide maximum anonymity and minimal risk of blocks. For high-speed scraping, choose providers with a rotating endpoint and a pool of 10,000 IP addresses.
All code examples from this article have been tested on Scrapy 2.x and are ready for production use. Adapt them to your tasks and scale as your project grows.