Rate limiting is one of the most common reasons why scrapers fail, API integrations break, and automated scripts receive a 429 Too Many Requests status. The server sees too many requests from a single IP and simply stops responding. In this article, we will discuss how to properly build an infrastructure using proxies to bypass request limits without bans and failures โ with real code examples in Python and Node.js.
What is rate limiting and why regular delays don't help
Rate limiting is a server protection mechanism that limits the number of requests from a single source over a specified period of time. The source is most often an IP address, but advanced systems also take into account authorization tokens, User-Agent, cookies, and even behavioral patterns.
When your script exceeds the limit, the server returns one of the following responses:
429 Too Many Requestsโ standard HTTP status for rate limiting503 Service Unavailableโ sometimes used instead of 429403 Forbiddenโ if the IP is already blacklisted- Empty response or timeout โ during aggressive blocking
The first thought of most developers is to add time.sleep(1) between requests. This only works with very soft limits (for example, 60 requests per minute). But real scenarios are more complex:
Real limits of popular platforms:
- Twitter/X API (free): 500,000 tweets per month, but no more than 15 requests every 15 minutes
- Google Search: ~100 requests per day from one IP without authorization
- Wildberries, Ozon: aggressive rate limiting โ block after 30โ50 requests per minute
- GitHub API: 60 requests/hour without a token, 5000/hour with a token
- Cloudflare-protected sites: can block after just 10โ20 requests per minute
If you need to collect 100,000 product cards from a marketplace or monitor prices in real-time โ delays simply won't help. A different architecture is needed. And this is where proxies become a necessity rather than an option.
It is important to understand: rate limiting is tied to the IP address. If you have 100 different IPs โ you effectively have 100 independent "quotas." This is the key principle of bypassing limits through proxies.
How proxies solve the request limit problem
The mechanism is simple: each request to the target server goes out from a different IP address. From the server's perspective โ these are different users. The quota for each of them is hardly consumed, so blocking does not occur.
Let's consider the difference between working without proxies and with a pool of proxies using a specific example. Suppose the server allows 10 requests per minute from one IP:
| Scenario | Requests per minute | Blocking | Time for 10,000 requests |
|---|---|---|---|
| One IP, no proxy | 10 | Yes, after 10 requests | ~16 hours |
| 10 proxies, rotation | 100 | No | ~1.7 hours |
| 100 proxies, rotation | 1000 | No | ~10 minutes |
In addition to scaling throughput, proxies provide several other advantages when working with rate limiting:
- Session isolation โ if one IP gets banned, the others continue to work
- Geographic distribution โ requests come from different regions, reducing suspicion
- Sticky sessions โ the ability to "stick" to one IP for multi-step scenarios (authorization + action)
- Load control โ you can accurately dose requests to each IP without exceeding the limit
What type of proxy to choose for your task
Not all proxies are equally effective against rate limiting. The choice of type depends on the target site, the volume of requests, and the budget. Let's discuss three main types:
Residential Proxies
These are IP addresses of real home users. They look like regular internet traffic and are rarely subject to blocking. Residential proxies are the optimal choice for sites with aggressive protection: marketplaces (Wildberries, Ozon), social networks, Cloudflare-protected resources. The main downside is the higher price compared to data center proxies.
Mobile Proxies
IP addresses from mobile operators (3G/4G/5G). Their feature is that one IP can be used by thousands of real subscribers simultaneously, so sites are very reluctant to block such addresses. Mobile proxies show the best results where residential proxies are already starting to get blocked โ for example, during high-frequency scraping of Instagram or working with APIs of platforms that analyze connection types.
Data Center Proxies
Fast and cheap IPs from server data centers. They are ideal for scraping sites without serious protection: open APIs, news aggregators, public databases. For tasks with rate limiting, you need more of them (as they are more likely to end up in blacklists), but with proper rotation, they handle large volumes of requests well. More details can be found on the data center proxies page.
| Proxy Type | Anonymity | Speed | Price | Best Scenario |
|---|---|---|---|---|
| Residential | Very High | Average | $$ | Marketplaces, social networks, Cloudflare |
| Mobile | Maximum | Average | $$$ | Instagram API, high-frequency scraping |
| Data Centers | Average | High | $ | Open APIs, public data |
IP Rotation Strategies: Per-Request, Sticky Sessions, Round-Robin
The mere presence of proxies does not solve the problem โ it is important to manage them correctly. There are several rotation strategies, each suitable for its own scenarios.
Per-Request Rotation (New IP for Each Request)
Each HTTP request goes through a new IP address. This is the most aggressive strategy for bypassing rate limiting โ the server physically does not have time to accumulate a counter for one IP. Suitable for:
- Scraping product cards (each card is a separate request)
- Gathering data from search engines
- Any stateless requests that do not require a session
Sticky Sessions (Fixed IP for a Session)
One IP is used throughout the session (usually 1โ30 minutes). This is critically important for scenarios where authorization is needed: logging into an account, performing an action, logging out. If the IP changes between steps โ the server may block the session as suspicious.
Round-Robin with Request Limits per IP
The most precise strategy. You know the server limit (for example, 10 requests per minute) and distribute requests across the proxy pool so that each IP never exceeds this threshold. This requires implementing a queue considering the time of the last request for each IP.
Formula for calculating the required number of proxies:
N proxies = (Target request speed/min) รท (Server limit/min per IP)
Example: need 500 requests/min, server limit โ 10/min โ need at least 50 proxies.
Add 20% reserve in case of blocks: total 60 proxies.
Python Code Examples: Requests, Aiohttp, Scrapy
Let's move on to practice. Below are ready-made templates for the three most popular Python tools.
1. Requests + Manual Proxy Rotation
The simplest option is a list of proxies and a random selection for each request:
import requests
import random
import time
PROXIES = [
"http://user:[email protected]:8080",
"http://user:[email protected]:8080",
"http://user:[email protected]:8080",
# ... add as many as needed
]
def get_random_proxy():
proxy = random.choice(PROXIES)
return {"http": proxy, "https": proxy}
def fetch_with_retry(url, max_retries=3):
for attempt in range(max_retries):
proxy = get_random_proxy()
try:
response = requests.get(
url,
proxies=proxy,
timeout=10,
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
)
if response.status_code == 429:
print(f"Rate limited on {proxy}, switching...")
time.sleep(1)
continue
return response
except requests.RequestException as e:
print(f"Attempt {attempt+1} failed: {e}")
time.sleep(2)
return None
# Usage
urls = ["https://example.com/item/1", "https://example.com/item/2"]
for url in urls:
result = fetch_with_retry(url)
if result:
print(f"OK: {url} โ {len(result.text)} bytes")
2. Smart Proxy Pool Considering Rate Limit
A more advanced option is the ProxyPool class, which tracks the last usage time of each IP and does not exceed the established limit:
import requests
import time
from collections import defaultdict
from threading import Lock
class ProxyPool:
def __init__(self, proxies, rate_limit=10, window=60):
"""
proxies: list of strings in the form 'http://user:pass@host:port'
rate_limit: maximum requests from one IP per window seconds
window: time window in seconds
"""
self.proxies = proxies
self.rate_limit = rate_limit
self.window = window
self.usage = defaultdict(list) # proxy -> [timestamps]
self.lock = Lock()
def get_available_proxy(self):
now = time.time()
with self.lock:
for proxy in self.proxies:
# Clear outdated timestamps
self.usage[proxy] = [
t for t in self.usage[proxy]
if now - t < self.window
]
if len(self.usage[proxy]) < self.rate_limit:
self.usage[proxy].append(now)
return {"http": proxy, "https": proxy}
return None # All proxies have exhausted their limit
def fetch(self, url, **kwargs):
proxy = self.get_available_proxy()
if proxy is None:
print("All proxies rate-limited, waiting...")
time.sleep(5)
return self.fetch(url, **kwargs)
try:
response = requests.get(url, proxies=proxy, timeout=10, **kwargs)
return response
except requests.RequestException as e:
print(f"Request failed: {e}")
return None
# Usage
pool = ProxyPool(
proxies=[
"http://user:[email protected]:8080",
"http://user:[email protected]:8080",
],
rate_limit=10, # 10 requests per minute per IP
window=60
)
for i in range(100):
r = pool.fetch(f"https://example.com/page/{i}")
if r:
print(f"Page {i}: {r.status_code}")
3. Aiohttp for Asynchronous Scraping
The asynchronous approach allows you to use dozens of proxies in parallel without blocking threads:
import asyncio
import aiohttp
import itertools
PROXIES = [
"http://user:[email protected]:8080",
"http://user:[email protected]:8080",
"http://user:[email protected]:8080",
]
proxy_cycle = itertools.cycle(PROXIES)
async def fetch(session, url, proxy):
try:
async with session.get(
url,
proxy=proxy,
timeout=aiohttp.ClientTimeout(total=10)
) as response:
if response.status == 429:
await asyncio.sleep(2)
return None
return await response.text()
except Exception as e:
print(f"Error: {e}")
return None
async def main(urls):
connector = aiohttp.TCPConnector(limit=50)
async with aiohttp.ClientSession(connector=connector) as session:
tasks = [
fetch(session, url, next(proxy_cycle))
for url in urls
]
results = await asyncio.gather(*tasks)
return results
urls = [f"https://example.com/item/{i}" for i in range(200)]
results = asyncio.run(main(urls))
print(f"Collected: {sum(1 for r in results if r is not None)} pages")
4. Scrapy with Rotation via Middleware
For Scrapy, there is a ready-made solution โ scrapy-rotating-proxies. However, you can write your own middleware:
# middlewares.py
import random
class RotatingProxyMiddleware:
def __init__(self, proxies):
self.proxies = proxies
@classmethod
def from_crawler(cls, crawler):
return cls(proxies=crawler.settings.getlist("PROXY_LIST"))
def process_request(self, request, spider):
proxy = random.choice(self.proxies)
request.meta["proxy"] = proxy
def process_response(self, request, response, spider):
if response.status == 429:
spider.logger.warning(f"Rate limited, proxy: {request.meta.get('proxy')}")
# Logic to exclude the problematic proxy can be added here
return response
# settings.py
PROXY_LIST = [
"http://user:[email protected]:8080",
"http://user:[email protected]:8080",
]
DOWNLOADER_MIDDLEWARES = {
"myproject.middlewares.RotatingProxyMiddleware": 350,
}
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_TARGET_CONCURRENCY = 10
Node.js Code Examples: Axios, Got, Puppeteer
Node.js is a popular choice for browser automation and working with APIs. Here are ready-made patterns for working with proxies.
1. Axios with Proxy Rotation
const axios = require('axios');
const { HttpsProxyAgent } = require('https-proxy-agent');
const proxies = [
'http://user:[email protected]:8080',
'http://user:[email protected]:8080',
'http://user:[email protected]:8080',
];
let proxyIndex = 0;
function getNextProxy() {
const proxy = proxies[proxyIndex % proxies.length];
proxyIndex++;
return proxy;
}
async function fetchWithProxy(url, retries = 3) {
for (let i = 0; i < retries; i++) {
const proxyUrl = getNextProxy();
const agent = new HttpsProxyAgent(proxyUrl);
try {
const response = await axios.get(url, {
httpsAgent: agent,
httpAgent: agent,
timeout: 10000,
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
},
});
return response.data;
} catch (error) {
if (error.response?.status === 429) {
console.log(`Rate limited, switching proxy...`);
await new Promise(r => setTimeout(r, 1000));
continue;
}
console.error(`Attempt ${i + 1} failed:`, error.message);
}
}
return null;
}
// Usage
(async () => {
const urls = Array.from({length: 50}, (_, i) => `https://example.com/item/${i}`);
const results = await Promise.allSettled(
urls.map(url => fetchWithProxy(url))
);
const successful = results.filter(r => r.status === 'fulfilled' && r.value).length;
console.log(`Success: ${successful}/${urls.length}`);
})();
2. Puppeteer with Proxy and Rate Limiting Bypass
For sites with JavaScript rendering and Cloudflare protection, a headless browser is needed:
const puppeteer = require('puppeteer');
const proxies = [
'proxy1.example.com:8080',
'proxy2.example.com:8080',
];
async function scrapeWithProxy(url, proxyHost) {
const browser = await puppeteer.launch({
args: [
`--proxy-server=${proxyHost}`,
'--no-sandbox',
'--disable-setuid-sandbox',
],
headless: true,
});
const page = await browser.newPage();
// Proxy authentication
await page.authenticate({
username: 'user',
password: 'pass',
});
// Set a realistic User-Agent
await page.setUserAgent(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
);
try {
await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 });
// Check for rate limit
const status = await page.evaluate(() => document.title);
if (status.includes('429') || status.includes('Too Many')) {
console.log('Rate limited, need to switch proxy');
return null;
}
const data = await page.evaluate(() => {
return document.querySelector('.price')?.textContent || null;
});
return data;
} finally {
await browser.close();
}
}
// Rotation by tasks
(async () => {
const urls = ['https://example.com/product/1', 'https://example.com/product/2'];
for (let i = 0; i < urls.length; i++) {
const proxy = proxies[i % proxies.length];
const result = await scrapeWithProxy(urls[i], proxy);
console.log(`${urls[i]}: ${result}`);
await new Promise(r => setTimeout(r, 500)); // small delay
}
})();
Advanced Techniques: Headers, Fingerprinting, Bypassing Cloudflare
Changing IP is a necessary but not always sufficient condition. Modern protection systems analyze dozens of request parameters. Let's discuss what else needs to be considered.
HTTP Headers: Minimum Required Set
A request without normal headers looks like a bot even with an IP change. Always add:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Cache-Control": "max-age=0",
}
Handling the Retry-After Header
When receiving a 429 response, the server often indicates how long to wait. Proper handling of this header allows you to avoid wasting requests:
def handle_rate_limit(response):
if response.status_code == 429:
retry_after = response.headers.get("Retry-After")
if retry_after:
wait_time = int(retry_after)
print(f"Rate limited. Waiting {wait_time} seconds...")
time.sleep(wait_time + 1) # +1 second buffer
else:
# Exponential delay if no header is present
time.sleep(min(2 ** attempt, 60))
return True
return False
TLS Fingerprinting and How to Bypass It
Advanced systems (Cloudflare, Akamai, PerimeterX) analyze the TLS fingerprint โ a unique "fingerprint" of your TLS connection. The standard requests library has an easily recognizable fingerprint. Solutions:
- curl_cffi (Python) โ emulates Chrome/Firefox fingerprinting at the TLS level
- tls-client (Go/Python) โ similar tool with support for different browser profiles
- Playwright/Puppeteer โ real browser, ideal fingerprint by default
# pip install curl-cffi
from curl_cffi import requests as cffi_requests
response = cffi_requests.get(
"https://cloudflare-protected-site.com/api/data",
impersonate="chrome120", # Emulating Chrome 120
proxies={"https": "http://user:[email protected]:8080"}
)
print(response.json())
Managing Cookies and Sessions
If a site uses cookies to track sessions, changing IP without changing cookies is pointless. Always create a new session when switching proxies:
import requests
def create_fresh_session(proxy_url):
"""Create a new session with clean cookies for each proxy"""
session = requests.Session()
session.proxies = {"http": proxy_url, "https": proxy_url}
session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)...",
})
# Cookies are not carried over from the previous session
return session
# For each new IP โ a new session
for proxy in proxies:
session = create_fresh_session(proxy)
response = session.get("https://example.com/protected-page")
# Process the response...
Common Mistakes When Working with Proxies and Rate Limiting
Even with properly configured proxies, developers regularly fall into the same traps. Here are the most common mistakes and how to avoid them.
Checklist: What to Check Before Starting the Scraper
- โ Realistic HTTP headers added (User-Agent, Accept, Accept-Language)
- โ A new session is created when switching proxies (new cookies)
- โ Statuses 429, 503, 403 are handled with retry logic
- โ A delay between requests is implemented (at least 100โ500 ms)
- โ The number of proxies matches the target request speed
- โ Proxies are checked for functionality before starting (health check)
- โ Errors and statistics for each proxy are logged
- โ A timeout for requests is set (no more than 15โ30 seconds)
Error 1: Using "Dead" Proxies
Always check proxies before adding them to the pool and periodically during operation. One non-working proxy in the cycle means lost requests and timeouts:
def check_proxy(proxy_url, test_url="https://httpbin.org/ip", timeout=5):
try:
r = requests.get(
test_url,
proxies={"http": proxy_url, "https": proxy_url},
timeout=timeout
)
return r.status_code == 200
except:
return False
# Filter working proxies before starting
working_proxies = [p for p in PROXIES if check_proxy(p)]
print(f"Working proxies: {len(working_proxies)}/{len(PROXIES)}")
Error 2: Ignoring Protocol Type
HTTP proxies cannot proxy HTTPS traffic directly (only through CONNECT). SOCKS5 proxies work at the transport level and support any protocols. For most modern sites, use SOCKS5 or HTTPS proxies:
# SOCKS5 proxy in requests (requires pip install requests[socks])
proxies = {
"http": "socks5://user:[email protected]:1080",
"https": "socks5://user:[email protected]:1080",
}
# HTTPS proxy
proxies = {
"http": "https://user:[email protected]:8080",
"https": "https://user:[email protected]:8080",
}
Error 3: Lack of Exponential Backoff
If you immediately repeat the request after a 429 โ you only worsen the situation. The correct strategy is exponential delay with jitter (random deviation):
import random
def exponential_backoff(attempt, base=1, max_wait=60):
"""
attempt: attempt number (starting from 0)
base: base delay in seconds
max_wait: maximum delay
"""
wait = min(base * (2 ** attempt), max_wait)
# Jitter ยฑ25% to prevent thundering herd
jitter = wait * 0.25 * random.uniform(-1, 1)
return wait + jitter
# Usage in retry logic
for attempt in range(5):
response = requests.get(url, proxies=proxy)
if response.status_code == 429:
wait = exponential_backoff(attempt)
print(f"Rate limited. Waiting {wait:.1f}s (attempt {attempt+1})")
time.sleep(wait)
else:
break
Error 4: One Thread for All Proxies
If you have 50 proxies but one execution thread โ you are using a maximum of 1 proxy at a time. Use ThreadPoolExecutor or an asynchronous approach to use the entire pool in parallel:
from concurrent.futures import ThreadPoolExecutor, as_completed
def fetch_url(args):
url, proxy = args
try:
r = requests.get(url, proxies={"https": proxy}, timeout=10)
return url, r.status_code, len(r.text)
except Exception as e:
return url, None, str(e)
# Use all proxies in parallel
tasks = [(url, proxies[i % len(proxies)]) for i, url in enumerate(urls)]
with ThreadPoolExecutor(max_workers=len(proxies)) as executor:
futures = {executor.submit(fetch_url, task): task for task in tasks}
for future in as_completed(futures):
url, status, size = future.result()
print(f"{url}: {status} ({size})")
Conclusion and Recommendations
Rate limiting is a solvable problem if approached systematically. Key takeaways from this guide:
- A proxy pool, not a single proxy โ is the minimum unit for serious work. The number of proxies is determined by the formula: target speed รท server limit per IP.
- Rotation strategy is important โ per-request for stateless requests, sticky sessions for authorized scenarios.
- IP is not the only parameter โ headers, cookies, TLS fingerprint, and behavioral patterns are also analyzed by protection systems.
- Handle 429 correctly โ exponential backoff, Retry-After header, switch proxies when blocked.
- The type of proxy depends on the goal โ data center proxies for open APIs, residential for marketplaces, mobile for maximum protection.
If you are working with scraping marketplaces (Wildberries, Ozon), collecting data from protected APIs, or automating at high speeds โ we recommend starting with residential proxies: they provide the optimal balance between anonymity and speed, and their IP addresses rarely end up in blacklists. For tasks that require maximum resilience against blocks at high request frequencies, consider mobile proxies โ their IPs are shared by thousands of real users, making blocking extremely undesirable for any site.