If you're scraping marketplaces, monitoring competitor prices, or collecting data from websites, you know the problem: sites block IP addresses, require captcha, or return empty pages. Ban rate (percentage of blocked requests) can reach 70-90%, making scraping impossible. In this article, we'll examine specific methods that will help reduce ban rate to 5-10% and collect data consistently.
We'll cover both technical solutions (proxy rotation, HTTP headers, fingerprinting) and behavioral patterns (delays, imitating user actions). All methods have been tested in practice when scraping Wildberries, Ozon, Avito, and international platforms.
Why websites block scrapers: main triggers
Before examining protection methods, it's important to understand how sites identify automated traffic. Modern anti-bot systems (Cloudflare, Akamai, DataDome, Imperva) analyze dozens of parameters for each request. Here are the main blocking triggers:
Network-level triggers:
- Too many requests from a single IP address (e.g., 100+ requests per minute)
- IP from known data center ranges (AWS, Google Cloud, Hetzner)
- Geographic mismatch: IP from Russia requesting English version of site
- Absence of reverse DNS record for IP address
HTTP-level triggers:
- Missing or incorrect HTTP headers (User-Agent, Accept-Language, Referer)
- Header order differs from browser standard
- TLS/SSL version doesn't match declared browser
- Missing cookies or incorrect usage
Browser-level triggers (JavaScript):
- Absence of JavaScript execution (if using simple HTTP client)
- Browser fingerprinting: Canvas, WebGL, AudioContext, installed fonts
- Absence of mouse movement, scrolling, clicks
- Browser window size (headless browsers often have non-standard sizes)
- Presence of automation: navigator.webdriver, window.chrome properties
Behavioral triggers:
- Too fast navigation between pages (less than 1 second)
- Identical intervals between requests (e.g., exactly every 2 seconds)
- Sequential page traversal (1, 2, 3, 4...) without skips
- Absence of typical user actions: search, filters, viewing images
For example, when scraping Wildberries, a typical mistake is sending requests every 0.5 seconds from one IP. Cloudflare's anti-bot system will instantly identify the pattern and block the IP for 24 hours. A real user spends 5-15 seconds viewing a product card, scrolls the page, clicks on images.
Proxy rotation: how to properly change IP addresses
Using proxies is a basic method for reducing ban rate. But it's important not just to buy proxies, but to configure rotation correctly. Here are proven strategies:
Choosing proxy type for scraping
| Proxy Type | Ban Rate | Speed | When to Use |
|---|---|---|---|
| Datacenter Proxies | High (40-60%) | Very High | Simple sites without protection, mass scraping with large IP pool |
| Residential Proxies | Low (5-15%) | Medium | Marketplaces (Wildberries, Ozon), sites with Cloudflare, social networks |
| Mobile Proxies | Very Low (2-8%) | Low | Sites with aggressive protection, mobile app versions |
For scraping marketplaces (Wildberries, Ozon, Avito), residential proxies are recommended — they have IPs of real home users, which are difficult to distinguish from regular traffic. Datacenter proxies are suitable for less protected sites or when maximum speed is needed with large data volumes.
IP address rotation strategies
Strategy 1: Time-based rotation
Change IP every 5-10 minutes. This is the optimal balance: long enough not to raise suspicion with frequent changes, but frequent enough not to accumulate request history on one IP.
Example: When scraping a catalog of 1000 products with 3-second intervals between requests, one IP will be active for approximately 100 requests, then rotation occurs.
Strategy 2: Request count-based rotation
Change IP after 50-150 requests. This helps avoid accumulating suspicious activity on one address. Add randomness: not exactly 100 requests, but from 80 to 120.
Example: Configure the script so that after a random number of requests (80-120), proxy rotation from the pool occurs.
Strategy 3: Sticky sessions (session proxies)
For sites requiring authorization or working with shopping carts, use sticky sessions — IP binding for the session duration (10-30 minutes). This allows maintaining cookies and doesn't raise suspicion when changing IP within one session.
Example: When scraping a personal account on Ozon, use one IP for login and all subsequent requests within a 15-minute session.
Important: Don't use the same IP for different tasks. If an IP was blocked when scraping one site, don't use it immediately for another — wait 24-48 hours.
Proxy pool size
Minimum pool size depends on scraping intensity:
- Low intensity (up to 10,000 requests per day): 10-20 proxies
- Medium intensity (10,000 - 100,000 requests per day): 50-100 proxies
- High intensity (more than 100,000 requests per day): 200+ proxies or residential with automatic rotation
For residential proxies with rotation on each request (rotating proxies), pool size can be smaller, as the provider automatically substitutes a new IP from their pool of millions of addresses.
User-Agent and HTTP headers: imitating a real browser
Even with good proxies, you can be blocked if HTTP headers look suspicious. Sites analyze not only User-Agent, but also header order, their values, and correspondence to each other.
Proper User-Agent
Don't use the same User-Agent for all requests. Create a list of popular browsers and randomly select from it:
user_agents = [
# Chrome on Windows
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
# Chrome on macOS
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
# Firefox on Windows
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
# Safari on macOS
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1 Safari/605.1.15",
# Edge on Windows
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0"
]
Mistake: Using outdated browser versions (e.g., Chrome 80) — this will immediately raise suspicion. Update the User-Agent list every 2-3 months, tracking current versions on whatismybrowser.com.
Complete set of HTTP headers
Modern browsers send 15-20 headers. Here's the minimum necessary set for imitating Chrome:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9,ru-RU;q=0.8,ru;q=0.7",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Cache-Control": "max-age=0",
"sec-ch-ua": '"Not_A Brand";v="8", "Chromium";v="120", "Google Chrome";v="120"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"Windows"'
}
Note the Sec-Fetch-* and sec-ch-ua-* headers — they appeared in new Chrome versions and their absence can reveal automation.
Header order matters
Browsers send headers in a specific order. For example, Chrome always puts Host first, then Connection, User-Agent, and so on. If you use Python's requests library, the order may be alphabetical, which reveals automation.
Solution: use libraries that properly format headers (curl_cffi for Python, got for Node.js) or headless browsers (Puppeteer, Playwright, Selenium), which generate headers like a real browser.
Delays between requests: optimal intervals
One of the simplest but most effective methods for reducing ban rate is proper delays between requests. A real user cannot open 10 pages per second, so too-fast requests instantly trigger blocking.
Random delays instead of fixed
Don't use fixed delays (e.g., exactly 2 seconds between requests). Anti-bot systems easily identify such patterns. Use random intervals:
import random
import time
# Instead of fixed delay
time.sleep(2) # ❌ Bad
# Use random interval
delay = random.uniform(2.5, 5.5) # ✅ Good
time.sleep(delay)
Recommended intervals for different sites
| Site Type | Minimum Delay | Recommended Delay | Examples |
|---|---|---|---|
| Marketplaces with protection | 3-5 sec | 5-10 sec | Wildberries, Ozon, Lamoda |
| Classified ads | 2-4 sec | 4-8 sec | Avito, Yula, CIAN |
| News sites | 1-2 sec | 2-4 sec | RBC, Kommersant, Vedomosti |
| APIs without restrictions | 0.5-1 sec | 1-2 sec | Open APIs, RSS feeds |
Adaptive delays based on server responses
Advanced approach — dynamically change delays depending on server responses:
base_delay = 3.0 # Base delay
delay_multiplier = 1.0
response = requests.get(url, headers=headers, proxies=proxies)
# If got captcha or 429 — increase delay
if response.status_code == 429 or 'captcha' in response.text.lower():
delay_multiplier *= 1.5
print(f"Protection detected, increasing delay to {base_delay * delay_multiplier}s")
# If everything is fine — can speed up a bit
elif response.status_code == 200:
delay_multiplier = max(1.0, delay_multiplier * 0.95)
time.sleep(random.uniform(base_delay * delay_multiplier, base_delay * delay_multiplier * 1.5))
This approach allows automatically slowing down when protection is detected and speeding up when the site doesn't show aggression.
Fingerprinting protection: Canvas, WebGL, fonts
If the site uses JavaScript for verification, simple HTTP headers are not enough. Modern anti-bot systems create a browser "fingerprint" based on dozens of parameters: Canvas, WebGL, installed fonts, time zone, screen resolution, and others.
Main fingerprinting parameters
Canvas fingerprinting
The site draws an invisible image in Canvas and reads it. Different browsers and operating systems render the image differently, creating a unique fingerprint. Headless browsers often generate identical Canvas, which reveals automation.
WebGL fingerprinting
Similar to Canvas, but uses 3D rendering. Information about graphics card, drivers, supported extensions is read. Headless browsers often show software rendering (SwiftShader) instead of real GPU.
Installed fonts
JavaScript can determine the list of installed fonts. Headless browsers usually have a minimal set of system fonts, which differs from a real user with installed Microsoft Office, Adobe, and other programs.
Navigator properties
Properties navigator.webdriver, navigator.plugins, navigator.languages reveal automation. For example, in Selenium navigator.webdriver === true, which is instantly detected by anti-bot systems.
Tools for bypassing fingerprinting
To bypass fingerprinting, use specialized tools:
- Undetected ChromeDriver (Python) — modified version of Selenium that hides automation signs
- Puppeteer Stealth (Node.js) — plugin for Puppeteer that substitutes fingerprint parameters
- Playwright with stealth — similar to Puppeteer, but with better support for different browsers
- Anti-detect browsers (Dolphin Anty, AdsPower, Multilogin) — for those who don't want to write code, these browsers automatically substitute fingerprint
Example of using undetected-chromedriver in Python:
import undetected_chromedriver as uc
# Create browser with detection protection
options = uc.ChromeOptions()
options.add_argument('--disable-blink-features=AutomationControlled')
driver = uc.Chrome(options=options)
driver.get('https://example.com')
# Check that navigator.webdriver === undefined
webdriver_status = driver.execute_script("return navigator.webdriver")
print(f"navigator.webdriver: {webdriver_status}") # Should be None/undefined
Managing cookies and sessions
Many sites use cookies to track user behavior. Proper cookie management helps avoid blocking and look like a real user.
Saving and reusing cookies
Instead of creating a new session for each request, save cookies and reuse them. This imitates the behavior of a real user returning to the site:
import requests
import pickle
session = requests.Session()
# First visit — get cookies
response = session.get('https://example.com')
# Save cookies to file
with open('cookies.pkl', 'wb') as f:
pickle.dump(session.cookies, f)
# Later load cookies
with open('cookies.pkl', 'rb') as f:
session.cookies.update(pickle.load(f))
# Now requests look like from a returning user
response = session.get('https://example.com/catalog')
Warming up session before scraping
Don't start scraping immediately with target pages. Imitate real user behavior:
- Open the site's homepage
- Wait 2-5 seconds
- Open a category or section page
- Wait 3-7 seconds
- Only after this start scraping target pages
This creates activity history in cookies and reduces the likelihood of blocking.
Handling session cookies and tokens
Some sites generate unique tokens on first visit and check them in subsequent requests. For example, Wildberries uses a token in the x-requested-with header. Always save such tokens from the first response and send them in subsequent requests.
JavaScript rendering: when it's necessary
Many modern sites load content via JavaScript. If you use a simple HTTP client (requests in Python, axios in Node.js), you'll get an empty page or stub. In such cases, JavaScript rendering is necessary.
When JavaScript rendering is needed
- Site uses React, Vue, Angular — content loads after initial page load
- Data is loaded via AJAX/Fetch requests
- Site requires JavaScript execution to generate tokens or cookies
- Bot protection is present requiring JS code execution (e.g., Cloudflare Challenge)
Tools for JavaScript rendering
| Tool | Language | Speed | Protection Bypass |
|---|---|---|---|
| Selenium | Python, Java, C# | Slow | Medium (with undetected-chromedriver) |
| Puppeteer | Node.js | Medium | Good (with puppeteer-extra-plugin-stealth) |
| Playwright | Python, Node.js, Java | Fast | Excellent |
| Splash | HTTP API | Medium | Weak |
For most tasks, Playwright is recommended — it's faster than Selenium, better bypasses protection, and has a more convenient API.
Alternative: intercepting API requests
Often you can avoid JavaScript rendering if you find the API requests the site uses to load data. Open DevTools (F12) → Network tab → XHR/Fetch filter and see what requests the site sends. Then repeat these requests directly via HTTP client.
Example: Wildberries loads product data via API https://catalog.wb.ru/catalog/.... Instead of rendering the entire page, you can request this API directly, which is 10-20 times faster.
Bypassing captcha: automated solutions
Even with proper proxies and headers, you may encounter captcha. There are several approaches to solving it:
Captcha types and solving methods
reCAPTCHA v2 ("I'm not a robot" checkbox)
Solved via recognition services: 2Captcha, Anti-Captcha, CapMonster. Cost: $1-3 per 1000 solutions. Solution time: 10-30 seconds.
reCAPTCHA v3 (invisible, score-based)
More complex. Analyzes user behavior and assigns a score from 0 to 1. Bypass: using headless browsers with proper fingerprint + imitating user actions (mouse movement, clicks).
hCaptcha
Analog of reCAPTCHA, used on many sites. Solved via the same recognition services. Cost: $0.5-2 per 1000 solutions.
Cloudflare Challenge
JavaScript-challenge that checks the browser. Bypass: using specialized libraries (cloudscraper for Python, cloudflare-scraper for Node.js) or services (FlareSolverr).
Integrating captcha recognition service
Example of 2Captcha integration in Python:
from twocaptcha import TwoCaptcha
solver = TwoCaptcha('YOUR_API_KEY')
try:
# Solve reCAPTCHA v2
result = solver.recaptcha(
sitekey='6Le-wvkSAAAAAPBMRTvw0Q4Muexq9bi0DJwx_mJ-',
url='https://example.com'
)
# Get solution token
captcha_token = result['code']
# Submit form with token
response = requests.post('https://example.com/submit', data={
'g-recaptcha-response': captcha_token
})
except Exception as e:
print(f"Captcha solving error: {e}")
Important: Solving captcha slows down scraping by 10-30 times and increases costs. Use it only when other methods don't work. First try improving proxies, fingerprint, and delays.
Rate limiting: how not to exceed site limits
Many sites have explicit or implicit limits on the number of requests. Exceeding these limits leads to temporary or permanent IP blocking.
Determining site limits
Pay attention to HTTP headers in server responses:
X-RateLimit-Limit— maximum number of requests in periodX-RateLimit-Remaining— how many requests remainX-RateLimit-Reset— when limit will reset (Unix timestamp)Retry-After— how many seconds until you can retry request
If you received status code 429 (Too Many Requests), this means limit exceeded. Read the Retry-After header and wait the specified time before the next request.
Implementing rate limiter
Create a request rate control mechanism: