How to Reduce Ban Rate in Web Scraping: 12 Protection Methods

If you're scraping marketplaces, monitoring competitor prices, or collecting data from websites, you know the problem: sites block IP addresses, require captcha, or return empty pages. Ban rate (percentage of blocked requests) can reach 70-90%, making scraping impossible. In this article, we'll examine specific methods that will help reduce ban rate to 5-10% and collect data consistently.

We'll cover both technical solutions (proxy rotation, HTTP headers, fingerprinting) and behavioral patterns (delays, imitating user actions). All methods have been tested in practice when scraping Wildberries, Ozon, Avito, and international platforms.

Why websites block scrapers: main triggers

Before examining protection methods, it's important to understand how sites identify automated traffic. Modern anti-bot systems (Cloudflare, Akamai, DataDome, Imperva) analyze dozens of parameters for each request. Here are the main blocking triggers:

Network-level triggers:

Too many requests from a single IP address (e.g., 100+ requests per minute)
IP from known data center ranges (AWS, Google Cloud, Hetzner)
Geographic mismatch: IP from Russia requesting English version of site
Absence of reverse DNS record for IP address

HTTP-level triggers:

Missing or incorrect HTTP headers (User-Agent, Accept-Language, Referer)
Header order differs from browser standard
TLS/SSL version doesn't match declared browser
Missing cookies or incorrect usage

Browser-level triggers (JavaScript):

Absence of JavaScript execution (if using simple HTTP client)
Browser fingerprinting: Canvas, WebGL, AudioContext, installed fonts
Absence of mouse movement, scrolling, clicks
Browser window size (headless browsers often have non-standard sizes)
Presence of automation: navigator.webdriver, window.chrome properties

Behavioral triggers:

Too fast navigation between pages (less than 1 second)
Identical intervals between requests (e.g., exactly every 2 seconds)
Sequential page traversal (1, 2, 3, 4...) without skips
Absence of typical user actions: search, filters, viewing images

For example, when scraping Wildberries, a typical mistake is sending requests every 0.5 seconds from one IP. Cloudflare's anti-bot system will instantly identify the pattern and block the IP for 24 hours. A real user spends 5-15 seconds viewing a product card, scrolls the page, clicks on images.

Proxy rotation: how to properly change IP addresses

Using proxies is a basic method for reducing ban rate. But it's important not just to buy proxies, but to configure rotation correctly. Here are proven strategies:

Choosing proxy type for scraping

Proxy Type	Ban Rate	Speed	When to Use
Datacenter Proxies	High (40-60%)	Very High	Simple sites without protection, mass scraping with large IP pool
Residential Proxies	Low (5-15%)	Medium	Marketplaces (Wildberries, Ozon), sites with Cloudflare, social networks
Mobile Proxies	Very Low (2-8%)	Low	Sites with aggressive protection, mobile app versions

For scraping marketplaces (Wildberries, Ozon, Avito), residential proxies are recommended — they have IPs of real home users, which are difficult to distinguish from regular traffic. Datacenter proxies are suitable for less protected sites or when maximum speed is needed with large data volumes.

IP address rotation strategies

Strategy 1: Time-based rotation

Change IP every 5-10 minutes. This is the optimal balance: long enough not to raise suspicion with frequent changes, but frequent enough not to accumulate request history on one IP.

Example: When scraping a catalog of 1000 products with 3-second intervals between requests, one IP will be active for approximately 100 requests, then rotation occurs.

Strategy 2: Request count-based rotation

Change IP after 50-150 requests. This helps avoid accumulating suspicious activity on one address. Add randomness: not exactly 100 requests, but from 80 to 120.

Example: Configure the script so that after a random number of requests (80-120), proxy rotation from the pool occurs.

Strategy 3: Sticky sessions (session proxies)

For sites requiring authorization or working with shopping carts, use sticky sessions — IP binding for the session duration (10-30 minutes). This allows maintaining cookies and doesn't raise suspicion when changing IP within one session.

Example: When scraping a personal account on Ozon, use one IP for login and all subsequent requests within a 15-minute session.

Important: Don't use the same IP for different tasks. If an IP was blocked when scraping one site, don't use it immediately for another — wait 24-48 hours.

Proxy pool size

Minimum pool size depends on scraping intensity:

Low intensity (up to 10,000 requests per day): 10-20 proxies
Medium intensity (10,000 - 100,000 requests per day): 50-100 proxies
High intensity (more than 100,000 requests per day): 200+ proxies or residential with automatic rotation

For residential proxies with rotation on each request (rotating proxies), pool size can be smaller, as the provider automatically substitutes a new IP from their pool of millions of addresses.

User-Agent and HTTP headers: imitating a real browser

Even with good proxies, you can be blocked if HTTP headers look suspicious. Sites analyze not only User-Agent, but also header order, their values, and correspondence to each other.

Proper User-Agent

Don't use the same User-Agent for all requests. Create a list of popular browsers and randomly select from it:

user_agents = [
    # Chrome on Windows
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    # Chrome on macOS
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    # Firefox on Windows
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
    # Safari on macOS
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1 Safari/605.1.15",
    # Edge on Windows
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0"
]

Mistake: Using outdated browser versions (e.g., Chrome 80) — this will immediately raise suspicion. Update the User-Agent list every 2-3 months, tracking current versions on whatismybrowser.com.

Complete set of HTTP headers

Modern browsers send 15-20 headers. Here's the minimum necessary set for imitating Chrome:

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9,ru-RU;q=0.8,ru;q=0.7",
    "Accept-Encoding": "gzip, deflate, br",
    "DNT": "1",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
    "Cache-Control": "max-age=0",
    "sec-ch-ua": '"Not_A Brand";v="8", "Chromium";v="120", "Google Chrome";v="120"',
    "sec-ch-ua-mobile": "?0",
    "sec-ch-ua-platform": '"Windows"'
}

Note the Sec-Fetch-* and sec-ch-ua-* headers — they appeared in new Chrome versions and their absence can reveal automation.

Header order matters

Browsers send headers in a specific order. For example, Chrome always puts Host first, then Connection, User-Agent, and so on. If you use Python's requests library, the order may be alphabetical, which reveals automation.

Solution: use libraries that properly format headers (curl_cffi for Python, got for Node.js) or headless browsers (Puppeteer, Playwright, Selenium), which generate headers like a real browser.

Delays between requests: optimal intervals

One of the simplest but most effective methods for reducing ban rate is proper delays between requests. A real user cannot open 10 pages per second, so too-fast requests instantly trigger blocking.

Random delays instead of fixed

Don't use fixed delays (e.g., exactly 2 seconds between requests). Anti-bot systems easily identify such patterns. Use random intervals:

import random
import time

# Instead of fixed delay
time.sleep(2)  # ❌ Bad

# Use random interval
delay = random.uniform(2.5, 5.5)  # ✅ Good
time.sleep(delay)

Recommended intervals for different sites

Site Type	Minimum Delay	Recommended Delay	Examples
Marketplaces with protection	3-5 sec	5-10 sec	Wildberries, Ozon, Lamoda
Classified ads	2-4 sec	4-8 sec	Avito, Yula, CIAN
News sites	1-2 sec	2-4 sec	RBC, Kommersant, Vedomosti
APIs without restrictions	0.5-1 sec	1-2 sec	Open APIs, RSS feeds

Adaptive delays based on server responses

Advanced approach — dynamically change delays depending on server responses:

base_delay = 3.0  # Base delay
delay_multiplier = 1.0

response = requests.get(url, headers=headers, proxies=proxies)

# If got captcha or 429 — increase delay
if response.status_code == 429 or 'captcha' in response.text.lower():
    delay_multiplier *= 1.5
    print(f"Protection detected, increasing delay to {base_delay * delay_multiplier}s")

# If everything is fine — can speed up a bit
elif response.status_code == 200:
    delay_multiplier = max(1.0, delay_multiplier * 0.95)

time.sleep(random.uniform(base_delay * delay_multiplier, base_delay * delay_multiplier * 1.5))

This approach allows automatically slowing down when protection is detected and speeding up when the site doesn't show aggression.

Fingerprinting protection: Canvas, WebGL, fonts

If the site uses JavaScript for verification, simple HTTP headers are not enough. Modern anti-bot systems create a browser "fingerprint" based on dozens of parameters: Canvas, WebGL, installed fonts, time zone, screen resolution, and others.

Main fingerprinting parameters

Canvas fingerprinting

The site draws an invisible image in Canvas and reads it. Different browsers and operating systems render the image differently, creating a unique fingerprint. Headless browsers often generate identical Canvas, which reveals automation.

WebGL fingerprinting

Similar to Canvas, but uses 3D rendering. Information about graphics card, drivers, supported extensions is read. Headless browsers often show software rendering (SwiftShader) instead of real GPU.

Installed fonts

JavaScript can determine the list of installed fonts. Headless browsers usually have a minimal set of system fonts, which differs from a real user with installed Microsoft Office, Adobe, and other programs.

Navigator properties

Properties navigator.webdriver, navigator.plugins, navigator.languages reveal automation. For example, in Selenium navigator.webdriver === true, which is instantly detected by anti-bot systems.

Tools for bypassing fingerprinting

To bypass fingerprinting, use specialized tools:

Undetected ChromeDriver (Python) — modified version of Selenium that hides automation signs
Puppeteer Stealth (Node.js) — plugin for Puppeteer that substitutes fingerprint parameters
Playwright with stealth — similar to Puppeteer, but with better support for different browsers
Anti-detect browsers (Dolphin Anty, AdsPower, Multilogin) — for those who don't want to write code, these browsers automatically substitute fingerprint

Example of using undetected-chromedriver in Python:

import undetected_chromedriver as uc

# Create browser with detection protection
options = uc.ChromeOptions()
options.add_argument('--disable-blink-features=AutomationControlled')

driver = uc.Chrome(options=options)
driver.get('https://example.com')

# Check that navigator.webdriver === undefined
webdriver_status = driver.execute_script("return navigator.webdriver")
print(f"navigator.webdriver: {webdriver_status}")  # Should be None/undefined

Managing cookies and sessions

Many sites use cookies to track user behavior. Proper cookie management helps avoid blocking and look like a real user.

Saving and reusing cookies

Instead of creating a new session for each request, save cookies and reuse them. This imitates the behavior of a real user returning to the site:

import requests
import pickle

session = requests.Session()

# First visit — get cookies
response = session.get('https://example.com')

# Save cookies to file
with open('cookies.pkl', 'wb') as f:
    pickle.dump(session.cookies, f)

# Later load cookies
with open('cookies.pkl', 'rb') as f:
    session.cookies.update(pickle.load(f))

# Now requests look like from a returning user
response = session.get('https://example.com/catalog')

Warming up session before scraping

Don't start scraping immediately with target pages. Imitate real user behavior:

Open the site's homepage
Wait 2-5 seconds
Open a category or section page
Wait 3-7 seconds
Only after this start scraping target pages

This creates activity history in cookies and reduces the likelihood of blocking.

Handling session cookies and tokens

Some sites generate unique tokens on first visit and check them in subsequent requests. For example, Wildberries uses a token in the x-requested-with header. Always save such tokens from the first response and send them in subsequent requests.

JavaScript rendering: when it's necessary

Many modern sites load content via JavaScript. If you use a simple HTTP client (requests in Python, axios in Node.js), you'll get an empty page or stub. In such cases, JavaScript rendering is necessary.

When JavaScript rendering is needed

Site uses React, Vue, Angular — content loads after initial page load
Data is loaded via AJAX/Fetch requests
Site requires JavaScript execution to generate tokens or cookies
Bot protection is present requiring JS code execution (e.g., Cloudflare Challenge)

Tools for JavaScript rendering

Tool	Language	Speed	Protection Bypass
Selenium	Python, Java, C#	Slow	Medium (with undetected-chromedriver)
Puppeteer	Node.js	Medium	Good (with puppeteer-extra-plugin-stealth)
Playwright	Python, Node.js, Java	Fast	Excellent
Splash	HTTP API	Medium	Weak

For most tasks, Playwright is recommended — it's faster than Selenium, better bypasses protection, and has a more convenient API.

Alternative: intercepting API requests

Often you can avoid JavaScript rendering if you find the API requests the site uses to load data. Open DevTools (F12) → Network tab → XHR/Fetch filter and see what requests the site sends. Then repeat these requests directly via HTTP client.

Example: Wildberries loads product data via API https://catalog.wb.ru/catalog/.... Instead of rendering the entire page, you can request this API directly, which is 10-20 times faster.

Bypassing captcha: automated solutions

Even with proper proxies and headers, you may encounter captcha. There are several approaches to solving it:

Captcha types and solving methods

reCAPTCHA v2 ("I'm not a robot" checkbox)

Solved via recognition services: 2Captcha, Anti-Captcha, CapMonster. Cost: $1-3 per 1000 solutions. Solution time: 10-30 seconds.

reCAPTCHA v3 (invisible, score-based)

More complex. Analyzes user behavior and assigns a score from 0 to 1. Bypass: using headless browsers with proper fingerprint + imitating user actions (mouse movement, clicks).

hCaptcha

Analog of reCAPTCHA, used on many sites. Solved via the same recognition services. Cost: $0.5-2 per 1000 solutions.

Cloudflare Challenge

JavaScript-challenge that checks the browser. Bypass: using specialized libraries (cloudscraper for Python, cloudflare-scraper for Node.js) or services (FlareSolverr).

Integrating captcha recognition service

Example of 2Captcha integration in Python:

from twocaptcha import TwoCaptcha

solver = TwoCaptcha('YOUR_API_KEY')

try:
    # Solve reCAPTCHA v2
    result = solver.recaptcha(
        sitekey='6Le-wvkSAAAAAPBMRTvw0Q4Muexq9bi0DJwx_mJ-',
        url='https://example.com'
    )
    
    # Get solution token
    captcha_token = result['code']
    
    # Submit form with token
    response = requests.post('https://example.com/submit', data={
        'g-recaptcha-response': captcha_token
    })
    
except Exception as e:
    print(f"Captcha solving error: {e}")

Important: Solving captcha slows down scraping by 10-30 times and increases costs. Use it only when other methods don't work. First try improving proxies, fingerprint, and delays.

Rate limiting: how not to exceed site limits

Many sites have explicit or implicit limits on the number of requests. Exceeding these limits leads to temporary or permanent IP blocking.

Determining site limits

Pay attention to HTTP headers in server responses:

X-RateLimit-Limit — maximum number of requests in period
X-RateLimit-Remaining — how many requests remain
X-RateLimit-Reset — when limit will reset (Unix timestamp)
Retry-After — how many seconds until you can retry request

If you received status code 429 (Too Many Requests), this means limit exceeded. Read the Retry-After header and wait the specified time before the next request.

Implementing rate limiter

Create a request rate control mechanism: