Protection Against Blocking When Making Mass Requests: Techniques and Tools
Account and IP address blocking is the main problem when scraping, automating, and performing mass operations on social media. Modern anti-bot systems analyze dozens of parameters: from request frequency to browser fingerprints. In this guide, we will explore specific mechanisms of automation detection and practical ways to bypass them.
Automation Detection Mechanisms
Modern protection systems use multi-level analysis to identify bots. Understanding these mechanisms is critically important for choosing the right bypass strategy.
Key Analysis Parameters
IP Reputation: Anti-bot systems check the history of the IP address, its affiliation with data centers, and its presence on blacklists. IPs from known proxy pools are blocked more frequently.
Request Frequency: A human physically cannot send 100 requests per minute. Systems analyze not only the total number but also the distribution over time—uniform intervals between requests reveal a bot.
Behavior Patterns: Sequence of actions, scroll depth, mouse movements, time spent on the page. A bot that instantly clicks links without delays is easily recognized.
Technical Fingerprints: User-Agent, HTTP headers, header order, TLS fingerprint, Canvas/WebGL fingerprinting. Inconsistencies in these parameters are a red flag for anti-bot systems.
| Parameter | What is Analyzed | Risk of Detection |
|---|---|---|
| IP Address | Reputation, ASN, geolocation | High |
| User-Agent | Browser version, OS, device | Medium |
| TLS Fingerprint | Cipher suite, extensions | High |
| HTTP/2 Fingerprint | Header order, settings | High |
| Canvas/WebGL | Graphics rendering | Medium |
| Behavior | Clicks, scrolling, time | High |
Rate Limiting and Request Frequency Control
Controlling the speed of requests is the first line of defense against blocks. Even with proxy rotation, overly aggressive scraping will lead to bans.
Dynamic Delays
Fixed intervals (e.g., exactly 2 seconds between requests) are easily recognized. Use random delays with a normal distribution:
import time
import random
import numpy as np
def human_delay(min_delay=1.5, max_delay=4.0, mean=2.5, std=0.8):
"""
Generate a delay with a normal distribution
simulating human behavior
"""
delay = np.random.normal(mean, std)
# Limit the range
delay = max(min_delay, min(delay, max_delay))
# Add micro-delays for realism
delay += random.uniform(0, 0.3)
time.sleep(delay)
# Usage
for url in urls:
response = session.get(url)
human_delay(min_delay=2, max_delay=5, mean=3, std=1)
Adaptive Rate Limiting
A more advanced approach is to adapt the speed based on server responses. If you receive 429 (Too Many Requests) or 503 codes, automatically reduce the pace:
class AdaptiveRateLimiter:
def __init__(self, initial_delay=2.0):
self.current_delay = initial_delay
self.min_delay = 1.0
self.max_delay = 30.0
self.error_count = 0
def wait(self):
time.sleep(self.current_delay + random.uniform(0, 0.5))
def on_success(self):
# Gradually speed up on successful requests
self.current_delay = max(
self.min_delay,
self.current_delay * 0.95
)
self.error_count = 0
def on_rate_limit(self):
# Sharply slow down on blocking
self.error_count += 1
self.current_delay = min(
self.max_delay,
self.current_delay * (1.5 + self.error_count * 0.5)
)
print(f"Rate limit hit. New delay: {self.current_delay:.2f}s")
# Application
limiter = AdaptiveRateLimiter(initial_delay=2.0)
for url in urls:
limiter.wait()
response = session.get(url)
if response.status_code == 429:
limiter.on_rate_limit()
time.sleep(60) # Pause before retrying
elif response.status_code == 200:
limiter.on_success()
else:
# Handle other errors
pass
Practical Tip: The optimal speed varies for different sites. Large platforms (Google, Facebook) tolerate 5-10 requests per minute from one IP. Smaller sites may block at 20-30 requests per hour. Always start conservatively and gradually increase the load while monitoring the error rate.
Proxy Rotation and IP Address Management
Using a single IP address for mass requests guarantees blocking. Proxy rotation distributes the load and reduces the risk of detection.
Rotation Strategies
1. Request-based Rotation: Change IP after each or every N requests. Suitable for scraping search engines, where the anonymity of each request is important.
2. Time-based Rotation: Change IP every 5-15 minutes. Effective for working with social networks, where session stability is important.
3. Sticky Sessions: Use one IP for the entire user session (authorization, sequence of actions). Critical for sites with CSRF protection.
import requests
from itertools import cycle
class ProxyRotator:
def __init__(self, proxy_list, rotation_type='request', rotation_interval=10):
"""
rotation_type: 'request' (every request) or 'time' (by time)
rotation_interval: number of requests or seconds
"""
self.proxies = cycle(proxy_list)
self.current_proxy = next(self.proxies)
self.rotation_type = rotation_type
self.rotation_interval = rotation_interval
self.request_count = 0
self.last_rotation = time.time()
def get_proxy(self):
if self.rotation_type == 'request':
self.request_count += 1
if self.request_count >= self.rotation_interval:
self.current_proxy = next(self.proxies)
self.request_count = 0
print(f"Rotated to: {self.current_proxy}")
elif self.rotation_type == 'time':
if time.time() - self.last_rotation >= self.rotation_interval:
self.current_proxy = next(self.proxies)
self.last_rotation = time.time()
print(f"Rotated to: {self.current_proxy}")
return {'http': self.current_proxy, 'https': self.current_proxy}
# Example usage
proxy_list = [
'http://user:pass@proxy1.example.com:8000',
'http://user:pass@proxy2.example.com:8000',
'http://user:pass@proxy3.example.com:8000',
]
rotator = ProxyRotator(proxy_list, rotation_type='request', rotation_interval=5)
for url in urls:
proxies = rotator.get_proxy()
response = requests.get(url, proxies=proxies, timeout=10)
Choosing the Type of Proxy
| Proxy Type | Trust Level | Speed | Usage |
|---|---|---|---|
| Data Centers | Low | High | Simple scraping, API |
| Residential | High | Medium | Social networks, protected sites |
| Mobile | Very High | Medium | Instagram, TikTok, anti-fraud |
For mass operations on social media and platforms with serious protection, use residential proxies. They appear as regular home connections and rarely get blacklisted. Data center proxies are suitable for less protected resources where speed is important.
Browser Fingerprinting and TLS Fingerprints
Even with IP rotation, you can be identified by technical fingerprints of the browser and TLS connection. These parameters are unique to each client and difficult to spoof.
TLS Fingerprinting
When establishing an HTTPS connection, the client sends a ClientHello with a set of supported ciphers and extensions. This combination is unique to each library. For example, Python requests use OpenSSL, which has a fingerprint easily distinguishable from Chrome.
Problem: Standard libraries (requests, urllib, curl) have fingerprints different from real browsers. Services like Cloudflare, Akamai, DataDome actively use TLS fingerprinting to block bots.
Solution: Use libraries that mimic browser TLS fingerprints. For Python, this includes curl_cffi, tls_client, or playwright/puppeteer for full browser emulation.
# Installation: pip install curl-cffi
from curl_cffi import requests
# Mimicking Chrome 110
response = requests.get(
'https://example.com',
impersonate="chrome110",
proxies={'https': 'http://proxy:port'}
)
# Alternative: tls_client
import tls_client
session = tls_client.Session(
client_identifier="chrome_108",
random_tls_extension_order=True
)
response = session.get('https://example.com')
HTTP/2 Fingerprinting
In addition to TLS, anti-bot systems analyze HTTP/2 parameters: header order, SETTINGS frame settings, stream priorities. Standard libraries do not maintain the exact header order of Chrome or Firefox.
# Correct header order for Chrome
headers = {
':method': 'GET',
':authority': 'example.com',
':scheme': 'https',
':path': '/',
'sec-ch-ua': '"Not_A Brand";v="8", "Chromium";v="120"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)...',
'accept': 'text/html,application/xhtml+xml...',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
}
Canvas and WebGL Fingerprinting
Browsers render graphics differently depending on the GPU, drivers, and OS. Sites use this to create a unique device fingerprint. When using headless browsers (Selenium, Puppeteer), it is important to mask signs of automation:
// Puppeteer: hiding headless mode
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
const browser = await puppeteer.launch({
headless: true,
args: [
'--disable-blink-features=AutomationControlled',
'--no-sandbox',
'--disable-setuid-sandbox',
`--proxy-server=${proxyUrl}`
]
});
const page = await browser.newPage();
// Overriding navigator.webdriver
await page.evaluateOnNewDocument(() => {
Object.defineProperty(navigator, 'webdriver', {
get: () => false,
});
});
Headers, Cookies, and Session Management
Proper handling of HTTP headers and cookies is critical for simulating a real user. Errors in these parameters are a common cause of blocks.
Required Headers
The minimum set of headers to simulate a Chrome browser:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Cache-Control': 'max-age=0',
}
session = requests.Session()
session.headers.update(headers)
Managing Cookies
Many sites set tracking cookies on the first visit and check for their presence on subsequent requests. The absence of cookies or discrepancies is a sign of a bot.
import requests
import pickle
class SessionManager:
def __init__(self, session_file='session.pkl'):
self.session_file = session_file
self.session = requests.Session()
self.load_session()
def load_session(self):
"""Load saved session"""
try:
with open(self.session_file, 'rb') as f:
cookies = pickle.load(f)
self.session.cookies.update(cookies)
except FileNotFoundError:
pass
def save_session(self):
"""Save cookies for reuse"""
with open(self.session_file, 'wb') as f:
pickle.dump(self.session.cookies, f)
def request(self, url, **kwargs):
response = self.session.get(url, **kwargs)
self.save_session()
return response
# Usage
manager = SessionManager('instagram_session.pkl')
response = manager.request('https://www.instagram.com/explore/')
Important: When rotating proxies, remember to reset cookies if they are tied to a specific IP. A mismatch between IP and cookies (e.g., cookies with US geolocation and IP from Germany) will raise suspicions.
Referer and Origin
The Referer and Origin headers indicate where the user came from. Their absence or incorrect values are a red flag.
# Correct sequence: main → category → product
session = requests.Session()
# Step 1: visit the main page
response = session.get('https://example.com/')
# Step 2: navigate to the category
response = session.get(
'https://example.com/category/electronics',
headers={'Referer': 'https://example.com/'}
)
# Step 3: view the product
response = session.get(
'https://example.com/product/12345',
headers={'Referer': 'https://example.com/category/electronics'}
)
Simulating Human Behavior
Technical parameters are only half the story. Modern anti-bot systems analyze behavioral patterns: how the user interacts with the page, how much time they spend, and how the mouse moves.
Scrolling and Mouse Movement
When using Selenium or Puppeteer, add random mouse movements and page scrolling:
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import random
import time
def human_like_mouse_move(driver):
"""Random mouse movement across the page"""
action = ActionChains(driver)
for _ in range(random.randint(3, 7)):
x = random.randint(0, 1000)
y = random.randint(0, 800)
action.move_by_offset(x, y)
action.pause(random.uniform(0.1, 0.3))
action.perform()
def human_like_scroll(driver):
"""Simulating natural scrolling"""
total_height = driver.execute_script("return document.body.scrollHeight")
current_position = 0
while current_position < total_height:
# Random scroll step
scroll_step = random.randint(100, 400)
current_position += scroll_step
driver.execute_script(f"window.scrollTo(0, {current_position});")
# Pause with variation
time.sleep(random.uniform(0.5, 1.5))
# Sometimes scroll back a bit (as people do)
if random.random() < 0.2:
back_scroll = random.randint(50, 150)
current_position -= back_scroll
driver.execute_script(f"window.scrollTo(0, {current_position});")
time.sleep(random.uniform(0.3, 0.8))
# Usage
driver = webdriver.Chrome()
driver.get('https://example.com')
human_like_mouse_move(driver)
time.sleep(random.uniform(2, 4))
human_like_scroll(driver)
Time on Page
Real users spend time on the page: reading content, looking at images. A bot that instantly clicks links is easily recognized.
def realistic_page_view(driver, url, min_time=5, max_time=15):
"""
Realistic page view with activity
"""
driver.get(url)
# Initial delay (loading and "reading")
time.sleep(random.uniform(2, 4))
# Scrolling
human_like_scroll(driver)
# Additional activity
total_time = random.uniform(min_time, max_time)
elapsed = 0
while elapsed < total_time:
action_choice = random.choice(['scroll', 'mouse_move', 'pause'])
if action_choice == 'scroll':
# Small scroll up/down
scroll_amount = random.randint(-200, 300)
driver.execute_script(f"window.scrollBy(0, {scroll_amount});")
pause = random.uniform(1, 3)
elif action_choice == 'mouse_move':
human_like_mouse_move(driver)
pause = random.uniform(0.5, 2)
else: # pause
pause = random.uniform(2, 5)
time.sleep(pause)
elapsed += pause
Navigation Patterns
Avoid suspicious patterns: direct transitions to deep pages, ignoring the main page, sequentially visiting all elements without skipping.
Good Practices:
- Start from the main page or popular sections
- Use the site's internal navigation instead of direct URLs
- Sometimes go back or navigate to other sections
- Vary the depth of viewing: do not always reach the end
- Add "errors": transitions to non-existent links, returns
Bypassing Cloudflare, DataDome, and Other Protections
Specialized anti-bot systems require a comprehensive approach. They use JavaScript challenges, CAPTCHA, and real-time behavior analysis.
Cloudflare
Cloudflare uses multiple layers of protection: Browser Integrity Check, JavaScript Challenge, CAPTCHA. To bypass basic protection, a correct TLS fingerprint and JavaScript execution are sufficient:
# Option 1: cloudscraper (automatic JS challenge solution)
import cloudscraper
scraper = cloudscraper.create_scraper(
browser={
'browser': 'chrome',
'platform': 'windows',
'desktop': True
}
)
response = scraper.get('https://protected-site.com')
# Option 2: undetected-chromedriver (for complex cases)
import undetected_chromedriver as uc
options = uc.ChromeOptions()
options.add_argument('--proxy-server=http://proxy:port')
driver = uc.Chrome(options=options)
driver.get('https://protected-site.com')
# Wait for the challenge to pass
time.sleep(5)
# Get cookies for requests
cookies = driver.get_cookies()
session = requests.Session()
for cookie in cookies:
session.cookies.set(cookie['name'], cookie['value'])
DataDome
DataDome analyzes user behavior in real-time: mouse movements, typing patterns, timings. To bypass it, a full browser with simulated activity is necessary:
from playwright.sync_api import sync_playwright
import random
def bypass_datadome(url, proxy=None):
with sync_playwright() as p:
browser = p.chromium.launch(
headless=False, # DataDome detects headless
proxy={'server': proxy} if proxy else None
)
context = browser.new_context(
viewport={'width': 1920, 'height': 1080},
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64)...'
)
page = context.new_page()
# Inject scripts to mask automation
page.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {get: () => false});
window.chrome = {runtime: {}};
""")
page.goto(url)
# Simulating human behavior
time.sleep(random.uniform(2, 4))
# Random mouse movements
for _ in range(random.randint(5, 10)):
page.mouse.move(
random.randint(100, 1800),
random.randint(100, 1000)
)
time.sleep(random.uniform(0.1, 0.3))
# Scrolling
page.evaluate(f"window.scrollTo(0, {random.randint(300, 800)})")
time.sleep(random.uniform(1, 2))
content = page.content()
browser.close()
return content
CAPTCHA
For automatic CAPTCHA solving, use recognition services (2captcha, Anti-Captcha) or avoidance strategies:
- Reduce request frequency to a level that does not trigger CAPTCHA
- Use clean residential IPs with a good reputation
- Work through authorized accounts (they have a higher CAPTCHA threshold)
- Distribute the load over time (avoid peak hours)
Monitoring and Handling Blocks
Even with the best practices, blocks are inevitable. It is important to detect them quickly and handle them correctly.
Block Indicators
| Signal | Description | Action |
|---|---|---|
| HTTP 429 | Too Many Requests | Increase delays, change IP |
| HTTP 403 | Forbidden (IP ban) | Change proxy, check fingerprint |
| CAPTCHA | Verification required | Solve or reduce activity |
| Empty Response | Content not loading | Check JavaScript, cookies |
| Redirect to /blocked | Explicit blocking | Complete strategy change |
Retry System
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
def create_session_with_retries():
"""
Session with automatic retries and error handling
"""
session = requests.Session()
retry_strategy = Retry(
total=5,
backoff_factor=2, # 2, 4, 8, 16, 32 seconds
status_forcelist=[429, 500, 502, 503, 504],
method_whitelist=["GET", "POST"]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)
return session
def safe_request(url, session, max_attempts=3):
"""
Request with block handling
"""
for attempt in range(max_attempts):
try:
response = session.get(url, timeout=15)
# Check for blocking
if response.status_code == 403:
print(f"IP blocked. Rotating proxy...")
# Logic for changing proxy
continue
elif response.status_code == 429:
wait_time = int(response.headers.get('Retry-After', 60))
print(f"Rate limited. Waiting {wait_time}s...")
time.sleep(wait_time)
continue
elif 'captcha' in response.text.lower():
print("CAPTCHA detected")
# Logic for solving CAPTCHA or skipping
return None
return response
except requests.exceptions.Timeout:
print(f"Timeout on attempt {attempt + 1}")
time.sleep(5 * (attempt + 1))
except requests.exceptions.ProxyError:
print("Proxy error. Rotating...")
# Change proxy
continue
return None
Logging and Analytics
Track metrics to optimize strategy:
import logging
from collections import defaultdict
from datetime import datetime
class ScraperMetrics:
def __init__(self):
self.stats = {
'total_requests': 0,
'successful': 0,
'rate_limited': 0,
'blocked': 0,
'captcha': 0,
'errors': 0,
'proxy_failures': defaultdict(int)
}
def log_request(self, status, proxy=None):
self.stats['total_requests'] += 1
if status == 200:
self.stats['successful'] += 1
elif status == 429:
self.stats['rate_limited'] += 1
elif status == 403:
self.stats['blocked'] += 1
if proxy:
self.stats['proxy_failures'][proxy] += 1
def get_success_rate(self):
if self.stats['total_requests'] == 0:
return 0
return (self.stats['successful'] / self.stats['total_requests']) * 100
def print_report(self):
print(f"\n=== Scraping Report ===")
print(f"Total requests: {self.stats['total_requests']}")
print(f"Success rate: {self.get_success_rate():.2f}%")
print(f"Rate limited: {self.stats['rate_limited']}")
print(f"Blocked: {self.stats['blocked']}")
print(f"CAPTCHA: {self.stats['captcha']}")
if self.stats['proxy_failures']:
print(f"\nProblematic proxies:")
for proxy, count in sorted(
self.stats['proxy_failures'].items(),
key=lambda x: x[1],
reverse=True
)[:5]:
print(f" {proxy}: {count} failures")
# Usage
metrics = ScraperMetrics()
for url in urls:
response = safe_request(url, session)
if response:
metrics.log_request(response.status_code, current_proxy)
metrics.print_report()
Optimal Metrics: A success rate above 95% is an excellent result. 80-95% is acceptable, but there is room for improvement. Below 80%—reconsider your strategy: perhaps the rate limiting is too aggressive, the proxies are poor, or there are issues with fingerprinting.
Conclusion
Protection against blocking during mass requests requires a comprehensive approach, combining various techniques and tools to ensure successful automation without detection.