Parsing product characteristics from marketplaces is a critically important task for sellers, analysts, and price aggregators. Wildberries, Ozon, Yandex.Market, and other platforms actively block automated data collection using advanced anti-bot systems. Without properly configured proxies, your parser will get banned after just 50-100 requests. In this article, we will discuss which types of proxies are suitable for parsing, how to set up IP rotation, and how to bypass the protection of the largest marketplaces.
Why Marketplaces Block Parsing and How It Works
Modern marketplaces lose millions of rubles due to parsing: competitors copy product descriptions, aggregators intercept traffic, and server load increases. Therefore, Wildberries, Ozon, Yandex.Market, and other platforms have implemented multi-layered protection against automated data collection.
How Marketplaces Identify Parsing:
- Request Frequency from One IP — if more than 100 requests per minute come from a single address, it is a clear sign of a bot. A regular user views 5-10 product cards in that time.
- Absence of JavaScript — simple parsers using requests or curl do not execute JS code that loads prices and characteristics. The site sees that content is requested without rendering.
- User-Agent and Headers — default headers from libraries (e.g., "python-requests/2.28.0") instantly reveal a bot. The absence of Accept-Language and Referer headers is also suspicious.
- Browser Fingerprint — advanced systems (Cloudflare, Kasada, DataDome) analyze Canvas, WebGL, fonts, and browser extensions. Headless browsers like Puppeteer are easily identified by the absence of certain parameters.
- Behavioral Patterns — a bot opens product cards at the same interval (e.g., exactly every 2 seconds), does not scroll the page, and does not move the mouse. This distinguishes it from a real person.
Consequences of Blocking: Temporary IP ban for 1-24 hours, CAPTCHA on every request, complete blocking of the data center IP range. For businesses, this means stopping data collection and losing competitive advantage.
Real Case: A price aggregator parsed Wildberries with 10 data center IPs, making 500 requests per hour from each. After 3 days, the entire /24 range received a permanent ban — they had to change the proxy provider and switch to residential IPs with rotation.
Comparison of Proxy Types for Product Parsing
Three main types of proxies are used for parsing product characteristics. Each has its advantages and limitations depending on the volume of data, budget, and speed requirements.
| Proxy Type | Speed | Ban Risk | Cost | When to Use |
|---|---|---|---|---|
| Data Center Proxies | High (50-200 ms) | High | Low | Parsing small volumes (up to 10,000 products/day), testing the parser |
| Residential Proxies | Medium (200-800 ms) | Low | High (based on traffic) | Parsing Wildberries, Ozon with bot protection, large volumes of data |
| Mobile Proxies | Medium (300-1000 ms) | Very Low | Very High | Parsing with maximum protection, bypassing strict blocks, critical projects |
Data Center Proxies are IP addresses of servers in data centers (AWS, Hetzner, OVH). They are fast and cheap, but marketplaces can easily identify them through ASN databases. Suitable for parsing small catalogs (up to 10,000 products per day) or platforms without serious protection. Cost: from $1-3 per IP per month.
Residential Proxies are IPs of home users obtained legally through SDKs in applications. Marketplaces perceive them as regular buyers. Ideal for parsing Wildberries, Ozon, Yandex.Market in large volumes. Cost: from $5-15 per 1 GB of traffic (approximately 10,000-30,000 requests).
Mobile Proxies are IPs of mobile operators (MTS, Beeline, MegaFon). The most reliable type for bypassing protection, but expensive and slow. Use only for critical tasks where blocking is unacceptable. Cost: from $50-150 per IP per month with rotation.
Residential or Data Center: What to Choose for Your Task
The choice of proxy type depends on three factors: the volume of parsing, the level of protection of the platform, and the budget. Let's discuss specific usage scenarios.
When Data Center Proxies Are Suitable
Scenario 1: Testing the Parser
You are developing a new parser and checking the logic of data extraction. You need to parse 100-500 products for debugging. In this case, residential proxies are an excessive waste of money. Take 5-10 data center IPs and make 50-100 requests from each per hour. This will be enough for testing without blocks.
Scenario 2: Parsing Platforms Without Protection
Small regional marketplaces, classified ads like Avito (in some categories), online stores on OpenCart often do not have serious anti-bot systems. Here, data centers work stably under moderate load (up to 200 requests per hour from an IP).
Scenario 3: Limited Budget and Small Volumes
If you need to parse 5,000-10,000 products per day and the budget is limited, try data centers with aggressive rotation (changing IP every 50-100 requests). Yes, there will be more blocks, but with the right retry logic setup (repeating the request with a new IP), it works.
When Residential Proxies Are Needed
Scenario 1: Parsing Wildberries and Ozon
These platforms use Cloudflare, DataDome, and their own anti-bot systems. From data centers, you will get CAPTCHA or a ban after 20-50 requests. Residential proxies with rotation every 5-10 minutes allow you to parse hundreds of thousands of products without problems. One client parsed the entire Wildberries catalog (20+ million products) in a week using a pool of 1,000 residential IPs.
Scenario 2: Parsing with Authorization
Some product characteristics (wholesale prices, stock levels) are only available to authorized users. If you are parsing through an account, using data centers will lead to account blocking. Residential proxies simulate the actions of a real user, reducing the risk of a ban.
Scenario 3: Geo-Targeting
Prices and availability of products on Wildberries, Ozon, Yandex.Market depend on the user's region. To collect data for Moscow, St. Petersburg, Yekaterinburg simultaneously, residential proxies with city selection are needed. Data centers do not allow precise control over geolocation.
Formula for Choosing Proxy Type:
- Volume < 10,000 products/day + no strict protection = data centers
- Volume > 10,000 products/day + Wildberries/Ozon = residential
- Parsing with authorization + risk of account ban = residential
- Need geo-targeting by cities in Russia = residential
- Critical project + zero tolerance for blocks = mobile
Setting Up IP Rotation: Intervals and Strategies
IP rotation is the automatic change of the proxy server after a certain number of requests or time. Properly configuring rotation is key to stable parsing without blocks.
Types of Proxy Rotation
1. Time-based Rotation
The IP changes after a fixed interval: 5 minutes, 10 minutes, 30 minutes. This is the simplest method but not the most effective. If you make 200 requests in 5 minutes, and the platform's limit is 100 requests from an IP, you will still get banned.
When to Use: For residential proxies with low load (up to 50 requests per IP). For example, parsing Wildberries with an interval of 3-5 seconds between requests — rotating every 10 minutes will be optimal.
2. Request-based Rotation
The IP changes after N requests: 50, 100, 200. This is more precise than time-based rotation but requires tracking the request counter in the parser's code.
When to Use: For data centers and aggressive parsing. For example, if you know that Ozon blocks after 80 requests from an IP — set the rotation to every 70 requests as a buffer.
3. Per-request Rotation
Each request goes through a new IP. Maximum protection against blocks, but the most expensive strategy for residential proxies (traffic consumption increases due to establishing new connections).
When to Use: To bypass the strictest protections (Cloudflare in "Under Attack" mode), parsing with a high risk of account ban, collecting data from competitors who monitor parsing.
Recommended Rotation Intervals for Popular Platforms
| Platform | Proxy Type | Rotation Interval | Delay Between Requests |
|---|---|---|---|
| Wildberries | Residential | Every 5-10 minutes or 50 requests | 2-4 seconds |
| Ozon | Residential | Every 7-12 minutes or 60 requests | 3-5 seconds |
| Yandex.Market | Residential | Every 10-15 minutes or 80 requests | 2-3 seconds |
| Avito (product category) | Data Centers | Every 15-20 minutes or 100 requests | 1-2 seconds |
| AliExpress | Residential | Every 3-5 minutes or 30 requests | 4-6 seconds |
Important Note: These figures are the result of testing in 2024. Marketplaces constantly update their protection, so it is recommended to start with conservative settings (fewer requests, longer delays) and gradually increase the load while monitoring the block percentage.
"Smart" Rotation Strategy
Instead of fixed intervals, use adaptive rotation based on server responses:
- HTTP 429 (Too Many Requests) — immediate IP change and add this IP to the blacklist for 30-60 minutes.
- HTTP 403 (Forbidden) or CAPTCHA — change IP and increase the delay between requests by 50%.
- HTTP 503 (Service Unavailable) — possibly, the issue is not with the proxy but with the site overload. Pause for 30-60 seconds without changing IP.
- Successful requests in a row > 100 — you can slightly reduce the delay or increase the number of requests before rotation.
This logic is implemented in the parser's code and can save up to 30-40% of proxy traffic by avoiding unnecessary rotations.
Bypassing Anti-Bot Systems of Wildberries, Ozon, and Yandex.Market
Modern marketplaces use multi-layered protection: from simple User-Agent checks to advanced browser fingerprinting. Proxies alone are not enough — a comprehensive bypass strategy is needed.
Level 1: Correct HTTP Headers
The minimum set of headers that your parser should send:
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Language: ru-RU,ru;q=0.9,en;q=0.8
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Cache-Control: max-age=0
Critical Points:
- User-Agent must match a real browser. Use current versions of Chrome 120+, Firefox 121+. Do not use old versions (Chrome 90) — this is a red flag.
- Accept-Language should be "ru-RU" for Russian platforms. If you parse with the header "en-US", the site understands the mismatch (IP from Russia, but language is English).
- Sec-Fetch-* headers appeared in Chrome 76+ and are mandatory for modern sites. Their absence reveals an old parser.
Level 2: Executing JavaScript
Wildberries and Ozon load prices, characteristics, and stock levels via JavaScript after the page loads. If your parser using requests/curl simply downloads HTML, it will receive an empty page or a placeholder.
Solution: Use headless browsers — Puppeteer (Node.js), Playwright (Python/Node.js), Selenium. They fully render the page, execute JS, and obtain the final HTML.
The Problem with Headless Browsers: Sites easily identify them by parameters navigator.webdriver === true, absence of plugins, specific Canvas sizes. The detection rate of headless Chrome is about 80-90%.
Solution to the Problem: Use libraries for stealth mode:
- puppeteer-extra-plugin-stealth (Node.js) — masks Puppeteer as regular Chrome, patches 30+ fingerprint parameters.
- undetected-chromedriver (Python) — patched version of Selenium ChromeDriver that is not detected by most anti-bot systems.
- playwright-stealth (Python) — equivalent for Playwright with support for Firefox and WebKit.
Level 3: Bypassing Cloudflare and DataDome
Wildberries uses Cloudflare Bot Management, Ozon uses DataDome. These systems analyze not only IP and headers but also behavior: scrolling speed, mouse movements, page load time.
Signs of Cloudflare Challenge: Instead of content, you see a "Checking your browser..." page with a 5-second delay. In the code, this is a JavaScript challenge that checks the browser.
How to Bypass:
- FlareSolverr — a proxy service that automatically solves Cloudflare Challenge. You send it the URL, and it returns cookies for bypassing. Works in 70-80% of cases.
- Playwright with Waiting — load the page in a headless browser, wait 10-15 seconds (while JS executes), extract cookies and use them in regular HTTP requests. Saves resources: the browser is only needed to obtain cookies, then parse via requests.
- Residential Proxies + Stealth Browser — this combination gives 95%+ successful bypasses. Cloudflare sees the real user's IP and the correct browser fingerprint.
Important: Cloudflare constantly updates its protection. What worked in December 2024 may not work in March 2025. Always have a backup plan: manual CAPTCHA solving through services like 2Captcha/AntiCaptcha or switching to the marketplace API (if available).
Level 4: Simulating User Behavior
Advanced anti-bot systems track behavioral patterns. A real user scrolls the page, moves the mouse, and sometimes goes back. A bot opens product cards at a perfect interval of 2.000 seconds.
How to Simulate:
- Randomizing Delays — instead of fixed 3 seconds, use random.uniform(2.5, 5.0). Add rare long pauses (15-30 seconds), simulating user distraction.
- Scrolling the Page — in Puppeteer/Playwright, add scrolling before data extraction:
await page.evaluate(() => window.scrollBy(0, 500)). - Mouse Movements — the ghost-cursor library for Puppeteer generates realistic cursor movement trajectories.
- Transitions via Search — do not open product cards directly by URL. First, go to the homepage, perform a search, click on the product in the results. This looks natural.
Popular Parsing Tools with Proxy Support
For parsing product characteristics, you don't necessarily have to write code from scratch. There are ready-made tools with a visual interface, proxy support, and automatic bypass of protection.
Octoparse — No-Code Parser
Description: A desktop application for Windows/Mac with a visual parser builder. You click on elements of the page (product name, price, characteristics), and the program automatically creates extraction rules.
Proxy Support: Built-in. In the settings, you specify the list of proxies, and the program automatically rotates them. Supports HTTP, HTTPS, SOCKS5. There is integration with providers like Bright Data, Smartproxy.
Pros: No coding required, works with JavaScript sites, built-in task scheduler, export to Excel/CSV/JSON.
Cons: Paid subscription starting at $75/month, slower than Python code, limitations on the number of pages in the free version.
When to Use: For small projects (up to 50,000 products), if you are not a programmer or need a quick prototype.
ParseHub — Cloud Parser
Description: An alternative to Octoparse, but it works in the cloud. You configure the parser in a desktop application, and it runs on ParseHub's servers. Convenient for long tasks (parsing 100,000+ products).
Proxy Support: Only in paid plans (from $149/month). You can upload your list of proxies or use ParseHub's built-in residential IPs.
Pros: Does not overload your computer, automatic pagination handling, API for integration.
Cons: Expensive, slow support, difficulties with configuration for complex sites.
Scrapy (Python) — For Programmers
Description: A framework for creating parsers in Python. The most flexible and fastest option — you can parse millions of products per day. Requires intermediate knowledge of Python.
Proxy Support: Through middleware. Popular solutions: scrapy-rotating-proxies (rotation from a list), scrapy-proxy-pool (integration with provider APIs). Setup takes 10-15 minutes.
Pros: Free, very fast (asynchronous requests), full control over logic, large community.
Cons: Requires coding, difficulties with JavaScript sites (requires integration with Splash or Playwright).
When to Use: For serious projects with a volume of 100,000+ products per day, if you have a programmer on the team.
Apify — Marketplace of Ready-Made Parsers
Description: A platform with thousands of ready-made parsers (called "actors") for popular sites. There are ready-made solutions for Amazon, eBay, AliExpress. For Russian marketplaces, the selection is smaller, but you can order development.
Proxy Support: Built-in for all actors. Apify provides its own residential proxies (payment by traffic) or you can connect your own.
Pros: Ready-made solutions, cloud execution, API for automation, built-in proxies.
Cons: Expensive (from $49/month + proxy payment), dependence on the platform, limitations on customization.
Comparison of Tools
| Tool | Is Code Required? | Price | Speed | For Whom |
|---|---|---|---|---|
| Octoparse | No | From $75/month | Average | Marketers, analysts without programming |
| ParseHub | No | From $149/month | Average | Same, who wants cloud execution |
| Scrapy | Yes (Python) | Free | Very High | Programmers, large volumes of data |
| Apify | No (ready-made actors) | From $49/month + traffic | High | Business, needs ready-made solutions |
| Puppeteer/Playwright | Yes (JS/Python) | Free | Average (heavy browsers) | Programmers, complex JS sites |
Step-by-Step Proxy Setup in the Parser
Let's consider practical proxy setup using popular tools. These instructions are suitable for parsing any marketplaces, not just Russian ones.
Setup in Octoparse
Step 1: Open Octoparse and create a new parsing task. Enter the URL of the starting page (for example, a product category on Wildberries).
Step 2: Go to the menu "Settings" → "Advanced Settings" → "Proxy". Select "Use custom proxy".
Step 3: Add proxies in the format:
http://username:password@proxy-server.com:8080
socks5://username:password@proxy-server.com:1080
Step 4: Enable the "Rotate proxy" option and set the rotation interval. For Wildberries, it is recommended to "Rotate on every 50 requests" or "Rotate every 10 minutes".
Step 5: Click "Test Proxy" — Octoparse will check the availability of each proxy. Remove non-working ones from the list.
Step 6: In the "Speed" section, set the delay between requests: 2-4 seconds for residential proxies, 3-5 seconds for data centers.
Setup in Scrapy (Python)
Step 1: Install the library for rotating proxies:
pip install scrapy-rotating-proxies
Step 2: Create a file proxies.txt with the list of proxies (one per line):
http://user:pass@1.2.3.4:8080
http://user:pass@5.6.7.8:8080
socks5://user:pass@9.10.11.12:1080
Step 3: In the settings.py file of your Scrapy project, add:
ROTATING_PROXY_LIST_PATH = 'proxies.txt'
DOWNLOADER_MIDDLEWARES = {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}
# Delay between requests (in seconds)
DOWNLOAD_DELAY = 3
# Randomize delay (±50%)
RANDOMIZE_DOWNLOAD_DELAY = True
# Concurrent requests (no more than 16 for residential proxies)
CONCURRENT_REQUESTS = 8
Step 4: Scrapy will automatically rotate proxies with each request. If a proxy returns an error (HTTP 403, 429, timeout), it is marked as "bad" and temporarily excluded from rotation.
Setup in Puppeteer (Node.js)
Step 1: Install Puppeteer and the plugin for stealth mode:
npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth
Step 2: Create a script with proxy support:
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
const proxyList = [
'http://user:pass@proxy1.com:8080',
'http://user:pass@proxy2.com:8080'
];
let currentProxyIndex = 0;
async function scrapeWithProxy(url) {
const proxy = proxyList[currentProxyIndex];
currentProxyIndex = (currentProxyIndex + 1) % proxyList.length;
const browser = await puppeteer.launch({
headless: true,
args: [`--proxy-server=${proxy}`]
});
const page = await browser.newPage();
// Proxy authorization (if required)
await page.authenticate({
username: 'user',
password: 'pass'
});
await page.goto(url, { waitUntil: 'networkidle2' });
// Data extraction
const data = await page.evaluate(() => {
return {
title: document.querySelector('.product-title')?.innerText,
price: document.querySelector('.product-price')?.innerText,
// Add more fields as needed
};
});
await browser.close();
return data;
}
Step 3: Call the scrapeWithProxy function with the target URL to start scraping.