← Back to Blog

Data Collection for Machine Learning Datasets: How to Scrape Thousands of Pages Without Blocks and CAPTCHAs

We discuss how to collect large volumes of data for ML datasets without bans and CAPTCHAsβ€”what proxies to choose and how to structure the process.

πŸ“…March 14, 2026
```html

The quality of an ML model directly depends on the quality and volume of training data. However, as soon as you start collecting thousands of pages, websites begin to block requests, show CAPTCHAs, and ban IPs. In this article, we will discuss how to build a reliable data collection pipeline for datasets: which tools to use, how to bypass protections, and which type of proxy is suitable for each task.

Why Websites Block Data Collection and What to Do About It

When you launch automated data collection, the website sees not an ordinary user, but a stream of requests from a single IP address. This immediately raises red flags for protection systems β€” Cloudflare, DataDome, PerimeterX, and other anti-bot solutions. The result: CAPTCHA, temporary block, or complete IP ban.

The problem is particularly acute for ML projects because a dataset requires not 100 pages, but tens of thousands. To train even a simple text classification model, you need at least 5,000–10,000 examples. For computer vision β€” hundreds of thousands of images. Collecting such a volume from a single IP is physically impossible.

Protection systems analyze the following parameters:

  • Request Frequency β€” more than 10–20 requests per minute from a single IP is already suspicious.
  • User-Agent and Headers β€” standard parser headers are easily recognized.
  • Absence of Cookies and Session Data β€” a real browser always carries history.
  • Geolocation of IP β€” a data center IP from the Netherlands on a Russian-language site looks suspicious.
  • Behavior Pattern β€” a person reads a page for 30–60 seconds, a bot β€” 0.3 seconds.

The solution is a combination of the right proxies, IP rotation, and mimicking real user behavior. Let's break down each element in more detail.

Where to Get Data for ML Datasets: Main Sources

Before discussing tools, it is important to understand where data for training models comes from. Sources are divided into several categories, and each requires its own approach.

Open Datasets (No Scraping Required)

The first thing to check is existing open datasets. Kaggle, Hugging Face Datasets, Google Dataset Search, UCI Machine Learning Repository contain thousands of ready-made datasets. If your task is standard (text classification, object recognition, sentiment analysis), a dataset may already exist. This saves weeks of work.

Web Scraping (Requires Proxies)

When ready data is not available or does not fit your specifications β€” scraping is needed. Typical tasks include:

  • Collecting reviews from Wildberries, Ozon, Yandex.Market for sentiment analysis
  • Scraping news sites for training language models
  • Collecting product images for computer vision models
  • Scraping job postings from hh.ru, SuperJob for HR models
  • Collecting price data from marketplaces for forecasting models
  • Scraping social networks (VK, Twitter/X) for NLP tasks

API Platforms (Partially Closed)

Some platforms provide official APIs β€” Twitter/X API, Reddit API, Google Places API. The problem: they are expensive, have limits, and often do not provide the required volume of data for free. Therefore, many ML teams combine APIs with scraping.

Synthetic Data

A separate approach is generating synthetic data using GPT-4 or other LLMs. However, real data is still needed as a basis (few-shot examples). Therefore, scraping remains the basic tool for data collection for most ML projects.

Tools for Data Collection Without Coding

The good news is that you do not need to be a developer to collect data for ML datasets. There are ready-made no-code and low-code tools that can work with proxies and bypass basic protections.

No-Code Scrapers

Tool Suitable For Proxy Support Difficulty
Octoparse Websites, tables, pagination βœ… Yes Low
ParseHub Dynamic websites (JS) βœ… Yes Low
Apify Ready actors for 100+ websites βœ… Built-in Medium
Bright Data IDE Complex protected websites βœ… Built-in Medium
Scrapy Cloud Large-scale scraping βœ… Through middleware High

For most ML data collection tasks, Octoparse or Apify is sufficient. Octoparse allows you to visually set up a scraper in 20–30 minutes: you specify elements on the page, configure pagination, insert proxies β€” and start collecting. The result is exported in CSV or JSON, which can be immediately used for training.

Apify is particularly convenient if you need to scrape popular platforms: they have ready-made "actors" for Instagram, Twitter/X, Amazon, Google Maps, LinkedIn, and dozens of other sites. You simply set the parameters β€” and receive structured data.

Which Type of Proxy to Choose for ML Datasets

Choosing the right type of proxy is one of the key factors for success in data collection. A mistake here can be costly: either you will be blocked halfway, or you will overpay for unnecessary power. Let's break down three main types.

Residential Proxies β€” For Protected Websites

Residential proxies are IP addresses of real home users. For anti-bot systems, they are indistinguishable from regular visitors. This makes them ideal for scraping websites with serious protection: marketplaces (Wildberries, Ozon), social networks, news aggregators.

The main advantage for ML tasks: you can collect data with geographical tagging. If you are training a model on regional content, you choose proxies from the desired region of Russia or another country. This is especially important for geolocation classification tasks or analyzing regional dialects.

Mobile Proxies β€” For Social Networks and Mobile Platforms

Mobile proxies use IPs from mobile operators (4G/5G). They have the highest level of trust with platforms β€” because one mobile IP is actually used by hundreds of people simultaneously (all subscribers of one tower go through one IP). This means that even active data collection from a mobile IP looks normal.

Mobile proxies are especially needed if you are collecting data from VK, TikTok, or Instagram β€” platforms that aggressively block data center IPs.

Data Center Proxies β€” For Open Sources and Speed

Data center proxies are fast and cheap. They are not tied to real users, so they are more easily recognized by protection systems. But for many ML tasks, this is sufficient: if you are scraping Wikipedia, open archives, GitHub, public APIs, or sites without serious protection β€” data center proxies will perform excellently and will be significantly cheaper.

How to Choose the Type of Proxy for Your ML Task:

  • Marketplaces (Wildberries, Ozon, Avito): residential proxies with rotation
  • Social Networks (VK, Instagram, TikTok): mobile proxies
  • News Sites, Forums, Wikipedia: data center proxies
  • Google Search, Yandex: residential or mobile proxies
  • Open Archives, Common Crawl: data center proxies

Practical Scenarios: Text, Images, Prices, Reviews

Let's analyze specific scenarios for data collection for popular types of ML tasks β€” indicating sources, tools, and the required type of proxy.

Scenario 1: Review Dataset for Sentiment Analysis (NLP)

Task: collect 50,000 reviews with ratings from Wildberries to train a sentiment classification model.

Source: Wildberries β€” reviews for products with ratings of 1–5 stars (ideal labeling is already available).
Tool: Octoparse or a ready-made Python script with the requests library.
Proxy: Residential with rotation β€” Wildberries actively blocks data center IPs.
Collection Speed: 1 request every 3–5 seconds with pauses β€” 50,000 reviews in 2–3 days.

What You Get: A CSV file with columns: review text, rating (1–5), product category, date. This is a ready dataset for training β€” the labeling is already embedded in the data.

Scenario 2: Image Dataset for Computer Vision

Task: collect 100,000 images of products from several categories to train a classification model.

Source: Ozon, Yandex.Market β€” product photos with categories.
Tool: Apify (there are ready-made actors for e-commerce) or ParseHub.
Proxy: Residential proxies with geographical rotation across Russia.
Important: Download images through proxies, not directly β€” CDN servers can also block bulk downloads.

What You Get: Folders with images sorted by categories β€” a structure that is directly accepted by ImageDataGenerator in Keras or DataLoader in PyTorch.

Scenario 3: Text Corpus for Language Model

Task: collect a large corpus of Russian texts for fine-tuning a language model on a specific topic β€” for example, legal texts or medical articles.

Source: Thematic forums, news sites, Habr, professional portals.
Tool: Scrapy Cloud or Octoparse for structured collection.
Proxy: Data center proxies with rotation β€” most text sites do not have strict protection, and speed is more important than anonymity.
Speed: With data center proxies, you can make 50–100 requests per minute and collect a million documents in a few days.

Scenario 4: Job Dataset for HR Model

Task: collect 200,000 job postings from hh.ru to train a recommendation or profession classification model.

Source: hh.ru β€” they have an official API, but with limits. For large volumes, scraping is needed.
Tool: Apify (there is an actor for hh.ru) or Octoparse.
Proxy: Residential proxies β€” hh.ru is well protected and blocks data center IPs.
What You Get: Structured data: job title, description, salary, requirements, region, industry β€” an excellent dataset for NLP and recommendation systems.

How to Avoid Blocks When Collecting Data in Bulk

Even with good proxies, you can get banned if you do not follow basic rules. Here are proven methods that help collect data reliably and without loss.

IP and Session Rotation

The most important rule: do not use one IP for thousands of requests. Set up rotation so that the IP changes every 10–50 requests. Most tools (Octoparse, Apify, Scrapy) support this out of the box when connecting a proxy pool.

Additionally, change session cookies along with the IP β€” this mimics a new user, not just an address change.

Proper Delays Between Requests

Add random delays between requests β€” not fixed 2 seconds, but random from 1 to 5 seconds. A fixed interval is easily detected as a bot pattern. Random β€” mimics human behavior.

For particularly protected sites, add longer pauses: after every 100 requests, take a break of 30–60 seconds. This reduces speed but drastically decreases the risk of blocking.

Correct Request Headers

Set the User-Agent to a current browser (latest Chrome, Firefox). Add standard HTTP headers: Accept-Language, Accept-Encoding, Referer. The absence of these headers is a clear sign of a bot for most protection systems.

Collecting During Off-Peak Hours

Launch bulk collection at night (from 2:00 to 6:00 AM Moscow time). During this time, traffic on websites is minimal, anti-bot systems are less aggressive, and your requests make up a larger share of the load β€” paradoxically reducing suspicions as there is less competing traffic.

Error Handling and Retries

Set up automatic handling of response codes:

  • 429 (Too Many Requests) β€” increase the delay, change the IP, wait 5–10 minutes.
  • 403 (Forbidden) β€” IP is blocked, definitely change the proxy.
  • 503 (Service Unavailable) β€” temporary server overload, retry in 1–2 minutes.
  • 200 with CAPTCHA β€” a higher quality proxy is needed (residential instead of data center).

Geographic Matching of Proxies and Site

Use proxies from the same country as the target site. If scraping Wildberries β€” choose Russian IPs. If collecting data from a German site β€” you need German proxies. Mismatched geolocation is one of the most common triggers for blocking.

Checklist: Setting Up a Data Collection Pipeline for ML

Use this checklist before launching any large-scale data collection for a dataset:

πŸ“‹ Preparation

  • ☐ Check for an existing dataset on Kaggle / Hugging Face
  • ☐ Study the robots.txt of the target site
  • ☐ Determine the volume of data and structure of the dataset
  • ☐ Choose a scraping tool (Octoparse, Apify, Scrapy)
  • ☐ Choose the type of proxy for the task (residential / mobile / data center)

βš™οΈ Setup

  • ☐ Connect a proxy pool with IP rotation
  • ☐ Set the User-Agent (current Chrome/Firefox)
  • ☐ Add standard HTTP headers
  • ☐ Set random delays (1–5 seconds)
  • ☐ Set up error handling (429, 403, 503)
  • ☐ Specify the data export format (CSV, JSON, JSONL)

πŸ§ͺ Testing

  • ☐ Run a test on 100–500 records
  • ☐ Check the quality and completeness of the data
  • ☐ Ensure there are no blocks on the test volume
  • ☐ Check the collection speed and calculate the time for the full dataset

πŸš€ Launch and Monitoring

  • ☐ Launch during nighttime (02:00–06:00 MSK)
  • ☐ Set up error notifications
  • ☐ Periodically check the quality of the collected data
  • ☐ Save intermediate results (checkpoint every 10,000 records)

🧹 Post-Processing

  • ☐ Remove duplicates
  • ☐ Clean HTML tags and special characters from texts
  • ☐ Check class balance (for classification tasks)
  • ☐ Split into train/validation/test sets
  • ☐ Save in a format compatible with your ML framework

Conclusion

Collecting data for ML datasets is not a one-time task, but a systematic process. The main takeaways from this article: the right choice of proxy determines whether you will reach the end or get stuck on blocks. Residential proxies are needed for protected marketplaces and aggregators, mobile proxies for social networks, and data center proxies for open text sources. Tools like Octoparse and Apify allow you to build a pipeline without coding. And adhering to basic rules (IP rotation, random delays, correct headers) enables you to collect hundreds of thousands of records without loss.

If you plan to collect data from marketplaces, news sites, or thematic portals for training ML models, we recommend starting with residential proxies β€” they provide the highest level of trust from protection systems and the lowest risk of blocks even during large-scale data collection.

```