Proxies for AI training: how to collect data without blocks.

Training AI models requires vast amounts of data — texts, images, videos, and structured information from websites. The problem is that during mass scraping, websites quickly block IP addresses, considering the activity as bot behavior. In this article, we will discuss how to properly organize data collection through proxies, which type of IP to choose for different tasks, and how to set up infrastructure for stable operation.

Why proxies are needed for AI training

Modern language models like GPT, LLaMA, or Claude are trained on billions of tokens of text. Computer vision models require tens of millions of images. Recommendation systems analyze user behavior on thousands of websites. All this data needs to be sourced somewhere.

The main problem is that websites actively protect themselves against mass scraping. If you send 100+ requests per minute from a single IP, you will be blocked within 5-10 minutes. Reasons for blocking include:

Rate limiting: limiting the number of requests from a single IP (usually 10-60 requests per minute)
Anti-bot systems: Cloudflare, Akamai, PerimeterX analyze behavior and block suspicious activity
Geographical restrictions: some content is only available from specific countries
Protection from competitors: marketplaces and aggregators block mass collection of prices and products

Proxies solve this problem by distributing requests through thousands of different IP addresses. Instead of making 1000 requests from one IP, you make 1-2 requests from each of 500-1000 different addresses — this looks like activity from regular users.

Which type of proxy to choose for data collection

For AI training, three types of proxies are used, each with its own advantages and limitations. The choice depends on the data source, volume, and budget of the project.

Type of Proxy	Speed	Website Trust	Cost	When to Use
Datacenter	100-1000 Mbps	Low	$0.5-2/IP	Open APIs, simple sites without protection
Residential	10-50 Mbps	High	$5-15/GB	Social networks, sites with Cloudflare, e-commerce
Mobile	5-30 Mbps	Very High	$10-30/GB	Mobile applications, strict protection

Datacenter Proxies: Speed for Large Volumes

Datacenter proxies are IP addresses from servers in cloud providers (AWS, Google Cloud, Hetzner). The main advantage is speed and low cost. One datacenter IP can handle hundreds of requests per second.

They are suitable for data collection from sources that do not use aggressive protection: open APIs (GitHub, Wikipedia, Stack Overflow), government databases, news sites without Cloudflare, scientific publications. If a site serves data without JavaScript rendering and does not check the browser fingerprint — datacenter proxies will work.

The downside is that many sites maintain a blacklist of datacenter IPs. Instagram, Facebook, Google Search, and large marketplaces block datacenter IPs almost immediately. For such sources, residential proxies are needed.

Residential Proxies: Bypassing Any Protection

Residential proxies use IP addresses of real home users. For the website, such a request looks like a regular visitor from home. This allows bypassing Cloudflare, Akamai, and collecting data from social networks and protected platforms.

Residential proxies are necessary for: Instagram, Facebook, Twitter/X (collecting posts, comments, profiles), Google Search (scraping search results for NLP models), marketplaces (Amazon, eBay, Wildberries — products, reviews, prices), sites with geo-restrictions (content available only from certain countries).

The cost is higher — payment for traffic ($5-15 per GB). To save costs, use residential proxies only for critical sources, while scraping simple sites through datacenter proxies.

Mobile Proxies: For Mobile Applications

Mobile proxies use IP addresses from mobile operators (4G/5G). They are rarely needed — mainly for collecting data from mobile applications (TikTok, Instagram app, mobile games) or when the site distinguishes between mobile and desktop traffic.

The advantage of mobile IPs is that operators use CGNAT (one IP for hundreds of users), so blocking such addresses is not profitable. However, for most AI training tasks, residential proxies are sufficient.

Types of Data Sources and Proxy Requirements

Different types of data require different approaches to proxies. Let's consider popular sources for training AI models.

Text Data for NLP Models

For training language models, texts are collected from news sites, forums, blogs, social networks, Wikipedia, and specialized resources. The volumes can reach tens of terabytes of text.

Proxy Recommendation: News sites and blogs — datacenter (speed is more important). Forums like Reddit, Quora — residential (there is rate limiting). Twitter, Facebook, Instagram — only residential with rotation every 5-10 minutes.

A feature of text scraping is that you need to preserve the structure (headings, paragraphs, metadata). Use headless browsers (Puppeteer, Playwright) for JavaScript sites or simple HTTP clients (requests, axios) for static pages.

Images for Computer Vision

Training recognition models requires millions of labeled images. Sources include Google Images, Pinterest, Instagram, specialized photo stocks, and e-commerce sites (product photos).

The problem is that images are large (average size 200-500 KB), so traffic is consumed quickly. When using residential proxies (payment per GB), this is critical. Optimization strategy: first collect image URLs through residential proxies, then download the actual files through datacenter proxies or directly (if the CDN does not check the referrer).

Structured Data from E-commerce

Data about products, prices, and reviews are used to train recommendation systems and pricing models. Sources include Amazon, eBay, Wildberries, Ozon, AliExpress.

All major marketplaces use Cloudflare or their own anti-bot systems. Residential proxies with rotation are essential. Additionally, the correct browser fingerprint is important — use tools like puppeteer-extra-plugin-stealth to mask automation.

Video and Audio Data

YouTube, TikTok, podcast platforms are sources for training speech and video recognition models. The problem is the huge traffic (one video = hundreds of MB). For such tasks, residential proxies are economically unfeasible.

Solution: use residential proxies only for obtaining metadata and video links, and download through datacenter proxies or special tools like yt-dlp (which can bypass YouTube restrictions without proxies).

IP Rotation Strategies for Different Volumes

IP rotation is a key point for stable scraping. Incorrect configuration will lead either to blocks or overpayment for traffic.

Request-based Rotation (Rotating Proxies)

Each request goes through a new IP. Suitable for mass scraping of different sites when there is no need to maintain a session. For example, collecting texts from 10,000 different news sites — each site sees only 1-2 requests from one IP.

import requests

# Rotating proxy - each request uses a new IP
proxies = {
    'http': 'http://username:[email protected]:12345',
    'https': 'http://username:[email protected]:12345'
}

urls = ['https://site1.com', 'https://site2.com', ...]
for url in urls:
    response = requests.get(url, proxies=proxies)
    # Each request goes with a new IP
    parse_data(response.text)

The advantage is maximum protection against blocks. The downside is that it is impossible to work with sites that require authorization or cookie retention.

Time-based Rotation (Sticky Sessions)

The IP is retained for 5-30 minutes, then changed. Suitable for scraping a single site with pagination, when you need to go through pages 1, 2, 3... while maintaining a session.

import requests
import time

# Sticky session - IP is retained for 10 minutes
session_id = generate_random_string()  # unique session ID
proxies = {
    'http': f'http://username-session-{session_id}:[email protected]:12345'
}

# All requests for 10 minutes go from one IP
for page in range(1, 100):
    url = f'https://site.com/catalog?page={page}'
    response = requests.get(url, proxies=proxies)
    parse_page(response.text)
    time.sleep(2)  # delay between requests

Adjust the session time based on the site's rate limit. If the limit is 60 requests per minute, set the session to 1-2 minutes and make no more than 50 requests.

Pool of Static IPs

You receive a list of 100-1000 IPs and manage the distribution of requests yourself. Suitable for complex scenarios where full control is needed: parallel scraping of different sections of a site, load balancing, custom rotation logic.

import requests
from itertools import cycle

# Pool of 500 static IPs
ip_pool = [
    'http://user:[email protected]:12345',
    'http://user:[email protected]:12345',
    # ... 500 addresses
]

proxy_cycle = cycle(ip_pool)

for url in urls:
    proxy = next(proxy_cycle)  # get the next IP from the pool
    response = requests.get(url, proxies={'http': proxy, 'https': proxy})
    parse_data(response.text)

This approach provides maximum flexibility but requires more code for error handling (if an IP is blocked, it needs to be excluded from the pool).

Bypassing Anti-Bot Systems During Scraping

Proxies solve the problem of IP blocks, but modern websites analyze dozens of parameters to identify bots. Even with residential IPs, you can be blocked if the browser fingerprint indicates automation.

What Anti-Bot Systems Check

User-Agent: must match a real browser (Chrome, Firefox), should not contain the words "headless" or "bot"
Headers: the set of headers must be typical for a browser (Accept, Accept-Language, Accept-Encoding, Referer)
TLS fingerprint: SSL connection parameters differ between browsers and scripts
JavaScript fingerprint: WebGL, Canvas, AudioContext, fonts, plugins, screen resolution
Behavior: mouse movements, scrolling speed, clicks (for sites with JavaScript rendering)

Tools for Masking Automation

To bypass advanced protection, use headless browsers with masking plugins:

// Puppeteer with stealth plugin
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());

const browser = await puppeteer.launch({
    headless: true,
    args: [
        '--proxy-server=http://username:[email protected]:12345',
        '--disable-blink-features=AutomationControlled'
    ]
});

const page = await browser.newPage();
// Set a realistic viewport
await page.setViewport({ width: 1920, height: 1080 });

// Add random delays
await page.goto('https://protected-site.com');
await page.waitForTimeout(2000 + Math.random() * 3000);

const data = await page.evaluate(() => {
    return document.querySelector('.data').innerText;
});

await browser.close();

For Python, use Playwright with similar settings or Selenium with undetected-chromedriver — a library that automatically patches ChromeDriver to bypass detection.

Bypassing Cloudflare and Other WAFs

Cloudflare uses a JavaScript challenge to verify the browser. Simple HTTP clients (requests, axios) cannot pass it. Solutions include:

Headless Browser: Puppeteer/Playwright with stealth plugin can pass most challenges
Ready-made Solutions: libraries like cloudscraper (Python) or puppeteer-extra-plugin-recaptcha
Bypass Services: specialized APIs (FlareSolverr, Anti-Captcha) solve challenges for you

Important: even with the correct fingerprint, make pauses between requests. Sending 100 requests per second with a perfect browser fingerprint still looks suspicious. The optimal speed is 10-30 requests per minute from one IP.

Data Collection Infrastructure Architecture

When collecting data for AI training on an industrial scale, a well-thought-out architecture is necessary. A simple script on one server will not handle scraping terabytes of data.

Components of the Collection System

1. Task Queue

Stores a list of URLs for scraping. Use Redis, RabbitMQ, or AWS SQS. Allows distributing tasks among workers and reassigning failed tasks.

2. Workers

Processes that take tasks from the queue and perform scraping. Run 10-100 workers in parallel on different servers. Each worker uses its own proxy or proxy pool.

3. Data Storage

Where the collected data is stored. For texts — S3/MinIO (object storage). For structured data — PostgreSQL or MongoDB. For large volumes — data lake (AWS S3 + Athena, Google Cloud Storage).

4. Monitoring

Tracking scraping speed, error rate, traffic consumption. Use Grafana + Prometheus or ready-made solutions like Datadog. Set up alerts for critical metrics (error rate >10%, speed drops by 2 times).

Example Architecture in Python

# worker.py - scraping process
import redis
import requests
import json
from datetime import datetime

# Connect to Redis (task queue)
queue = redis.Redis(host='redis-server', port=6379)
# Proxy pool
proxies_pool = load_proxies_from_config()

while True:
    # Take a task from the queue
    task = queue.blpop('parsing_queue', timeout=5)
    if not task:
        continue
    
    url = task[1].decode('utf-8')
    proxy = get_next_proxy(proxies_pool)
    
    try:
        response = requests.get(
            url, 
            proxies={'http': proxy, 'https': proxy},
            timeout=30,
            headers={'User-Agent': get_random_user_agent()}
        )
        
        # Parse data
        data = parse_html(response.text)
        
        # Save to S3
        save_to_s3(data, f'data/{datetime.now().isoformat()}/{hash(url)}.json')
        
        # Log success
        log_success(url, proxy)
        
    except Exception as e:
        # On error, return the task to the queue
        queue.rpush('parsing_queue', url)
        log_error(url, proxy, str(e))
        mark_proxy_as_failed(proxy)

This architecture allows for horizontal scaling — simply add new servers with workers. If one worker fails, the others continue to operate.

Tools for Automating Collection

For industrial scraping, specialized frameworks are used that solve typical tasks out of the box.

Scrapy — Framework for Python

Scrapy is the most popular tool for web scraping in Python. It supports out of the box: parallel scraping (hundreds of requests simultaneously), automatic retries on errors, middleware for proxy and User-Agent rotation, exporting to JSON, CSV, XML, databases.

# settings.py - Scrapy configuration with proxies
ROTATING_PROXY_LIST = [
    'http://user:[email protected]:12345',
    'http://user:[email protected]:12345',
    # ... list of proxies
]

DOWNLOADER_MIDDLEWARES = {
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}

# Concurrency
CONCURRENT_REQUESTS = 100
DOWNLOAD_DELAY = 0.5  # delay between requests

Scrapy is suitable for static sites (HTML without JavaScript). For dynamic sites, use Scrapy + Splash (headless browser) or switch to Playwright.

Crawlee — Framework for Node.js

Crawlee (formerly Apify SDK) is an equivalent of Scrapy for JavaScript. The advantage is native integration with Puppeteer and Playwright, built-in proxy rotation, automatic queue management, adaptive scraping speed (slows down on errors).

import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://user:[email protected]:12345',
        'http://user:[email protected]:12345',
    ],
});

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    maxConcurrency: 50,
    requestHandler: async ({ page, request }) => {
        await page.waitForSelector('.data');
        const data = await page.$$eval('.item', items => 
            items.map(item => ({
                title: item.querySelector('h2').innerText,
                price: item.querySelector('.price').innerText
            }))
        );
        await saveData(data);
    },
});

await crawler.run(['https://site.com/catalog']);

Apache Nutch — For Large-Scale Crawling

If you need to collect data from the entire internet (like search engines), use Apache Nutch. It is a distributed crawler that runs on top of Hadoop. It can process petabytes of data, automatically discovers new pages through links, and supports crawling policies (robots.txt, sitemap.xml).

Nutch is more complex to set up but is indispensable for collecting Common Crawl-like datasets. For working with proxies, use the proxy-rotator plugin.

Optimizing Speed and Cost

Collecting data for AI training is an expensive endeavor. With volumes in terabytes, proxy costs can reach tens of thousands of dollars per month. Let's consider how to optimize expenses without compromising quality.

Combine Proxy Types

Do not use residential proxies for all tasks. Divide sources into three categories:

No Protection: datacenter proxies ($0.5-2/IP) — open APIs, simple sites, government databases
Medium Protection: residential rotating ($5-10/GB) — news sites with Cloudflare, forums
High Protection: residential sticky sessions ($10-15/GB) — social networks, marketplaces

Example: you scrape 100 news sites. 70 of them operate without Cloudflare — use datacenter proxies. 30 with protection — use residential proxies. Savings will amount to 60-70% of the proxy budget.

Cache Requests

If you scrape one site multiple times (for example, daily news collection), cache immutable pages. Use Redis or local storage for HTML caching.

import hashlib
import redis

cache = redis.Redis(host='localhost', port=6379)

def fetch_with_cache(url, proxies):
    # Check the cache
    cache_key = hashlib.md5(url.encode()).hexdigest()
    cached = cache.get(cache_key)
    
    if cached:
        return cached.decode('utf-8')
    
    # If not in cache - make a request
    response = requests.get(url, proxies=proxies)
    html = response.text
    
    # Save to cache for 24 hours
    cache.setex(cache_key, 86400, html)
    return html

Optimize Traffic

When using residential proxies (payment per GB), it is critical to reduce traffic volume:

Disable loading images, CSS, fonts if they are not needed (in Puppeteer: page.setRequestInterception)
Use compression (gzip, brotli) — most proxies support it
Scrape only the necessary elements — do not download the entire page if only one block is needed
For APIs, use JSON instead of HTML (5-10 times less traffic)

Distribute Load Over Time

Many sites have different loads throughout the day. Scrape during nighttime hours (according to the server time of the site) — lower chance of hitting rate limiting. Also consider weekends — on Saturday and Sunday, protection may be weaker.

Monitor Metrics

Track key indicators for optimization:

Metric	Norm	What to Do When Deviating
Success Rate	>90%	Increase delays, change proxy type
Average Speed	50-200 req/min per worker	Add workers or proxies
Cost per 1000 Records	$0.5-5	Optimize traffic, use datacenter
Duplicate Rate	<5%	Improve deduplication, check crawling logic

Conclusion

Collecting data for training AI models is a complex task that requires the right choice of proxies, setting up rotation, bypassing protections, and optimizing costs. Key points include:

For simple sources (APIs, unprotected sites), use datacenter proxies — they are fast and cheap
For protected platforms (social networks, marketplaces, sites with Cloudflare), residential proxies are essential
Set up rotation based on the task: request-based for mass scraping of different sites, sticky sessions for working with a single site
Use headless browsers with masking plugins to bypass anti-bot systems
Build a scalable architecture with task queues and parallel workers
Optimize expenses: combine proxy types, cache requests, reduce traffic

With the right setup, you can collect terabytes of data consistently and economically. Start with a small pilot project on 10-20 sources, refine the process, and then scale to industrial volumes.

If you plan to collect data from protected platforms (social networks, e-commerce, sites with anti-bot systems), we recommend using residential proxies — they provide a high level of trust and a minimal block rate. For simple sources and APIs, datacenter proxies are sufficient, offering maximum speed at a low cost.