News websites are among the most protected resources on the internet. Cloudflare, rate limiting, IP blocks — all of this makes news scraping a serious technical challenge. In this guide, we will discuss how to properly configure proxies for stable data collection from news portals, which type of proxy to choose for different tasks, and how to bypass modern protection systems.
Why News Websites Block Scrapers
News portals are particularly sensitive to automated data collection for several reasons. First, content is their main asset, which they monetize through advertising and subscriptions. Mass scraping allows competitors to copy materials and reduces unique traffic. Second, high bot traffic increases server and CDN costs.
Modern news websites use multi-layered protection:
- Cloudflare and Analogues — check JavaScript, TLS fingerprints of browsers, behavioral patterns
- Rate Limiting — limit the number of requests from a single IP (usually 10-50 requests per minute)
- User-Agent Blocking — ban standard headers from libraries (Python-requests, curl)
- CAPTCHA — shown during suspicious activity
- Geoblocking — some news portals are only accessible from certain countries
Typical signs by which news websites detect scrapers: the same IP makes many requests in a row, absence of JavaScript, non-standard order of HTTP headers, too fast request speed (a human cannot open 10 pages per second), absence of cookies and referrer.
Important: Scraping news websites is in a gray area. Always check the robots.txt and Terms of Service of the target resource. For commercial use of data, it is recommended to use official APIs or enter into partnership agreements.
Which Type of Proxy to Choose for News Scraping
The choice of proxy type depends on the scale of the task, budget, and the level of protection of the target websites. Let's consider three main options and their applicability for news scraping.
| Proxy Type | Speed | Cost | When to Use |
|---|---|---|---|
| Datacenter Proxies | High (50-100 ms) | Low | Websites without Cloudflare, large data volumes, testing |
| Residential Proxies | Medium (200-500 ms) | High | Websites with Cloudflare, strict protection, geo-targeting |
| Mobile Proxies | Medium (300-600 ms) | Very High | Maximum protection, mobile versions of news sites |
Datacenter Proxies for News Scraping
Suitable for scraping news websites without serious protection: regional publications, blogs, small news portals. Advantages: high speed (important when scraping hundreds of sources), low cost (can rent a pool of 50-100 IPs), stable connection.
Disadvantages: easily detected by ASN (affiliation with a datacenter), often already blacklisted by major sites, do not pass Cloudflare Challenge in 70% of cases. Use datacenter proxies for mass scraping of RSS feeds, sitemap.xml, API endpoints, or for collecting metadata (headlines, publication dates) without loading the full content.
Residential Proxies — The Gold Standard
Residential proxies are IP addresses of real home users provided by Internet Service Providers. For news websites, they appear as regular visitors, which is critically important when working with protected resources.
When residential proxies are mandatory: scraping large news portals (CNN, BBC, Reuters, RBC, Kommersant), sites behind Cloudflare or similar protection, data collection from specific countries (geo-targeting), long sessions with authentication. Residential proxies pass JavaScript checks from Cloudflare, have a clean IP reputation, and support sticky sessions (fixing IP for 10-30 minutes).
Practical advice: use residential proxies with time-based rotation (sticky sessions), not per request. For example, one IP works for 10 minutes, collects 20-30 articles, then changes. This looks more natural than changing IP for each request.
Mobile Proxies for Special Cases
Mobile proxies use IPs from mobile operators (MTS, Beeline, Tele2). They have maximum trust, as millions of people use mobile internet to read news. Use them for scraping mobile versions of news sites (often have simplified protection), sites with extremely strict protection, AMP pages from Google News.
A feature of mobile proxies: IPs often change automatically (mobile operators use CGNAT), one IP can be shared by hundreds of users simultaneously, making blocking pointless. Disadvantage — high price, so use them selectively, only for the most protected targets.
Bypassing Cloudflare and Other Anti-Bot Systems
Cloudflare is the main enemy of news scrapers. About 40% of major news portals use Cloudflare to protect against bots. Standard libraries (requests, urllib) do not pass checks, as Cloudflare analyzes TLS fingerprints, JavaScript execution, HTTP header order, and behavioral patterns.
Strategies for Bypassing Cloudflare
1. Headless Browsers (Selenium, Playwright, Puppeteer)
Emulate a real browser with JavaScript execution. Cloudflare sees the correct TLS fingerprint of Chrome/Firefox and allows the request. Cons: slow (2-5 seconds per page), requires many resources (RAM, CPU).
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# Setting up proxy for Selenium
chrome_options = Options()
chrome_options.add_argument('--proxy-server=http://username:password@proxy.example.com:8080')
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-blink-features=AutomationControlled')
driver = webdriver.Chrome(options=chrome_options)
driver.get('https://news-site.com/article')
# Wait for JavaScript to load
driver.implicitly_wait(10)
html = driver.page_source
driver.quit()
2. Libraries with TLS Fingerprinting (curl_cffi, tls-client)
Imitate the TLS fingerprint of a real browser without launching a headless browser. Work 10-20 times faster than Selenium but do not execute JavaScript. Suitable for sites with basic Cloudflare checks (without JS challenge).
from curl_cffi import requests
proxies = {
'http': 'http://username:password@proxy.example.com:8080',
'https': 'http://username:password@proxy.example.com:8080'
}
response = requests.get(
'https://news-site.com/article',
proxies=proxies,
impersonate='chrome110' # Imitating TLS fingerprint of Chrome 110
)
print(response.text)
3. Cloudflare-bypass Services (scraperapi, scrapingbee)
Paid APIs that automatically bypass Cloudflare. You send the URL, and they return the ready HTML. Pros: no need to deal with technical details, automatic proxy rotation, CAPTCHA handling. Cons: expensive for large volumes (from $50/month for 100K requests).
Correct HTTP Headers
Even with proxies, it is important to send correct headers; otherwise, the site will identify the bot by non-standard User-Agent or absence of Accept-Language.
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Cache-Control': 'max-age=0'
}
Periodically update the User-Agent — use current versions of browsers. You can check your fingerprint on sites like whoer.net or browserleaks.com.
Setting Up IP Rotation and Request Management
Proper proxy rotation is key to stable scraping without blocks. News websites track the frequency of requests from a single IP, and exceeding the limit leads to temporary or permanent bans.
Types of Proxy Rotation
Request-based Rotation — each request goes through a new IP. Suitable for quickly scraping a large number of different sites, minimizes the risk of bans due to request frequency. Cons: not suitable for sites with sessions (cookies, authentication), may look suspicious to some protections.
Time-based Rotation (Sticky Sessions) — one IP is used for a fixed time (5-30 minutes), then changes. Suitable for scraping a single news portal with many pages, retains cookies and sessions, looks like the behavior of a real user. Recommended for most news scraping tasks.
Geolocation-based Rotation — changing IPs from different countries/cities. Used for scraping geo-dependent content (regional news), bypassing geoblocks.
Optimal Request Frequency
Even with proxy rotation, requests should not be made too frequently. Safe intervals for different types of sites:
- Major News Portals (RBC, Kommersant, Vedomosti) — 2-5 seconds between requests from a single IP
- Medium Sites — 1-3 seconds
- Small Blogs and Regional Publications — 0.5-1 second
Add random delays (randomization) to make the request pattern look natural:
import time
import random
def fetch_article(url, proxies):
response = requests.get(url, proxies=proxies, headers=headers)
# Random delay from 2 to 5 seconds
delay = random.uniform(2, 5)
time.sleep(delay)
return response.text
Example of Proxy Rotation from a Pool
If you have a list of proxies, you can implement simple rotation manually:
import itertools
import requests
# Proxy pool
proxy_list = [
'http://user:pass@proxy1.example.com:8080',
'http://user:pass@proxy2.example.com:8080',
'http://user:pass@proxy3.example.com:8080',
]
# Create an infinite iterator
proxy_pool = itertools.cycle(proxy_list)
def get_next_proxy():
proxy = next(proxy_pool)
return {'http': proxy, 'https': proxy}
# Usage
urls = ['https://news1.com/article', 'https://news2.com/article']
for url in urls:
proxies = get_next_proxy()
response = requests.get(url, proxies=proxies, headers=headers)
print(f'Fetched {url} via {proxies["http"]}')
Code Examples: Python + Scrapy + Proxies
Scrapy is a professional framework for scraping that natively supports proxies, middleware, rotation, and error handling. Let's consider a complete example of a news site scraper with proxy rotation.
Installing Dependencies
pip install scrapy scrapy-rotating-proxies
Configuring Scrapy with Proxies (settings.py)
# settings.py
# Enable middleware for proxy rotation
ROTATING_PROXY_LIST = [
'http://user:pass@proxy1.example.com:8080',
'http://user:pass@proxy2.example.com:8080',
'http://user:pass@proxy3.example.com:8080',
]
DOWNLOADER_MIDDLEWARES = {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}
# Settings for bypassing blocks
CONCURRENT_REQUESTS = 8 # No more than 8 concurrent requests
DOWNLOAD_DELAY = 2 # Delay of 2 seconds between requests
RANDOMIZE_DOWNLOAD_DELAY = True # Random delay
# User-Agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
# Retry attempts on errors
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]
Spider for News Scraping
# news_spider.py
import scrapy
from datetime import datetime
class NewsSpider(scrapy.Spider):
name = 'news_parser'
# List of news sites to scrape
start_urls = [
'https://example-news.com/latest',
]
def parse(self, response):
# Parse the list of articles on the homepage
articles = response.css('article.news-item')
for article in articles:
article_url = article.css('a.title::attr(href)').get()
if article_url:
# Follow to the article page
yield response.follow(article_url, callback=self.parse_article)
def parse_article(self, response):
# Extract article data
yield {
'url': response.url,
'title': response.css('h1.article-title::text').get(),
'date': response.css('time.published::attr(datetime)').get(),
'author': response.css('span.author::text').get(),
'text': ' '.join(response.css('div.article-body p::text').getall()),
'tags': response.css('a.tag::text').getall(),
'scraped_at': datetime.now().isoformat(),
}
Running the Scraper
# Save to JSON
scrapy crawl news_parser -o news_data.json
# Save to CSV
scrapy crawl news_parser -o news_data.csv
Simple Scraper with Requests + BeautifulSoup
If complex logic is not needed, you can use the combination of requests + BeautifulSoup:
import requests
from bs4 import BeautifulSoup
import time
import random
# Setting up proxy
proxies = {
'http': 'http://user:pass@proxy.example.com:8080',
'https': 'http://user:pass@proxy.example.com:8080'
}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
def parse_news_article(url):
try:
response = requests.get(url, proxies=proxies, headers=headers, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
# Extracting data (selectors depend on the site)
title = soup.find('h1', class_='article-title').text.strip()
date = soup.find('time', class_='published')['datetime']
text = ' '.join([p.text for p in soup.find_all('p', class_='article-text')])
return {
'url': url,
'title': title,
'date': date,
'text': text
}
except Exception as e:
print(f'Error parsing {url}: {e}')
return None
# Scraping the list of articles
urls = [
'https://news-site.com/article-1',
'https://news-site.com/article-2',
]
for url in urls:
article_data = parse_news_article(url)
if article_data:
print(article_data)
# Delay between requests
time.sleep(random.uniform(2, 4))
Common Mistakes When Scraping News
Even with the correct proxy setup, scrapers often get blocked due to technical errors. Let's discuss the most common problems and their solutions.
Error 1: Too High Request Frequency
Symptoms: HTTP 429 (Too Many Requests), temporary IP bans, CAPTCHA. Reason: the scraper makes 10-50 requests per second from a single IP. Solution: add delays (time.sleep()), use DOWNLOAD_DELAY in Scrapy, limit CONCURRENT_REQUESTS.
Error 2: Using One Proxy for All Requests
Symptoms: proxy gets banned quickly, even with delays. Reason: one IP makes hundreds of requests to one site. Solution: use a proxy pool with rotation, for large sites — at least 10-20 proxies, for sticky sessions change IP every 10-15 minutes.
Error 3: Ignoring Cookies
Many news websites set cookies on the first visit and check their presence in subsequent requests. Absence of cookies is a sign of a bot. Solution: use requests.Session() for automatic cookie storage, in Scrapy enable COOKIES_ENABLED = True.
import requests
session = requests.Session()
session.proxies = {'http': 'http://proxy.com:8080', 'https': 'http://proxy.com:8080'}
# First request — get cookies
response1 = session.get('https://news-site.com')
# Subsequent requests automatically send cookies
response2 = session.get('https://news-site.com/article')
Error 4: Incorrect Handling of Redirects
News websites often use redirects (301, 302) for mobile versions, regional subdomains, AMP pages. If the scraper does not follow redirects, it receives an empty page. Solution: in requests, this is enabled by default (allow_redirects=True), check the final URL via response.url.
Error 5: Scraping Dynamic Content Without JavaScript
Many modern news websites load content via JavaScript (React, Vue). The requests library receives an empty HTML skeleton without articles. Solution: use Selenium/Playwright to execute JavaScript, check the Network in DevTools — perhaps the data is loaded via API (can scrape JSON directly).
Scaling: Scraping Hundreds of Sources
When you need to scrape not just one news site, but hundreds of sources simultaneously (news aggregators, media monitoring), a scalable architecture is required.
Distributed Scraping with Scrapy Cloud
Scrapy Cloud (from the creators of Scrapy) allows you to run scrapers in the cloud with automatic scaling. Advantages: no need for your own servers, automatic proxy rotation, monitoring and logs. Cost: from $9/month for the basic plan.
Task Queues (Celery + Redis)
For self-deployment, use Celery — a distributed task system. Architecture: Redis stores a queue of URLs to scrape, several workers (servers) take tasks from the queue and scrape in parallel, each worker uses its own proxy pool.
# tasks.py
from celery import Celery
import requests
app = Celery('news_parser', broker='redis://localhost:6379/0')
@app.task
def parse_article(url, proxy):
proxies = {'http': proxy, 'https': proxy}
response = requests.get(url, proxies=proxies, timeout=10)
# Parsing and saving data
return response.text
# Adding tasks to the queue
urls = ['https://news1.com/article', 'https://news2.com/article']
proxies = ['http://proxy1.com:8080', 'http://proxy2.com:8080']
for url in urls:
proxy = random.choice(proxies)
parse_article.delay(url, proxy) # Asynchronous execution
Monitoring and Error Handling
In large-scale scraping, monitoring is critically important: how many URLs have been processed, how many errors, which proxies have been banned. Use Sentry for tracking Python errors, Grafana + Prometheus for metrics (requests per second, response time), logging in ELK Stack (Elasticsearch, Logstash, Kibana).
Tip: Create a system for automatic proxy checking. Every 5-10 minutes, send a test request through each proxy to whoer.net or httpbin.org. If the proxy does not respond or is banned — exclude it from the pool and add a new one.
Optimizing Proxy Costs
When scraping hundreds of sources, proxy costs can reach thousands of dollars per month. Optimization strategies: use datacenter proxies for simple sites (RSS, API), residential proxies only for protected ones, cache data — do not scrape the same article twice, scrape during off-peak hours (site load is lower at night, less risk of bans).
Example: for scraping 500 news sites, you can use 80% datacenter proxies (for RSS and simple sites) and 20% residential proxies (for the top 100 protected portals). This will reduce costs by 3-5 times.
Conclusion
Scraping news websites is a technically complex task that requires the right choice of proxies, rotation setup, and bypassing anti-bot systems. Key takeaways from the article: for protected news portals (Cloudflare, strict rate limiting), use residential proxies with sticky sessions; for mass scraping of hundreds of sources, datacenter proxies with fast rotation are suitable; always add delays between requests (2-5 seconds) and correct HTTP headers; for bypassing Cloudflare, use headless browsers (Selenium, Playwright) or libraries with TLS fingerprinting.
When scaling, use distributed systems (Celery, Scrapy Cloud) and error monitoring. Remember that scraping should be ethical — comply with robots.txt, do not create excessive load on servers, and respect copyright on content.
If you plan to scrape large news portals protected by Cloudflare, we recommend using residential proxies — they provide a high level of trust and minimal risk of blocks. For tasks where speed and data volume are important (scraping RSS, API endpoints), datacenter proxies will suffice.