Parsing medical data is a task that requires a special approach to proxy selection. Medical portals, clinical trial databases, and pharmaceutical resources use advanced protection systems against automated data collection. In this article, we will discuss how to properly configure proxies for safe parsing of medical information, avoid blocks, and efficiently collect the necessary data.
Why Medical Sites Block Parsing
Medical portals and databases are particularly sensitive to automated information collection for several reasons. First, many of them operate on a commercial basis and sell access to data through paid subscriptions. Automated parsing may violate terms of service and licensing agreements.
Second, medical data often contains confidential information protected by law (HIPAA in the US, GDPR in Europe). Resource owners are required to control access to such data and prevent unauthorized distribution. Therefore, they use advanced protection systems:
- Rate limiting — limiting the number of requests from a single IP address within a time unit (usually 10-50 requests per minute)
- Fingerprinting — analyzing browser characteristics, HTTP headers, resource loading order
- CAPTCHA — systems like reCAPTCHA v3 that trigger on suspicious activity
- IP blocking — temporary or permanent blocking of data center IP addresses
- Cloudflare and analogs — bot protection at the CDN level
The third reason is server load. Medical databases often contain millions of records, and mass parsing can create significant strain on the infrastructure. Therefore, administrators actively combat automated data collection by tracking behavior patterns typical of bots: identical intervals between requests, linear page traversal, absence of JavaScript and cookies.
Important: Before starting to parse medical data, be sure to study the website's terms of use and applicable legislation. Some data may be copyright protected or contain personal information about patients. Ensure that your activities are legal and do not violate third-party rights.
Which Type of Proxy to Choose for Medical Data
Choosing the right type of proxy is critical for successful parsing of medical data. Different sources require different approaches. Let's consider the main types of proxies and their applicability:
| Type of Proxy | Advantages | Disadvantages | When to Use |
|---|---|---|---|
| Data Center Proxies | High speed (100+ Mbps), low cost, stable connection | Easily detected, often blocked on protected sites | Open databases without strict protection (PubMed, WHO) |
| Residential Proxies | Real IPs of home users, low risk of blocking, bypass Cloudflare | Higher cost, variable speed, may be unstable | Protected commercial databases (Elsevier, Springer), sites with Cloudflare |
| Mobile Proxies | Maximum trust (IP of mobile operators), virtually unblocked | Most expensive, limited geography, may be slower | Highly protected resources when residential proxies do not help |
| ISP Proxies | Data center speed + residential trust, static IPs | Average cost, limited availability | Long-term parsing from a single IP when stability is needed |
For most medical data parsing tasks, it is recommended to use residential proxies. They provide the optimal balance between cost and effectiveness. Data center proxies are suitable only for open sources without protection. Mobile proxies should be used in extreme cases when other types do not work.
Recommendations for Specific Sources
- PubMed, PubMed Central — data center proxies are sufficient, but with a speed limit of 3 requests per second
- ClinicalTrials.gov — data center proxies, there is an official API
- Elsevier, Springer, Wiley — residential proxies are mandatory, use advanced fingerprinting
- DrugBank, RxList — residential proxies, active protection against parsing
- FDA, EMA databases — data center proxies are suitable, but with slow parsing speed
Main Sources of Medical Data and Their Protection
Medical data is distributed across many sources, each with its own specifics and level of protection. Understanding these features will help properly configure the parsing strategy.
Open Government Databases
PubMed/PubMed Central — the largest database of medical publications, contains over 35 million records. The National Library of Medicine (NLM) provides an official E-utilities API, which is the preferred way to access data. Direct parsing of the web interface is possible but limited to 3 requests per second from a single IP. Exceeding the limit results in a temporary block for 24 hours.
ClinicalTrials.gov — a database of clinical trials, contains information on over 400,000 studies in 220 countries. It also provides an API for programmatic access. The web interface is protected by rate limiting — a maximum of 100 requests in 5 minutes from a single IP. Basic bot protection is used, but without Cloudflare.
FDA Drugs Database — a database of FDA-approved drugs. Open access through the web interface and the openFDA API. Limitations: 240 requests per minute for anonymous users, 1000 requests per minute with an API key. Blocks are rare, but aggressive parsing may lead to them.
Commercial Scientific Publishers
Elsevier (ScienceDirect) — one of the largest publishers of scientific literature. Uses multi-layered protection: Cloudflare, browser fingerprinting, user behavior analysis. Detects patterns of automated downloads: sequential access to articles, absence of JavaScript, atypical User-Agent. Upon detecting parsing, it blocks IPs at the account level and may block the entire institution. Residential proxies with rotation and full browser emulation are mandatory.
Springer Nature — similar protection, additionally tracks page scrolling speed and mouse movements. Uses machine learning to detect bots. It is recommended to parse no more than 10-15 articles per hour from a single IP, with randomized delays between requests.
Wiley Online Library — less aggressive protection, but still requires the use of proxies. Allows about 50 requests per hour from a single IP without blocking. Uses session cookies to track activity.
Pharmaceutical Databases
DrugBank — a comprehensive database of drugs. The free version is limited to the web interface, while the commercial version provides an API and data dumps. The web version is protected by Cloudflare and rate limiting — a maximum of 20 requests per minute. It detects automation by the absence of cookies and JavaScript.
RxList, Drugs.com — popular drug reference guides for consumers. Use Cloudflare and actively combat parsing. Block data center IPs almost instantly. Residential proxies and slow parsing speed (5-10 pages per minute) are required.
Setting Up IP Rotation for Long-Term Parsing
Proper IP address rotation is a key factor for successful parsing of medical data. There are two main approaches: rotation at the request level and time-based rotation.
Request-Level Rotation
In this approach, each request is sent through a new IP address. This minimizes the risk of blocking, but may cause issues with sites that track sessions via cookies. It is suitable for parsing lists and catalogs where session state maintenance is not required.
Most residential proxy providers offer automatic rotation through a special endpoint. For example, when using a rotating proxy endpoint, each new TCP connection receives a new IP. This works automatically with libraries like requests in Python, as a new connection is created for each request by default.
Time-Based Rotation (Sticky Sessions)
Sticky sessions allow the use of one IP address for a certain period (usually 5-30 minutes), after which an automatic change occurs. This is useful for sites that require authentication or track session state via cookies. You can parse several pages from one IP, mimicking the behavior of a real user, after which the IP changes automatically.
For medical sites, it is recommended to use sticky sessions lasting 10-15 minutes. During this time, you can parse 10-20 pages (depending on delays), after which the IP changes, and you start a "new session." This looks natural and reduces the risk of detection.
IP Address Pool Size
For long-term parsing, the size of the available IP address pool is important. If you use the same set of 100 IPs over a week, the site may notice the pattern and block all those addresses. Residential proxies usually provide access to millions of IPs, which practically eliminates the reuse of the same address.
When using data center proxies, it is recommended to have a pool of at least 500-1000 IPs for medium volume parsing (10,000-50,000 pages per month). For large-scale parsing (hundreds of thousands of pages), it is better to use residential proxies with their huge IP pools.
Rotation Tips for Different Sources:
- PubMed — rotation is not mandatory, one IP is sufficient while adhering to the rate limit
- Commercial Publishers — sticky sessions of 10-15 minutes, new IP every 15-20 pages
- Pharmaceutical Databases — rotation on each request or sticky sessions of 5 minutes
- Sites with Cloudflare — sticky sessions are mandatory, request-level rotation does not work
Python Code Examples for Parsing with Proxies
Let's consider practical examples of configuring proxies for parsing medical data using popular Python libraries. We will start with a basic example and gradually complicate it.
Basic Setup with Requests Library
import requests
from time import sleep
import random
# Proxy setup (replace with your data)
PROXY_HOST = "proxy.example.com"
PROXY_PORT = "8080"
PROXY_USER = "username"
PROXY_PASS = "password"
proxies = {
'http': f'http://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{PROXY_PORT}',
'https': f'http://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{PROXY_PORT}'
}
# Headers to mimic a real browser
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1'
}
# Example request to PubMed
url = "https://pubmed.ncbi.nlm.nih.gov/?term=diabetes"
try:
response = requests.get(url, proxies=proxies, headers=headers, timeout=30)
print(f"Status code: {response.status_code}")
print(f"Content length: {len(response.content)}")
# Adding delay between requests (mandatory for PubMed)
sleep(random.uniform(1.0, 3.0))
except requests.exceptions.RequestException as e:
print(f"Error: {e}")
Advanced Setup with Rotation and Retry Logic
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from time import sleep
import random
class ProxyRotator:
def __init__(self, proxy_list):
"""
proxy_list: list of dictionaries with proxies
[{'http': 'http://user:pass@host:port', 'https': '...'}, ...]
"""
self.proxy_list = proxy_list
self.current_index = 0
def get_next_proxy(self):
"""Get the next proxy from the list"""
proxy = self.proxy_list[self.current_index]
self.current_index = (self.current_index + 1) % len(self.proxy_list)
return proxy
def create_session_with_retries():
"""Create a session with automatic retries on errors"""
session = requests.Session()
# Setting up automatic retries
retry_strategy = Retry(
total=3, # maximum 3 attempts
backoff_factor=1, # delay between attempts: 1, 2, 4 seconds
status_forcelist=[429, 500, 502, 503, 504], # codes for retrying
allowed_methods=["GET", "POST"]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)
return session
def scrape_with_rotation(urls, proxy_rotator):
"""Parsing a list of URLs with proxy rotation"""
session = create_session_with_retries()
results = []
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
}
for url in urls:
# Get a new proxy for each request
proxy = proxy_rotator.get_next_proxy()
try:
response = session.get(
url,
proxies=proxy,
headers=headers,
timeout=30
)
if response.status_code == 200:
results.append({
'url': url,
'status': 'success',
'content_length': len(response.content)
})
print(f"✓ Success: {url}")
else:
results.append({
'url': url,
'status': 'failed',
'error': f"Status code: {response.status_code}"
})
print(f"✗ Failed: {url} (Status: {response.status_code})")
except requests.exceptions.RequestException as e:
results.append({
'url': url,
'status': 'error',
'error': str(e)
})
print(f"✗ Error: {url} ({e})")
# Random delay between requests (important!)
sleep(random.uniform(2.0, 5.0))
return results
# Example usage
proxy_list = [
{
'http': 'http://user1:pass1@proxy1.example.com:8080',
'https': 'http://user1:pass1@proxy1.example.com:8080'
},
{
'http': 'http://user2:pass2@proxy2.example.com:8080',
'https': 'http://user2:pass2@proxy2.example.com:8080'
}
]
rotator = ProxyRotator(proxy_list)
urls_to_scrape = [
"https://pubmed.ncbi.nlm.nih.gov/?term=diabetes",
"https://pubmed.ncbi.nlm.nih.gov/?term=cancer",
"https://pubmed.ncbi.nlm.nih.gov/?term=covid"
]
results = scrape_with_rotation(urls_to_scrape, rotator)
Using Selenium for JavaScript Sites
Many modern medical sites use JavaScript to load content. In such cases, a headless browser is necessary:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
def create_proxy_driver(proxy_host, proxy_port, proxy_user, proxy_pass):
"""Create Chrome WebDriver with proxy"""
chrome_options = Options()
# Headless mode (no GUI)
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
# Proxy setup
chrome_options.add_argument(f'--proxy-server=http://{proxy_host}:{proxy_port}')
# Disable automation (important for bypassing detection)
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
# User-Agent
chrome_options.add_argument('--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36')
driver = webdriver.Chrome(options=chrome_options)
# For proxies with authentication, you need to use an extension
# or configure through capabilities (more complex option)
return driver
def scrape_with_selenium(url, driver):
"""Parse a page while waiting for JavaScript to load"""
driver.get(url)
# Wait for the element to load (e.g., search results)
try:
wait = WebDriverWait(driver, 10)
results = wait.until(
EC.presence_of_element_located((By.CLASS_NAME, "results-article"))
)
# Extract data
articles = driver.find_elements(By.CLASS_NAME, "results-article")
data = []
for article in articles:
try:
title = article.find_element(By.CLASS_NAME, "docsum-title").text
authors = article.find_element(By.CLASS_NAME, "docsum-authors").text
data.append({
'title': title,
'authors': authors
})
except:
continue
return data
except Exception as e:
print(f"Error waiting for elements: {e}")
return []
# Example usage
proxy_host = "proxy.example.com"
proxy_port = "8080"
proxy_user = "username"
proxy_pass = "password"
driver = create_proxy_driver(proxy_host, proxy_port, proxy_user, proxy_pass)
try:
url = "https://pubmed.ncbi.nlm.nih.gov/?term=diabetes"
results = scrape_with_selenium(url, driver)
for result in results:
print(f"Title: {result['title']}")
print(f"Authors: {result['authors']}\n")
finally:
driver.quit()
Rate Limiting Control and Bypassing Rate Limiting
Rate limiting is one of the main protections medical sites have against parsing. Properly setting the request speed is critical for long-term parsing without blocks.
Determining a Safe Speed
The first step is to determine the limits of a specific site. This can be done experimentally by gradually increasing the request speed until 429 (Too Many Requests) errors or blocks occur. For most medical sites, safe values are:
- PubMed — a maximum of 3 requests per second (official recommendation)
- ClinicalTrials.gov — 20 requests per minute is safe, up to 100 in 5 minutes is acceptable
- Commercial Publishers — 10-15 requests per hour from a single IP
- Pharmaceutical Databases — 5-10 requests per minute
Implementing a Rate Limiter in Python
import time
from collections import deque
class RateLimiter:
def __init__(self, max_calls, period):
"""
max_calls: maximum number of calls
period: time period in seconds
For example: RateLimiter(3, 1) = 3 requests per second
"""
self.max_calls = max_calls
self.period = period
self.calls = deque()
def __call__(self, func):
"""Decorator for limiting the rate of function calls"""
def wrapper(*args, **kwargs):
now = time.time()
# Remove old calls outside the period
while self.calls and self.calls[0] < now - self.period:
self.calls.popleft()
# If the limit is reached, wait
if len(self.calls) >= self.max_calls:
sleep_time = self.period - (now - self.calls[0])
if sleep_time > 0:
print(f"Rate limit reached, sleeping {sleep_time:.2f}s")
time.sleep(sleep_time)
# Clear after waiting
self.calls.clear()
# Record the call time
self.calls.append(time.time())
# Execute the function
return func(*args, **kwargs)
return wrapper
# Example usage
@RateLimiter(max_calls=3, period=1) # 3 requests per second
def fetch_pubmed_page(url):
response = requests.get(url, headers=headers, proxies=proxies)
return response
# Now the function automatically adheres to the rate limit
for i in range(10):
result = fetch_pubmed_page(f"https://pubmed.ncbi.nlm.nih.gov/?term=test&page={i}")
print(f"Page {i} fetched")
Adaptive Rate Limiting
A more advanced approach is to adaptively change the speed based on server responses. If we receive 429 or 503 errors, we automatically reduce the speed:
import time
import random
class AdaptiveRateLimiter:
def __init__(self, initial_delay=1.0, max_delay=60.0):
self.current_delay = initial_delay
self.initial_delay = initial_delay
self.max_delay = max_delay
self.success_count = 0
def wait(self):
"""Wait before the next request"""
# Add randomness for naturalness
actual_delay = self.current_delay * random.uniform(0.8, 1.2)
time.sleep(actual_delay)
def on_success(self):
"""Called on a successful request"""
self.success_count += 1
# After 10 successful requests, speed up a bit
if self.success_count >= 10:
self.current_delay = max(
self.initial_delay,
self.current_delay * 0.9
)
self.success_count = 0
def on_rate_limit(self):
"""Called when receiving 429 or similar errors"""
# Double the delay, but not more than the maximum
self.current_delay = min(
self.current_delay * 2,
self.max_delay
)
self.success_count = 0
print(f"Rate limit hit! Increasing delay to {self.current_delay:.2f}s")
def on_error(self):
"""Called on other errors"""
# Slightly increase the delay
self.current_delay = min(
self.current_delay * 1.5,
self.max_delay
)
self.success_count = 0
# Example usage
limiter = AdaptiveRateLimiter(initial_delay=2.0, max_delay=30.0)
for url in urls_to_scrape:
limiter.wait()
try:
response = requests.get(url, proxies=proxies, headers=headers)
if response.status_code == 200:
limiter.on_success()
# Process data
elif response.status_code == 429:
limiter.on_rate_limit()
# Retry later
else:
limiter.on_error()
except requests.exceptions.RequestException:
limiter.on_error()
Correct Headers and User-Agent for Medical Sites
Medical sites analyze HTTP headers for bot detection. Incorrect or missing headers are a common reason for blocks even when using quality proxies.
Mandatory Headers
The minimum set of headers that must be present in each request:
headers = {
# User-Agent — must be a current browser
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
# Accept — content types accepted by the browser
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
# Accept-Language — user language
'Accept-Language': 'en-US,en;q=0.9',
# Accept-Encoding — support for compression
'Accept-Encoding': 'gzip, deflate, br',
# Connection — keep the connection alive
'Connection': 'keep-alive',
# Upgrade-Insecure-Requests — automatic transition to HTTPS
'Upgrade-Insecure-Requests': '1',
# DNT — Do Not Track (optional but adds realism)
'DNT': '1',
# Sec-Fetch-* headers (important for modern browsers)
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
# Cache-Control
'Cache-Control': 'max-age=0'
}
User-Agent Rotation
Using the same User-Agent can be suspicious. It is recommended to rotate between several current browsers:
import random
USER_AGENTS = [
# Chrome on Windows
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
# Chrome on Mac
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
# Firefox on Windows
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
# Firefox on Mac
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0',
# Safari on Mac
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1 Safari/605.1.15',
# Edge on Windows
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0'
]
def get_random_headers():
"""Get headers with a random User-Agent"""
return {
'User-Agent': random.choice(USER_AGENTS),
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'DNT': '1'
}
# Usage
for url in urls:
headers = get_random_headers()
response = requests.get(url, headers=headers, proxies=proxies)
Referer and Origin for Forms
When working with search forms or sending POST requests, be sure to add the Referer and Origin headers:
# For POST requests to a search form
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Content-Type': 'application/x-www-form-urlencoded',
'Origin': 'https://example.com',
'Referer': 'https://example.com/search',
'Connection': 'keep-alive'
}
# POST request with form data
data = {
'query': 'diabetes',
'page': '1'
}
response = requests.post(
'https://example.com/search',
headers=headers,
data=data,
proxies=proxies
)
Common Problems and Their Solutions
When parsing medical data, specific problems arise. Let's consider the most common ones and how to solve them.
Problem: Cloudflare Blocks All Requests
Symptoms: You receive a page with the text "Checking your browser" or a 403 Forbidden error mentioning Cloudflare.
Solution:
- Use residential proxies instead of data centers — Cloudflare blocks data center IPs by default
- Switch to Selenium or Puppeteer — headless browsers pass Cloudflare checks better
- Use the cloudscraper library for Python — it automatically bypasses basic Cloudflare protection
- Enable cookies and JavaScript — Cloudflare checks for their presence
- Add TLS fingerprinting — use curl_cffi to mimic a real browser at the TLS level
Problem: Receiving 429 Too Many Requests Error
Symptoms: After several successful requests, the server starts returning 429.
Solution:
- Increase the delay between requests — try starting with 3-5 seconds
- Enable IP rotation — each request through a new IP removes rate limiting
- Check the Retry-After header in the 429 response — it indicates how many seconds to wait
- Use exponential backoff on retries — 1s, 2s, 4s, 8s, etc.
Problem: Proxies Are Slow or Frequently Disconnect
Symptoms: Timeout errors, very long page load times, connection drops.
Solution:
- Increase the timeout in requests to 30-60 seconds — residential proxies may be slower
- Use geographically close proxies — if parsing a European site, use European IPs
- Check the quality of the proxy provider — cheap proxies are often unstable
- Add retry logic — automatically retry the request on connection error
- Use connection pooling — reuse TCP connections via requests.Session()
Problem: The Site Requires Authentication or Subscription
Symptoms: Access to full text articles is restricted, login is required.
Solution:
- Use institutional access — many universities and hospitals have subscriptions
- Check for Open Access versions — many articles are available for free through repositories
- Use APIs instead of parsing — some publishers provide APIs for researchers
- Parse only metadata (titles, authors, abstracts) — they are usually available for free
Problem: JavaScript Content Does Not Load
Symptoms: The HTML does not contain the required data, only loading spinners or empty containers are visible.
Solution:
- Switch to Selenium/Puppeteer — they execute JavaScript
- Find the API endpoint — open DevTools in the browser, go to the Network tab, and find XHR requests with data
- Use requests-html — a library with...