Back to Blog

Proxies for Medical Data Scraping: How to Collect Information Without Getting Blocked

Learn how to safely scrape medical data from clinical studies, drug databases, and medical journals without getting blocked.

📅March 9, 2026
```html

Parsing medical data is a task that requires a special approach to proxy selection. Medical portals, clinical trial databases, and pharmaceutical resources use advanced protection systems against automated data collection. In this article, we will discuss how to properly configure proxies for safe parsing of medical information, avoid blocks, and efficiently collect the necessary data.

Why Medical Sites Block Parsing

Medical portals and databases are particularly sensitive to automated information collection for several reasons. First, many of them operate on a commercial basis and sell access to data through paid subscriptions. Automated parsing may violate terms of service and licensing agreements.

Second, medical data often contains confidential information protected by law (HIPAA in the US, GDPR in Europe). Resource owners are required to control access to such data and prevent unauthorized distribution. Therefore, they use advanced protection systems:

  • Rate limiting — limiting the number of requests from a single IP address within a time unit (usually 10-50 requests per minute)
  • Fingerprinting — analyzing browser characteristics, HTTP headers, resource loading order
  • CAPTCHA — systems like reCAPTCHA v3 that trigger on suspicious activity
  • IP blocking — temporary or permanent blocking of data center IP addresses
  • Cloudflare and analogs — bot protection at the CDN level

The third reason is server load. Medical databases often contain millions of records, and mass parsing can create significant strain on the infrastructure. Therefore, administrators actively combat automated data collection by tracking behavior patterns typical of bots: identical intervals between requests, linear page traversal, absence of JavaScript and cookies.

Important: Before starting to parse medical data, be sure to study the website's terms of use and applicable legislation. Some data may be copyright protected or contain personal information about patients. Ensure that your activities are legal and do not violate third-party rights.

Which Type of Proxy to Choose for Medical Data

Choosing the right type of proxy is critical for successful parsing of medical data. Different sources require different approaches. Let's consider the main types of proxies and their applicability:

Type of Proxy Advantages Disadvantages When to Use
Data Center Proxies High speed (100+ Mbps), low cost, stable connection Easily detected, often blocked on protected sites Open databases without strict protection (PubMed, WHO)
Residential Proxies Real IPs of home users, low risk of blocking, bypass Cloudflare Higher cost, variable speed, may be unstable Protected commercial databases (Elsevier, Springer), sites with Cloudflare
Mobile Proxies Maximum trust (IP of mobile operators), virtually unblocked Most expensive, limited geography, may be slower Highly protected resources when residential proxies do not help
ISP Proxies Data center speed + residential trust, static IPs Average cost, limited availability Long-term parsing from a single IP when stability is needed

For most medical data parsing tasks, it is recommended to use residential proxies. They provide the optimal balance between cost and effectiveness. Data center proxies are suitable only for open sources without protection. Mobile proxies should be used in extreme cases when other types do not work.

Recommendations for Specific Sources

  • PubMed, PubMed Central — data center proxies are sufficient, but with a speed limit of 3 requests per second
  • ClinicalTrials.gov — data center proxies, there is an official API
  • Elsevier, Springer, Wiley — residential proxies are mandatory, use advanced fingerprinting
  • DrugBank, RxList — residential proxies, active protection against parsing
  • FDA, EMA databases — data center proxies are suitable, but with slow parsing speed

Main Sources of Medical Data and Their Protection

Medical data is distributed across many sources, each with its own specifics and level of protection. Understanding these features will help properly configure the parsing strategy.

Open Government Databases

PubMed/PubMed Central — the largest database of medical publications, contains over 35 million records. The National Library of Medicine (NLM) provides an official E-utilities API, which is the preferred way to access data. Direct parsing of the web interface is possible but limited to 3 requests per second from a single IP. Exceeding the limit results in a temporary block for 24 hours.

ClinicalTrials.gov — a database of clinical trials, contains information on over 400,000 studies in 220 countries. It also provides an API for programmatic access. The web interface is protected by rate limiting — a maximum of 100 requests in 5 minutes from a single IP. Basic bot protection is used, but without Cloudflare.

FDA Drugs Database — a database of FDA-approved drugs. Open access through the web interface and the openFDA API. Limitations: 240 requests per minute for anonymous users, 1000 requests per minute with an API key. Blocks are rare, but aggressive parsing may lead to them.

Commercial Scientific Publishers

Elsevier (ScienceDirect) — one of the largest publishers of scientific literature. Uses multi-layered protection: Cloudflare, browser fingerprinting, user behavior analysis. Detects patterns of automated downloads: sequential access to articles, absence of JavaScript, atypical User-Agent. Upon detecting parsing, it blocks IPs at the account level and may block the entire institution. Residential proxies with rotation and full browser emulation are mandatory.

Springer Nature — similar protection, additionally tracks page scrolling speed and mouse movements. Uses machine learning to detect bots. It is recommended to parse no more than 10-15 articles per hour from a single IP, with randomized delays between requests.

Wiley Online Library — less aggressive protection, but still requires the use of proxies. Allows about 50 requests per hour from a single IP without blocking. Uses session cookies to track activity.

Pharmaceutical Databases

DrugBank — a comprehensive database of drugs. The free version is limited to the web interface, while the commercial version provides an API and data dumps. The web version is protected by Cloudflare and rate limiting — a maximum of 20 requests per minute. It detects automation by the absence of cookies and JavaScript.

RxList, Drugs.com — popular drug reference guides for consumers. Use Cloudflare and actively combat parsing. Block data center IPs almost instantly. Residential proxies and slow parsing speed (5-10 pages per minute) are required.

Setting Up IP Rotation for Long-Term Parsing

Proper IP address rotation is a key factor for successful parsing of medical data. There are two main approaches: rotation at the request level and time-based rotation.

Request-Level Rotation

In this approach, each request is sent through a new IP address. This minimizes the risk of blocking, but may cause issues with sites that track sessions via cookies. It is suitable for parsing lists and catalogs where session state maintenance is not required.

Most residential proxy providers offer automatic rotation through a special endpoint. For example, when using a rotating proxy endpoint, each new TCP connection receives a new IP. This works automatically with libraries like requests in Python, as a new connection is created for each request by default.

Time-Based Rotation (Sticky Sessions)

Sticky sessions allow the use of one IP address for a certain period (usually 5-30 minutes), after which an automatic change occurs. This is useful for sites that require authentication or track session state via cookies. You can parse several pages from one IP, mimicking the behavior of a real user, after which the IP changes automatically.

For medical sites, it is recommended to use sticky sessions lasting 10-15 minutes. During this time, you can parse 10-20 pages (depending on delays), after which the IP changes, and you start a "new session." This looks natural and reduces the risk of detection.

IP Address Pool Size

For long-term parsing, the size of the available IP address pool is important. If you use the same set of 100 IPs over a week, the site may notice the pattern and block all those addresses. Residential proxies usually provide access to millions of IPs, which practically eliminates the reuse of the same address.

When using data center proxies, it is recommended to have a pool of at least 500-1000 IPs for medium volume parsing (10,000-50,000 pages per month). For large-scale parsing (hundreds of thousands of pages), it is better to use residential proxies with their huge IP pools.

Rotation Tips for Different Sources:

  • PubMed — rotation is not mandatory, one IP is sufficient while adhering to the rate limit
  • Commercial Publishers — sticky sessions of 10-15 minutes, new IP every 15-20 pages
  • Pharmaceutical Databases — rotation on each request or sticky sessions of 5 minutes
  • Sites with Cloudflare — sticky sessions are mandatory, request-level rotation does not work

Python Code Examples for Parsing with Proxies

Let's consider practical examples of configuring proxies for parsing medical data using popular Python libraries. We will start with a basic example and gradually complicate it.

Basic Setup with Requests Library

import requests
from time import sleep
import random

# Proxy setup (replace with your data)
PROXY_HOST = "proxy.example.com"
PROXY_PORT = "8080"
PROXY_USER = "username"
PROXY_PASS = "password"

proxies = {
    'http': f'http://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{PROXY_PORT}',
    'https': f'http://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{PROXY_PORT}'
}

# Headers to mimic a real browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'DNT': '1',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1'
}

# Example request to PubMed
url = "https://pubmed.ncbi.nlm.nih.gov/?term=diabetes"

try:
    response = requests.get(url, proxies=proxies, headers=headers, timeout=30)
    print(f"Status code: {response.status_code}")
    print(f"Content length: {len(response.content)}")
    
    # Adding delay between requests (mandatory for PubMed)
    sleep(random.uniform(1.0, 3.0))
    
except requests.exceptions.RequestException as e:
    print(f"Error: {e}")

Advanced Setup with Rotation and Retry Logic

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from time import sleep
import random

class ProxyRotator:
    def __init__(self, proxy_list):
        """
        proxy_list: list of dictionaries with proxies
        [{'http': 'http://user:pass@host:port', 'https': '...'}, ...]
        """
        self.proxy_list = proxy_list
        self.current_index = 0
    
    def get_next_proxy(self):
        """Get the next proxy from the list"""
        proxy = self.proxy_list[self.current_index]
        self.current_index = (self.current_index + 1) % len(self.proxy_list)
        return proxy

def create_session_with_retries():
    """Create a session with automatic retries on errors"""
    session = requests.Session()
    
    # Setting up automatic retries
    retry_strategy = Retry(
        total=3,  # maximum 3 attempts
        backoff_factor=1,  # delay between attempts: 1, 2, 4 seconds
        status_forcelist=[429, 500, 502, 503, 504],  # codes for retrying
        allowed_methods=["GET", "POST"]
    )
    
    adapter = HTTPAdapter(max_retries=retry_strategy)
    session.mount("http://", adapter)
    session.mount("https://", adapter)
    
    return session

def scrape_with_rotation(urls, proxy_rotator):
    """Parsing a list of URLs with proxy rotation"""
    session = create_session_with_retries()
    results = []
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.9',
    }
    
    for url in urls:
        # Get a new proxy for each request
        proxy = proxy_rotator.get_next_proxy()
        
        try:
            response = session.get(
                url, 
                proxies=proxy, 
                headers=headers, 
                timeout=30
            )
            
            if response.status_code == 200:
                results.append({
                    'url': url,
                    'status': 'success',
                    'content_length': len(response.content)
                })
                print(f"✓ Success: {url}")
            else:
                results.append({
                    'url': url,
                    'status': 'failed',
                    'error': f"Status code: {response.status_code}"
                })
                print(f"✗ Failed: {url} (Status: {response.status_code})")
        
        except requests.exceptions.RequestException as e:
            results.append({
                'url': url,
                'status': 'error',
                'error': str(e)
            })
            print(f"✗ Error: {url} ({e})")
        
        # Random delay between requests (important!)
        sleep(random.uniform(2.0, 5.0))
    
    return results

# Example usage
proxy_list = [
    {
        'http': 'http://user1:pass1@proxy1.example.com:8080',
        'https': 'http://user1:pass1@proxy1.example.com:8080'
    },
    {
        'http': 'http://user2:pass2@proxy2.example.com:8080',
        'https': 'http://user2:pass2@proxy2.example.com:8080'
    }
]

rotator = ProxyRotator(proxy_list)

urls_to_scrape = [
    "https://pubmed.ncbi.nlm.nih.gov/?term=diabetes",
    "https://pubmed.ncbi.nlm.nih.gov/?term=cancer",
    "https://pubmed.ncbi.nlm.nih.gov/?term=covid"
]

results = scrape_with_rotation(urls_to_scrape, rotator)

Using Selenium for JavaScript Sites

Many modern medical sites use JavaScript to load content. In such cases, a headless browser is necessary:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

def create_proxy_driver(proxy_host, proxy_port, proxy_user, proxy_pass):
    """Create Chrome WebDriver with proxy"""
    
    chrome_options = Options()
    
    # Headless mode (no GUI)
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    
    # Proxy setup
    chrome_options.add_argument(f'--proxy-server=http://{proxy_host}:{proxy_port}')
    
    # Disable automation (important for bypassing detection)
    chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
    chrome_options.add_experimental_option('useAutomationExtension', False)
    
    # User-Agent
    chrome_options.add_argument('--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36')
    
    driver = webdriver.Chrome(options=chrome_options)
    
    # For proxies with authentication, you need to use an extension
    # or configure through capabilities (more complex option)
    
    return driver

def scrape_with_selenium(url, driver):
    """Parse a page while waiting for JavaScript to load"""
    
    driver.get(url)
    
    # Wait for the element to load (e.g., search results)
    try:
        wait = WebDriverWait(driver, 10)
        results = wait.until(
            EC.presence_of_element_located((By.CLASS_NAME, "results-article"))
        )
        
        # Extract data
        articles = driver.find_elements(By.CLASS_NAME, "results-article")
        
        data = []
        for article in articles:
            try:
                title = article.find_element(By.CLASS_NAME, "docsum-title").text
                authors = article.find_element(By.CLASS_NAME, "docsum-authors").text
                
                data.append({
                    'title': title,
                    'authors': authors
                })
            except:
                continue
        
        return data
        
    except Exception as e:
        print(f"Error waiting for elements: {e}")
        return []

# Example usage
proxy_host = "proxy.example.com"
proxy_port = "8080"
proxy_user = "username"
proxy_pass = "password"

driver = create_proxy_driver(proxy_host, proxy_port, proxy_user, proxy_pass)

try:
    url = "https://pubmed.ncbi.nlm.nih.gov/?term=diabetes"
    results = scrape_with_selenium(url, driver)
    
    for result in results:
        print(f"Title: {result['title']}")
        print(f"Authors: {result['authors']}\n")
        
finally:
    driver.quit()

Rate Limiting Control and Bypassing Rate Limiting

Rate limiting is one of the main protections medical sites have against parsing. Properly setting the request speed is critical for long-term parsing without blocks.

Determining a Safe Speed

The first step is to determine the limits of a specific site. This can be done experimentally by gradually increasing the request speed until 429 (Too Many Requests) errors or blocks occur. For most medical sites, safe values are:

  • PubMed — a maximum of 3 requests per second (official recommendation)
  • ClinicalTrials.gov — 20 requests per minute is safe, up to 100 in 5 minutes is acceptable
  • Commercial Publishers — 10-15 requests per hour from a single IP
  • Pharmaceutical Databases — 5-10 requests per minute

Implementing a Rate Limiter in Python

import time
from collections import deque

class RateLimiter:
    def __init__(self, max_calls, period):
        """
        max_calls: maximum number of calls
        period: time period in seconds
        For example: RateLimiter(3, 1) = 3 requests per second
        """
        self.max_calls = max_calls
        self.period = period
        self.calls = deque()
    
    def __call__(self, func):
        """Decorator for limiting the rate of function calls"""
        def wrapper(*args, **kwargs):
            now = time.time()
            
            # Remove old calls outside the period
            while self.calls and self.calls[0] < now - self.period:
                self.calls.popleft()
            
            # If the limit is reached, wait
            if len(self.calls) >= self.max_calls:
                sleep_time = self.period - (now - self.calls[0])
                if sleep_time > 0:
                    print(f"Rate limit reached, sleeping {sleep_time:.2f}s")
                    time.sleep(sleep_time)
                    # Clear after waiting
                    self.calls.clear()
            
            # Record the call time
            self.calls.append(time.time())
            
            # Execute the function
            return func(*args, **kwargs)
        
        return wrapper

# Example usage
@RateLimiter(max_calls=3, period=1)  # 3 requests per second
def fetch_pubmed_page(url):
    response = requests.get(url, headers=headers, proxies=proxies)
    return response

# Now the function automatically adheres to the rate limit
for i in range(10):
    result = fetch_pubmed_page(f"https://pubmed.ncbi.nlm.nih.gov/?term=test&page={i}")
    print(f"Page {i} fetched")

Adaptive Rate Limiting

A more advanced approach is to adaptively change the speed based on server responses. If we receive 429 or 503 errors, we automatically reduce the speed:

import time
import random

class AdaptiveRateLimiter:
    def __init__(self, initial_delay=1.0, max_delay=60.0):
        self.current_delay = initial_delay
        self.initial_delay = initial_delay
        self.max_delay = max_delay
        self.success_count = 0
    
    def wait(self):
        """Wait before the next request"""
        # Add randomness for naturalness
        actual_delay = self.current_delay * random.uniform(0.8, 1.2)
        time.sleep(actual_delay)
    
    def on_success(self):
        """Called on a successful request"""
        self.success_count += 1
        
        # After 10 successful requests, speed up a bit
        if self.success_count >= 10:
            self.current_delay = max(
                self.initial_delay,
                self.current_delay * 0.9
            )
            self.success_count = 0
    
    def on_rate_limit(self):
        """Called when receiving 429 or similar errors"""
        # Double the delay, but not more than the maximum
        self.current_delay = min(
            self.current_delay * 2,
            self.max_delay
        )
        self.success_count = 0
        print(f"Rate limit hit! Increasing delay to {self.current_delay:.2f}s")
    
    def on_error(self):
        """Called on other errors"""
        # Slightly increase the delay
        self.current_delay = min(
            self.current_delay * 1.5,
            self.max_delay
        )
        self.success_count = 0

# Example usage
limiter = AdaptiveRateLimiter(initial_delay=2.0, max_delay=30.0)

for url in urls_to_scrape:
    limiter.wait()
    
    try:
        response = requests.get(url, proxies=proxies, headers=headers)
        
        if response.status_code == 200:
            limiter.on_success()
            # Process data
            
        elif response.status_code == 429:
            limiter.on_rate_limit()
            # Retry later
            
        else:
            limiter.on_error()
            
    except requests.exceptions.RequestException:
        limiter.on_error()

Correct Headers and User-Agent for Medical Sites

Medical sites analyze HTTP headers for bot detection. Incorrect or missing headers are a common reason for blocks even when using quality proxies.

Mandatory Headers

The minimum set of headers that must be present in each request:

headers = {
    # User-Agent — must be a current browser
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    
    # Accept — content types accepted by the browser
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
    
    # Accept-Language — user language
    'Accept-Language': 'en-US,en;q=0.9',
    
    # Accept-Encoding — support for compression
    'Accept-Encoding': 'gzip, deflate, br',
    
    # Connection — keep the connection alive
    'Connection': 'keep-alive',
    
    # Upgrade-Insecure-Requests — automatic transition to HTTPS
    'Upgrade-Insecure-Requests': '1',
    
    # DNT — Do Not Track (optional but adds realism)
    'DNT': '1',
    
    # Sec-Fetch-* headers (important for modern browsers)
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
    
    # Cache-Control
    'Cache-Control': 'max-age=0'
}

User-Agent Rotation

Using the same User-Agent can be suspicious. It is recommended to rotate between several current browsers:

import random

USER_AGENTS = [
    # Chrome on Windows
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    
    # Chrome on Mac
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    
    # Firefox on Windows
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
    
    # Firefox on Mac
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0',
    
    # Safari on Mac
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1 Safari/605.1.15',
    
    # Edge on Windows
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0'
]

def get_random_headers():
    """Get headers with a random User-Agent"""
    return {
        'User-Agent': random.choice(USER_AGENTS),
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.9',
        'Accept-Encoding': 'gzip, deflate, br',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'DNT': '1'
    }

# Usage
for url in urls:
    headers = get_random_headers()
    response = requests.get(url, headers=headers, proxies=proxies)

Referer and Origin for Forms

When working with search forms or sending POST requests, be sure to add the Referer and Origin headers:

# For POST requests to a search form
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Content-Type': 'application/x-www-form-urlencoded',
    'Origin': 'https://example.com',
    'Referer': 'https://example.com/search',
    'Connection': 'keep-alive'
}

# POST request with form data
data = {
    'query': 'diabetes',
    'page': '1'
}

response = requests.post(
    'https://example.com/search',
    headers=headers,
    data=data,
    proxies=proxies
)

Common Problems and Their Solutions

When parsing medical data, specific problems arise. Let's consider the most common ones and how to solve them.

Problem: Cloudflare Blocks All Requests

Symptoms: You receive a page with the text "Checking your browser" or a 403 Forbidden error mentioning Cloudflare.

Solution:

  • Use residential proxies instead of data centers — Cloudflare blocks data center IPs by default
  • Switch to Selenium or Puppeteer — headless browsers pass Cloudflare checks better
  • Use the cloudscraper library for Python — it automatically bypasses basic Cloudflare protection
  • Enable cookies and JavaScript — Cloudflare checks for their presence
  • Add TLS fingerprinting — use curl_cffi to mimic a real browser at the TLS level

Problem: Receiving 429 Too Many Requests Error

Symptoms: After several successful requests, the server starts returning 429.

Solution:

  • Increase the delay between requests — try starting with 3-5 seconds
  • Enable IP rotation — each request through a new IP removes rate limiting
  • Check the Retry-After header in the 429 response — it indicates how many seconds to wait
  • Use exponential backoff on retries — 1s, 2s, 4s, 8s, etc.

Problem: Proxies Are Slow or Frequently Disconnect

Symptoms: Timeout errors, very long page load times, connection drops.

Solution:

  • Increase the timeout in requests to 30-60 seconds — residential proxies may be slower
  • Use geographically close proxies — if parsing a European site, use European IPs
  • Check the quality of the proxy provider — cheap proxies are often unstable
  • Add retry logic — automatically retry the request on connection error
  • Use connection pooling — reuse TCP connections via requests.Session()

Problem: The Site Requires Authentication or Subscription

Symptoms: Access to full text articles is restricted, login is required.

Solution:

  • Use institutional access — many universities and hospitals have subscriptions
  • Check for Open Access versions — many articles are available for free through repositories
  • Use APIs instead of parsing — some publishers provide APIs for researchers
  • Parse only metadata (titles, authors, abstracts) — they are usually available for free

Problem: JavaScript Content Does Not Load

Symptoms: The HTML does not contain the required data, only loading spinners or empty containers are visible.

Solution:

  • Switch to Selenium/Puppeteer — they execute JavaScript
  • Find the API endpoint — open DevTools in the browser, go to the Network tab, and find XHR requests with data
  • Use requests-html — a library with...
```