Proxies for job vacancy parsing: how to collect data from hh.ru

```html

Parsing job boards is one of the most in-demand scenarios for data collection in HR analytics, labor market monitoring, and recruitment automation. However, job vacancy sites actively protect against automated data collection: they block IPs after 50-100 requests, show CAPTCHAs, and ban suspicious accounts. In this article, we will discuss how to properly set up proxies for stable parsing of hh.ru, Superjob, LinkedIn, and other platforms without blocks.

Why job boards block parsing and how protection works

Job vacancy sites lose money on parsing: data is sold to competitors, aggregators are created without licenses, and employers bypass paid placements. Therefore, all major platforms have implemented multi-layered protection against automated data collection.

Main methods of job board protection:

Rate limiting by IP — hh.ru blocks IPs after 80-120 requests per hour, Superjob — after 50-70 requests. The block can last from 1 hour to a day.
Browser fingerprinting — sites analyze User-Agent, HTTP headers, screen resolution, installed fonts. If the data does not match a real browser, the request is blocked.
JavaScript checks — many sites use Cloudflare or their own scripts to check that the request comes from a real browser, not a bot.
Honeypot traps — hidden links and fields that are only visible to the parser. If the bot clicks on them, the IP gets blacklisted.
CAPTCHA for suspicious activity — appears after a series of rapid requests or when using data center IPs.

Without proxies, you can scrape a maximum of 100-200 job vacancies, after which your IP will be banned. For large-scale data collection (thousands of vacancies daily), proxies become an essential tool.

Important: Parsing must comply with the website's terms of use. Many job boards provide official APIs for legal access to data. For example, hh.ru has a free API with a request limit, which is suitable for most tasks.

What type of proxy to choose for parsing job vacancies

The choice of proxy type depends on the scale of parsing, budget, and speed requirements. Let's discuss three main options with specific use cases.

Proxy Type	Speed	Risk of Ban	When to Use
Data Center	High (50-200 ms)	High	Testing the parser, collecting public data without authorization
Residential	Medium (200-800 ms)	Low	Large-scale parsing of hh.ru, Superjob with IP rotation
Mobile	Medium (300-1000 ms)	Very Low	Parsing with authorization, bypassing strict LinkedIn protection

Data Center Proxies for Parsing

This is the fastest and cheapest option, but it comes with limitations. Data center IPs are easily recognized by websites, so they are only suitable for simple tasks: parsing job vacancy lists without authorization, collecting public data, testing the parser before launching on residential proxies.

When data center proxies work:

Parsing a small volume of data (up to 500 vacancies per day)
Collecting data from sites without strict protection (small regional job boards)
Using official APIs with IP rotation to bypass rate limits
Parsing RSS feeds and XML job vacancy files

For hh.ru and Superjob, data center proxies will work unstably: you will receive a CAPTCHA after 20-30 requests, and many IPs are already on the blacklists of these sites.

Residential Proxies — Optimal Choice for Job Boards

Residential proxies use IP addresses of real home users, so websites perceive them as regular visitors. This is the optimal balance of price and quality for parsing job vacancies.

Advantages for parsing job boards:

Low risk of blocking — hh.ru and Superjob cannot distinguish a residential IP from a real user
Large pool of IP addresses — rotation can be set for each request or every 5-10 minutes
Geographical targeting — you can parse vacancies from a specific city using IPs from that region
Stability — one residential IP can handle 200-500 requests without blocking

For large-scale parsing (over 1000 vacancies per day), residential proxies with IP rotation are the standard solution. You set the IP change every 5-10 minutes, add random delays between requests (3-7 seconds), and achieve stable data collection without blocks.

Mobile Proxies for LinkedIn and Parsing with Authorization

Mobile proxies use IPs from mobile operators. Their main advantage is that one IP is used by hundreds of real users simultaneously, so websites cannot block such an address without risking blocking thousands of ordinary visitors.

When mobile proxies are needed:

Parsing LinkedIn — this platform has the strictest protection against bots and aggressively blocks data center and even residential IPs
Working with authorization — if you need to parse closed vacancies or profile data, mobile IPs reduce the risk of account bans
Parsing foreign job boards — Indeed, Glassdoor, Monster use advanced protection systems where mobile IPs work more reliably
Bypassing strict blocks — if your residential proxies start receiving CAPTCHAs, switching to mobile will solve the problem

The downside of mobile proxies is their high cost and lower speed. But for critical tasks where blocking is unacceptable, this is the best choice.

Features of Parsing hh.ru: Protection and Bypass Methods

hh.ru is the largest Russian job vacancy site with the most advanced protection against parsing among domestic job boards. The site uses a combination of rate limiting, fingerprinting, and behavioral analysis to identify bots.

How hh.ru Protection Works

1. IP address limits: After 80-120 requests per hour from one IP, the site starts showing a CAPTCHA or returns HTTP 429 (Too Many Requests). The block lasts from 1 to 6 hours depending on the aggressiveness of the parsing.

2. User-Agent and header checks: hh.ru analyzes HTTP request headers. If the User-Agent does not match a real browser or standard headers (Accept-Language, Accept-Encoding) are missing, the request is blocked.

3. JavaScript checks: Some hh.ru pages require JavaScript execution to load data. A simple HTTP parser without a headless browser will not be able to retrieve the full content.

4. Honeypot links: There are hidden elements on the pages that only the parser can see. If your script clicks on these links, the IP gets blacklisted for 24 hours.

Strategy for Bypassing hh.ru Protection with Proxies

For stable parsing of hh.ru without blocks, use the following configuration:

Optimal settings for parsing hh.ru:

Proxy type: Residential with IP rotation every 5-10 minutes
Delay between requests: 4-8 seconds (random value)
User-Agent: Rotation of real User-Agents from modern browsers (latest versions of Chrome, Firefox, Safari)
Headers: Full set of standard browser headers (Accept, Accept-Language, Accept-Encoding, Referer)
Cookies: Saving and passing cookies between requests within one session
Request limit: No more than 60-80 requests per IP, after which change the proxy

Example of a safe sequence of actions:

Connect to a residential proxy with an IP from the desired region (e.g., Moscow)
Make the first request to the main page of hh.ru, receive and save cookies
Wait 5-7 seconds (simulate reading the page)
Make a request to the job search page with the necessary filters
Parse the list of vacancies (usually 20-50 on the page)
For each vacancy, make a request to the detailed page with a delay of 4-6 seconds
After 60-70 requests, change the proxy and repeat the cycle

With this strategy, you can parse 1000-2000 vacancies per day from one stream without a single block. If you need a larger volume, run several parallel streams with different proxies.

Tip: hh.ru provides a free API for accessing job vacancies. For most tasks (labor market analysis, salary monitoring), the API will be a more stable solution than parsing HTML. Proxies can be used for IP rotation when working with the API to bypass rate limits.

Parsing Superjob, LinkedIn, and Foreign Platforms

Superjob: Features of Protection

Superjob has less strict protection compared to hh.ru, but still actively fights against parsing. The main differences:

Lower rate limit: Blocking occurs after 50-70 requests per hour (compared to 80-120 for hh.ru)
Less strict header checks: A simplified set of headers can be used
No JavaScript protection: Most data is accessible through a simple HTTP request without a headless browser
Regional blocking: Some vacancies are only available from IPs of a specific region

For Superjob, residential proxies with rotation every 10-15 minutes and a delay of 3-5 seconds between requests are sufficient. This will allow you to reliably parse 500-1000 vacancies per day.

LinkedIn: The Strictest Protection

LinkedIn is a different story. The platform uses advanced machine learning algorithms to identify bots and has one of the most aggressive protection systems among all social networks and job boards.

Features of LinkedIn Protection:

Mandatory authorization: Most data is only available to authorized users
Behavioral analysis: LinkedIn analyzes action patterns: scrolling speed, mouse movements, time on the page
Account blocking: In case of suspicious activity, not only the IP but also the account itself gets blocked
Profile view limits: Free accounts can view a limited number of profiles per month
Mandatory JavaScript execution: Parsing is impossible without a headless browser

LinkedIn Parsing Strategy:

Use mobile proxies — they provide the lowest risk of blocking. One mobile IP can be used for 100-200 profile views per day.
Headless browser is mandatory — use Puppeteer or Playwright with a real browser fingerprint setup (screen resolution, WebGL, Canvas).
Slow parsing speed — no more than 20-30 profiles per hour from one account. Add delays of 10-20 seconds between views.
Simulate real behavior — scrolling the page, random clicks, transitions between profile sections.
Warm up accounts — new LinkedIn accounts cannot be used for parsing immediately. You need to simulate the activity of a regular user for 1-2 weeks.
Account rotation — use multiple accounts with different proxies to distribute the load.

Parsing LinkedIn is the most challenging task among all job boards. If you need data from this platform, consider using the official Sales Navigator API or third-party services that provide data legally.

Foreign Job Boards: Indeed, Glassdoor, Monster

Foreign platforms usually have stricter protection than Russian sites (except for hh.ru). Main features:

Indeed — uses Cloudflare with JavaScript checks. A headless browser and residential/mobile proxies from the country whose vacancies you are parsing are needed.
Glassdoor — requires authorization to view most data. Actively blocks data center IPs. Use residential proxies and slow parsing speed (delay of 8-12 seconds).
Monster — has an API for partners, but for HTML parsing, residential proxies with geographical targeting to the desired country are needed.

For all foreign platforms, geographical targeting of proxies is critically important. If you are parsing vacancies in the USA, use American residential IPs. Requests from IPs in other countries may raise suspicion and lead to blocking.

Setting Up IP Rotation and Delays Between Requests

Properly setting up proxy rotation is key to stable parsing without blocks. Let's discuss two main strategies: rotation for each request and time-based rotation.

Rotation for Each Request (Rotating Proxies)

In this approach, each HTTP request comes from a new IP address. This is the safest method, but it has limitations:

Advantages:

It is impossible to track the activity of a single IP
You can make more requests in a unit of time
No need to track limits for each IP

Disadvantages:

It is impossible to maintain a session (cookies are lost when changing IP)
Not suitable for parsing with authorization
Some sites block requests if the IP changes too frequently

Rotation for each request is suitable for parsing public pages of hh.ru and Superjob without authorization. It is configured through the proxy provider's parameter (usually a special endpoint with automatic rotation).

Time-Based Rotation (Sticky Sessions)

In this approach, one IP is used for a certain period (5-30 minutes), after which it is automatically changed. This is the optimal option for most job board parsing tasks.

Recommended rotation intervals:

Site	Rotation Interval	Max Requests per IP	Delay Between Requests
hh.ru	5-10 minutes	60-80	4-8 seconds
Superjob	10-15 minutes	50-70	3-5 seconds
LinkedIn	30-60 minutes	20-40	10-20 seconds
Indeed	10-20 minutes	40-60	5-10 seconds
Glassdoor	15-30 minutes	30-50	8-12 seconds

Setting Up Random Delays

A fixed delay between requests (e.g., exactly 5 seconds) looks suspicious to protection systems. A real user cannot act with such precision. Always use random delays within a range.

Examples of implementing random delays:

// Python
import time
import random

# Delay from 4 to 8 seconds
delay = random.uniform(4, 8)
time.sleep(delay)

# More complex logic: sometimes make a long pause
if random.random() < 0.1:  # 10% probability
    time.sleep(random.uniform(15, 30))  # Simulating user distraction
else:
    time.sleep(random.uniform(4, 8))

// JavaScript / Node.js
const sleep = (min, max) => {
  const delay = Math.random() * (max - min) + min;
  return new Promise(resolve => setTimeout(resolve, delay * 1000));
};

// Usage
await sleep(4, 8);  // Delay 4-8 seconds

// With a probability of a long pause
if (Math.random() < 0.1) {
  await sleep(15, 30);  // 10% probability of a long pause
} else {
  await sleep(4, 8);
}

Adding random long pauses (15-30 seconds) with a probability of 5-10% makes the parser's behavior even more similar to that of a real user, who might get distracted by a phone call or another task.

Handling CAPTCHA and Other Blocks

Even with proper proxy and delay settings, you may encounter CAPTCHA or other types of blocks. Let's discuss how to properly respond to these situations.

Types of Job Board Blocks

1. HTTP 429 Too Many Requests — the most common type of block. The site clearly indicates that you have exceeded the request limit. Usually, the response header contains Retry-After, which indicates how many seconds you can wait before retrying the request.

How to handle: Immediately change the proxy and add the current IP to the blacklist for the time specified in Retry-After (usually 1-6 hours). If Retry-After is absent, add the IP to the blacklist for 2 hours.

2. HTTP 403 Forbidden — the IP is blocked at the server level. This is a more serious block that can last from several hours to a day.

How to handle: Change the proxy and add the IP to a long-term blacklist (24 hours). Analyze logs: you may be parsing too aggressively or using data center IPs where residential ones are needed.

3. CAPTCHA — the site shows a "I'm not a robot" check. This means that your behavior appeared suspicious, but the IP has not yet been completely blocked.

How to handle: There are three options:

Change the proxy — the simplest way. The current IP is added to the blacklist for 6-12 hours.
Automatic CAPTCHA solving — using services like 2Captcha, Anti-Captcha, CapSolver. They cost $1-3 for 1000 solutions.
Manual solving — if parsing is not time-critical, you can send the CAPTCHA for manual solving by an operator.

4. Cloudflare Challenge — a JavaScript check that requires code execution in the browser. A regular HTTP library will not pass this check.

How to handle: Use a headless browser (Puppeteer, Playwright, Selenium) with a real fingerprint setup. Libraries like puppeteer-extra-plugin-stealth help bypass headless mode detection.

Integration of CAPTCHA Solving Services

If you decide to solve CAPTCHA automatically, here is an example of integration with the popular service 2Captcha:

// Python using the 2captcha-python library
from twocaptcha import TwoCaptcha
import requests

solver = TwoCaptcha('YOUR_API_KEY')

try:
    # Solving reCAPTCHA v2
    result = solver.recaptcha(
        sitekey='6Le-wvkSAAAAAPBMRTvw0Q4Muexq9bi0DJwx_mJ-',
        url='https://hh.ru/search/vacancy',
        proxy={
            'type': 'HTTPS',
            'uri': 'login:password@ip:port'
        }
    )
    
    # Getting the solution token
    captcha_token = result['code']
    
    # Sending a request with the token
    response = requests.post(
        'https://hh.ru/search/vacancy',
        data={
            'g-recaptcha-response': captcha_token,
            # other form parameters
        },
        proxies={
            'http': 'http://login:password@ip:port',
            'https': 'http://login:password@ip:port'
        }
    )
    
except Exception as e:
    print(f'Error solving CAPTCHA: {e}')

Solving one CAPTCHA takes 10-30 seconds and costs about $0.001-0.003. For large-scale parsing, this can be expensive, so it is better to set up parsing to minimize CAPTCHA occurrences.

Monitoring and Alerting System

For stable parser operation, it is important to set up monitoring of blocks and automatic alerts:

What to monitor:

Percentage of successful requests — if it drops below 90%, check the proxies and settings
Number of CAPTCHAs per hour — if more than 5-10, you are parsing too aggressively
Average response speed of proxies — if it sharply increases, proxies may be overloaded
Number of 429/403 errors — an indicator of proxy quality and correctness of settings
List of blocked IPs — if the same IP is constantly blocked, exclude it from the pool

Set up notifications (Telegram, email, Slack) if the percentage of successful requests falls below a threshold. This will allow you to quickly respond to problems and avoid losing parsing time.

Setting Up Proxies in Popular Parsing Tools

Let's discuss how to set up proxies in the most popular tools for parsing job boards: Python (requests, Scrapy), Node.js (axios, Puppeteer), and ready-made solutions.

Python: requests and Scrapy

Python is the most popular language for parsing due to libraries like requests, BeautifulSoup, and Scrapy.

Example with the requests library:

import requests
import random
import time

# List of proxies (get from the provider)
PROXIES = [
    'http://user:[email protected]:8080',
    'http://user:[email protected]:8080',
    'http://user:[email protected]:8080'
]

# List of User-Agents for rotation
USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
]

def parse_vacancy(url):
    proxy = random.choice(PROXIES)
    user_agent = random.choice(USER_AGENTS)
    
    headers = {
        'User-Agent': user_agent,
        'Accept': 'text/html,application/xhtml+xml',
        'Accept-Language': 'ru-RU,ru;q=0.9,en;q=0.8',
        'Accept-Encoding': 'gzip, deflate, br',
        'Connection': 'keep-alive'
    }
    
    proxies = {
        'http': proxy,
        'https': proxy
    }
    
    try:
        response = requests.get(
            url,
            headers=headers,
            proxies=proxies,
            timeout=30
        )
        
        if response.status_code == 200:
            return response.text
        elif response.status_code == 429:
            print(f'Rate limit for {proxy}, changing proxy')
            # Temporarily remove the proxy from the list
            return None
        else:
            print(f'Error {response.status_code}')
            return None
            
    except Exception as e:
        print(f'Request error: {e}')
        return None

# Usage
for i in range(100):
    html = parse_vacancy('https://hh.ru/vacancy/123456')
    if html:
        # Process data
        pass
    
    # Random delay
    time.sleep(random.uniform(4, 8))

Example of Scrapy setup:

# settings.py

# Enable proxy support
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
    'scrapy_rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
    'scrapy_rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}

# List of proxies
ROTATING_PROXY_LIST = [
    'http://user:[email protected]:8080',
    'http://user:[email protected]:8080',
    'http://user:[email protected]:8080'
]

# Automatic ban detection
ROTATING_PROXY_BAN_POLICY = 'scrapy_rotating_proxies.policy.BanDetectionPolicy'

# Delay between requests
DOWNLOAD_DELAY = 5
RANDOMIZE_DOWNLOAD_DELAY = True  # Random delay ±50%

# User-Agent rotation
DOWNLOADER_MIDDLEWARES.update({
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
})

# Maximum concurrent requests
CONCURRENT_REQUESTS = 4
CONCURRENT_REQUESTS_PER_DOMAIN = 1

Node.js: Puppeteer with Proxies

For parsing sites with JavaScript (LinkedIn, Indeed), a headless browser is needed. Puppeteer is the most popular solution for Node.js.

const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');

// Plugin to bypass headless browser detection
puppeteer.use(StealthPlugin());

async function parseWithProxy() {
  const proxy = 'http://user:[email protected]:8080';
  
  const browser = await puppeteer.launch({
    headless: true,
    args: [
      `--proxy-server=${proxy}`,
      '--no-sandbox',
      '--disable-setuid-sandbox',
      '--disable-dev-shm-usage',
      '--disable-blink-features=AutomationControlled'
    ]
  });
  
  const page = await browser.newPage();
  
  // Set a real User-Agent
  await page.setUserAgent(
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
  );
  
  // Continue with your parsing logic...