```html

7 phương pháp đã được kiểm chứng để vượt qua phát hiện Cloudflare khi làm việc qua proxy

Cloudflare xử lý hơn 20% toàn bộ lưu lượng web và sử dụng hệ thống bảo vệ nhiều lớp chống lại bot. Khi làm việc qua máy chủ proxy, khả năng nhận được captcha hoặc bị chặn tăng lên rất nhiều. Trong hướng dẫn này, chúng ta sẽ phân tích các khía cạnh kỹ thuật của việc phát hiện và các phương pháp thực tiễn để vượt qua, hoạt động vào năm 2024.

Cách Cloudflare xác định proxy và bot

Cloudflare sử dụng một hệ thống phân tích toàn diện, kiểm tra hàng chục thông số của mỗi yêu cầu. Hiểu các cơ chế phát hiện là bước đầu tiên để vượt qua bảo vệ thành công.

Các phương pháp phát hiện chính

TLS Fingerprinting: Cloudflare phân tích các thông số của quá trình bắt tay SSL/TLS (cipher suites, mở rộng, thứ tự của chúng). Mỗi HTTP client có một "dấu vân tay" độc nhất. Ví dụ, Python requests sử dụng OpenSSL với một bộ mã hóa đặc trưng mà dễ dàng phân biệt với Chrome hoặc Firefox.

Khi phân tích yêu cầu, Cloudflare so sánh TLS fingerprint với User-Agent được khai báo. Nếu bạn chỉ định Chrome 120, nhưng các thông số TLS tương ứng với Python requests — đây là phát hiện bot ngay lập tức.

Thông số kiểm tra	Những gì được phân tích	Rủi ro phát hiện
TLS fingerprint	Cipher suites, mở rộng, phiên bản TLS	Cao
HTTP/2 fingerprint	Thứ tự headers, SETTINGS frames	Cao
IP-reputation	Lịch sử IP, thuộc về các trung tâm dữ liệu	Trung bình
JavaScript challenge	Thực thi JS, canvas fingerprint, WebGL	Cao
Phân tích hành vi	Mẫu yêu cầu, thời gian, chuyển động chuột	Trung bình

Từ năm 2023, Cloudflare đã tích cực sử dụng máy học để phân tích các mẫu hành vi. Hệ thống theo dõi không chỉ các thông số kỹ thuật mà còn cả khoảng thời gian giữa các yêu cầu, thứ tự hành động của người dùng, chuyển động chuột và cuộn trang.

Ngụy trang TLS fingerprint

TLS fingerprinting là phương pháp phát hiện bot hiệu quả nhất. Các HTTP client tiêu chuẩn (requests, curl, axios) tạo ra fingerprint mà không thể nhầm lẫn với trình duyệt thực. Giải pháp là sử dụng các thư viện chuyên dụng, mô phỏng hành vi TLS của trình duyệt.

Sử dụng curl-impersonate

Thư viện curl-impersonate là phiên bản sửa đổi của curl, sao chép chính xác TLS và HTTP/2 fingerprints của các trình duyệt phổ biến. Hỗ trợ Chrome, Firefox, Safari và Edge.

# Cài đặt curl-impersonate
git clone https://github.com/lwthiker/curl-impersonate
cd curl-impersonate
make chrome-build

# Sử dụng với mô phỏng Chrome 120
curl_chrome120 -x http://username:password@proxy.example.com:8080 \
  -H "Accept-Language: en-US,en;q=0.9" \
  https://example.com

Python: thư viện tls-client

Đối với Python, có một lớp bọc tls-client, sử dụng curl-impersonate bên trong và cung cấp giao diện tương tự như requests.

import tls_client

# Tạo phiên với fingerprint Chrome 120
session = tls_client.Session(
    client_identifier="chrome_120",
    random_tls_extension_order=True
)

# Cấu hình proxy
proxies = {
    'http': 'http://username:password@proxy.example.com:8080',
    'https': 'http://username:password@proxy.example.com:8080'
}

# Thực hiện yêu cầu
response = session.get(
    'https://example.com',
    proxies=proxies,
    headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.9',
        'Accept-Encoding': 'gzip, deflate, br',
        'DNT': '1',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1'
    }
)

print(response.status_code)

Quan trọng: Khi sử dụng tls-client, điều quan trọng là User-Agent trong headers phải tương ứng với client_identifier đã chọn. Sự không tương thích sẽ dẫn đến phát hiện ngay lập tức.

Kiểm tra TLS fingerprint

Trước khi bắt đầu phân tích, nên kiểm tra TLS fingerprint của bạn. Sử dụng các dịch vụ tls.peet.ws hoặc ja3er.com để phân tích.

# Kiểm tra fingerprint
response = session.get('https://tls.peet.ws/api/all')
print(response.json()['tls']['ja3'])

# So sánh với fingerprint của Chrome thực tế:
# https://kawayiyi.com/tls-fingerprint-database/

Cấu hình HTTP headers đúng cách

Ngay cả với TLS fingerprint đúng, headers HTTP không chính xác sẽ phát hiện bot. Cloudflare phân tích không chỉ sự tồn tại của các headers mà còn cả thứ tự, định dạng giá trị và tính hợp lý logic.

Các headers bắt buộc cho Chrome

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'DNT': '1',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
    'Sec-Ch-Ua': '"Not_A Brand";v="8", "Chromium";v="120", "Google Chrome";v="120"',
    'Sec-Ch-Ua-Mobile': '?0',
    'Sec-Ch-Ua-Platform': '"Windows"',
    'Cache-Control': 'max-age=0'
}

Các headers Sec-Ch-Ua-* xuất hiện trong Chrome 89 và là một phần của Client Hints API. Việc thiếu chúng khi sử dụng User-Agent hiện đại là dấu hiệu rõ ràng của bot.

Thứ tự headers có ý nghĩa

Trong HTTP/2, thứ tự headers được cố định cho mỗi trình duyệt. Python requests và các client tiêu chuẩn khác gửi headers theo thứ tự chữ cái, điều này khác với hành vi của các trình duyệt. Sử dụng các thư viện hỗ trợ thứ tự headers tùy chỉnh.

Mẹo: Sử dụng DevTools của trình duyệt (tab Network → nhấp chuột phải vào yêu cầu → Sao chép → Sao chép dưới dạng cURL) để lấy bản sao chính xác của các headers từ trình duyệt thực. Sau đó, điều chỉnh chúng cho mã của bạn.

Tạo User-Agent động

Sử dụng cùng một User-Agent cho tất cả các yêu cầu làm tăng rủi ro phát hiện. Tạo một nhóm User-Agent hiện tại và xoay vòng chúng.

import random

# Nhóm User-Agent hiện tại (tháng 12 năm 2024)
USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1 Safari/605.1.15',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
]

def get_random_headers():
    ua = random.choice(USER_AGENTS)
    
    # Điều chỉnh các headers khác theo UA đã chọn
    if 'Chrome' in ua:
        return {
            'User-Agent': ua,
            'Sec-Ch-Ua': '"Not_A Brand";v="8", "Chromium";v="120"',
            # ... các headers khác của Chrome
        }
    elif 'Firefox' in ua:
        return {
            'User-Agent': ua,
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            # ... headers của Firefox
        }
    # ... xử lý các trình duyệt khác

Sử dụng trình duyệt headless

Khi Cloudflare sử dụng thử thách JavaScript hoặc phát hiện nâng cao, cách duy nhất đáng tin cậy để vượt qua là sử dụng trình duyệt thực. Các trình duyệt headless tự động xử lý JavaScript, cookies và tạo ra fingerprint hoàn toàn xác thực.

Playwright với các bản vá chống phát hiện

Playwright là một lựa chọn hiện đại cho Selenium với hiệu suất tốt hơn. Tuy nhiên, Playwright tiêu chuẩn dễ dàng bị phát hiện qua navigator.webdriver và các dấu hiệu khác. Sử dụng playwright-stealth để ngụy trang.

from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

def bypass_cloudflare(url, proxy):
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            proxy={
                "server": f"http://{proxy['host']}:{proxy['port']}",
                "username": proxy['username'],
                "password": proxy['password']
            },
            args=[
                '--disable-blink-features=AutomationControlled',
                '--disable-dev-shm-usage',
                '--no-sandbox'
            ]
        )
        
        context = browser.new_context(
            viewport={'width': 1920, 'height': 1080},
            user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
            locale='en-US',
            timezone_id='America/New_York'
        )
        
        page = context.new_page()
        stealth_sync(page)  # Áp dụng các bản vá chống phát hiện
        
        # Chuyển đến trang
        page.goto(url, wait_until='networkidle', timeout=30000)
        
        # Chờ vượt qua thử thách Cloudflare (thường là 5-10 giây)
        page.wait_for_timeout(8000)
        
        # Kiểm tra việc vượt qua thành công
        if 'Just a moment' in page.content():
            print('Thử thách Cloudflare không vượt qua')
            return None
        
        # Trích xuất cookies để sử dụng sau
        cookies = context.cookies()
        html = page.content()
        
        browser.close()
        return {'html': html, 'cookies': cookies}

# Sử dụng
proxy_config = {
    'host': 'proxy.example.com',
    'port': 8080,
    'username': 'user',
    'password': 'pass'
}

result = bypass_cloudflare('https://example.com', proxy_config)

Puppeteer Extra với các plugin

Đối với hệ sinh thái Node.js, giải pháp tốt nhất là puppeteer-extra với plugin puppeteer-extra-plugin-stealth. Plugin này áp dụng hơn 30 kỹ thuật ngụy trang tự động.

const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');

puppeteer.use(StealthPlugin());

async function bypassCloudflare(url, proxyUrl) {
    const browser = await puppeteer.launch({
        headless: 'new',
        args: [
            `--proxy-server=${proxyUrl}`,
            '--disable-blink-features=AutomationControlled',
            '--window-size=1920,1080'
        ]
    });
    
    const page = await browser.newPage();
    
    // Thiết lập viewport và user-agent
    await page.setViewport({ width: 1920, height: 1080 });
    await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36');
    
    // Ghi đè navigator.webdriver
    await page.evaluateOnNewDocument(() => {
        delete Object.getPrototypeOf(navigator).webdriver;
    });
    
    // Chuyển đến trang
    await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 });
    
    // Chờ vượt qua thử thách
    await page.waitForTimeout(8000);
    
    // Lấy nội dung và cookies
    const content = await page.content();
    const cookies = await page.cookies();
    
    await browser.close();
    
    return { content, cookies };
}

// Ví dụ sử dụng
bypassCloudflare('https://example.com', 'http://user:pass@proxy.example.com:8080')
    .then(result => console.log('Thành công'))
    .catch(err => console.error(err));

Hiệu suất: Các trình duyệt headless tiêu tốn tài nguyên nhiều hơn đáng kể (200-500 MB RAM cho mỗi phiên). Đối với các tác vụ có tải cao, chỉ nên sử dụng chúng để lấy cookies, sau đó chuyển sang các client HTTP với những cookies này.

Lựa chọn loại proxy để vượt qua Cloudflare

Loại proxy ảnh hưởng nghiêm trọng đến khả năng vượt qua. Cloudflare duy trì cơ sở dữ liệu IP của các trung tâm dữ liệu và áp dụng các quy tắc kiểm tra nghiêm ngặt hơn cho chúng.

Loại proxy	Khả năng vượt qua	Tốc độ	Chi phí	Khuyến nghị
Datacenter	30-40%	Cao	Thấp	Chỉ với trình duyệt headless
Residential	85-95%	Trung bình	Cao	Lựa chọn tối ưu
Mobile	90-98%	Trung bình	Rất cao	Cho các tác vụ quan trọng
ISP (Static Residential)	80-90%	Cao	Trung bình	Cân bằng giữa giá cả và chất lượng

Tại sao proxy residential hiệu quả hơn

Proxy residential sử dụng địa chỉ IP của các thiết bị thực (bộ định tuyến gia đình, điện thoại thông minh). Cloudflare không thể chặn hàng loạt các IP này, vì điều đó sẽ chặn cả người dùng thông thường. Thống kê cho thấy, các IP residential nhận captcha ít hơn 15-20 lần so với các trung tâm dữ liệu.

Khi làm việc với proxy residential, địa điểm địa lý là rất quan trọng. Nếu trang web mục tiêu hướng đến Hoa Kỳ, việc sử dụng proxy từ châu Á sẽ làm tăng nghi ngờ. Chọn các nhà cung cấp với độ phủ rộng và khả năng nhắm mục tiêu theo thành phố.

Proxy di động cho độ tin cậy tối đa

Proxy di động sử dụng địa chỉ IP của các nhà mạng di động (4G/5G). Đặc điểm của các mạng di động là thay đổi IP một cách động thông qua chế độ máy bay, điều này mang lại số lượng IP sạch gần như không giới hạn. Khả năng bị chặn của IP di động gần như bằng không.

# Ví dụ về xoay vòng IP di động qua API
import requests
import time

def rotate_mobile_ip(proxy_api_url):
    """Thay đổi IP của proxy di động"""
    response = requests.get(f"{proxy_api_url}/rotate")
    if response.status_code == 200:
        print("IP đã được thay đổi thành công")
        time.sleep(5)  # Chờ áp dụng thay đổi
        return True
    return False

# Sử dụng với proxy di động
mobile_proxy = "http://user:pass@mobile.proxy.com:8080"

for i in range(10):
    # Thực hiện yêu cầu
    response = requests.get(
        'https://example.com',
        proxies={'http': mobile_proxy, 'https': mobile_proxy}
    )
    
    # Xoay vòng IP sau mỗi yêu cầu
    rotate_mobile_ip('https://api.proxy.com/mobile')

Quản lý cookies và phiên làm việc

Sau khi vượt qua thử thách Cloudflare thành công, máy chủ thiết lập cookies (cf_clearance, __cfduid và các cookies khác), xác nhận tính hợp pháp của khách hàng. Quản lý đúng cách các cookies này giúp tránh các kiểm tra lặp lại.

Trích xuất và tái sử dụng cf_clearance

Cookie cf_clearance thường có hiệu lực từ 30-60 phút. Sau khi nhận được thông qua trình duyệt headless, nó có thể được sử dụng trong các yêu cầu HTTP thông thường.

import requests
import pickle
from datetime import datetime, timedelta

class CloudflareCookieManager:
    def __init__(self, cookie_file='cf_cookies.pkl'):
        self.cookie_file = cookie_file
        self.cookies = self.load_cookies()
    
    def load_cookies(self):
        """Tải cookies đã lưu"""
        try:
            with open(self.cookie_file, 'rb') as f:
                data = pickle.load(f)
                # Kiểm tra thời hạn
                if data['expires'] > datetime.now():
                    return data['cookies']
        except FileNotFoundError:
            pass
        return None
    
    def save_cookies(self, cookies, ttl_minutes=30):
        """Lưu cookies với TTL"""
        data = {
            'cookies': cookies,
            'expires': datetime.now() + timedelta(minutes=ttl_minutes)
        }
        with open(self.cookie_file, 'wb') as f:
            pickle.dump(data, f)
    
    def get_cf_clearance(self, url, proxy):
        """Nhận cf_clearance qua trình duyệt"""
        if self.cookies:
            return self.cookies
        
        # Đây là mã khởi động trình duyệt (từ phần trước)
        # ...
        browser_cookies = bypass_cloudflare(url, proxy)['cookies']
        
        # Chuyển đổi sang định dạng requests
        cookies_dict = {c['name']: c['value'] for c in browser_cookies}
        self.save_cookies(cookies_dict)
        self.cookies = cookies_dict
        
        return cookies_dict
    
    def make_request(self, url, proxy):
        """Yêu cầu với quản lý cookies tự động"""
        cookies = self.get_cf_clearance(url, proxy)
        
        response = requests.get(
            url,
            cookies=cookies,
            proxies={'http': proxy, 'https': proxy},
            headers=get_random_headers()
        )
        
        # Nếu nhận được thử thách một lần nữa — cập nhật cookies
        if response.status_code == 403 or 'cf-browser-verification' in response.text:
            print("Cookies đã hết hạn, nhận mới...")
            self.cookies = None
            return self.make_request(url, proxy)
        
        return response

# Sử dụng
manager = CloudflareCookieManager()
response = manager.make_request(
    'https://example.com/api/data',
    'http://user:pass@proxy.example.com:8080'
)

Gắn cookies với địa chỉ IP

Cloudflare gắn cf_clearance với địa chỉ IP mà từ đó thử thách đã được vượt qua. Việc sử dụng cookie này từ một IP khác sẽ dẫn đến việc bị chặn. Khi làm việc với các proxy xoay vòng, cần lưu trữ một tập hợp cookies riêng cho mỗi IP.

import hashlib

class IPBoundCookieManager:
    def __init__(self):
        self.cookies_by_ip = {}
    
    def get_ip_hash(self, proxy_url):
        """Tạo hash để xác định proxy"""
        return hashlib.md5(proxy_url.encode()).hexdigest()
    
    def get_cookies_for_proxy(self, proxy_url, target_url):
        """Nhận cookies cho một proxy cụ thể"""
        ip_hash = self.get_ip_hash(proxy_url)
        
        if ip_hash in self.cookies_by_ip:
            cookies_data = self.cookies_by_ip[ip_hash]
            if cookies_data['expires'] > datetime.now():
                return cookies_data['cookies']
        
        # Nhận cookies mới qua trình duyệt
        new_cookies = self.fetch_cookies_with_browser(target_url, proxy_url)
        
        self.cookies_by_ip[ip_hash] = {
            'cookies': new_cookies,
            'expires': datetime.now() + timedelta(minutes=30)
        }
        
        return new_cookies

Xoay vòng proxy và kiểm soát tần suất yêu cầu

Ngay cả với công nghệ kỹ thuật đúng, tần suất yêu cầu quá cao từ một IP cũng kích hoạt rate limiting. Cloudflare phân tích các mẫu lưu lượng và phát hiện hoạt động bất thường.

Chiến lược xoay vòng proxy

Có ba phương pháp chính để xoay vòng: round-robin (tuần tự), random (ngẫu nhiên) và sticky sessions (gắn với phiên). Để vượt qua Cloudflare, chiến lược sticky sessions với giới hạn yêu cầu trên IP là tối ưu.

import time
import random
from collections import defaultdict
from datetime import datetime, timedelta

class SmartProxyRotator:
    def __init__(self, proxy_list, max_requests_per_ip=20, cooldown_minutes=10):
        self.proxy_list = proxy_list
        self.max_requests_per_ip = max_requests_per_ip
        self.cooldown_minutes = cooldown_minutes
        
        # Bộ đếm sử dụng
        self.usage_count = defaultdict(int)
        self.last_used = {}
        self.cooldown_until = {}
    
    def get_proxy(self):
        """Nhận proxy tiếp theo có sẵn"""
        available_proxies = []
        
        for proxy in self.proxy_list:
            # Kiểm tra cooldown
            if proxy in self.cooldown_until:
                if datetime.now() < self.cooldown_until[proxy]:
                    continue
                else:
                    # Đặt lại bộ đếm sau cooldown
                    self.usage_count[proxy] = 0
                    del self.cooldown_until[proxy]
            
            # Kiểm tra giới hạn yêu cầu
            if self.usage_count[proxy] < self.max_requests_per_ip:
                available_proxies.append(proxy)
        
        if not available_proxies:
            # Nếu tất cả các proxy đều trong cooldown — chờ
            wait_time = min(
                (self.cooldown_until[p] - datetime.now()).total_seconds()
                for p in self.cooldown_until
            )
            print(f"Tất cả các proxy đều trong cooldown. Chờ {wait_time:.0f} giây...")
            time.sleep(wait_time + 1)
            return self.get_proxy()
        
        # Chọn proxy có mức sử dụng thấp nhất
        proxy = min(available_proxies, key=lambda p: self.usage_count[p])
        
        self.usage_count[proxy] += 1
        self.last_used[proxy] = datetime.now()
        
        # Thiết lập cooldown khi đạt giới hạn
        if self.usage_count[proxy] >= self.max_requests_per_ip:
            self.cooldown_until[proxy] = datetime.now() + timedelta(
                minutes=self.cooldown_minutes
            )
            print(f"Proxy {proxy} đã đạt giới hạn. Cooldown {self.cooldown_minutes} phút.")
        
        return proxy
    
    def add_delay(self):
        """Đặt độ trễ ngẫu nhiên giữa các yêu cầu (mô phỏng con người)"""
        delay = random.uniform(2, 5)  # 2-5 giây
        time.sleep(delay)

# Sử dụng
proxy_pool = [
    'http://user:pass@proxy1.example.com:8080',
    'http://user:pass@proxy2.example.com:8080',
    'http://user:pass@proxy3.example.com:8080',
    # ... đến 50-100 proxy cho hoạt động ổn định
]

rotator = SmartProxyRotator(
    proxy_pool,
    max_requests_per_ip=15,  # Giá trị bảo thủ
    cooldown_minutes=15
)

# Thực hiện các yêu cầu
for i in range(1000):
    proxy = rotator.get_proxy()
    
    response = requests.get(
        'https://example.com/page',
        proxies={'http': proxy, 'https': proxy},
        headers=get_random_headers()
    )
    
    print(f"Yêu cầu {i+1}: {response.status_code}")
    rotator.add_delay()

Adaptive rate limiting

Một phương pháp tiên tiến hơn là điều chỉnh động tần suất yêu cầu dựa trên phản hồi của máy chủ. Nếu bắt đầu xuất hiện lỗi 429 hoặc captcha, tự động giảm tốc độ.

class AdaptiveRateLimiter:
    def __init__(self, initial_delay=3.0):
        self.delay = initial_delay
        self.min_delay = 1.0
        self.max_delay = 30.0
        self.success_streak = 0
        self.failure_streak = 0
    
    def on_success(self):
        """Yêu cầu thành công — có thể tăng tốc độ"""
        self.success_streak += 1
        self.failure_streak = 0
        
        if self.success_streak >= 10:
            # Giảm độ trễ 10%
            self.delay = max(self.min_delay, self.delay * 0.9)
            self.success_streak = 0
    
    def on_failure(self, status_code):
        """Lỗi — làm chậm lại"""
        self.failure_streak += 1
        self.success_streak = 0
        
        if status_code == 429:  # Giới hạn tần suất
            # Làm chậm mạnh mẽ
            self.delay = min(self.max_delay, self.delay * 2.0)
        elif status_code == 403:  # Có thể bị chặn
            self.delay = min(self.max_delay, self.delay * 1.5)
        
        print(f"Độ trễ đã tăng lên {self.delay:.2f}s")
    
    def wait(self):
        """Chờ trước yêu cầu tiếp theo"""
        # Thêm ngẫu nhiên ±20%
        actual_delay = self.delay * random.uniform(0.8, 1.2)
        time.sleep(actual_delay)

Công cụ và thư viện sẵn có để vượt qua

Phát triển một giải pháp riêng từ đầu cần thời gian và chuyên môn. Có những công cụ sẵn có tự động hóa quá trình vượt qua Cloudflare.

cloudscraper (Python)

Thư viện cloudscraper là một lớp trên requests, tự động giải quyết các thử thách JavaScript. Hoạt động với các bảo vệ cơ bản, nhưng có thể không xử lý được các kiểm tra nâng cao.

import cloudscraper

# Tạo scraper với hỗ trợ proxy
scraper = cloudscraper.create_scraper(
    browser={
        'browser': 'chrome',
        'platform': 'windows',
        'desktop': True
    }
)

# Cấu hình proxy
proxies = {
    'http': 'http://user:pass@proxy.example.com:8080',
    'https': 'http://user:pass@proxy.example.com:8080'
}

# Thực hiện yêu cầu
response = scraper.get('https://example.com', proxies=proxies)

if response.status_code == 200:
    print("Vượt qua thành công")
    print(response.text)
else:
    print(f"Lỗi: {response.status_code}")

FlareSolverr (đại trà)

FlareSolverr là một máy chủ proxy chạy trình duyệt headless để giải quyết các thử thách Cloudflare. Hoạt động qua HTTP API, hỗ trợ bất kỳ ngôn ngữ lập trình nào.

# Khởi động FlareSolverr qua Docker
docker run -d \
  --name=flaresolverr \
  -p 8191:8191 \
  -e LOG_LEVEL=info \
  ghcr.io/flaresolverr/flaresolverr:latest

# Sử dụng từ Python
import requests

def solve_cloudflare(url, proxy=None):
    flaresolverr_url = "http://localhost:8191/v1"
    
    payload = {
        "cmd": "request.get",
        "url": url,
        "maxTimeout": 60000
    }
    
    if proxy:
        payload["proxy"] = {
            "url": proxy
        }
    
    response = requests.post(flaresolverr_url, json=payload)
    result = response.json()
    
    if result['status'] == 'ok':
        return {
            'html': result['solution']['response'],
            'cookies': result['solution']['cookies'],
            'user_agent': result['solution']['userAgent']
        }
    else:
        raise Exception(f"Lỗi FlareSolverr: {result['message']}")

# Ví dụ sử dụng
result = solve_cloudflare(
    'https://example.com',
    proxy='http://user:pass@proxy.example.com:8080'
)

print(result['html'])

undetected-chromedriver

Phiên bản đã sửa của Selenium ChromeDriver, tự động áp dụng nhiều kỹ thuật chống phát hiện. Dễ sử dụng hơn Playwright, nhưng ít linh hoạt hơn.

import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def bypass_with_uc(url, proxy):
    options = uc.ChromeOptions()
    options.add_argument(f'--proxy-server={proxy}')
    options.add_argument('--disable-blink-features=AutomationControlled')
    
    driver = uc.Chrome(options=options, version_main=120)
    
    try:
        driver.get(url)
        
        # Chờ thử thách Cloudflare biến mất
        WebDriverWait(driver, 20).until_not(
            EC.presence_of_element_located((By.ID, "cf-spinner-please-wait"))
        )
        
        # Chờ thêm để đảm bảo
        time.sleep(3)
        
        # Nhận kết quả
        html = driver.page_source
        cookies = driver.get_cookies()
        
        return {'html': html, 'cookies': cookies}
    
    finally:
        driver.quit()

# Sử dụng
result = bypass_with_uc(
    'https://example.com',
    'http://user:pass@proxy.example.com:8080'
)

Cách tiếp cận kết hợp: Chiến lược tối ưu là sử dụng trình duyệt headless chỉ để nhận cookies ban đầu, sau đó chuyển sang các client HTTP (tls-client, cloudscraper) với những cookies này. Điều này mang lại sự cân bằng giữa độ tin cậy và hiệu suất.

Kết luận

Vượt qua Cloudflare khi làm việc qua proxy yêu cầu một cách tiếp cận toàn diện: TLS fingerprint đúng, HTTP headers xác thực, proxy chất lượng và quản lý phiên làm việc hợp lý. Các khuyến nghị chính:

Sử dụng proxy residential hoặc proxy di động thay vì các trung tâm dữ liệu
Áp dụng các thư viện với TLS fingerprint đúng (tls-client, curl-impersonate)
Đối với các trường hợp phức tạp, sử dụng trình duyệt headless với các bản vá chống phát hiện
Lưu trữ và tái sử dụng cookies cf_clearance
Xoay vòng proxy với sự chú ý đến rate limiting (không quá 15-20 yêu cầu trên IP)
Thêm độ trễ ngẫu nhiên giữa các yêu cầu (2-5 giây)

Bảo vệ Cloudflare liên tục tiến hóa, vì vậy quan trọng là thường xuyên cập nhật công cụ và điều chỉnh chiến lược. Theo dõi các thay đổi trong kỹ thuật fingerprinting và thử nghiệm các giải pháp trên các phiên bản bảo vệ hiện tại.

Để hoạt động ổn định, nên sử dụng các dịch vụ proxy chuyên nghiệp với một nhóm IP rộng và xoay vòng tự động.

```

Cách vượt qua phát hiện Cloudflare khi sử dụng proxy