在Python的requests和aiohttp中设置代理：5分钟完成

```html

在开发解析器、自动化数据收集或测试 Web 服务时，Python 中经常需要使用代理服务器。requests 和 aiohttp 提供了灵活的代理工作机制，但它们的设置有重要的细节。在本指南中，我们将讨论同步和异步方法，展示 HTTP 和 SOCKS5 代理的示例，讨论 IP 轮换和错误处理。

requests 中的基本代理设置

requests 库是 Python 中 HTTP 请求的标准。代理的设置通过 proxies 参数进行，该参数接受一个包含协议和代理服务器地址的字典。

最简单的 HTTP 代理示例：

import requests

# 设置代理
proxies = {
    'http': 'http://123.45.67.89:8080',
    'https': 'http://123.45.67.89:8080'
}

# 通过代理执行请求
response = requests.get('https://httpbin.org/ip', proxies=proxies)
print(response.json())  # {'origin': '123.45.67.89'}

请注意：对于 HTTPS 请求，代理值中也应指定协议 http://（而不是 https://）。这是因为与代理服务器的连接是通过 HTTP 建立的，然后通过 CONNECT 方法为 HTTPS 流量创建隧道。

使用环境变量：

requests 库会自动从环境变量 HTTP_PROXY 和 HTTPS_PROXY 中读取代理：

import os
import requests

# 通过环境变量设置
os.environ['HTTP_PROXY'] = 'http://123.45.67.89:8080'
os.environ['HTTPS_PROXY'] = 'http://123.45.67.89:8080'

# 代理将自动应用
response = requests.get('https://httpbin.org/ip')
print(response.json())

这种方法对于容器化（Docker）或在系统级别设置代理非常方便。然而，为了灵活性，建议明确传递 proxies 参数。

requests 中的身份验证和 SOCKS5

大多数商业代理服务需要通过用户名和密码进行身份验证。在 requests 中，这通过包含凭据的 URL 格式实现。

带身份验证的 HTTP 代理：

import requests

# 格式：http://username:password@host:port
proxies = {
    'http': 'http://user123:[email protected]:8080',
    'https': 'http://user123:[email protected]:8080'
}

response = requests.get('https://httpbin.org/ip', proxies=proxies)
print(response.json())

SOCKS5 代理设置：

使用 SOCKS5 需要额外的库 requests[socks] 或 PySocks。安装：

pip install requests[socks]

SOCKS5 的使用示例：

import requests

# 无身份验证的 SOCKS5
proxies = {
    'http': 'socks5://123.45.67.89:1080',
    'https': 'socks5://123.45.67.89:1080'
}

# 带身份验证的 SOCKS5
proxies_auth = {
    'http': 'socks5://user:[email protected]:1080',
    'https': 'socks5://user:[email protected]:1080'
}

response = requests.get('https://httpbin.org/ip', proxies=proxies_auth)
print(response.json())

SOCKS5 代理在处理住宅代理时特别有用，因为该协议提供了更可靠的流量隧道，并支持 UDP（某些应用程序所需）。

requests 中的代理轮换

在解析大量数据时，使用单个 IP 地址会导致封锁。代理轮换是循环更换 IP，以分散负载并绕过速率限制。

通过列表简单轮换：

import requests
import itertools

# 代理服务器列表
proxy_list = [
    'http://user:[email protected]:8080',
    'http://user:[email protected]:8080',
    'http://user:[email protected]:8080',
]

# 创建无限迭代器
proxy_pool = itertools.cycle(proxy_list)

# 执行轮换请求
for i in range(10):
    proxy = next(proxy_pool)
    proxies = {'http': proxy, 'https': proxy}
    
    try:
        response = requests.get('https://httpbin.org/ip', proxies=proxies, timeout=5)
        print(f"请求 {i+1}: IP = {response.json()['origin']}")
    except Exception as e:
        print(f"代理 {proxy} 出错: {e}")

使用会话进行轮换以保留 cookies：

import requests
from itertools import cycle

class ProxyRotator:
    def __init__(self, proxy_list):
        self.proxy_pool = cycle(proxy_list)
        self.session = requests.Session()
    
    def get(self, url, **kwargs):
        proxy = next(self.proxy_pool)
        self.session.proxies = {'http': proxy, 'https': proxy}
        return self.session.get(url, **kwargs)

# 使用
proxy_list = [
    'http://user:[email protected]:8080',
    'http://user:[email protected]:8080',
]

rotator = ProxyRotator(proxy_list)

for i in range(5):
    response = rotator.get('https://httpbin.org/ip', timeout=5)
    print(f"请求 {i+1}: {response.json()['origin']}")

随机轮换以增加不可预测性：

import requests
import random

proxy_list = [
    'http://user:[email protected]:8080',
    'http://user:[email protected]:8080',
    'http://user:[email protected]:8080',
]

def get_random_proxy():
    proxy = random.choice(proxy_list)
    return {'http': proxy, 'https': proxy}

# 每个请求使用随机代理
for i in range(5):
    response = requests.get('https://httpbin.org/ip', proxies=get_random_proxy(), timeout=5)
    print(f"请求 {i+1}: {response.json()['origin']}")

随机轮换在处理监控请求模式的网站时更有效。顺序更换 IP 可能看起来可疑，而随机选择则模拟了不同用户的行为。

aiohttp 中的代理设置

aiohttp 库用于异步 HTTP 请求，对于高负载解析器至关重要。代理的设置与 requests 不同——使用单数形式的 proxy 参数。

带 HTTP 代理的基本示例：

import aiohttp
import asyncio

async def fetch_with_proxy():
    proxy = 'http://123.45.67.89:8080'
    
    async with aiohttp.ClientSession() as session:
        async with session.get('https://httpbin.org/ip', proxy=proxy) as response:
            data = await response.json()
            print(data)

# 运行
asyncio.run(fetch_with_proxy())

带身份验证的代理：

在 aiohttp 中，身份验证通过 aiohttp.BasicAuth 对象或直接在 URL 中传递：

import aiohttp
import asyncio

async def fetch_with_auth_proxy():
    # 选项 1：凭据在 URL 中
    proxy = 'http://user123:[email protected]:8080'
    
    async with aiohttp.ClientSession() as session:
        async with session.get('https://httpbin.org/ip', proxy=proxy) as response:
            print(await response.json())

# 选项 2：通过 BasicAuth（适用于某些代理）
async def fetch_with_basic_auth():
    proxy = 'http://proxy.example.com:8080'
    proxy_auth = aiohttp.BasicAuth('user123', 'pass456')
    
    async with aiohttp.ClientSession() as session:
        async with session.get('https://httpbin.org/ip', 
                                proxy=proxy, 
                                proxy_auth=proxy_auth) as response:
            print(await response.json())

asyncio.run(fetch_with_auth_proxy())

aiohttp 中的 SOCKS5：

使用 SOCKS5 需要库 aiohttp-socks：

pip install aiohttp-socks

import asyncio
from aiohttp_socks import ProxyConnector
import aiohttp

async def fetch_with_socks5():
    connector = ProxyConnector.from_url('socks5://user:[email protected]:1080')
    
    async with aiohttp.ClientSession(connector=connector) as session:
        async with session.get('https://httpbin.org/ip') as response:
            print(await response.json())

asyncio.run(fetch_with_socks5())

在处理移动代理以解析社交网络或市场时，建议使用 aiohttp——异步性允许并行处理数百个请求而不阻塞执行线程。

异步轮换和代理池

对于高负载解析器，有效的代理轮换、故障处理和自动替换失效 IP 是至关重要的。我们将讨论 aiohttp 的高级模式。

管理代理池的类：

import aiohttp
import asyncio
from itertools import cycle
from typing import List, Optional

class ProxyPool:
    def __init__(self, proxy_list: List[str]):
        self.proxy_list = proxy_list
        self.proxy_cycle = cycle(proxy_list)
        self.failed_proxies = set()
    
    def get_next_proxy(self) -> Optional[str]:
        """获取下一个可用的代理"""
        for _ in range(len(self.proxy_list)):
            proxy = next(self.proxy_cycle)
            if proxy not in self.failed_proxies:
                return proxy
        return None  # 所有代理不可用
    
    def mark_failed(self, proxy: str):
        """将代理标记为不可用"""
        self.failed_proxies.add(proxy)
        print(f"代理 {proxy} 被标记为不可用")
    
    async def fetch(self, session: aiohttp.ClientSession, url: str, **kwargs):
        """在出错时自动更换代理执行请求"""
        max_retries = 3
        
        for attempt in range(max_retries):
            proxy = self.get_next_proxy()
            if not proxy:
                raise Exception("所有代理不可用")
            
            try:
                async with session.get(url, proxy=proxy, timeout=aiohttp.ClientTimeout(total=10), **kwargs) as response:
                    return await response.json()
            except (aiohttp.ClientError, asyncio.TimeoutError) as e:
                print(f"代理 {proxy} 出错: {e}")
                self.mark_failed(proxy)
                continue
        
        raise Exception(f"在 {max_retries} 次尝试后无法执行请求")

# 使用
async def main():
    proxy_list = [
        'http://user:[email protected]:8080',
        'http://user:[email protected]:8080',
        'http://user:[email protected]:8080',
    ]
    
    pool = ProxyPool(proxy_list)
    
    async with aiohttp.ClientSession() as session:
        # 执行 10 个请求，自动轮换代理
        tasks = [pool.fetch(session, 'https://httpbin.org/ip') for _ in range(10)]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        
        for i, result in enumerate(results):
            if isinstance(result, Exception):
                print(f"请求 {i+1} 发生错误: {result}")
            else:
                print(f"请求 {i+1}: IP = {result.get('origin')}")

asyncio.run(main())

限制并发的并行处理：

import aiohttp
import asyncio
from itertools import cycle

async def fetch_url(session, url, proxy, semaphore):
    async with semaphore:  # 限制并发请求
        try:
            async with session.get(url, proxy=proxy, timeout=aiohttp.ClientTimeout(total=10)) as response:
                data = await response.json()
                return {'url': url, 'ip': data.get('origin'), 'status': response.status}
        except Exception as e:
            return {'url': url, 'error': str(e)}

async def main():
    urls = [f'https://httpbin.org/ip' for _ in range(50)]  # 50 个请求
    proxy_list = [
        'http://user:[email protected]:8080',
        'http://user:[email protected]:8080',
    ]
    proxy_cycle = cycle(proxy_list)
    
    # 限制：最多 10 个并发请求
    semaphore = asyncio.Semaphore(10)
    
    async with aiohttp.ClientSession() as session:
        tasks = [
            fetch_url(session, url, next(proxy_cycle), semaphore)
            for url in urls
        ]
        results = await asyncio.gather(*tasks)
        
        # 分析结果
        successful = [r for r in results if 'ip' in r]
        failed = [r for r in results if 'error' in r]
        
        print(f"成功请求数: {len(successful)}")
        print(f"失败请求数: {len(failed)}")

asyncio.run(main())

使用 asyncio.Semaphore 在处理代理时至关重要——通过一个 IP 进行的过多并发连接可能会导致目标网站或代理提供商的封锁。

错误和超时处理

使用代理时会增加错误的数量：超时、连接中断、代理服务器拒绝。正确的错误处理是解析器稳定性的关键。

使用代理时的常见错误：

错误	原因	解决方案
`ProxyError`	代理服务器不可用	切换到其他代理
`ConnectTimeout`	代理未及时响应	增加超时或更换代理
`ProxyAuthenticationRequired`	用户名/密码错误	检查凭据
`SSLError`	SSL 证书问题	禁用 SSL 验证（不推荐）
`TooManyRedirects`	代理产生重定向循环	更换代理或限制重定向

在 requests 中处理错误：

import requests
from requests.exceptions import ProxyError, ConnectTimeout, RequestException

def fetch_with_retry(url, proxies, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = requests.get(
                url, 
                proxies=proxies, 
                timeout=(5, 10),  # (连接超时, 读取超时)
                allow_redirects=True,
                verify=True  # 验证 SSL 证书
            )
            response.raise_for_status()  # 在 4xx/5xx 时引发异常
            return response.json()
            
        except ProxyError as e:
            print(f"尝试 {attempt + 1}: 代理不可用 - {e}")
        except ConnectTimeout as e:
            print(f"尝试 {attempt + 1}: 连接超时 - {e}")
        except requests.exceptions.HTTPError as e:
            print(f"尝试 {attempt + 1}: HTTP 错误 {e.response.status_code}")
            if e.response.status_code == 407:  # 代理身份验证所需
                print("代理身份验证错误！")
                break  # 在身份验证错误时不重试
        except RequestException as e:
            print(f"尝试 {attempt + 1}: 一般错误 - {e}")
        
        if attempt < max_retries - 1:
            print(f"等待 2 秒后重试...")
            import time
            time.sleep(2)
    
    raise Exception(f"在 {max_retries} 次尝试后无法执行请求")

# 使用
proxies = {'http': 'http://user:[email protected]:8080', 'https': 'http://user:[email protected]:8080'}
try:
    data = fetch_with_retry('https://httpbin.org/ip', proxies)
    print(data)
except Exception as e:
    print(f"严重错误: {e}")

在 aiohttp 中处理错误：

import aiohttp
import asyncio
from aiohttp import ClientError, ClientProxyConnectionError

async def fetch_with_retry(session, url, proxy, max_retries=3):
    for attempt in range(max_retries):
        try:
            timeout = aiohttp.ClientTimeout(total=10, connect=5)
            async with session.get(url, proxy=proxy, timeout=timeout) as response:
                response.raise_for_status()
                return await response.json()
                
        except ClientProxyConnectionError as e:
            print(f"尝试 {attempt + 1}: 连接代理时出错 - {e}")
        except asyncio.TimeoutError:
            print(f"尝试 {attempt + 1}: 超时")
        except aiohttp.ClientHttpProxyError as e:
            print(f"尝试 {attempt + 1}: 代理 HTTP 错误 - {e}")
            if e.status == 407:
                print("代理身份验证错误！")
                break
        except ClientError as e:
            print(f"尝试 {attempt + 1}: 客户端一般错误 - {e}")
        
        if attempt < max_retries - 1:
            await asyncio.sleep(2)
    
    raise Exception(f"在 {max_retries} 次尝试后无法执行请求")

async def main():
    proxy = 'http://user:[email protected]:8080'
    async with aiohttp.ClientSession() as session:
        try:
            data = await fetch_with_retry(session, 'https://httpbin.org/ip', proxy)
            print(data)
        except Exception as e:
            print(f"严重错误: {e}")

asyncio.run(main())

超时设置：

正确的超时设置对于稳定性至关重要。推荐值：

连接超时： 5-10 秒（与代理建立连接的时间）
读取超时： 10-30 秒（从目标网站获取响应的时间）
总超时： 30-60 秒（请求的总时间）

对于较慢的住宅代理，建议将连接超时增加到 20-30 秒，因为通过真实提供商的路由可能需要更多时间。

最佳实践和优化

有效的代理工作需要遵循一系列规则，以最小化封锁并最大化性能。

1. 使用会话重用连接：

# requests: 会话重用 TCP 连接
session = requests.Session()
session.proxies = {'http': proxy, 'https': proxy}

for url in urls:
    response = session.get(url)  # 比 requests.get() 更快

# aiohttp: 会话对于异步性是必需的
async with aiohttp.ClientSession() as session:
    tasks = [session.get(url, proxy=proxy) for url in urls]
    await asyncio.gather(*tasks)

2. 设置现实的 User-Agent 和头部：

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate, br',
    'DNT': '1',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1'
}

proxies = {'http': proxy, 'https': proxy}
response = requests.get('https://example.com', headers=headers, proxies=proxies)

3. 限制速率限制（每秒请求数）：

import time
import requests

class RateLimiter:
    def __init__(self, max_requests_per_second):
        self.max_requests = max_requests_per_second
        self.interval = 1.0 / max_requests_per_second
        self.last_request_time = 0
    
    def wait(self):
        elapsed = time.time() - self.last_request_time
        if elapsed < self.interval:
            time.sleep(self.interval - elapsed)
        self.last_request_time = time.time()

# 使用：每秒不超过 2 个请求
limiter = RateLimiter(2)
proxies = {'http': proxy, 'https': proxy}

for url in urls:
    limiter.wait()
    response = requests.get(url, proxies=proxies)

4. 代理的日志记录和监控：

import logging
from collections import defaultdict

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class ProxyMonitor:
    def __init__(self):
        self.stats = defaultdict(lambda: {'success': 0, 'failed': 0, 'total_time': 0})
    
    def log_request(self, proxy, success, response_time):
        stats = self.stats[proxy]
        if success:
            stats['success'] += 1
        else:
            stats['failed'] += 1
        stats['total_time'] += response_time
        
        # 每 10 个请求记录一次
        total = stats['success'] + stats['failed']
        if total % 10 == 0:
            avg_time = stats['total_time'] / total
            success_rate = stats['success'] / total * 100
            logger.info(f"代理 {proxy}: {total} 个请求, 成功率 {success_rate:.1f}%, 平均时间 {avg_time:.2f}s")

monitor = ProxyMonitor()

# 在请求代码中
import time
start = time.time()
try:
    response = requests.get(url, proxies=proxies, timeout=10)
    monitor.log_request(proxy, True, time.time() - start)
except Exception as e:
    monitor.log_request(proxy, False, time.time() - start)
    logger.error(f"代理 {proxy} 出错: {e}")

5. DNS 缓存以加速：

# aiohttp 使用 DNS 缓存
import aiohttp
from aiohttp.resolver import AsyncResolver

resolver = AsyncResolver(nameservers=['8.8.8.8', '8.8.4.4'])
connector = aiohttp.TCPConnector(resolver=resolver, ttl_dns_cache=300)

async with aiohttp.ClientSession(connector=connector) as session:
    # 请求将使用 5 分钟的 DNS 缓存
    async with session.get(url, proxy=proxy) as response:
        data = await response.json()

6. 处理验证码和封锁：

建议： 在收到状态 403、429 或验证码时，建议：

更换代理到另一个子网的 IP
增加请求之间的延迟（最多 5-10 秒）
更改 User-Agent 和其他头部
使用之前成功会话的 cookies

requests 和 aiohttp 的代理比较

在 requests 和 aiohttp 之间的选择取决于任务和数据量。让我们看看关键差异。

标准	requests	aiohttp
同步性	同步（阻塞）	异步（非阻塞）
性能	~10-50 请求/秒	~100-1000 请求/秒
代码简洁性	对初学者更简单	需要了解 async/await
代理设置	字典 `proxies`	参数 `proxy`
SOCKS5 支持	通过 `requests[socks]`	通过 `aiohttp-socks`
内存使用	较少（一个线程）	更多（多个任务）
更适合	简单脚本，<100 请求	解析器，>1000 请求

何时使用 requests：

一次性任务的简单脚本
原型设计和测试
小规模请求（每分钟最多 100 个）
当代码简单性和可读性重要时
与同步库集成时

何时使用 aiohttp：

解析大量数据（数千个页面）
实时监控多个来源
高负载的 API 服务
当处理速度至关重要时
通过代理处理 WebSocket

性能的实际比较：

# 测试：通过代理进行 100 个请求

# requests（同步） - ~50 秒
import requests
import time

start = time.time()
proxies = {'http': proxy, 'https': proxy}
for i in range(100):
    response = requests.get('https://httpbin.org/ip', proxies=proxies)
print(f"requests: {time.time() - start:.2f} 秒")

# aiohttp（异步） - ~5 秒
import aiohttp
import asyncio

async def fetch_all():
    async with aiohttp.ClientSession() as session:
        tasks = [
            session.get('https://httpbin.org/ip', proxy=proxy)
            for _ in range(100)
        ]
        await asyncio.gather(*tasks)

start = time.time()
asyncio.run(fetch_all())
print(f"aiohttp: {time.time() - start:.2f} 秒")

使用数据中心代理进行高速解析时，aiohttp 相比 requests 显示出 10-20 倍的优势，因为它能够并行处理请求。

结论

通过 requests 和 aiohttp 在 Python 中设置代理是开发解析器、自动化数据收集和绕过地理限制的基础技能。requests 库适合简单脚本和原型设计，因为其同步 API 易于理解，而 aiohttp 在处理数千个请求时提供高性能，采用异步架构。

有效使用 Python 中代理的关键点：正确处理错误和超时，实施 IP 地址轮换以分散负载，使用会话重用连接，设置现实的头部和 User-Agent，监控代理服务器的性能。SOCKS5 代理需要额外的库—— requests[socks] 或 aiohttp-socks。

在选择用于解析的代理类型时，请考虑任务的具体情况：对于高负载解析器，数千个请求适合快速的数据中心代理；而对于绕过严格的反机器人系统和处理社交网络，建议使用真实用户的住宅代理；对于需要最大匿名性和模拟移动流量的任务，移动代理是最佳选择，使用移动运营商的 IP。

如果您计划开发高性能解析器或自动化从多个来源收集数据，建议尝试住宅代理——它们提供高水平的匿名性，最低的封锁风险，并与大多数受保护的 Web 服务稳定工作。对于技术任务，要求高处理速度的数据中心代理也适用，具有低延迟和高带宽。