Robots.txt and proxies: parsing ethics for marketers

```html

Scraping competitor data is a common practice for marketers, marketplace sellers, and agencies. You monitor prices on Wildberries, collect listings from Avito, and analyze competitors' assortments. However, most websites block mass requests, and ignoring the robots.txt file can lead to legal issues. In this article, we will discuss how to use proxies for ethical scraping: adhering to website rules, avoiding blocks, and collecting data without risks to your business.

What is robots.txt and why is it needed for websites

The robots.txt file is a text document located at the root of a website that tells search engine bots and scrapers which sections can be crawled and which are disallowed. For example, an online store may prohibit indexing of the cart or user account pages so that these pages do not appear in Google.

A typical robots.txt file looks like this:

User-agent: *
Disallow: /admin/
Disallow: /cart/
Disallow: /api/
Crawl-delay: 10

User-agent: Googlebot
Allow: /

Sitemap: https://example.com/sitemap.xml

Directive breakdown:

User-agent: * — rules for all bots (the asterisk means "any bot")
Disallow: /admin/ — disallowed to crawl the /admin/ section
Crawl-delay: 10 — a delay of 10 seconds between requests
User-agent: Googlebot — special rules for Google (everything is allowed)

Why websites use robots.txt:

Protection against server overload — mass scraping creates load on the server, slowing down performance for real users
Hiding technical pages — carts, payment forms, and API endpoints should not be indexed
Protection of commercial data — marketplaces do not want competitors to easily extract their entire product catalog
Traffic savings — each bot request costs the website owner money

Important: robots.txt is a recommendation, not a technical barrier. The file does not physically block access, but ignoring the rules can lead to your IP being blocked or lawsuits (especially in the US and Europe).

Legal risks of scraping: what the law says

Data scraping exists in a gray area of the law. Different countries have different rules, but there are general principles that are important to know to avoid lawsuits.

Legislation in Russia

In Russia, there is no specific law on scraping, but general norms apply:

Copyright (Civil Code of the Russian Federation, Article 1259) — you cannot copy unique texts, photographs, or product descriptions without the permission of the copyright holder. Scraping prices and characteristics is usually safe, as these are factual data.
Personal data (152-FZ) — it is prohibited to collect personal data of users (full name, phone numbers, email) without consent. This applies to scraping social media profiles or contact databases.
Unfair competition (Article 14.33 of the Administrative Code of the Russian Federation) — if scraping is used to copy a business model or mislead customers, fines can reach up to 500,000 rubles.

Legislation in the USA and Europe

In the USA and EU, the laws are stricter:

CFAA (Computer Fraud and Abuse Act, USA) — unauthorized access to computer systems is considered a crime. Violating robots.txt can be interpreted as "unauthorized access." A well-known case: LinkedIn vs hiQ Labs (2022) — the court ruled that scraping public data is legal, but ignoring technical barriers (such as CAPTCHAs) is not.
GDPR (General Data Protection Regulation, EU) — collecting personal data of EU citizens without explicit consent is prohibited. Fines can reach up to 20 million euros or 4% of the company's annual turnover.
Terms of Service — many websites explicitly prohibit scraping in their rules. Violating this can lead to a lawsuit for breach of contract.

Practical advice: Before scraping, check three documents: robots.txt, Terms of Service, and Privacy Policy of the target website. If scraping is explicitly prohibited — look for alternative data sources (public APIs, partner programs, ready-made datasets).

What is safe to scrape

Type of data	Risk	Comment
Product prices	Low	Factual data, not protected by copyright
Product characteristics	Low	Technical data is safe
Unique descriptions	High	Protected by copyright
Product photos	High	Permission from the copyright holder is required
User contacts	Critical	Violation of 152-FZ and GDPR
Public statistics	Low	Open data is safe

Ethical scraping: how to collect data without violations

Ethical scraping is a balance between business tasks and respect for website owners. You can collect the necessary data without creating problems for the target resource and without violating laws.

Basic principles of ethical scraping

Comply with robots.txt — if a section is disallowed for scraping, do not attempt to bypass it. Look for alternative data sources.
Limit request speed — do not send 1000 requests per second. Make delays of 2-10 seconds between requests to avoid overloading the server.
Use your scraper's User-Agent — do not disguise yourself as a regular user. Specify an honest User-Agent, for example: "MyCompanyParser/1.0 ([email protected])". This allows website administrators to contact you if issues arise.
Scrape only public data — do not attempt to access closed sections, APIs, or databases.
Do not resell copied data — use the collected information for internal needs (competitor analysis, price monitoring), not for creating a competing service.
Cache data — do not request the same page multiple times. Save results locally and update them on a schedule (once a day, once a week).

When NOT to scrape

There are situations when scraping creates more problems than benefits:

The website provides an API — many marketplaces (Wildberries, Ozon, Yandex.Market) have official APIs for partners. Use them instead of scraping — it is faster, more legal, and more reliable.
The data is protected by CAPTCHA or authentication — bypassing protection may be considered hacking.
The website explicitly prohibits scraping in its Terms of Service — the risk of a lawsuit is too high.
You are collecting personal data — this violates GDPR and 152-FZ with huge fines.

How to correctly read and comply with robots.txt

The robots.txt file is located at the root of the domain: https://example.com/robots.txt. Always check this file before starting to scrape.

Main directives of robots.txt

Directive	Meaning	Example
`User-agent`	For which bot the rules apply	`User-agent: *` (all bots)
`Disallow`	Sections prohibited for scraping	`Disallow: /admin/`
`Allow`	Allowed sections (exception from Disallow)	`Allow: /public/`
`Crawl-delay`	Minimum delay between requests (in seconds)	`Crawl-delay: 10`
`Sitemap`	Link to the sitemap (list of all pages)	`Sitemap: /sitemap.xml`

Examples of robots.txt and how to interpret them

Example 1: Complete prohibition of scraping

User-agent: *
Disallow: /

This means: "All bots are prohibited from crawling the entire site." Scraping such a site is a violation of the owner's rules. Look for alternative data sources.

Example 2: Selective restrictions

User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /cart/
Allow: /products/
Crawl-delay: 5

This means: "You can scrape the /products/ section (products), but /admin/, /api/, and /cart/ are prohibited. Make a delay of 5 seconds between requests." These are normal conditions — you can scrape products while adhering to limits.

Example 3: Rules for specific bots

User-agent: Googlebot
Allow: /

User-agent: *
Disallow: /
Crawl-delay: 10

This means: "Google can crawl the entire site, but all other bots cannot." If you are not Google, scraping is prohibited.

How to check robots.txt before scraping

Most programming languages have libraries for automatically checking robots.txt. Here is an example in Python:

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

# Check if scraping the page is allowed
url = "https://example.com/products/item123"
user_agent = "MyParser/1.0"

if rp.can_fetch(user_agent, url):
    print("Scraping allowed")
else:
    print("Scraping prohibited by robots.txt")

This will automatically check the rules and inform you whether scraping a specific URL is allowed.

Rate Limiting and delays between requests

Rate Limiting is a protection mechanism for websites against overload. If you send too many requests in a short period, the server may block your IP or show a CAPTCHA.

Why it is important to observe delays

Avoiding IP blocks — websites track the frequency of requests from a single IP. If you send 100 requests per minute, you will be blocked as a bot.
Reducing server load — mass scraping can "bring down" a site, especially if it is a small resource on cheap hosting.
Complying with Crawl-delay from robots.txt — if the site specifies a delay of 10 seconds, ignoring this rule is unethical.
More natural behavior — regular users do not open 10 pages per second. Delays make your scraper resemble a real person.

Recommended delays for different tasks

Type of scraping	Delay between requests	Comment
Small site (up to 1000 pages)	5-10 seconds	Low server power
Medium site (online store)	2-5 seconds	Optimal balance
Large marketplace (Wildberries, Ozon)	1-3 seconds	Powerful infrastructure, but strong protection
API endpoints	By API limits (usually 10-100 requests/min)	Check API documentation
Social networks (Instagram, VK)	10-30 seconds	Very strict limits, high risk of ban

How to implement delays in code

Here is an example in Python using the time library:

import time
import requests

urls = [
    "https://example.com/product1",
    "https://example.com/product2",
    "https://example.com/product3"
]

for url in urls:
    response = requests.get(url)
    print(f"Searched: {url}")
    
    # Delay 3 seconds before the next request
    time.sleep(3)

For more complex scenarios, use random delays to make behavior even more natural:

import time
import random

for url in urls:
    response = requests.get(url)
    
    # Random delay from 2 to 5 seconds
    delay = random.uniform(2, 5)
    time.sleep(delay)

Proxy rotation for ethical scraping

Even if you comply with robots.txt and make delays, scraping a large volume of data from a single IP can raise suspicions. Proxy rotation helps distribute requests across different IP addresses, mimicking the behavior of many real users.

Why proxy rotation is needed

Bypassing Rate Limiting — if the limit is 100 requests/hour from one IP, then 10 proxies will give you 1000 requests/hour.
Geographical distribution — for scraping regional data (prices on Wildberries in Moscow and Vladivostok), proxies from different cities are needed.
Reducing suspicion — requests from different IPs look like traffic from real users.
Failover — if one proxy is blocked, the scraper automatically switches to another.

Which proxies to use for ethical scraping

Type of proxy	Pros	Cons	When to use
Residential	Real IPs of home users, low risk of ban	More expensive than other types	Scraping social networks, marketplaces with strong protection
Mobile	IPs of mobile operators, maximum trust	Most expensive, fewer available IPs	Scraping Instagram, TikTok, mobile applications
Data center	Cheap, high speed	Easily detected, often blacklisted	Scraping simple websites, testing

Recommendation for ethical scraping: Use residential proxies with automatic rotation. They provide a balance between cost and reliability, and their IPs look like regular users.

Proxy rotation strategies

Rotation for each request — each request comes from a new IP. Suitable for scraping sites with strict limits (social networks, marketplaces).
Time-based rotation (every 5-10 minutes) — one IP is used for several requests, then changes. More natural behavior.
Sticky sessions — one IP is used for the entire user session (e.g., authorization + scraping the personal account). Essential for sites with authentication.
Geographical rotation — a proxy from that region is used for each region. Example: scraping prices on Wildberries in Moscow — use a proxy from Moscow.

Example of proxy rotation in Python

import requests
import random
import time

# List of proxies (replace with real ones)
proxies_list = [
    {"http": "http://user:[email protected]:8080"},
    {"http": "http://user:[email protected]:8080"},
    {"http": "http://user:[email protected]:8080"}
]

urls = [
    "https://example.com/product1",
    "https://example.com/product2",
    "https://example.com/product3"
]

for url in urls:
    # Choose a random proxy
    proxy = random.choice(proxies_list)
    
    try:
        response = requests.get(url, proxies=proxy, timeout=10)
        print(f"Searched {url} through {proxy}")
    except Exception as e:
        print(f"Error with proxy {proxy}: {e}")
    
    # Delay 3 seconds
    time.sleep(3)

Practical cases: scraping marketplaces and competitors

Let's consider real scenarios of ethical scraping for business.

Case 1: Price monitoring on Wildberries

Task: You sell products on Wildberries and want to track competitors' prices to adjust your own.

Problems:

Wildberries blocks IPs with frequent requests
Prices depend on the delivery region
You need to scrape 100-500 products daily

Ethical solution:

Check robots.txt — Wildberries allows scraping product cards but prohibits API endpoints.
Use residential proxies — for each region (Moscow, St. Petersburg, Novosibirsk), take proxies from that region.
Rotation for each request — scrape each product with a new IP.
Delay 2-3 seconds — pause between requests.
Scrape once a day — do not update prices every hour; daily monitoring is sufficient.

Result: You receive up-to-date competitor prices without blocks. Wildberries does not see abnormal load since requests are distributed over time and IPs.

Case 2: Scraping listings on Avito

Task: You are a realtor and want to collect all listings for apartment sales in your city for market analysis.

Problems:

Avito shows CAPTCHA with suspicious activity
You need to scrape 5000+ listings
Data is updated daily

Ethical solution:

Check robots.txt — Avito allows scraping listing pages but with a Crawl-delay of 5 seconds.
Use residential proxies — rotate every 10 requests (not for every request to avoid looking suspicious).
Delay 5-7 seconds — comply with the Crawl-delay from robots.txt.
Scrape at night — when the load on the site is minimal (2-6 AM).
Cache data — do not scrape the same listing twice; save results in a database.

Result: Overnight, you collect all new listings without CAPTCHA and blocks. Avito does not experience overload, and you get the necessary data.

Case 3: Analyzing a competitor's assortment

Task: You are the owner of an electronics online store and want to know what new products have appeared at a competitor's store.

Problems:

The competitor's website is on protected hosting with an anti-bot system
You need to scrape a catalog of 10,000 products
You want to do this weekly

Ethical solution:

Check robots.txt — scraping /catalog/ is allowed, but /admin/ and /api/ are prohibited.
Use Sitemap — instead of manually crawling all pages, take the URL list from sitemap.xml (this is faster and does not create unnecessary load).
Residential proxies with rotation every 5 minutes — one IP makes 20-30 requests, then changes.
Delay 3-5 seconds — mimic the behavior of a regular user.
Scrape only new products — compare the current catalog with the previous one and scrape only the changes.

Result: You receive a list of competitor new arrivals weekly without blocks. The competitor's site does not experience problems, and you gain a competitive advantage.

Tools for automation while complying with rules

There are ready-made tools that simplify ethical scraping and automatically comply with robots.txt.

Scrapy (Python)

Scrapy is a popular framework for scraping in Python. It automatically checks robots.txt and complies with the rules.

Setting up robots.txt compliance in Scrapy:

# settings.py

# Enable robots.txt compliance
ROBOTSTXT_OBEY = True

# Delay between requests (in seconds)
DOWNLOAD_DELAY = 3

# Random delay (from 0.5 to 1.5 * DOWNLOAD_DELAY)
RANDOMIZE_DOWNLOAD_DELAY = True

# Limit concurrent requests to one domain
CONCURRENT_REQUESTS_PER_DOMAIN = 1

# User-Agent of your scraper
USER_AGENT = 'MyCompanyParser/1.0 (+http://mycompany.com/bot)'

With these settings, Scrapy will automatically check robots.txt before scraping and will comply with all rules.

Apify (cloud platform)

Apify is a cloud platform for web scraping without code. You create a scraper through a visual interface, and Apify automatically manages proxies and compliance with limits.

Benefits for ethical scraping:

Built-in proxy rotation (residential and data center)
Automatic compliance with robots.txt
Delay settings through the interface
Scheduling (scraping once a day/week)

Octoparse (no-code scraper)

Octoparse is a desktop application for scraping without programming. It is suitable for marketers and sellers who do not know how to code.

How to set up ethical scraping in Octoparse:

Open task settings
Enable "Respect robots.txt"
Set a delay of 3-5 seconds
Connect proxies in the "Proxy Settings" section
Set up a launch schedule

Puppeteer/Playwright (JavaScript)

Puppeteer and Playwright are libraries for browser automation. They are suitable for scraping sites with JavaScript rendering.

Example of ethical scraping with Puppeteer:

const puppeteer = require('puppeteer');
const robotsParser = require('robots-parser');

async function ethicalScrape(url) {
  // Check robots.txt
  const robots = robotsParser('https://example.com/robots.txt', 
    'MyParser/1.0');
  
  if (!robots.isAllowed(url)) {
    console.log('Scraping prohibited by robots.txt');
    return;
  }
  
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  
  // Delay before loading the page
  await new Promise(resolve => setTimeout(resolve, 3000));
  
  await page.goto(url);
  const data = await page.evaluate(() => {
    return document.querySelector('h1').innerText;
  });
  
  console.log(data);
  await browser.close();
}

ethicalScrape('https://example.com/product1');

Conclusion

Ethical scraping through proxies is a balance between business tasks and respect for website owners. By complying with robots.txt, making delays between requests, and using proxy rotation, you can collect the necessary data without legal risks and blocks. The main principles: check robots.txt before scraping, limit request speed, use an honest User-Agent, and scrape only public data. This will protect your business from lawsuits and ensure stable operation of scrapers.

If you plan to scrape marketplaces, competitor websites, or collect data for market analysis, we recommend using residential proxies with automatic rotation. They provide a balance between cost and reliability, and their IPs appear as regular users.

Robots.txt and Proxies: How to Legally Scrape Competitors Without Getting Banned

What is robots.txt and why is it needed for websites

Legal risks of scraping: what the law says

Legislation in Russia

Legislation in the USA and Europe

What is safe to scrape

Ethical scraping: how to collect data without violations

Basic principles of ethical scraping

When NOT to scrape

How to correctly read and comply with robots.txt

Main directives of robots.txt

Examples of robots.txt and how to interpret them

How to check robots.txt before scraping

Rate Limiting and delays between requests

Why it is important to observe delays

Recommended delays for different tasks

How to implement delays in code

Proxy rotation for ethical scraping

Why proxy rotation is needed

Which proxies to use for ethical scraping

Proxy rotation strategies

Example of proxy rotation in Python

Practical cases: scraping marketplaces and competitors

Case 1: Price monitoring on Wildberries

Case 2: Scraping listings on Avito

Case 3: Analyzing a competitor's assortment

Tools for automation while complying with rules

Scrapy (Python)

Apify (cloud platform)

Octoparse (no-code scraper)

Puppeteer/Playwright (JavaScript)

Conclusion