Data collection via proxies is a common practice for marketers, analysts, and business owners. But where is the line between legal parsing and breaking the law? In this article, we explore the legal aspects of working with data: what can be collected, which methods are allowed, how to avoid violating GDPR and Russian personal data legislation.
Legal Basics of Data Collection: What the Law Says
Data collection via proxies is regulated by several legal norms depending on the jurisdiction. In Russia, the main document is Federal Law No. 152-FZ "On Personal Data," in Europe — GDPR (General Data Protection Regulation), and in the USA — various industry laws and case law.
The key principle: data collection itself is not illegal. The method of obtaining data, its use, or violation of the website owner's rights may be illegal. Proxies in this context are simply a technical tool, like a browser or internet connection.
It is important to understand: The use of proxies does not automatically make data collection illegal. Proxies are a means of ensuring privacy and bypassing technical restrictions (geo-blocking, rate limits), not a tool for illegal activities.
Russian legislation distinguishes between several categories of data:
- Public Data — information published in open access without restrictions (prices in stores, news, public profiles)
- Personal Data — information related to a specific individual (full name, phone number, email, address)
- Commercial Secrets — data that has commercial value and is protected by the owner
- Technical Data — logs, metrics, analytics that do not contain personal information
Each category has its own rules for collection and use. For example, parsing competitor prices on Wildberries or Ozon is the collection of public data that does not violate the personal data law. However, collecting email addresses from someone else's database is already a violation.
Public Data: What Can Be Parsed Without Restrictions
Public data is information that the website owner has consciously placed in open access without requiring authorization or payment. Collecting such data via proxies is completely legal if technical and ethical norms are followed.
| Type of Data | Examples | Legal Status |
|---|---|---|
| Product Prices | Wildberries, Ozon, Yandex.Market | Legal |
| Product Descriptions | Specifications, photos, reviews | Legal (considering copyright) |
| News and Articles | Media sites, blogs | Legal (for analysis, not publication) |
| Job Listings | hh.ru, Avito Work | Legal |
| Advertisements | Avito, Youla (without contacts) | Legal |
| Weather and Geodata | Open APIs, weather services | Legal |
Typical scenarios for legal use of proxies for collecting public data:
- Monitoring Competitor Prices — sellers on marketplaces track prices daily through parsers to remain competitive
- Real Estate Market Analysis — agencies collect data on listings from Avito and CIAN for analytics
- Job Monitoring — HR agencies parse hh.ru for salary and market demand analysis
- News Collection — media monitoring collects publications for clients (PR agencies, analysts)
For such tasks, data center proxies are usually used — they provide high speed and stability when parsing large volumes of data. The main thing is to maintain reasonable intervals between requests to avoid excessive load on servers.
Personal Data: Where the Red Line Is Drawn
Personal data is information that directly or indirectly relates to a specific person. The collection of such data is regulated most strictly, and it is important to clearly understand the boundaries of what is permissible.
According to 152-FZ, personal data includes:
- Full Name
- Date and Place of Birth
- Residential Address
- Phone Number
- Email Address
- Passport Data
- Photographs (if they can identify a person)
- IP Addresses (in some jurisdictions)
Prohibited: Collection of personal data without the consent of the data subject or without a legal basis. For example, parsing phone numbers and emails from social media profiles for mailing is a direct violation of 152-FZ with fines up to 500,000 rubles.
However, there are exceptions when the collection of personal data is legal:
- Data is Publicly Posted by the Subject — if a person has published their phone number in an advertisement on Avito, you can see it and use it to contact them regarding that advertisement
- Processing for Journalistic Purposes — media can collect public data for preparing materials
- Statistical and Research Purposes — if the data is anonymized and does not allow for the identification of a specific person
- Explicit Consent Exists — the person has given written consent for their data to be processed
A practical example for marketers: you can collect a list of companies and their phone numbers from public sources (company websites, directories like 2GIS). But you CANNOT parse personal phone numbers of employees from their VK or Instagram profiles for cold calls — that is a violation.
| Scenario | Legality | Comment |
|---|---|---|
| Parsing Phone Numbers from Avito Ads | Legal | Data is publicly posted for contact |
| Parsing Emails from LinkedIn Profiles | Gray Area | Violates LinkedIn ToS, but not always the law |
| Collecting Full Names and Phone Numbers from Closed VK Groups | Prohibited | Violation of 152-FZ and ToS |
| Parsing Company Contacts from 2GIS | Legal | Public directory |
| Collecting Emails from Company Websites for B2B Mailings | Legal | Contacts are posted for communication |
GDPR and International Requirements When Working with Proxies
If you are collecting data from websites aimed at a European audience, or your company works with clients from the EU, you must comply with GDPR (General Data Protection Regulation). Fines for violations can reach 20 million euros or 4% of the company's annual turnover.
Key principles of GDPR that are important when collecting data:
- Lawfulness, Fairness, and Transparency — data collection must have a legal basis (consent, contract, legitimate interest)
- Purpose Limitation — data is collected only for a specific stated purpose
- Data Minimization — collect only the data that is actually necessary
- Accuracy — data must be current and correct
- Storage Limitation — do not keep data longer than necessary
- Integrity and Confidentiality — protect data from breaches
Using proxies when working with European websites does not exempt you from complying with GDPR. If you are parsing data from EU citizens, you are required to:
- Have a legal basis for processing (e.g., legitimate interest for market analysis)
- Ensure the ability to delete data upon request from the subject ("right to be forgotten")
- Not transfer data to third parties without consent
- Protect data from breaches (encryption, access control)
Practical Advice: If you are collecting data for market analytics (prices, assortment, trends), this is considered "legitimate interest" under GDPR. But if you are collecting emails for mailing — explicit consent from each recipient is needed.
When using residential proxies to access European websites, ensure that the proxy provider also complies with GDPR — this is important for the data processing chain.
Robots.txt and Terms of Service: Legal Force of Restrictions
One of the most contentious issues in web scraping is whether robots.txt files and user agreements (Terms of Service, ToS) prohibiting automated data collection have legal force.
Robots.txt
The robots.txt file is a technical recommendation for search bots, not a legal document. In most jurisdictions, violating robots.txt is not a crime in itself. However, there are nuances:
- USA — there are precedents where courts recognized violations of robots.txt as "unauthorized access" (CFAA), but this is a controversial practice
- Europe — robots.txt usually has no legal force but can be used as evidence of violating ToS
- Russia — there is no clear judicial practice, but ignoring robots.txt may be regarded as creating excessive load on the server
Practical recommendation: comply with robots.txt if you do not want to take risks. If you need data from closed sections — contact the website owner for an API or official permission.
Terms of Service (ToS)
User agreements are contracts between you and the website owner. Many large platforms (Facebook, LinkedIn, Amazon) explicitly prohibit automated data collection in their ToS.
The legal force of ToS depends on several factors:
| Factor | Impact on Legal Force |
|---|---|
| You are registered on the site | ToS has full force of contract — violation may lead to blocking and lawsuits |
| You are not registered | ToS has limited force — you did not explicitly accept the terms |
| Data is Public | ToS may prohibit commercial use but not personal use |
| You are creating load on the server | Violation of ToS + possible liability for DDoS |
Notable court precedents:
- hiQ Labs vs LinkedIn (2019, USA) — the court ruled that parsing public data does not violate CFAA, even if prohibited by ToS
- Ryanair vs PR Aviation (2015, EU) — the EU court ruled that collecting public data about flights does not violate the law, despite ToS
- eBay vs Bidder's Edge (2000, USA) — the court prohibited parsing due to excessive load on eBay servers
Conclusion: ToS may prohibit you from using the site, but it does not always prohibit the collection of public data. However, violating ToS always carries the risk of account blocking and potential lawsuits.
Legal Methods of Data Collection for Business
There are many completely legal ways to collect data for business purposes. The key is to use the right tools and adhere to ethical norms.
1. Using Official APIs
Many platforms provide official APIs for data access. This is the safest way:
- Google Maps API — for geodata and information about places
- Twitter API — for analyzing mentions and trends
- Wildberries API — for sellers (access to their data)
- OpenWeatherMap API — for weather data
APIs usually have request limits (rate limits), but you get structured data and legal protection.
2. Parsing Public Data Ethically
If there is no API, you can parse public pages while adhering to the following rules:
- Maintain Intervals — pause between requests (1-3 seconds) to avoid creating load
- Respect robots.txt — even if it is not legally required
- Use User-Agent — honestly identify your bot
- Parse During Off-Peak Hours — server load is lower at night
For such tasks, residential proxies are suitable — they mimic regular users and are less likely to be blocked by anti-bot systems.
3. Purchasing Ready-Made Datasets
Many companies sell legally collected data:
- Statistical Data — Rosstat, World Bank, UN
- Market Research — Nielsen, GfK, Kantar
- Company Databases — SPARK, Kontur.Focus (legal B2B databases)
- Industry Data — specialized providers for real estate, finance, retail
4. Crowdsourcing and Surveys
Collect data directly from users with their consent:
- Online surveys (Google Forms, SurveyMonkey)
- Loyalty programs exchanging data for bonuses
- User-generated content (reviews, comments on your site)
- Affiliate programs exchanging data
What is Prohibited: Actions with High Legal Risk
Some data collection methods are unequivocally illegal or carry a high risk of litigation. Avoid the following practices:
Categorically Prohibited:
- Hacking and Bypassing Protection — bypassing CAPTCHA, password cracking, exploiting vulnerabilities (Article 272 of the Criminal Code of the Russian Federation — up to 7 years)
- Collecting Data from Closed Accounts — parsing closed social media profiles, private groups
- DDoS Attacks — excessive load on the server leading to denial of service (Article 273 of the Criminal Code of the Russian Federation)
- Collecting Financial Data — card numbers, CVV, banking details (Article 159.6 of the Criminal Code of the Russian Federation — fraud)
- Parsing Competitors' Databases — theft of commercial secrets (Article 183 of the Criminal Code of the Russian Federation)
- Collecting Medical Data — diagnoses, medical history without consent (special category of PD)
Gray Area — High Risk:
- Parsing Emails for Spam — even if the email is public, mass mailing without consent violates 152-FZ and advertising laws
- Aggressive Parsing — thousands of requests per second may be regarded as an attack
- Bypassing Blocks via Proxies — if the site has blocked you, continuing to parse may be considered unauthorized access
- Parsing Paid Content — bypassing paid subscriptions, closed materials
Real examples of court cases:
- Facebook vs Power Ventures (2016) — the court awarded Facebook $3 million for parsing user data
- LinkedIn vs hiQ Labs (2022) — after lengthy proceedings, the case returned to court, and the outcome is still unclear
- Clearview AI (2021) — the company was fined in Europe for collecting photos from social networks for facial recognition
Safe Practices: How to Protect Your Business from Claims
To minimize legal risks when collecting data via proxies, follow these recommendations:
1. Document Legal Grounds
Create an internal document that explains:
- What data you are collecting
- From which sources (only public)
- For what purposes (market analysis, price monitoring)
- How you protect data from breaches
- How long you store the data
This will help prove good faith in case of claims.
2. Use Technical Security Measures
- Rate Limiting — limit the speed of requests (no more than 1-2 per second)
- Honest User-Agent — do not disguise as a browser, specify your bot's name
- Contact Email — add an email for contact in the User-Agent
- Proxy Rotation — use mobile proxies or residential ones to distribute the load
3. Anonymize Personal Data
If you have collected data containing personal information:
- Remove full names, phone numbers, emails immediately after processing
- Aggregate data (instead of "Ivan, 35 years old, Moscow" → "men aged 30-40, Moscow")
- Use hashing for identifiers
- Do not store more data than necessary for the task
4. Obtain Consent When Possible
If you plan to use data for marketing or mailings:
- Add a consent checkbox for processing personal data
- Explain how the data will be used
- Provide an option to unsubscribe
- Keep records of consent confirmations
5. Consult with Lawyers
If your business critically depends on data collection, hire a lawyer specializing in IT law. They can help:
- Draft a Privacy Policy and Terms of Use
- Conduct an audit for GDPR and 152-FZ compliance
- Prepare responses to claims from website owners
- Register personal data processing with Roskomnadzor (if required)
Checklist for Legal Data Collection:
✅ Collect only public data
✅ Do not create excessive load on servers
✅ Comply with robots.txt (if possible)
✅ Do not collect personal data without consent
✅ Anonymize data before storage
✅ Use data only for stated purposes
✅ Protect data from breaches
✅ Be ready to delete data upon request from the subject
Conclusion
Data collection via proxies is a legal and widespread practice if legal and ethical norms are followed. Key principles: collect only public data, do not violate the rights of personal data subjects, do not create excessive load on servers, and use data in good faith.
Most business tasks — monitoring prices on marketplaces, analyzing competitors, collecting news, conducting market research — fully comply with legal frameworks. The main thing is to understand the boundaries and not to cross them.
If you plan to collect data for analytics or monitoring, we recommend using residential proxies — they provide a high level of anonymity and minimal risk of blocks, allowing you to work with data legally and effectively. For tasks requiring high processing speed, data center proxies are suitable, and for working with mobile platforms — mobile proxies.
Remember: technology is neutral; it is how you use it that matters. Proxies are a tool for legal data work, not a way to circumvent the law. Follow the rules, respect the rights of others, and your business will be protected from legal risks.