If you are scraping marketplaces, monitoring competitor prices, or collecting data for analytics ā the issue of GDPR (General Data Protection Regulation) compliance directly affects your business. Fines can reach ā¬20 million or 4% of the company's annual turnover, and European regulators are actively imposing them. In this guide, we will discuss what data can be collected legally, how to properly use proxies for compliance, and what protective measures to implement in the web scraping process.
It is important to understand: GDPR regulates not the scraping itself, but the processing of personal data of EU citizens. Even if your company is located outside of Europe, if you collect data from European users ā the regulation applies to you.
What is GDPR and how does it apply to web scraping
GDPR (General Data Protection Regulation) is a European regulation on the protection of personal data that came into effect in May 2018. It applies to any company or individual that processes personal data of citizens of the European Union, regardless of the location of the company itself.
For web scraping, this means: if you scrape public websites and collect information about European users (names, emails, phone numbers, addresses, behavioral data), you automatically become subject to GDPR regulation. This applies to all popular tasks:
- Scraping marketplaces (Wildberries, Ozon, Amazon EU) ā if you collect data about sellers or buyers
- Monitoring competitor prices ā if the data includes information about company contacts
- Collecting contacts for B2B ā emails, phone numbers, job titles of company employees
- Social media analysis ā user profiles, comments, activity
- Aggregating listings (real estate, job vacancies, services) with contact information
The key point: GDPR does not prohibit web scraping as such. It establishes rules for processing personal data. If you collect only public non-personal information (product prices, specifications, descriptions without reference to specific individuals) ā GDPR does not formally apply. But as soon as names, contacts, or user identifiers appear in the data ā the requirements of the regulation come into force.
Important: Fines for violating GDPR can reach up to ā¬20 million or 4% of the company's annual turnover (the higher amount applies). In 2023, European regulators issued fines totaling over ā¬2.5 billion. The largest fines were imposed on Meta (ā¬1.2 billion), Amazon (ā¬746 million), and TikTok (ā¬345 million).
What data is considered personal under GDPR
GDPR defines personal data very broadly: it is any information relating to an identified or identifiable natural person. In practice, for web scraping, personal data includes:
| Data Category | Examples in Scraping | Risk Level |
|---|---|---|
| Direct Identifiers | Full name, email, phone, address, profile photo, social media username | High |
| Indirect Identifiers | IP address, cookie ID, device fingerprint, geolocation, browsing history | Medium |
| Special Categories | Racial origin, political views, religion, health, biometrics | Critical |
| Business Information | Job title, company, work email/phone, LinkedIn profile | Medium |
| Non-Personal Data | Product prices, specifications, descriptions, statistics without reference to individuals | Low |
A common mistake is to assume that publicly available data can be freely collected and used. GDPR does not make exceptions for public information. If you scrape LinkedIn profiles, contacts from corporate websites, or ads with phone numbers ā this is personal data, and the requirements of the regulation fully apply.
Special attention should be paid to IP addresses. The European Court ruled in 2016 that dynamic IP addresses are personal data, as the provider can identify the user. This is important when using proxies: if you log the IP addresses of end users while scraping ā this is processing of personal data.
Legal grounds for data collection in scraping
GDPR requires a legal basis for processing personal data. The following bases are applicable to web scraping (Article 6 GDPR):
1. Consent of the data subject
The most obvious, but least applicable basis for scraping. Consent must be:
- Voluntary and informed
- Specific (for a particular purpose)
- Informed (the user understands what you are doing with the data)
- Revocable (can be easily withdrawn)
In scraping, obtaining such consent is practically impossible ā you collect data automatically, without interaction with users. Therefore, this basis is rarely applied.
2. Legitimate Interests
The most commonly used basis for web scraping. You may process data if it is necessary for your legitimate interests, provided that the interests of the data subject do not outweigh yours. Examples of legitimate interests include:
- Monitoring competitor prices ā to form your own pricing strategy
- Market analysis ā for business analytics and research
- Fraud detection ā collecting data to protect against fraud
- Service improvement ā aggregating public data to create a useful product
It is important to conduct a Legitimate Interest Assessment (LIA): document why your interest outweighs the interests of users. For example, if you scrape product prices from a marketplace ā this is a legitimate interest. But if you collect emails for spam ā this is a violation.
3. Performance of a contract or public task
These grounds are rarely applicable in scraping. Performance of a contract is relevant if you collect data to provide a service under a contract with the user (for example, a job aggregator collects data to show to users). Public task applies to government bodies.
Practical advice:
Document the legal basis for each type of data collected. Create an internal document (Data Processing Record) where you describe: what data you collect, for what purposes, on what basis, how you store and protect it. This is the first thing regulators will ask for during an audit.
The role of proxies in GDPR compliance: protection and anonymization
Proxy servers play a dual role in the context of GDPR compliance in web scraping. On one hand, they help minimize the collection of personal data and protect privacy. On the other hand, they can create risks themselves if used improperly.
How proxies help comply with GDPR
1. Anonymization of requests. When you use residential proxies for scraping, the target site sees the IP address of the proxy server, not your actual IP. This means the site cannot directly identify your company as the source of requests. For GDPR, this is important if you want to minimize the disclosure of your own data.
2. Geographical distribution. Residential and mobile proxies allow you to make requests from IP addresses of different countries. This is useful for collecting region-specific data (for example, prices in different EU countries) without the need for physical presence. At the same time, you comply with the principle of minimization ā collecting only data available in a specific region.
3. IP rotation to minimize traces. Automatic rotation of IP addresses through proxies helps avoid creating a profile of your scraping activity on the target site. This reduces the risk that the site will collect and store your metadata (request times, behavior patterns), which may themselves be personal data.
Risks of using proxies in the context of GDPR
1. Logging of data by the proxy provider. If your proxy provider logs your requests and the IP addresses of target users ā it becomes a Data Processor under GDPR. You are required to enter into a Data Processing Agreement (DPA) with them, outlining data protection obligations. Choose providers that offer a no-log policy or are willing to sign a DPA.
2. Using proxies to bypass protections. Some sites block scraping through technical measures (rate limiting, CAPTCHA, IP blocks). Using proxies to circumvent these measures may violate not only GDPR but also other laws (such as the Computer Fraud and Abuse Act in the US or the E-Commerce Directive in the EU). GDPR is not the issue here, but legal risks exist.
3. Proxies from unreliable providers. If you use cheap public proxies or proxies with unknown IP sources ā there is a risk that these IPs are compromised or used for illegal activities. This may lead to the collected data being considered unlawfully obtained.
| Proxy Type | Benefits for GDPR | Risks |
|---|---|---|
| Residential Proxies | Real IPs of home users, high anonymity, low risk of blocking | Need to ensure that IP owners consented to the provider |
| Mobile Proxies | IPs of mobile operators, ideal for social media, rarely blocked | High cost, less control over geolocation |
| Data Center Proxies | High speed, low price, full provider control | Easily detected, more frequently blocked, unsuitable for sensitive tasks |
Data minimization principle: collect only what is necessary
One of the key principles of GDPR is data minimization (Article 5). You must collect only the personal data that is truly necessary to achieve the stated purpose. This directly affects the setup of scraping.
Practical steps for minimization
1. Filter data at the collection stage. Do not save the entire page ā extract only the necessary fields. For example, if you are scraping a marketplace for price monitoring, do not save seller names, their ratings, or contacts. Collect only the product name, price, SKU.
# Bad ā saving everything
product_data = {
'title': title,
'price': price,
'seller_name': seller_name, # Personal data!
'seller_email': seller_email, # Personal data!
'seller_rating': seller_rating,
'reviews': reviews # May contain buyer names!
}
# Good ā only what is necessary
product_data = {
'title': title,
'price': price,
'sku': sku,
'availability': availability
}
2. Anonymize or pseudonymize data. If you need to track dynamics (for example, price changes for a specific seller), do not store the seller's name ā create a hash from their ID. This is pseudonymization: the data cannot be read directly, but can be matched.
import hashlib
# Pseudonymizing seller ID
seller_id_hash = hashlib.sha256(seller_id.encode()).hexdigest()
product_data = {
'title': title,
'price': price,
'seller_hash': seller_id_hash # Original ID cannot be restored
}
3. Delete data after use. GDPR requires that data is not stored longer than necessary (storage limitation). If you collect prices for a daily report ā delete data older than 30-60 days. Set up automatic database cleanup.
4. Do not collect special categories of data. Avoid collecting data on race, health, political views, religion (Article 9 GDPR). Explicit consent or very compelling grounds are required for them. In scraping, this is almost impossible to justify.
Practical example: A company scraped LinkedIn to collect contacts of HR specialists. They collected full names, emails, profile photos, current positions, previous workplaces. Under GDPR, this is excessive ā for mailing, email and position are sufficient. Photos, work history, and full names are unnecessary personal data that increase risks.
Secure storage of collected data
GDPR requires ensuring the security of personal data (Article 32). If you collect data through scraping, you must protect it from leaks, unauthorized access, and loss. Here is a minimum set of measures:
Technical security measures
- Data encryption at rest. Store the database with collected data in encrypted form. Use AES-256 or similar standards. Cloud providers (AWS, Google Cloud, Azure) offer automatic disk encryption.
- Data encryption in transit. All requests to APIs, databases, and proxies must go over HTTPS/TLS. Never transmit personal data over unencrypted channels.
- Access control. Limit access to the database: only authorized employees should see the collected data. Use role-based access control (RBAC) and log all data access.
- Regular backups. Make backups, but store them as securely as the main data. Encrypted backups, access via two-factor authentication.
- Monitoring and auditing. Set up a monitoring system to detect suspicious activity (e.g., mass data downloads). Regularly conduct security audits.
Organizational measures
- Privacy policy. Create an internal document describing how you collect, store, and use data. This is the basis for compliance.
- Employee training. All employees with access to data must understand GDPR requirements and the consequences of violations.
- Appointment of a DPO (Data Protection Officer). If your main activity involves regular and systematic monitoring of data subjects on a large scale, GDPR requires appointing a data protection officer.
- Data breach response plan. Prepare a procedure for a data breach. GDPR requires notifying the regulator within 72 hours of discovering a breach.
Data storage security checklist:
- ā Database is encrypted (AES-256 or higher)
- ā Password + 2FA access for all users
- ā Logging of all data access
- ā Regular backups (encrypted, in a separate storage)
- ā Automatic deletion of data older than N days
- ā Firewall and protection against SQL injections
- ā Regular software updates and security patches
How to handle data deletion requests
GDPR grants data subjects (the people whose data you collected) a number of rights. For web scraping, the most relevant are:
- Right to Access. The user can request a copy of all data you hold about them. You must provide it within 30 days.
- Right to Erasure / "Right to be Forgotten." The user can request the deletion of all their data. You must comply with the request if there are no legal grounds for retention.
- Right to Rectification. If the data is inaccurate, the user can request it to be corrected.
- Right to Restriction. Temporary freezing of data processing until a dispute is resolved.
The problem with scraping: you often do not know whose data you collected. Users did not register with you, did not provide an email for contact. How can they send a request? How do you identify them?
Practical solutions
1. Create a public request form. Place a "GDPR Data Subject Requests" page on your website with a form where the user can specify their email and describe what data they want to delete/get. Indicate that you will respond within 30 days.
2. Verify requests. Ensure that the request came from the actual data owner. Ask for confirmation (for example, send a code to the email that the user specified as theirs). This will protect against fraudulent requests.
3. Automate deletion. Create a script that deletes all related data from the database based on email or another identifier. Important: deletion must be complete ā from the main database, backups, logs.
# Example script for deleting data by email
def delete_user_data(email):
# Deleting from the main database
db.execute("DELETE FROM scraped_contacts WHERE email = ?", (email,))
# Deleting from logs (if stored)
db.execute("DELETE FROM activity_logs WHERE user_email = ?", (email,))
# Marking in backups (if cannot be deleted immediately)
db.execute("INSERT INTO deletion_queue (email, requested_at) VALUES (?, NOW())", (email,))
# Logging the deletion request (for compliance)
log_gdpr_request('deletion', email)
return "Data deleted successfully"
4. Document all requests. Keep a log of all GDPR requests: who requested, when, what was done. This will be needed during a regulator audit.
5. Respond on time. You have 30 days to respond (can be extended to 60 in complex cases, but you must notify the requester). Missing the deadline is a violation of GDPR.
Important: If you cannot identify the user in your database (for example, you only collected aggregated data without email), you have the right to refuse the request. But this must be justified: "We do not store personal data that allows you to be identified." This is another argument in favor of data minimization.
Practical GDPR compliance checklist for scraping
Use this checklist before launching any web scraping project involving personal data of EU citizens:
Stage 1: Planning
- ā Determine whether the collected data contains personal information (full name, email, IP, phone numbers, etc.)
- ā If yes ā determine the legal basis for collection (most often: legitimate interests)
- ā Conduct a Legitimate Interest Assessment (LIA) and document the result
- ā Identify the minimum set of data necessary for your purpose
- ā Set a data retention period (for example, 30 days)
Stage 2: Setting up infrastructure
- ā Choose a proxy provider with a no-log policy or willingness to sign a DPA
- ā Set up database encryption (AES-256)
- ā Set up access control (RBAC) to collected data
- ā Enable logging of all data access
- ā Set up automatic deletion of data older than the established period
- ā Set up encrypted backups
Stage 3: Developing the scraper
- ā Implement data filtering at the collection stage (do not save unnecessary fields)
- ā Use pseudonymization or anonymization where possible
- ā Do not collect special categories of data (race, health, religion, etc.)
- ā Use HTTPS for all requests
- ā Set up IP rotation through proxies to minimize traces
Stage 4: Documentation
- ā Create a Data Processing Record: what data, for what purpose, on what basis, how long you store it
- ā Prepare a Privacy Policy for your website
- ā If you use contractors (proxy provider, cloud storage) ā sign a DPA
- ā Create a data breach response plan
Stage 5: Handling data subject requests
- ā Create a public form for GDPR requests on your website
- ā Set up a request verification process
- ā Automate data deletion upon request
- ā Keep a log of all GDPR requests
- ā Respond to requests within 30 days
Stage 6: Monitoring and auditing
- ā Regularly check what data is actually being collected (new fields may appear)
- ā Conduct security audits of the data storage (quarterly/semi-annually)
- ā Train employees on GDPR requirements
- ā Stay updated on legislative and judicial developments
Proxy type recommendation:
For tasks requiring a high level of compliance and risk minimization, we recommend using residential or mobile proxies from reputable providers. They provide better anonymity and reduce the likelihood that your requests will be associated with mass scraping. Avoid cheap public proxies ā they may be compromised and create additional legal risks.
Conclusion
GDPR compliance in web scraping is not an obstacle for business, but a set of rules that protect both you and users. Key principles: collect only necessary data, justify the legal basis, protect the collected information, and be ready to delete data upon request. Fines for violations can reach ā¬20 million, but they can be completely avoided by following the practices described in this article.
Using the right tools ā proxies, encryption, automated deletion ā reduces risks and simplifies compliance. Document every step: what data you collect, why, how you store it. This will not only protect against fines but also increase trust among clients and partners.
If you plan large-scale web scraping involving the processing of personal data of EU citizens, we recommend consulting with a lawyer specializing in GDPR. Investments in compliance at the start of a project are much cheaper than fines and reputational losses due to violations.
For safe and anonymous web scraping, we recommend using residential proxies ā they provide a high level of anonymity, minimize the risk of blocks, and help comply with data minimization principles. Choose providers with a transparent privacy policy and a willingness to sign a Data Processing Agreement.