Sentiment analysis helps marketers understand how customers feel about a brand, product, or service. However, quality analysis is impossible without properly collected data. In this guide, we will explore where and how to collect information for sentiment analysis, which tools to use, and how to avoid blocks during scraping.
Main Data Sources for Sentiment Analysis
For quality sentiment analysis, diverse data sources are needed. The more information you collect from different channels, the more accurate the picture of your brand's perception will be.
| Source | Data Type | Collection Difficulty | Value for Analysis |
|---|---|---|---|
| Social Media (VK, Telegram) | Comments, Posts, Mentions | Medium | High |
| Marketplaces (Wildberries, Ozon) | Customer Reviews, Ratings | High | Very High |
| Review Sites (Irecommend, Otzovik) | Detailed Reviews | Medium | High |
| News Portals | Articles, Comments | Low | Medium |
| Forums and Q&A Sites | Discussions, Questions | Medium | Medium |
| YouTube | Video Comments | Medium | High |
For most brands, marketplaces and social media are priority sources — this is where the majority of customer opinions are concentrated. Review sites provide more detailed feedback, but the volume of data is usually smaller there.
Data Collection from Social Media
Social media is a goldmine for sentiment analysis. People freely express their opinions about brands, share their experiences with products, and leave comments under promotional posts.
VKontakte
VK provides an API for collecting public data, but with limitations on the number of requests. For large-scale monitoring, scraping through the web interface will be necessary. The main types of data to collect include:
- Comments under posts from your brand or competitors
- Mentions of the brand in public posts and groups
- Reviews in thematic communities (for example, "Heard It" for your niche)
- Discussions in industry groups
An important point: VK actively fights against automated data collection. When scraping without proxies, you will quickly encounter a captcha or temporary block. For stable operation, use residential proxies with Russian IP addresses — they mimic regular users and rarely get blocked.
Telegram
Telegram has become an important channel for monitoring public opinion. Several approaches can be used here:
- Official Telegram API — allows you to collect messages from public channels and chats. Requires application registration and obtaining API keys.
- Parsing Libraries — for example, Telethon or Pyrogram for Python. They simplify working with the API and allow you to automate data collection.
- Monitoring Mentions — track where and how your brand is mentioned in public channels.
Telegram is less aggressive in blocking scraping than VK, but it is still advisable to use proxies for large-scale tasks — especially if you are monitoring hundreds of channels simultaneously.
YouTube
Comments under product review videos are a valuable source of detailed opinions. The YouTube Data API allows you to collect comments legally, but it has quotas on the number of requests. To bypass these, you can:
- Create several API keys and rotate them
- Use scraping through the web interface with proxies
- Combine both approaches for maximum efficiency
Scraping Reviews from Marketplaces and Review Sites
Reviews on marketplaces are the most structured and relevant data source for sentiment analysis in e-commerce. Here, customers leave ratings and detailed comments immediately after purchase.
Wildberries
Wildberries actively protects against scraping. When trying to collect reviews from one IP address, you will quickly get blocked. Typical signs of a bot that the platform tracks include:
- Requests that are too fast (more than 1-2 per second)
- The same User-Agent in all requests
- Lack of cookies and session history
- Requests from data center IPs (not residential addresses)
For successful scraping of Wildberries, it is necessary to:
- Use residential proxies — they have the IPs of regular users and do not raise suspicion. For scraping a Russian marketplace, Russian IPs are needed.
- Set up proxy rotation — change IPs after every 20-30 requests or every 5-10 minutes.
- Add delays — pause for 2-5 seconds between requests, mimicking human behavior.
- Rotate User-Agent — use different browsers and versions for each request.
- Maintain cookies — keep the session for each proxy address.
Tip: For scraping marketplaces, it's better to use ready-made tools with built-in protection against blocks than to write your own scripts. This saves time and reduces the risk of bans.
Ozon
Ozon uses similar protection mechanisms, but they are less aggressive than Wildberries. Key features of scraping include:
- Reviews are loaded dynamically via AJAX requests — you need to analyze network traffic
- There is pagination — one product can have hundreds of reviews across dozens of pages
- Reviews contain ratings by parameters (quality, compliance with description, etc.) — valuable structured information
Yandex.Market
Yandex.Market has a strict bot protection system. Here, the use of residential proxies is essential, as data center IPs are blocked almost instantly. Reviews on the Market are especially valuable, as they often contain detailed descriptions of product usage experiences.
Review Sites (Irecommend, Otzovik, Otzovik.ru)
Specialized review platforms provide the most detailed opinions — users write entire articles about their experiences. Scraping here is usually easier than on marketplaces, but still requires proxies for large-scale data collection.
Monitoring News Sites and Forums
News portals and forums provide insight into public opinion about your industry and brand in a broader context.
News Sites
For monitoring news, use:
- RSS Feeds — many news sites provide RSS with the latest publications. This is a legal and convenient way to collect data.
- Google News API — allows you to search for mentions of your brand in news worldwide.
- Scraping Comments — discussions with valuable insights often unfold under news articles.
Forums and Communities
Thematic forums (e.g., automotive, technical, women's) contain expert opinions and detailed discussions. Scraping forums is usually technically easier but requires more time for post-processing due to the unstructured format.
Tools for Automating Data Collection
The choice of tool depends on your technical skills, budget, and the scale of the task.
Ready-Made Monitoring Services (No Code)
| Service | Data Sources | Features |
|---|---|---|
| Brand Analytics | Social Media, News, Forums | Built-in sentiment analysis, expensive |
| IQBuzz | Social Media, Media | Good for the Russian market |
| Babkee | Reviews from Marketplaces | Specialization in e-commerce |
| Popsters | Social Media | Competitor content analytics |
Ready-made services are convenient but expensive and do not provide full control over the data. For specific tasks or large volumes, it's more profitable to set up your own collection system.
Tools for Self-Scraping
If you are ready to delve into technical details, here are popular tools:
- Octoparse — a visual parser without code. You set up data collection through the interface by clicking on page elements. Supports proxies and task scheduling.
- ParseHub — similar to Octoparse, works well with dynamic JavaScript sites.
- Scrapy (Python) — a powerful framework for writing your own parsers. Requires programming skills but offers maximum flexibility.
- Beautiful Soup + Requests (Python) — a simple combination for scraping static sites.
- Selenium / Puppeteer — tools for controlling the browser. Needed for sites with bot protection and complex JavaScript logic.
Specialized APIs for Social Media
Many platforms provide official APIs:
- VK API — allows you to get public posts, comments, information about communities
- Telegram API — access to messages from public channels and chats
- YouTube Data API — collecting comments, information about videos and channels
APIs are convenient because they are legal and structured, but they have limitations on the number of requests and do not always provide access to all necessary data.
Why Proxies are Necessary for Scraping
Scraping without proxies is like trying to discreetly photograph hundreds of people from one spot. You will quickly be noticed and asked to leave. Proxies solve several critical problems:
Bypassing Rate Limiting
Most websites limit the number of requests from one IP address. For example, Wildberries may block an IP after 50-100 requests per hour. With proxies, you distribute the load across dozens or hundreds of IP addresses, bypassing these limits.
Avoiding Blocks
Websites use complex algorithms to detect bots. If all your requests come from one IP, it is a clear sign of automation. Proxies simulate requests from different users in various locations.
Accessing Geo-Specific Content
Some reviews and comments may only be shown to users from certain regions. For example, on marketplaces, prices and reviews may differ for Moscow and other regions. Proxies from the required cities provide access to the complete picture.
Which Type of Proxy to Choose
| Proxy Type | Pros | Cons | When to Use |
|---|---|---|---|
| Residential | Real user IPs, minimal ban risk | More expensive than other types | Marketplaces, social media with strong protection |
| Mobile | Mobile operator IPs, hardly ever get banned | Most expensive, fewer IPs in the pool | Instagram, TikTok, mobile applications |
| Data Center | Fast, cheap | Easily identified as proxies, often blocked | Simple sites without protection, news portals |
For sentiment analysis, the optimal choice is residential proxies. They provide a balance between cost and reliability. When scraping Russian marketplaces and social media, choose proxies with Russian IP addresses.
Setting Up a Data Collection System: Step-by-Step Guide
Let's go through setting up a data collection system using the example of scraping reviews from Wildberries with Octoparse and residential proxies.
Step 1: Preparing Proxies
- Purchase residential proxies with Russian IPs (at least 10-20 addresses for stable operation)
- Obtain a list of proxies in the format:
IP:PORT:USERNAME:PASSWORD - Check the functionality of each proxy through online checking services
Step 2: Setting Up Octoparse
- Download and install Octoparse from the official website
- Create a new scraping task: enter the product page URL on Wildberries
- Go to the reviews section on the product page
- In the visual editor of Octoparse, highlight the elements to be collected:
- Review text
- Rating (number of stars)
- Publication date
- Author's name
- Pros and cons (if any)
- Set up pagination to collect reviews from all pages
Step 3: Connecting Proxies in Octoparse
- Open task settings → "Proxy" section
- Select "Rotate proxy" mode
- Import your list of proxies
- Set the rotation interval: every 20-30 requests or every 5 minutes
- Check the operation of the proxies through the built-in tester
Step 4: Setting Up Scraping Parameters
- Set a delay between requests: 3-5 seconds (to mimic human behavior)
- Enable User-Agent rotation for additional masking
- Set up error handling: when an IP is blocked, automatically switch to the next proxy
- Set limits: a maximum of 50-100 reviews from one IP before rotation
Step 5: Launching and Monitoring
- Run the task in test mode on 10-20 reviews
- Check the quality of the collected data: are all fields filled correctly
- If everything works — launch full-scale collection
- Monitor the process: keep an eye on the number of errors and blocks
- Set up automatic data export to CSV or a database
Important: Always make the first run on a small scale. This will help identify configuration issues before you exhaust all your proxy traffic or receive mass blocks.
Step 6: Post-Processing Data
After collecting data, it is necessary to clean and prepare it for analysis:
- Remove duplicate reviews
- Clean the text from HTML tags and special characters
- Normalize dates to a single format
- Check for empty fields
- Export to a format suitable for your analysis system (CSV, JSON, database)
Best Practices and Common Mistakes
What to Do (Best Practices)
- Start Small — first set up collection from one source, debug the process, then scale to other platforms.
- Collect Metadata — save not only the review text but also the date, author, rating, number of likes. This is important for in-depth analysis.
- Regularly Update Data — sentiment changes over time. Set up automatic collection of new reviews daily or weekly.
- Make Backups — save raw data before processing. If the analysis algorithm changes, you can reprocess old data.
- Document the Process — record parser settings, data sources, collection periods. This will help with analysis and scaling.
- Monitor Quality — regularly check a random sample of collected data for correctness.
What to Avoid (Common Mistakes)
- Scraping Without Proxies — a quick way to get your IP blocked. Even for small volumes, use at least a few proxies.
- Too Aggressive Scraping — requests every second will raise suspicion. Add random delays of 2-5 seconds.
- Using Data Center Proxies for Social Media — Instagram, Facebook, VK easily identify and block them. For social media, only residential or mobile proxies.
- Ignoring robots.txt — while this is not a legal requirement, gross violations can lead to server-level IP bans.
- Collecting Personal Data — do not collect emails, phone numbers, or other private information. This violates data protection laws.
- Lack of Error Handling — the parser should correctly handle 404 errors, timeouts, and changes in page structure.
- Insufficient Proxy Rotation — if you use one proxy for too long, it will get blocked. Change IPs every 20-50 requests.
Performance Optimization
For collecting large volumes of data (thousands of reviews per day):
- Parallelization — run multiple scraping threads simultaneously, each with its own proxy
- Task Queues — use systems like Celery (for Python) to manage scraping tasks
- Caching — save already scraped pages to avoid scraping them again
- Incremental Collection — only collect new reviews since the last run, not everything from scratch
Legal Aspects
Scraping is in a gray area of legislation. To minimize risks:
- Collect only publicly available data (without authorization)
- Do not resell collected data
- Use data only for internal analysis and product improvement
- Remove personal data (names, photos) before analysis
- Maintain a reasonable load on the servers of the sites
Conclusion
Collecting data for sentiment analysis is the foundation for understanding customer attitudes towards your brand. A properly set up collection system provides a continuous flow of relevant information from social media, marketplaces, and other sources.
Key takeaways from this guide:
- Use diverse data sources — social media, marketplaces, review sites, forums
- Choose tools according to your level: ready-made services for quick start, custom parsers for flexibility
- Residential proxies are a must for stable scraping of protected platforms
- Set up the system gradually: start with one source, then scale
- Automate regular data collection to track sentiment dynamics
Start with scraping one or two sources that are most important for your business. Debug the process, set up automation, and only then add new platforms. Data quality is more important than quantity — it's better to have 1000 accurate and relevant reviews than 10,000 with junk and duplicates.
If you plan to collect data from Russian marketplaces or social media, we recommend using residential proxies with Russian IPs — they provide stable operation without blocks and give access to geo-specific content. For scraping mobile applications and platforms like Instagram, mobile proxies are suitable, which are practically indistinguishable from regular users.