Data Collection for Sentiment Analysis: Tools and Methods

```html

Sentiment analysis helps marketers understand how customers feel about a brand, product, or service. However, quality analysis is impossible without properly collected data. In this guide, we will explore where and how to collect information for sentiment analysis, which tools to use, and how to avoid blocks during scraping.

Main Data Sources for Sentiment Analysis

For quality sentiment analysis, diverse data sources are needed. The more information you collect from different channels, the more accurate the picture of your brand's perception will be.

Source	Data Type	Collection Difficulty	Value for Analysis
Social Media (VK, Telegram)	Comments, Posts, Mentions	Medium	High
Marketplaces (Wildberries, Ozon)	Customer Reviews, Ratings	High	Very High
Review Sites (Irecommend, Otzovik)	Detailed Reviews	Medium	High
News Portals	Articles, Comments	Low	Medium
Forums and Q&A Sites	Discussions, Questions	Medium	Medium
YouTube	Video Comments	Medium	High

For most brands, marketplaces and social media are priority sources — this is where the majority of customer opinions are concentrated. Review sites provide more detailed feedback, but the volume of data is usually smaller there.

Social media is a goldmine for sentiment analysis. People freely express their opinions about brands, share their experiences with products, and leave comments under promotional posts.

VKontakte

VK provides an API for collecting public data, but with limitations on the number of requests. For large-scale monitoring, scraping through the web interface will be necessary. The main types of data to collect include:

Comments under posts from your brand or competitors
Mentions of the brand in public posts and groups
Reviews in thematic communities (for example, "Heard It" for your niche)
Discussions in industry groups

An important point: VK actively fights against automated data collection. When scraping without proxies, you will quickly encounter a captcha or temporary block. For stable operation, use residential proxies with Russian IP addresses — they mimic regular users and rarely get blocked.

Telegram has become an important channel for monitoring public opinion. Several approaches can be used here:

Official Telegram API — allows you to collect messages from public channels and chats. Requires application registration and obtaining API keys.
Parsing Libraries — for example, Telethon or Pyrogram for Python. They simplify working with the API and allow you to automate data collection.
Monitoring Mentions — track where and how your brand is mentioned in public channels.

Telegram is less aggressive in blocking scraping than VK, but it is still advisable to use proxies for large-scale tasks — especially if you are monitoring hundreds of channels simultaneously.

YouTube

Comments under product review videos are a valuable source of detailed opinions. The YouTube Data API allows you to collect comments legally, but it has quotas on the number of requests. To bypass these, you can:

Create several API keys and rotate them
Use scraping through the web interface with proxies
Combine both approaches for maximum efficiency

Scraping Reviews from Marketplaces and Review Sites

Reviews on marketplaces are the most structured and relevant data source for sentiment analysis in e-commerce. Here, customers leave ratings and detailed comments immediately after purchase.

Wildberries

Wildberries actively protects against scraping. When trying to collect reviews from one IP address, you will quickly get blocked. Typical signs of a bot that the platform tracks include:

Requests that are too fast (more than 1-2 per second)
The same User-Agent in all requests
Lack of cookies and session history
Requests from data center IPs (not residential addresses)

For successful scraping of Wildberries, it is necessary to:

Use residential proxies — they have the IPs of regular users and do not raise suspicion. For scraping a Russian marketplace, Russian IPs are needed.
Set up proxy rotation — change IPs after every 20-30 requests or every 5-10 minutes.
Add delays — pause for 2-5 seconds between requests, mimicking human behavior.
Rotate User-Agent — use different browsers and versions for each request.
Maintain cookies — keep the session for each proxy address.

Tip: For scraping marketplaces, it's better to use ready-made tools with built-in protection against blocks than to write your own scripts. This saves time and reduces the risk of bans.

Ozon

Ozon uses similar protection mechanisms, but they are less aggressive than Wildberries. Key features of scraping include:

Reviews are loaded dynamically via AJAX requests — you need to analyze network traffic
There is pagination — one product can have hundreds of reviews across dozens of pages
Reviews contain ratings by parameters (quality, compliance with description, etc.) — valuable structured information

Yandex.Market

Yandex.Market has a strict bot protection system. Here, the use of residential proxies is essential, as data center IPs are blocked almost instantly. Reviews on the Market are especially valuable, as they often contain detailed descriptions of product usage experiences.

Review Sites (Irecommend, Otzovik, Otzovik.ru)

Specialized review platforms provide the most detailed opinions — users write entire articles about their experiences. Scraping here is usually easier than on marketplaces, but still requires proxies for large-scale data collection.

Monitoring News Sites and Forums

News portals and forums provide insight into public opinion about your industry and brand in a broader context.

News Sites

For monitoring news, use:

RSS Feeds — many news sites provide RSS with the latest publications. This is a legal and convenient way to collect data.
Google News API — allows you to search for mentions of your brand in news worldwide.
Scraping Comments — discussions with valuable insights often unfold under news articles.

Forums and Communities

Thematic forums (e.g., automotive, technical, women's) contain expert opinions and detailed discussions. Scraping forums is usually technically easier but requires more time for post-processing due to the unstructured format.

Tools for Automating Data Collection

The choice of tool depends on your technical skills, budget, and the scale of the task.

Ready-Made Monitoring Services (No Code)

Service	Data Sources	Features
Brand Analytics	Social Media, News, Forums	Built-in sentiment analysis, expensive
IQBuzz	Social Media, Media	Good for the Russian market
Babkee	Reviews from Marketplaces	Specialization in e-commerce
Popsters	Social Media	Competitor content analytics

Ready-made services are convenient but expensive and do not provide full control over the data. For specific tasks or large volumes, it's more profitable to set up your own collection system.

Tools for Self-Scraping

If you are ready to delve into technical details, here are popular tools:

Octoparse — a visual parser without code. You set up data collection through the interface by clicking on page elements. Supports proxies and task scheduling.
ParseHub — similar to Octoparse, works well with dynamic JavaScript sites.
Scrapy (Python) — a powerful framework for writing your own parsers. Requires programming skills but offers maximum flexibility.
Beautiful Soup + Requests (Python) — a simple combination for scraping static sites.
Selenium / Puppeteer — tools for controlling the browser. Needed for sites with bot protection and complex JavaScript logic.

Specialized APIs for Social Media

Many platforms provide official APIs:

VK API — allows you to get public posts, comments, information about communities
Telegram API — access to messages from public channels and chats
YouTube Data API — collecting comments, information about videos and channels

APIs are convenient because they are legal and structured, but they have limitations on the number of requests and do not always provide access to all necessary data.

Why Proxies are Necessary for Scraping

Scraping without proxies is like trying to discreetly photograph hundreds of people from one spot. You will quickly be noticed and asked to leave. Proxies solve several critical problems:

Bypassing Rate Limiting

Most websites limit the number of requests from one IP address. For example, Wildberries may block an IP after 50-100 requests per hour. With proxies, you distribute the load across dozens or hundreds of IP addresses, bypassing these limits.

Avoiding Blocks

Websites use complex algorithms to detect bots. If all your requests come from one IP, it is a clear sign of automation. Proxies simulate requests from different users in various locations.

Accessing Geo-Specific Content

Some reviews and comments may only be shown to users from certain regions. For example, on marketplaces, prices and reviews may differ for Moscow and other regions. Proxies from the required cities provide access to the complete picture.

Which Type of Proxy to Choose

Proxy Type	Pros	Cons	When to Use
Residential	Real user IPs, minimal ban risk	More expensive than other types	Marketplaces, social media with strong protection
Mobile	Mobile operator IPs, hardly ever get banned	Most expensive, fewer IPs in the pool	Instagram, TikTok, mobile applications
Data Center	Fast, cheap	Easily identified as proxies, often blocked	Simple sites without protection, news portals

For sentiment analysis, the optimal choice is residential proxies. They provide a balance between cost and reliability. When scraping Russian marketplaces and social media, choose proxies with Russian IP addresses.

Setting Up a Data Collection System: Step-by-Step Guide

Let's go through setting up a data collection system using the example of scraping reviews from Wildberries with Octoparse and residential proxies.

Step 1: Preparing Proxies

Purchase residential proxies with Russian IPs (at least 10-20 addresses for stable operation)
Obtain a list of proxies in the format: IP:PORT:USERNAME:PASSWORD
Check the functionality of each proxy through online checking services

Step 2: Setting Up Octoparse

Download and install Octoparse from the official website
Create a new scraping task: enter the product page URL on Wildberries
Go to the reviews section on the product page
In the visual editor of Octoparse, highlight the elements to be collected:
- Review text
- Rating (number of stars)
- Publication date
- Author's name
- Pros and cons (if any)
Set up pagination to collect reviews from all pages

Step 3: Connecting Proxies in Octoparse

Open task settings → "Proxy" section
Select "Rotate proxy" mode
Import your list of proxies
Set the rotation interval: every 20-30 requests or every 5 minutes
Check the operation of the proxies through the built-in tester

Step 4: Setting Up Scraping Parameters

Set a delay between requests: 3-5 seconds (to mimic human behavior)
Enable User-Agent rotation for additional masking
Set up error handling: when an IP is blocked, automatically switch to the next proxy
Set limits: a maximum of 50-100 reviews from one IP before rotation

Step 5: Launching and Monitoring

Run the task in test mode on 10-20 reviews
Check the quality of the collected data: are all fields filled correctly
If everything works — launch full-scale collection
Monitor the process: keep an eye on the number of errors and blocks
Set up automatic data export to CSV or a database

Important: Always make the first run on a small scale. This will help identify configuration issues before you exhaust all your proxy traffic or receive mass blocks.

Step 6: Post-Processing Data

After collecting data, it is necessary to clean and prepare it for analysis:

Remove duplicate reviews
Clean the text from HTML tags and special characters
Normalize dates to a single format
Check for empty fields
Export to a format suitable for your analysis system (CSV, JSON, database)

Best Practices and Common Mistakes

What to Do (Best Practices)

Start Small — first set up collection from one source, debug the process, then scale to other platforms.
Collect Metadata — save not only the review text but also the date, author, rating, number of likes. This is important for in-depth analysis.
Regularly Update Data — sentiment changes over time. Set up automatic collection of new reviews daily or weekly.
Make Backups — save raw data before processing. If the analysis algorithm changes, you can reprocess old data.
Document the Process — record parser settings, data sources, collection periods. This will help with analysis and scaling.
Monitor Quality — regularly check a random sample of collected data for correctness.

What to Avoid (Common Mistakes)

Scraping Without Proxies — a quick way to get your IP blocked. Even for small volumes, use at least a few proxies.
Too Aggressive Scraping — requests every second will raise suspicion. Add random delays of 2-5 seconds.
Using Data Center Proxies for Social Media — Instagram, Facebook, VK easily identify and block them. For social media, only residential or mobile proxies.
Ignoring robots.txt — while this is not a legal requirement, gross violations can lead to server-level IP bans.
Collecting Personal Data — do not collect emails, phone numbers, or other private information. This violates data protection laws.
Lack of Error Handling — the parser should correctly handle 404 errors, timeouts, and changes in page structure.
Insufficient Proxy Rotation — if you use one proxy for too long, it will get blocked. Change IPs every 20-50 requests.

Performance Optimization

For collecting large volumes of data (thousands of reviews per day):

Parallelization — run multiple scraping threads simultaneously, each with its own proxy
Task Queues — use systems like Celery (for Python) to manage scraping tasks
Caching — save already scraped pages to avoid scraping them again
Incremental Collection — only collect new reviews since the last run, not everything from scratch

Legal Aspects

Scraping is in a gray area of legislation. To minimize risks:

Collect only publicly available data (without authorization)
Do not resell collected data
Use data only for internal analysis and product improvement
Remove personal data (names, photos) before analysis
Maintain a reasonable load on the servers of the sites

Conclusion

Collecting data for sentiment analysis is the foundation for understanding customer attitudes towards your brand. A properly set up collection system provides a continuous flow of relevant information from social media, marketplaces, and other sources.

Key takeaways from this guide:

Use diverse data sources — social media, marketplaces, review sites, forums
Choose tools according to your level: ready-made services for quick start, custom parsers for flexibility
Residential proxies are a must for stable scraping of protected platforms
Set up the system gradually: start with one source, then scale
Automate regular data collection to track sentiment dynamics

Start with scraping one or two sources that are most important for your business. Debug the process, set up automation, and only then add new platforms. Data quality is more important than quantity — it's better to have 1000 accurate and relevant reviews than 10,000 with junk and duplicates.

If you plan to collect data from Russian marketplaces or social media, we recommend using residential proxies with Russian IPs — they provide stable operation without blocks and give access to geo-specific content. For scraping mobile applications and platforms like Instagram, mobile proxies are suitable, which are practically indistinguishable from regular users.