Back to Blog

How to Collect Data for Sentiment Analysis from Social Media and Review Sites: Tools and Methods

A complete guide to data collection for sentiment analysis: which sources to use, how to scrape social media and review sites without getting blocked, and which proxies to choose for stable operation.

📅March 9, 2026
```html

Sentiment analysis helps marketers understand how customers feel about a brand, product, or service. However, quality analysis is impossible without properly collected data. In this guide, we will explore where and how to collect information for sentiment analysis, which tools to use, and how to avoid blocks during scraping.

Main Data Sources for Sentiment Analysis

For quality sentiment analysis, diverse data sources are needed. The more information you collect from different channels, the more accurate the picture of your brand's perception will be.

Source Data Type Collection Difficulty Value for Analysis
Social Media (VK, Telegram) Comments, Posts, Mentions Medium High
Marketplaces (Wildberries, Ozon) Customer Reviews, Ratings High Very High
Review Sites (Irecommend, Otzovik) Detailed Reviews Medium High
News Portals Articles, Comments Low Medium
Forums and Q&A Sites Discussions, Questions Medium Medium
YouTube Video Comments Medium High

For most brands, marketplaces and social media are priority sources — this is where the majority of customer opinions are concentrated. Review sites provide more detailed feedback, but the volume of data is usually smaller there.

Data Collection from Social Media

Social media is a goldmine for sentiment analysis. People freely express their opinions about brands, share their experiences with products, and leave comments under promotional posts.

VKontakte

VK provides an API for collecting public data, but with limitations on the number of requests. For large-scale monitoring, scraping through the web interface will be necessary. The main types of data to collect include:

  • Comments under posts from your brand or competitors
  • Mentions of the brand in public posts and groups
  • Reviews in thematic communities (for example, "Heard It" for your niche)
  • Discussions in industry groups

An important point: VK actively fights against automated data collection. When scraping without proxies, you will quickly encounter a captcha or temporary block. For stable operation, use residential proxies with Russian IP addresses — they mimic regular users and rarely get blocked.

Telegram

Telegram has become an important channel for monitoring public opinion. Several approaches can be used here:

  • Official Telegram API — allows you to collect messages from public channels and chats. Requires application registration and obtaining API keys.
  • Parsing Libraries — for example, Telethon or Pyrogram for Python. They simplify working with the API and allow you to automate data collection.
  • Monitoring Mentions — track where and how your brand is mentioned in public channels.

Telegram is less aggressive in blocking scraping than VK, but it is still advisable to use proxies for large-scale tasks — especially if you are monitoring hundreds of channels simultaneously.

YouTube

Comments under product review videos are a valuable source of detailed opinions. The YouTube Data API allows you to collect comments legally, but it has quotas on the number of requests. To bypass these, you can:

  • Create several API keys and rotate them
  • Use scraping through the web interface with proxies
  • Combine both approaches for maximum efficiency

Scraping Reviews from Marketplaces and Review Sites

Reviews on marketplaces are the most structured and relevant data source for sentiment analysis in e-commerce. Here, customers leave ratings and detailed comments immediately after purchase.

Wildberries

Wildberries actively protects against scraping. When trying to collect reviews from one IP address, you will quickly get blocked. Typical signs of a bot that the platform tracks include:

  • Requests that are too fast (more than 1-2 per second)
  • The same User-Agent in all requests
  • Lack of cookies and session history
  • Requests from data center IPs (not residential addresses)

For successful scraping of Wildberries, it is necessary to:

  1. Use residential proxies — they have the IPs of regular users and do not raise suspicion. For scraping a Russian marketplace, Russian IPs are needed.
  2. Set up proxy rotation — change IPs after every 20-30 requests or every 5-10 minutes.
  3. Add delays — pause for 2-5 seconds between requests, mimicking human behavior.
  4. Rotate User-Agent — use different browsers and versions for each request.
  5. Maintain cookies — keep the session for each proxy address.

Tip: For scraping marketplaces, it's better to use ready-made tools with built-in protection against blocks than to write your own scripts. This saves time and reduces the risk of bans.

Ozon

Ozon uses similar protection mechanisms, but they are less aggressive than Wildberries. Key features of scraping include:

  • Reviews are loaded dynamically via AJAX requests — you need to analyze network traffic
  • There is pagination — one product can have hundreds of reviews across dozens of pages
  • Reviews contain ratings by parameters (quality, compliance with description, etc.) — valuable structured information

Yandex.Market

Yandex.Market has a strict bot protection system. Here, the use of residential proxies is essential, as data center IPs are blocked almost instantly. Reviews on the Market are especially valuable, as they often contain detailed descriptions of product usage experiences.

Review Sites (Irecommend, Otzovik, Otzovik.ru)

Specialized review platforms provide the most detailed opinions — users write entire articles about their experiences. Scraping here is usually easier than on marketplaces, but still requires proxies for large-scale data collection.

Monitoring News Sites and Forums

News portals and forums provide insight into public opinion about your industry and brand in a broader context.

News Sites

For monitoring news, use:

  • RSS Feeds — many news sites provide RSS with the latest publications. This is a legal and convenient way to collect data.
  • Google News API — allows you to search for mentions of your brand in news worldwide.
  • Scraping Comments — discussions with valuable insights often unfold under news articles.

Forums and Communities

Thematic forums (e.g., automotive, technical, women's) contain expert opinions and detailed discussions. Scraping forums is usually technically easier but requires more time for post-processing due to the unstructured format.

Tools for Automating Data Collection

The choice of tool depends on your technical skills, budget, and the scale of the task.

Ready-Made Monitoring Services (No Code)

Service Data Sources Features
Brand Analytics Social Media, News, Forums Built-in sentiment analysis, expensive
IQBuzz Social Media, Media Good for the Russian market
Babkee Reviews from Marketplaces Specialization in e-commerce
Popsters Social Media Competitor content analytics

Ready-made services are convenient but expensive and do not provide full control over the data. For specific tasks or large volumes, it's more profitable to set up your own collection system.

Tools for Self-Scraping

If you are ready to delve into technical details, here are popular tools:

  • Octoparse — a visual parser without code. You set up data collection through the interface by clicking on page elements. Supports proxies and task scheduling.
  • ParseHub — similar to Octoparse, works well with dynamic JavaScript sites.
  • Scrapy (Python) — a powerful framework for writing your own parsers. Requires programming skills but offers maximum flexibility.
  • Beautiful Soup + Requests (Python) — a simple combination for scraping static sites.
  • Selenium / Puppeteer — tools for controlling the browser. Needed for sites with bot protection and complex JavaScript logic.

Specialized APIs for Social Media

Many platforms provide official APIs:

  • VK API — allows you to get public posts, comments, information about communities
  • Telegram API — access to messages from public channels and chats
  • YouTube Data API — collecting comments, information about videos and channels

APIs are convenient because they are legal and structured, but they have limitations on the number of requests and do not always provide access to all necessary data.

Why Proxies are Necessary for Scraping

Scraping without proxies is like trying to discreetly photograph hundreds of people from one spot. You will quickly be noticed and asked to leave. Proxies solve several critical problems:

Bypassing Rate Limiting

Most websites limit the number of requests from one IP address. For example, Wildberries may block an IP after 50-100 requests per hour. With proxies, you distribute the load across dozens or hundreds of IP addresses, bypassing these limits.

Avoiding Blocks

Websites use complex algorithms to detect bots. If all your requests come from one IP, it is a clear sign of automation. Proxies simulate requests from different users in various locations.

Accessing Geo-Specific Content

Some reviews and comments may only be shown to users from certain regions. For example, on marketplaces, prices and reviews may differ for Moscow and other regions. Proxies from the required cities provide access to the complete picture.

Which Type of Proxy to Choose

Proxy Type Pros Cons When to Use
Residential Real user IPs, minimal ban risk More expensive than other types Marketplaces, social media with strong protection
Mobile Mobile operator IPs, hardly ever get banned Most expensive, fewer IPs in the pool Instagram, TikTok, mobile applications
Data Center Fast, cheap Easily identified as proxies, often blocked Simple sites without protection, news portals

For sentiment analysis, the optimal choice is residential proxies. They provide a balance between cost and reliability. When scraping Russian marketplaces and social media, choose proxies with Russian IP addresses.

Setting Up a Data Collection System: Step-by-Step Guide

Let's go through setting up a data collection system using the example of scraping reviews from Wildberries with Octoparse and residential proxies.

Step 1: Preparing Proxies

  1. Purchase residential proxies with Russian IPs (at least 10-20 addresses for stable operation)
  2. Obtain a list of proxies in the format: IP:PORT:USERNAME:PASSWORD
  3. Check the functionality of each proxy through online checking services

Step 2: Setting Up Octoparse

  1. Download and install Octoparse from the official website
  2. Create a new scraping task: enter the product page URL on Wildberries
  3. Go to the reviews section on the product page
  4. In the visual editor of Octoparse, highlight the elements to be collected:
    • Review text
    • Rating (number of stars)
    • Publication date
    • Author's name
    • Pros and cons (if any)
  5. Set up pagination to collect reviews from all pages

Step 3: Connecting Proxies in Octoparse

  1. Open task settings → "Proxy" section
  2. Select "Rotate proxy" mode
  3. Import your list of proxies
  4. Set the rotation interval: every 20-30 requests or every 5 minutes
  5. Check the operation of the proxies through the built-in tester

Step 4: Setting Up Scraping Parameters

  1. Set a delay between requests: 3-5 seconds (to mimic human behavior)
  2. Enable User-Agent rotation for additional masking
  3. Set up error handling: when an IP is blocked, automatically switch to the next proxy
  4. Set limits: a maximum of 50-100 reviews from one IP before rotation

Step 5: Launching and Monitoring

  1. Run the task in test mode on 10-20 reviews
  2. Check the quality of the collected data: are all fields filled correctly
  3. If everything works — launch full-scale collection
  4. Monitor the process: keep an eye on the number of errors and blocks
  5. Set up automatic data export to CSV or a database

Important: Always make the first run on a small scale. This will help identify configuration issues before you exhaust all your proxy traffic or receive mass blocks.

Step 6: Post-Processing Data

After collecting data, it is necessary to clean and prepare it for analysis:

  1. Remove duplicate reviews
  2. Clean the text from HTML tags and special characters
  3. Normalize dates to a single format
  4. Check for empty fields
  5. Export to a format suitable for your analysis system (CSV, JSON, database)

Best Practices and Common Mistakes

What to Do (Best Practices)

  • Start Small — first set up collection from one source, debug the process, then scale to other platforms.
  • Collect Metadata — save not only the review text but also the date, author, rating, number of likes. This is important for in-depth analysis.
  • Regularly Update Data — sentiment changes over time. Set up automatic collection of new reviews daily or weekly.
  • Make Backups — save raw data before processing. If the analysis algorithm changes, you can reprocess old data.
  • Document the Process — record parser settings, data sources, collection periods. This will help with analysis and scaling.
  • Monitor Quality — regularly check a random sample of collected data for correctness.

What to Avoid (Common Mistakes)

  • Scraping Without Proxies — a quick way to get your IP blocked. Even for small volumes, use at least a few proxies.
  • Too Aggressive Scraping — requests every second will raise suspicion. Add random delays of 2-5 seconds.
  • Using Data Center Proxies for Social Media — Instagram, Facebook, VK easily identify and block them. For social media, only residential or mobile proxies.
  • Ignoring robots.txt — while this is not a legal requirement, gross violations can lead to server-level IP bans.
  • Collecting Personal Data — do not collect emails, phone numbers, or other private information. This violates data protection laws.
  • Lack of Error Handling — the parser should correctly handle 404 errors, timeouts, and changes in page structure.
  • Insufficient Proxy Rotation — if you use one proxy for too long, it will get blocked. Change IPs every 20-50 requests.

Performance Optimization

For collecting large volumes of data (thousands of reviews per day):

  • Parallelization — run multiple scraping threads simultaneously, each with its own proxy
  • Task Queues — use systems like Celery (for Python) to manage scraping tasks
  • Caching — save already scraped pages to avoid scraping them again
  • Incremental Collection — only collect new reviews since the last run, not everything from scratch

Legal Aspects

Scraping is in a gray area of legislation. To minimize risks:

  • Collect only publicly available data (without authorization)
  • Do not resell collected data
  • Use data only for internal analysis and product improvement
  • Remove personal data (names, photos) before analysis
  • Maintain a reasonable load on the servers of the sites

Conclusion

Collecting data for sentiment analysis is the foundation for understanding customer attitudes towards your brand. A properly set up collection system provides a continuous flow of relevant information from social media, marketplaces, and other sources.

Key takeaways from this guide:

  • Use diverse data sources — social media, marketplaces, review sites, forums
  • Choose tools according to your level: ready-made services for quick start, custom parsers for flexibility
  • Residential proxies are a must for stable scraping of protected platforms
  • Set up the system gradually: start with one source, then scale
  • Automate regular data collection to track sentiment dynamics

Start with scraping one or two sources that are most important for your business. Debug the process, set up automation, and only then add new platforms. Data quality is more important than quantity — it's better to have 1000 accurate and relevant reviews than 10,000 with junk and duplicates.

If you plan to collect data from Russian marketplaces or social media, we recommend using residential proxies with Russian IPs — they provide stable operation without blocks and give access to geo-specific content. For scraping mobile applications and platforms like Instagram, mobile proxies are suitable, which are practically indistinguishable from regular users.

```