YouTube Data Collection for RAG: A Guide with Python and Proxies

RAG (Retrieval-Augmented Generation) systems require high-quality data for training. YouTube is a vast source of structured content: videos with subtitles, metadata, and comments. In this article, we will explore how to effectively collect YouTube data for RAG while avoiding blocks and adhering to API limits.

What is RAG and why YouTube data is needed

RAG (Retrieval-Augmented Generation) is an approach to building AI systems where a language model is supplemented by a knowledge base. Instead of relying solely on the data the model was trained on, RAG retrieves relevant information from an external source and uses it to generate responses.

YouTube contains millions of hours of content with subtitles in various languages. This makes the platform a valuable source of data for RAG systems in various fields:

Educational systems — lectures, tutorials, courses with timestamps
Technical documentation — video guides on programming, DevOps, software setup
Medical knowledge bases — lectures by doctors, case studies
Business analytics — interviews with experts, case studies, market reviews
Product support — product reviews, FAQs in video format

The advantage of YouTube data is its structure: subtitles with timestamps, metadata (categories, tags), and social context (comments, likes). All of this helps the RAG system to understand not only the content but also the context of the information.

What YouTube data is useful for RAG systems

For the effective operation of a RAG system, several types of data need to be collected. Each type serves its purpose in the process of information retrieval and generation.

Subtitles (Transcripts)

The primary source of textual data. YouTube provides two types of subtitles:

Automatic — generated by Google's speech recognition algorithms. Available for most videos in English and other popular languages. Accuracy is 85-95% depending on audio quality.
Manual — uploaded by authors or the community. More accurate, often containing formatting and additional context.

Subtitles include timestamps, which allow linking text to specific moments in the video. This is critical for creating accurate citations in RAG responses.

Video Metadata

Metadata helps the RAG system understand the context and relevance of the information:

Data Type	Application in RAG
Title and Description	Semantic search, topic identification
Tags and Categories	Content classification, filtering
Publication Date	Relevance of information (important for technical topics)
Duration	Assessment of topic depth
Statistics (views, likes)	Assessment of the quality and popularity of the source
Channel Information	Determining the authority of the source

Comments

Comments provide additional context: viewer questions, author clarifications, discussions. For RAG systems, this is valuable because:

Comments often contain FAQs on the video topic
Authors may post corrections and additions
Discussions reveal different perspectives on the issue

Working with YouTube Data API v3: setup and limits

YouTube Data API v3 is the official way to obtain data. It provides access to metadata, statistics, and comments. Subtitles are obtained through separate methods.

Obtaining an API Key

To work with the API, you need a key from the Google Cloud Console:

Go to console.cloud.google.com
Create a new project or select an existing one
Enable YouTube Data API v3 in the "APIs & Services" section
Create credentials → API key
Copy the key — it will be needed for all requests

Limits and Quotas

The YouTube API uses a quota system. Each request "costs" a certain number of units:

Operation	Cost in Quotas
Video Search (search.list)	100 units
Getting Video Data (videos.list)	1 unit
Getting Comments (commentThreads.list)	1 unit

The default daily limit is 10,000 units. This is approximately 100 search requests or 10,000 metadata requests. To increase the quota, you need to apply to Google.

Basic Example of Working with the API

import requests

API_KEY = 'your_api_key'
BASE_URL = 'https://www.googleapis.com/youtube/v3'

# Search for videos by query
def search_videos(query, max_results=10):
    url = f'{BASE_URL}/search'
    params = {
        'part': 'snippet',
        'q': query,
        'type': 'video',
        'maxResults': max_results,
        'key': API_KEY
    }
    
    response = requests.get(url, params=params)
    return response.json()

# Get video metadata
def get_video_details(video_id):
    url = f'{BASE_URL}/videos'
    params = {
        'part': 'snippet,contentDetails,statistics',
        'id': video_id,
        'key': API_KEY
    }
    
    response = requests.get(url, params=params)
    return response.json()

# Example usage
results = search_videos('machine learning tutorial', max_results=5)
for item in results.get('items', []):
    video_id = item['id']['videoId']
    title = item['snippet']['title']
    print(f'ID: {video_id}, Title: {title}')
    
    # Get detailed information
    details = get_video_details(video_id)
    stats = details['items'][0]['statistics']
    print(f"Views: {stats.get('viewCount')}, Likes: {stats.get('likeCount')}")

Parsing Video Subtitles: Automatic and Manual

YouTube Data API v3 does not provide direct access to subtitles. Alternative methods are used to obtain them.

Using the youtube-transcript-api Library

The simplest way is the youtube-transcript-api library for Python. It extracts subtitles directly, without an API key:

from youtube_transcript_api import YouTubeTranscriptApi

# Getting subtitles
video_id = 'dQw4w9WgXcQ'

try:
    # Attempt to get Russian subtitles, if not available — English
    transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=['ru', 'en'])
    
    # Output subtitles with timestamps
    for entry in transcript:
        start_time = entry['start']
        duration = entry['duration']
        text = entry['text']
        print(f"[{start_time:.2f}s] {text}")
        
except Exception as e:
    print(f"Error retrieving subtitles: {e}")

# Getting the list of available languages
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
for transcript in transcript_list:
    print(f"Language: {transcript.language}, Automatic: {transcript.is_generated}")

The library automatically detects available subtitles and can translate them into other languages (if YouTube provides such functionality).

Processing Timestamps for RAG

For RAG systems, it is important to maintain the link between text and timestamps. This allows for the creation of accurate citations:

def format_timestamp(seconds):
    """Convert seconds to MM:SS format"""
    minutes = int(seconds // 60)
    secs = int(seconds % 60)
    return f"{minutes:02d}:{secs:02d}"

def create_chunks_with_timestamps(transcript, chunk_size=500):
    """Break subtitles into chunks while preserving timestamps"""
    chunks = []
    current_chunk = ""
    chunk_start_time = 0
    
    for i, entry in enumerate(transcript):
        if len(current_chunk) == 0:
            chunk_start_time = entry['start']
        
        current_chunk += entry['text'] + " "
        
        # If we reached the required size or the end of the subtitles
        if len(current_chunk) >= chunk_size or i == len(transcript) - 1:
            chunks.append({
                'text': current_chunk.strip(),
                'start_time': chunk_start_time,
                'timestamp': format_timestamp(chunk_start_time),
                'video_id': video_id
            })
            current_chunk = ""
    
    return chunks

# Usage
transcript = YouTubeTranscriptApi.get_transcript(video_id)
chunks = create_chunks_with_timestamps(transcript)

for chunk in chunks[:3]:  # First 3 chunks
    print(f"[{chunk['timestamp']}] {chunk['text'][:100]}...")

Collecting Metadata: Titles, Descriptions, Tags

Metadata enriches the context for the RAG system. Here is a complete example of collecting all necessary data:

import requests
from datetime import datetime

def collect_video_metadata(video_id, api_key):
    """Collect full video metadata"""
    url = f'https://www.googleapis.com/youtube/v3/videos'
    params = {
        'part': 'snippet,contentDetails,statistics,topicDetails',
        'id': video_id,
        'key': api_key
    }
    
    response = requests.get(url, params=params)
    data = response.json()
    
    if 'items' not in data or len(data['items']) == 0:
        return None
    
    item = data['items'][0]
    snippet = item['snippet']
    stats = item.get('statistics', {})
    content = item.get('contentDetails', {})
    
    metadata = {
        'video_id': video_id,
        'title': snippet['title'],
        'description': snippet['description'],
        'channel_title': snippet['channelTitle'],
        'channel_id': snippet['channelId'],
        'published_at': snippet['publishedAt'],
        'tags': snippet.get('tags', []),
        'category_id': snippet.get('categoryId'),
        'duration': content.get('duration'),
        'view_count': int(stats.get('viewCount', 0)),
        'like_count': int(stats.get('likeCount', 0)),
        'comment_count': int(stats.get('commentCount', 0)),
        'topics': item.get('topicDetails', {}).get('topicCategories', [])
    }
    
    return metadata

# Example usage
metadata = collect_video_metadata('dQw4w9WgXcQ', API_KEY)
print(f"Title: {metadata['title']}")
print(f"Channel: {metadata['channel_title']}")
print(f"Views: {metadata['view_count']:,}")
print(f"Tags: {', '.join(metadata['tags'][:5])}")

Determining Content Relevance

For technical topics, the freshness of information is important. Let's add a function to assess relevance:

from datetime import datetime, timedelta

def calculate_content_freshness(published_date_str):
    """Assess content relevance"""
    published_date = datetime.fromisoformat(published_date_str.replace('Z', '+00:00'))
    age_days = (datetime.now(published_date.tzinfo) - published_date).days
    
    if age_days < 30:
        return 'very_fresh'
    elif age_days < 180:
        return 'fresh'
    elif age_days < 365:
        return 'moderate'
    else:
        return 'old'

def calculate_quality_score(metadata):
    """Calculate source quality score"""
    score = 0
    
    # Popularity
    views = metadata['view_count']
    if views > 100000:
        score += 3
    elif views > 10000:
        score += 2
    elif views > 1000:
        score += 1
    
    # Engagement (likes relative to views)
    if views > 0:
        like_ratio = metadata['like_count'] / views
        if like_ratio > 0.05:
            score += 2
        elif like_ratio > 0.02:
            score += 1
    
    # Relevance
    freshness = calculate_content_freshness(metadata['published_at'])
    if freshness == 'very_fresh':
        score += 2
    elif freshness == 'fresh':
        score += 1
    
    return score

# Usage
metadata = collect_video_metadata('dQw4w9WgXcQ', API_KEY)
quality = calculate_quality_score(metadata)
freshness = calculate_content_freshness(metadata['published_at'])
print(f"Quality Score: {quality}/7")
print(f"Relevance: {freshness}")

Parsing Comments for Contextual Analysis

Comments can contain valuable information: corrections to errors in the video, additional resources, frequent questions. For RAG systems, this provides additional context.

def get_video_comments(video_id, api_key, max_results=100):
    """Retrieve comments for a video"""
    url = 'https://www.googleapis.com/youtube/v3/commentThreads'
    comments = []
    next_page_token = None
    
    while len(comments) < max_results:
        params = {
            'part': 'snippet',
            'videoId': video_id,
            'maxResults': min(100, max_results - len(comments)),
            'order': 'relevance',  # Sort by relevance
            'key': api_key
        }
        
        if next_page_token:
            params['pageToken'] = next_page_token
        
        response = requests.get(url, params=params)
        data = response.json()
        
        if 'items' not in data:
            break
        
        for item in data['items']:
            top_comment = item['snippet']['topLevelComment']['snippet']
            comments.append({
                'author': top_comment['authorDisplayName'],
                'text': top_comment['textDisplay'],
                'like_count': top_comment['likeCount'],
                'published_at': top_comment['publishedAt'],
                'reply_count': item['snippet']['totalReplyCount']
            })
        
        next_page_token = data.get('nextPageToken')
        if not next_page_token:
            break
    
    return comments

def filter_valuable_comments(comments, min_likes=5):
    """Filter valuable comments"""
    valuable = []
    
    for comment in comments:
        # Value criteria:
        # 1. Many likes (popularity)
        # 2. Has replies (sparked discussion)
        # 3. Long text (detailed comment)
        
        if (comment['like_count'] >= min_likes or 
            comment['reply_count'] > 0 or 
            len(comment['text']) > 200):
            valuable.append(comment)
    
    return valuable

# Usage
comments = get_video_comments('dQw4w9WgXcQ', API_KEY, max_results=50)
valuable_comments = filter_valuable_comments(comments)

print(f"Total comments: {len(comments)}")
print(f"Valuable comments: {len(valuable_comments)}")

for comment in valuable_comments[:3]:
    print(f"\n[{comment['like_count']} likes] {comment['author']}:")
    print(comment['text'][:200])

Using Proxies for Scaling Data Collection

When scaling data collection, two problems arise: YouTube API limits (10,000 quotas per day) and blocks when parsing subtitles. Proxies help solve both issues.

When Proxies are Needed for YouTube Parsing

Exceeding API Quotas — using multiple API keys through different IPs can increase the daily limit
Parsing Subtitles Bypassing API — the youtube-transcript-api library makes direct requests that may be blocked at high frequency
Data Collection from Different Regions — some videos are only available in certain countries
Parallel Collection — distributing the load across multiple IPs to speed up the process

Choosing the Type of Proxy

Proxy Type	Advantages	When to Use
Data Center	High speed, low cost	Working with API, small volumes
Residential	Low risk of blocks, real IPs	Mass subtitle parsing, bypassing restrictions
Mobile	Maximum trust, rare blocks	Data collection from mobile apps, critical tasks

For most RAG system tasks, residential proxies are suitable — they provide a balance between cost and reliability when scaling parsing.

Setting Up Proxies in Code

import requests
from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api._api import TranscriptListFetcher

# Setting up proxy
PROXY = {
    'http': 'http://username:password@proxy-server:port',
    'https': 'http://username:password@proxy-server:port'
}

# To work with API through proxy
def get_video_details_with_proxy(video_id, api_key, proxy):
    url = f'https://www.googleapis.com/youtube/v3/videos'
    params = {
        'part': 'snippet,statistics',
        'id': video_id,
        'key': api_key
    }
    
    response = requests.get(url, params=params, proxies=proxy, timeout=10)
    return response.json()

# For parsing subtitles through proxy
class ProxiedTranscriptApi:
    def __init__(self, proxy):
        self.proxy = proxy
    
    def get_transcript(self, video_id, languages=['en']):
        # Create a custom session with proxy
        session = requests.Session()
        session.proxies = self.proxy
        
        # Use session for requests
        fetcher = TranscriptListFetcher(session)
        transcript_list = fetcher.fetch(video_id)
        
        # Get the required language
        for lang in languages:
            try:
                transcript = transcript_list.find_transcript([lang])
                return transcript.fetch()
            except:
                continue
        
        raise Exception(f"Subtitles not found for languages: {languages}")

# Usage
api = ProxiedTranscriptApi(PROXY)
transcript = api.get_transcript('dQw4w9WgXcQ', languages=['ru', 'en'])
print(f"Retrieved {len(transcript)} segments of subtitles")

Proxy Rotation for Scaling

When collecting data from thousands of videos, it is important to distribute the load across several proxies:

import random
import time

class ProxyRotator:
    def __init__(self, proxy_list):
        self.proxies = proxy_list
        self.current_index = 0
    
    def get_next_proxy(self):
        """Sequential rotation"""
        proxy = self.proxies[self.current_index]
        self.current_index = (self.current_index + 1) % len(self.proxies)
        return proxy
    
    def get_random_proxy(self):
        """Random rotation"""
        return random.choice(self.proxies)

# List of proxies
PROXY_LIST = [
    {'http': 'http://user:pass@proxy1:port', 'https': 'http://user:pass@proxy1:port'},
    {'http': 'http://user:pass@proxy2:port', 'https': 'http://user:pass@proxy2:port'},
    {'http': 'http://user:pass@proxy3:port', 'https': 'http://user:pass@proxy3:port'},
]

rotator = ProxyRotator(PROXY_LIST)

def collect_data_with_rotation(video_ids):
    results = []
    
    for video_id in video_ids:
        proxy = rotator.get_next_proxy()
        
        try:
            # Get metadata
            metadata = get_video_details_with_proxy(video_id, API_KEY, proxy)
            
            # Get subtitles
            api = ProxiedTranscriptApi(proxy)
            transcript = api.get_transcript(video_id)
            
            results.append({
                'video_id': video_id,
                'metadata': metadata,
                'transcript': transcript
            })
            
            # Delay between requests
            time.sleep(1)
            
        except Exception as e:
            print(f"Error for {video_id}: {e}")
            continue
    
    return results

# Usage
video_ids = ['video1', 'video2', 'video3', 'video4', 'video5']
data = collect_data_with_rotation(video_ids)
print(f"Data collected for {len(data)} videos")

Processing and Preparing Data for RAG

After collecting data, it needs to be processed and structured for the effective operation of RAG systems.

Creating Vector Embeddings

RAG systems use vector search to find relevant fragments. Text needs to be transformed into embeddings:

from sentence_transformers import SentenceTransformer
import numpy as np

# Load model for creating embeddings
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

def create_embeddings_from_transcript(transcript_chunks):
    """Create embeddings for transcript chunks"""
    embeddings = []
    
    for chunk in transcript_chunks:
        # Combine text with metadata for better context
        text_with_context = f"{chunk['title']} | {chunk['text']}"
        
        # Create embedding
        embedding = model.encode(text_with_context)
        
        embeddings.append({
            'video_id': chunk['video_id'],
            'timestamp': chunk['timestamp'],
            'text': chunk['text'],
            'embedding': embedding.tolist(),
            'metadata': {
                'title': chunk['title'],
                'channel': chunk['channel'],
                'views': chunk['views']
            }
        })
    
    return embeddings

# Preparing data
def prepare_rag_data(video_data):
    """Prepare all data for RAG"""
    all_chunks = []
    
    for video in video_data:
        metadata = video['metadata']
        transcript = video['transcript']
        
        # Break subtitles into chunks
        chunks = create_chunks_with_timestamps(transcript)
        
        # Add metadata to each chunk
        for chunk in chunks:
            chunk['title'] = metadata['title']
            chunk['channel'] = metadata['channel_title']
            chunk['views'] = metadata['view_count']
            all_chunks.append(chunk)
    
    # Create embeddings
    embeddings = create_embeddings_from_transcript(all_chunks)
    
    return embeddings

# Usage
rag_data = prepare_rag_data(collected_videos)
print(f"Prepared {len(rag_data)} fragments for RAG")

Saving to a Vector Database

For efficient searching, embeddings are saved in specialized databases. Popular options include Pinecone, Weaviate, Qdrant, ChromaDB.

import chromadb
from chromadb.config import Settings

# Initialize ChromaDB (local vector DB)
client = chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory="./youtube_rag_db"
))

# Create collection
collection = client.create_collection(
    name="youtube_transcripts",
    metadata={"description": "YouTube video transcripts for RAG"}
)

def store_in_vector_db(embeddings_data, collection):
    """Save embeddings to vector DB"""
    
    ids = []
    embeddings = []
    documents = []
    metadatas = []
    
    for i, item in enumerate(embeddings_data):
        ids.append(f"{item['video_id']}_{i}")
        embeddings.append(item['embedding'])
        documents.append(item['text'])
        metadatas.append({
            'video_id': item['video_id'],
            'timestamp': item['timestamp'],
            'title': item['metadata']['title'],
            'channel': item['metadata']['channel'],
            'views': str(item['metadata']['views']),
            'youtube_url': f"https://youtube.com/watch?v={item['video_id']}&t={int(float(item['timestamp'].split(':')[0])*60 + float(item['timestamp'].split(':')[1]))}s"
        })
    
    # Adding to collection
    collection.add(
        ids=ids,
        embeddings=embeddings,
        documents=documents,
        metadatas=metadatas
    )
    
    print(f"Saved {len(ids)} embeddings to vector DB")

# Usage
store_in_vector_db(rag_data, collection)

Searching and Generating Answers

The final step is implementing RAG search and answer generation:

def search_youtube_knowledge(query, collection, model, top_k=3):
    """Search for relevant fragments from YouTube"""
    
    # Create query embedding
    query_embedding = model.encode(query).tolist()
    
    # Search in vector DB
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k
    )
    
    # Formatting results
    sources = []
    for i in range(len(results['ids'][0])):
        sources.append({
            'text': results['documents'][0][i],
            'metadata': results['metadatas'][0][i],
            'distance': results['distances'][0][i] if 'distances' in results else None
        })
    
    return sources

def generate_rag_answer(query, sources, llm_api_key):
    """Generate an answer based on found sources"""
    
    # Form context from found sources
    context = "\n\n".join([
        f"Source: {s['metadata']['title']} ({s['metadata']['timestamp']})\n{s['text']}"
        for s in sources
    ])
    
    # Prompt for LLM
    prompt = f"""Based on the following fragments from YouTube videos, answer the user's question.
Make sure to cite sources with timestamps.

Context:
{context}

Question: {query}

Answer:"""
    
    # Here call your LLM (OpenAI, Claude, local model)
    # Example with OpenAI:
    # response = openai.ChatCompletion.create(
    #     model="gpt-4",
    #     messages=[{"role": "user", "content": prompt}]
    # )
    # answer = response.choices[0].message.content
    
    # For example, return the prompt
    return {
        'answer': 'This will be the LLM answer',
        'sources': sources
    }

# Usage
query = "How to set up a proxy in Python?"
sources = search_youtube_knowledge(query, collection, model, top_k=3)

print("Found sources:")
for source in sources:
    print(f"\n{source['metadata']['title']}")
    print(f"Time: {source['metadata']['timestamp']}")
    print(f"Link: {source['metadata']['youtube_url']}")
    print(f"Text: {source['text'][:200]}...")

Optimizing RAG Quality

A few tips for improving the quality of RAG systems using YouTube data:

Filter low-quality content — use metrics of views, likes, and recency of publication
Preserve context — add video and channel titles to each text chunk
Optimize chunk sizes — for technical content, 300-500 words is optimal
Use metadata for ranking — fresher and more popular content may receive priority
Add comments — they often contain important clarifications and FAQs
Check video availability — some videos may be deleted or become private

Tip: For large-scale YouTube data collection, it is recommended to use a combination of the official API (for metadata) and parsing through proxies (for subtitles). This allows you to bypass limits and get the most information.

Conclusion

Collecting YouTube data for RAG systems is a multi-step process that includes working with APIs, parsing subtitles, processing metadata, and creating vector embeddings. Key points:

YouTube Data API v3 provides metadata, statistics, and comments with a limit of 10,000 quotas per day
Subtitles are parsed through the youtube-transcript-api library or direct requests
Timestamps are critically important for creating accurate citations
Metadata (views, likes, date) helps assess the quality and relevance of content
Comments add context and often contain FAQs
Proxies are necessary for scaling and bypassing limits
Vector embeddings and specialized databases provide fast semantic search

With the right process setup, you can collect tens of thousands of quality content fragments per day, creating a powerful knowledge base for RAG systems in any subject area.

If you plan to scale YouTube data collection while bypassing limits and blocks, we recommend using residential proxies — they provide stability when parsing thousands of videos and minimize the risk of blocks from YouTube.