RAG (Retrieval-Augmented Generation) systems require high-quality data for training. YouTube is a vast source of structured content: videos with subtitles, metadata, and comments. In this article, we will explore how to effectively collect YouTube data for RAG while avoiding blocks and adhering to API limits.
What is RAG and why YouTube data is needed
RAG (Retrieval-Augmented Generation) is an approach to building AI systems where a language model is supplemented by a knowledge base. Instead of relying solely on the data the model was trained on, RAG retrieves relevant information from an external source and uses it to generate responses.
YouTube contains millions of hours of content with subtitles in various languages. This makes the platform a valuable source of data for RAG systems in various fields:
- Educational systems — lectures, tutorials, courses with timestamps
- Technical documentation — video guides on programming, DevOps, software setup
- Medical knowledge bases — lectures by doctors, case studies
- Business analytics — interviews with experts, case studies, market reviews
- Product support — product reviews, FAQs in video format
The advantage of YouTube data is its structure: subtitles with timestamps, metadata (categories, tags), and social context (comments, likes). All of this helps the RAG system to understand not only the content but also the context of the information.
What YouTube data is useful for RAG systems
For the effective operation of a RAG system, several types of data need to be collected. Each type serves its purpose in the process of information retrieval and generation.
Subtitles (Transcripts)
The primary source of textual data. YouTube provides two types of subtitles:
- Automatic — generated by Google's speech recognition algorithms. Available for most videos in English and other popular languages. Accuracy is 85-95% depending on audio quality.
- Manual — uploaded by authors or the community. More accurate, often containing formatting and additional context.
Subtitles include timestamps, which allow linking text to specific moments in the video. This is critical for creating accurate citations in RAG responses.
Video Metadata
Metadata helps the RAG system understand the context and relevance of the information:
| Data Type | Application in RAG |
|---|---|
| Title and Description | Semantic search, topic identification |
| Tags and Categories | Content classification, filtering |
| Publication Date | Relevance of information (important for technical topics) |
| Duration | Assessment of topic depth |
| Statistics (views, likes) | Assessment of the quality and popularity of the source |
| Channel Information | Determining the authority of the source |
Comments
Comments provide additional context: viewer questions, author clarifications, discussions. For RAG systems, this is valuable because:
- Comments often contain FAQs on the video topic
- Authors may post corrections and additions
- Discussions reveal different perspectives on the issue
Working with YouTube Data API v3: setup and limits
YouTube Data API v3 is the official way to obtain data. It provides access to metadata, statistics, and comments. Subtitles are obtained through separate methods.
Obtaining an API Key
To work with the API, you need a key from the Google Cloud Console:
- Go to console.cloud.google.com
- Create a new project or select an existing one
- Enable YouTube Data API v3 in the "APIs & Services" section
- Create credentials → API key
- Copy the key — it will be needed for all requests
Limits and Quotas
The YouTube API uses a quota system. Each request "costs" a certain number of units:
| Operation | Cost in Quotas |
|---|---|
| Video Search (search.list) | 100 units |
| Getting Video Data (videos.list) | 1 unit |
| Getting Comments (commentThreads.list) | 1 unit |
The default daily limit is 10,000 units. This is approximately 100 search requests or 10,000 metadata requests. To increase the quota, you need to apply to Google.
Basic Example of Working with the API
import requests
API_KEY = 'your_api_key'
BASE_URL = 'https://www.googleapis.com/youtube/v3'
# Search for videos by query
def search_videos(query, max_results=10):
url = f'{BASE_URL}/search'
params = {
'part': 'snippet',
'q': query,
'type': 'video',
'maxResults': max_results,
'key': API_KEY
}
response = requests.get(url, params=params)
return response.json()
# Get video metadata
def get_video_details(video_id):
url = f'{BASE_URL}/videos'
params = {
'part': 'snippet,contentDetails,statistics',
'id': video_id,
'key': API_KEY
}
response = requests.get(url, params=params)
return response.json()
# Example usage
results = search_videos('machine learning tutorial', max_results=5)
for item in results.get('items', []):
video_id = item['id']['videoId']
title = item['snippet']['title']
print(f'ID: {video_id}, Title: {title}')
# Get detailed information
details = get_video_details(video_id)
stats = details['items'][0]['statistics']
print(f"Views: {stats.get('viewCount')}, Likes: {stats.get('likeCount')}")
Parsing Video Subtitles: Automatic and Manual
YouTube Data API v3 does not provide direct access to subtitles. Alternative methods are used to obtain them.
Using the youtube-transcript-api Library
The simplest way is the youtube-transcript-api library for Python. It extracts subtitles directly, without an API key:
from youtube_transcript_api import YouTubeTranscriptApi
# Getting subtitles
video_id = 'dQw4w9WgXcQ'
try:
# Attempt to get Russian subtitles, if not available — English
transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=['ru', 'en'])
# Output subtitles with timestamps
for entry in transcript:
start_time = entry['start']
duration = entry['duration']
text = entry['text']
print(f"[{start_time:.2f}s] {text}")
except Exception as e:
print(f"Error retrieving subtitles: {e}")
# Getting the list of available languages
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
for transcript in transcript_list:
print(f"Language: {transcript.language}, Automatic: {transcript.is_generated}")
The library automatically detects available subtitles and can translate them into other languages (if YouTube provides such functionality).
Processing Timestamps for RAG
For RAG systems, it is important to maintain the link between text and timestamps. This allows for the creation of accurate citations:
def format_timestamp(seconds):
"""Convert seconds to MM:SS format"""
minutes = int(seconds // 60)
secs = int(seconds % 60)
return f"{minutes:02d}:{secs:02d}"
def create_chunks_with_timestamps(transcript, chunk_size=500):
"""Break subtitles into chunks while preserving timestamps"""
chunks = []
current_chunk = ""
chunk_start_time = 0
for i, entry in enumerate(transcript):
if len(current_chunk) == 0:
chunk_start_time = entry['start']
current_chunk += entry['text'] + " "
# If we reached the required size or the end of the subtitles
if len(current_chunk) >= chunk_size or i == len(transcript) - 1:
chunks.append({
'text': current_chunk.strip(),
'start_time': chunk_start_time,
'timestamp': format_timestamp(chunk_start_time),
'video_id': video_id
})
current_chunk = ""
return chunks
# Usage
transcript = YouTubeTranscriptApi.get_transcript(video_id)
chunks = create_chunks_with_timestamps(transcript)
for chunk in chunks[:3]: # First 3 chunks
print(f"[{chunk['timestamp']}] {chunk['text'][:100]}...")
Collecting Metadata: Titles, Descriptions, Tags
Metadata enriches the context for the RAG system. Here is a complete example of collecting all necessary data:
import requests
from datetime import datetime
def collect_video_metadata(video_id, api_key):
"""Collect full video metadata"""
url = f'https://www.googleapis.com/youtube/v3/videos'
params = {
'part': 'snippet,contentDetails,statistics,topicDetails',
'id': video_id,
'key': api_key
}
response = requests.get(url, params=params)
data = response.json()
if 'items' not in data or len(data['items']) == 0:
return None
item = data['items'][0]
snippet = item['snippet']
stats = item.get('statistics', {})
content = item.get('contentDetails', {})
metadata = {
'video_id': video_id,
'title': snippet['title'],
'description': snippet['description'],
'channel_title': snippet['channelTitle'],
'channel_id': snippet['channelId'],
'published_at': snippet['publishedAt'],
'tags': snippet.get('tags', []),
'category_id': snippet.get('categoryId'),
'duration': content.get('duration'),
'view_count': int(stats.get('viewCount', 0)),
'like_count': int(stats.get('likeCount', 0)),
'comment_count': int(stats.get('commentCount', 0)),
'topics': item.get('topicDetails', {}).get('topicCategories', [])
}
return metadata
# Example usage
metadata = collect_video_metadata('dQw4w9WgXcQ', API_KEY)
print(f"Title: {metadata['title']}")
print(f"Channel: {metadata['channel_title']}")
print(f"Views: {metadata['view_count']:,}")
print(f"Tags: {', '.join(metadata['tags'][:5])}")
Determining Content Relevance
For technical topics, the freshness of information is important. Let's add a function to assess relevance:
from datetime import datetime, timedelta
def calculate_content_freshness(published_date_str):
"""Assess content relevance"""
published_date = datetime.fromisoformat(published_date_str.replace('Z', '+00:00'))
age_days = (datetime.now(published_date.tzinfo) - published_date).days
if age_days < 30:
return 'very_fresh'
elif age_days < 180:
return 'fresh'
elif age_days < 365:
return 'moderate'
else:
return 'old'
def calculate_quality_score(metadata):
"""Calculate source quality score"""
score = 0
# Popularity
views = metadata['view_count']
if views > 100000:
score += 3
elif views > 10000:
score += 2
elif views > 1000:
score += 1
# Engagement (likes relative to views)
if views > 0:
like_ratio = metadata['like_count'] / views
if like_ratio > 0.05:
score += 2
elif like_ratio > 0.02:
score += 1
# Relevance
freshness = calculate_content_freshness(metadata['published_at'])
if freshness == 'very_fresh':
score += 2
elif freshness == 'fresh':
score += 1
return score
# Usage
metadata = collect_video_metadata('dQw4w9WgXcQ', API_KEY)
quality = calculate_quality_score(metadata)
freshness = calculate_content_freshness(metadata['published_at'])
print(f"Quality Score: {quality}/7")
print(f"Relevance: {freshness}")
Parsing Comments for Contextual Analysis
Comments can contain valuable information: corrections to errors in the video, additional resources, frequent questions. For RAG systems, this provides additional context.
def get_video_comments(video_id, api_key, max_results=100):
"""Retrieve comments for a video"""
url = 'https://www.googleapis.com/youtube/v3/commentThreads'
comments = []
next_page_token = None
while len(comments) < max_results:
params = {
'part': 'snippet',
'videoId': video_id,
'maxResults': min(100, max_results - len(comments)),
'order': 'relevance', # Sort by relevance
'key': api_key
}
if next_page_token:
params['pageToken'] = next_page_token
response = requests.get(url, params=params)
data = response.json()
if 'items' not in data:
break
for item in data['items']:
top_comment = item['snippet']['topLevelComment']['snippet']
comments.append({
'author': top_comment['authorDisplayName'],
'text': top_comment['textDisplay'],
'like_count': top_comment['likeCount'],
'published_at': top_comment['publishedAt'],
'reply_count': item['snippet']['totalReplyCount']
})
next_page_token = data.get('nextPageToken')
if not next_page_token:
break
return comments
def filter_valuable_comments(comments, min_likes=5):
"""Filter valuable comments"""
valuable = []
for comment in comments:
# Value criteria:
# 1. Many likes (popularity)
# 2. Has replies (sparked discussion)
# 3. Long text (detailed comment)
if (comment['like_count'] >= min_likes or
comment['reply_count'] > 0 or
len(comment['text']) > 200):
valuable.append(comment)
return valuable
# Usage
comments = get_video_comments('dQw4w9WgXcQ', API_KEY, max_results=50)
valuable_comments = filter_valuable_comments(comments)
print(f"Total comments: {len(comments)}")
print(f"Valuable comments: {len(valuable_comments)}")
for comment in valuable_comments[:3]:
print(f"\n[{comment['like_count']} likes] {comment['author']}:")
print(comment['text'][:200])
Using Proxies for Scaling Data Collection
When scaling data collection, two problems arise: YouTube API limits (10,000 quotas per day) and blocks when parsing subtitles. Proxies help solve both issues.
When Proxies are Needed for YouTube Parsing
- Exceeding API Quotas — using multiple API keys through different IPs can increase the daily limit
- Parsing Subtitles Bypassing API — the youtube-transcript-api library makes direct requests that may be blocked at high frequency
- Data Collection from Different Regions — some videos are only available in certain countries
- Parallel Collection — distributing the load across multiple IPs to speed up the process
Choosing the Type of Proxy
| Proxy Type | Advantages | When to Use |
|---|---|---|
| Data Center | High speed, low cost | Working with API, small volumes |
| Residential | Low risk of blocks, real IPs | Mass subtitle parsing, bypassing restrictions |
| Mobile | Maximum trust, rare blocks | Data collection from mobile apps, critical tasks |
For most RAG system tasks, residential proxies are suitable — they provide a balance between cost and reliability when scaling parsing.
Setting Up Proxies in Code
import requests
from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api._api import TranscriptListFetcher
# Setting up proxy
PROXY = {
'http': 'http://username:password@proxy-server:port',
'https': 'http://username:password@proxy-server:port'
}
# To work with API through proxy
def get_video_details_with_proxy(video_id, api_key, proxy):
url = f'https://www.googleapis.com/youtube/v3/videos'
params = {
'part': 'snippet,statistics',
'id': video_id,
'key': api_key
}
response = requests.get(url, params=params, proxies=proxy, timeout=10)
return response.json()
# For parsing subtitles through proxy
class ProxiedTranscriptApi:
def __init__(self, proxy):
self.proxy = proxy
def get_transcript(self, video_id, languages=['en']):
# Create a custom session with proxy
session = requests.Session()
session.proxies = self.proxy
# Use session for requests
fetcher = TranscriptListFetcher(session)
transcript_list = fetcher.fetch(video_id)
# Get the required language
for lang in languages:
try:
transcript = transcript_list.find_transcript([lang])
return transcript.fetch()
except:
continue
raise Exception(f"Subtitles not found for languages: {languages}")
# Usage
api = ProxiedTranscriptApi(PROXY)
transcript = api.get_transcript('dQw4w9WgXcQ', languages=['ru', 'en'])
print(f"Retrieved {len(transcript)} segments of subtitles")
Proxy Rotation for Scaling
When collecting data from thousands of videos, it is important to distribute the load across several proxies:
import random
import time
class ProxyRotator:
def __init__(self, proxy_list):
self.proxies = proxy_list
self.current_index = 0
def get_next_proxy(self):
"""Sequential rotation"""
proxy = self.proxies[self.current_index]
self.current_index = (self.current_index + 1) % len(self.proxies)
return proxy
def get_random_proxy(self):
"""Random rotation"""
return random.choice(self.proxies)
# List of proxies
PROXY_LIST = [
{'http': 'http://user:pass@proxy1:port', 'https': 'http://user:pass@proxy1:port'},
{'http': 'http://user:pass@proxy2:port', 'https': 'http://user:pass@proxy2:port'},
{'http': 'http://user:pass@proxy3:port', 'https': 'http://user:pass@proxy3:port'},
]
rotator = ProxyRotator(PROXY_LIST)
def collect_data_with_rotation(video_ids):
results = []
for video_id in video_ids:
proxy = rotator.get_next_proxy()
try:
# Get metadata
metadata = get_video_details_with_proxy(video_id, API_KEY, proxy)
# Get subtitles
api = ProxiedTranscriptApi(proxy)
transcript = api.get_transcript(video_id)
results.append({
'video_id': video_id,
'metadata': metadata,
'transcript': transcript
})
# Delay between requests
time.sleep(1)
except Exception as e:
print(f"Error for {video_id}: {e}")
continue
return results
# Usage
video_ids = ['video1', 'video2', 'video3', 'video4', 'video5']
data = collect_data_with_rotation(video_ids)
print(f"Data collected for {len(data)} videos")
Processing and Preparing Data for RAG
After collecting data, it needs to be processed and structured for the effective operation of RAG systems.
Creating Vector Embeddings
RAG systems use vector search to find relevant fragments. Text needs to be transformed into embeddings:
from sentence_transformers import SentenceTransformer
import numpy as np
# Load model for creating embeddings
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
def create_embeddings_from_transcript(transcript_chunks):
"""Create embeddings for transcript chunks"""
embeddings = []
for chunk in transcript_chunks:
# Combine text with metadata for better context
text_with_context = f"{chunk['title']} | {chunk['text']}"
# Create embedding
embedding = model.encode(text_with_context)
embeddings.append({
'video_id': chunk['video_id'],
'timestamp': chunk['timestamp'],
'text': chunk['text'],
'embedding': embedding.tolist(),
'metadata': {
'title': chunk['title'],
'channel': chunk['channel'],
'views': chunk['views']
}
})
return embeddings
# Preparing data
def prepare_rag_data(video_data):
"""Prepare all data for RAG"""
all_chunks = []
for video in video_data:
metadata = video['metadata']
transcript = video['transcript']
# Break subtitles into chunks
chunks = create_chunks_with_timestamps(transcript)
# Add metadata to each chunk
for chunk in chunks:
chunk['title'] = metadata['title']
chunk['channel'] = metadata['channel_title']
chunk['views'] = metadata['view_count']
all_chunks.append(chunk)
# Create embeddings
embeddings = create_embeddings_from_transcript(all_chunks)
return embeddings
# Usage
rag_data = prepare_rag_data(collected_videos)
print(f"Prepared {len(rag_data)} fragments for RAG")
Saving to a Vector Database
For efficient searching, embeddings are saved in specialized databases. Popular options include Pinecone, Weaviate, Qdrant, ChromaDB.
import chromadb
from chromadb.config import Settings
# Initialize ChromaDB (local vector DB)
client = chromadb.Client(Settings(
chroma_db_impl="duckdb+parquet",
persist_directory="./youtube_rag_db"
))
# Create collection
collection = client.create_collection(
name="youtube_transcripts",
metadata={"description": "YouTube video transcripts for RAG"}
)
def store_in_vector_db(embeddings_data, collection):
"""Save embeddings to vector DB"""
ids = []
embeddings = []
documents = []
metadatas = []
for i, item in enumerate(embeddings_data):
ids.append(f"{item['video_id']}_{i}")
embeddings.append(item['embedding'])
documents.append(item['text'])
metadatas.append({
'video_id': item['video_id'],
'timestamp': item['timestamp'],
'title': item['metadata']['title'],
'channel': item['metadata']['channel'],
'views': str(item['metadata']['views']),
'youtube_url': f"https://youtube.com/watch?v={item['video_id']}&t={int(float(item['timestamp'].split(':')[0])*60 + float(item['timestamp'].split(':')[1]))}s"
})
# Adding to collection
collection.add(
ids=ids,
embeddings=embeddings,
documents=documents,
metadatas=metadatas
)
print(f"Saved {len(ids)} embeddings to vector DB")
# Usage
store_in_vector_db(rag_data, collection)
Searching and Generating Answers
The final step is implementing RAG search and answer generation:
def search_youtube_knowledge(query, collection, model, top_k=3):
"""Search for relevant fragments from YouTube"""
# Create query embedding
query_embedding = model.encode(query).tolist()
# Search in vector DB
results = collection.query(
query_embeddings=[query_embedding],
n_results=top_k
)
# Formatting results
sources = []
for i in range(len(results['ids'][0])):
sources.append({
'text': results['documents'][0][i],
'metadata': results['metadatas'][0][i],
'distance': results['distances'][0][i] if 'distances' in results else None
})
return sources
def generate_rag_answer(query, sources, llm_api_key):
"""Generate an answer based on found sources"""
# Form context from found sources
context = "\n\n".join([
f"Source: {s['metadata']['title']} ({s['metadata']['timestamp']})\n{s['text']}"
for s in sources
])
# Prompt for LLM
prompt = f"""Based on the following fragments from YouTube videos, answer the user's question.
Make sure to cite sources with timestamps.
Context:
{context}
Question: {query}
Answer:"""
# Here call your LLM (OpenAI, Claude, local model)
# Example with OpenAI:
# response = openai.ChatCompletion.create(
# model="gpt-4",
# messages=[{"role": "user", "content": prompt}]
# )
# answer = response.choices[0].message.content
# For example, return the prompt
return {
'answer': 'This will be the LLM answer',
'sources': sources
}
# Usage
query = "How to set up a proxy in Python?"
sources = search_youtube_knowledge(query, collection, model, top_k=3)
print("Found sources:")
for source in sources:
print(f"\n{source['metadata']['title']}")
print(f"Time: {source['metadata']['timestamp']}")
print(f"Link: {source['metadata']['youtube_url']}")
print(f"Text: {source['text'][:200]}...")
Optimizing RAG Quality
A few tips for improving the quality of RAG systems using YouTube data:
- Filter low-quality content — use metrics of views, likes, and recency of publication
- Preserve context — add video and channel titles to each text chunk
- Optimize chunk sizes — for technical content, 300-500 words is optimal
- Use metadata for ranking — fresher and more popular content may receive priority
- Add comments — they often contain important clarifications and FAQs
- Check video availability — some videos may be deleted or become private
Tip: For large-scale YouTube data collection, it is recommended to use a combination of the official API (for metadata) and parsing through proxies (for subtitles). This allows you to bypass limits and get the most information.
Conclusion
Collecting YouTube data for RAG systems is a multi-step process that includes working with APIs, parsing subtitles, processing metadata, and creating vector embeddings. Key points:
- YouTube Data API v3 provides metadata, statistics, and comments with a limit of 10,000 quotas per day
- Subtitles are parsed through the youtube-transcript-api library or direct requests
- Timestamps are critically important for creating accurate citations
- Metadata (views, likes, date) helps assess the quality and relevance of content
- Comments add context and often contain FAQs
- Proxies are necessary for scaling and bypassing limits
- Vector embeddings and specialized databases provide fast semantic search
With the right process setup, you can collect tens of thousands of quality content fragments per day, creating a powerful knowledge base for RAG systems in any subject area.
If you plan to scale YouTube data collection while bypassing limits and blocks, we recommend using residential proxies — they provide stability when parsing thousands of videos and minimize the risk of blocks from YouTube.