Scraping YouTube Transcripts for Competitor Content Strategy (Python)
Most content marketers analyze their competitors by looking at video titles, thumbnails, and view counts. This is surface-level analysis.
If you want to truly understand why a competitor's video went viral, or what specific topics they are covering that you are missing, you need to look at the actual words spoken in the video.
The problem: Watching 100 hours of competitor YouTube videos to take notes is impossible.
The solution: Use Python to programmatically download the transcripts of every video your competitor has ever published, feed that text into a Natural Language Processing (NLP) pipeline, and instantly generate a map of their entire content strategy.
In this guide, we will build a Python pipeline that extracts YouTube transcripts at scale and uses AI to find content gaps you can exploit.
Why Transcripts > Titles
Titles are designed for clickbait. Transcripts contain the actual value.
By analyzing transcripts, you can extract:
- Keyword Density: What specific industry terms do they mention most often?
- Sponsorship Data: Are they consistently mentioning a specific brand or affiliate link in the middle of their videos?
- Content Gaps: If a competitor has 50 videos on "React" but never mentions "Server Components," you have just found a highly targeted topic you can rank for.
Architecture: The Python Extraction Pipeline
We don't need to use heavy tools like Selenium or Puppeteer for this. YouTube's internal API for transcripts is surprisingly accessible if you know how to call it. We will use the youtube-transcript-api library, which mimics the requests made by the YouTube web player.
The Python Script
This script takes a list of YouTube Video IDs, downloads the full text transcripts, cleans the data, and saves it to a Pandas DataFrame for analysis.
from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api.formatters import TextFormatter
import pandas as pd
import time
# List of competitor video IDs (found in the YouTube URL: watch?v=VIDEO_ID)
COMPETITOR_VIDEOS = [
"dQw4w9WgXcQ",
"jNQXAC9IVRw",
"3JZ_D3ELwOQ"
]
def extract_transcripts(video_ids):
print(f"🚀 Starting transcript extraction for {len(video_ids)} videos...")
dataset = []
formatter = TextFormatter()
for video_id in video_ids:
try:
print(f"Fetching transcript for {video_id}...")
# Fetch the transcript (defaults to English)
transcript_list = YouTubeTranscriptApi.get_transcript(video_id)
# Format the JSON response into a single block of text
full_text = formatter.format_transcript(transcript_list)
# Calculate video length based on the last timestamp
duration_seconds = transcript_list[-1]['start'] + transcript_list[-1]['duration']
dataset.append({
"video_id": video_id,
"transcript": full_text.replace('\n', ' '), # Clean newlines
"word_count": len(full_text.split()),
"duration_minutes": round(duration_seconds / 60, 2)
})
# Be polite to the API
time.sleep(2)
except Exception as e:
print(f"❌ Failed to fetch {video_id}: {str(e)}")
# Convert to a Pandas DataFrame for easy analysis
df = pd.DataFrame(dataset)
return df
# Run the extraction
df_transcripts = extract_transcripts(COMPETITOR_VIDEOS)
# Save to CSV
df_transcripts.to_csv("competitor_transcripts.csv", index=False)
print("✅ Saved transcripts to competitor_transcripts.csv")
# Example Analysis: Find videos that mention a specific keyword
keyword = "pricing"
mentions = df_transcripts[df_transcripts['transcript'].str.contains(keyword, case=False)]
print(f"\nFound {len(mentions)} videos mentioning '{keyword}'.")
Step 2: LLM Topic Extraction
Once you have the CSV of transcripts, the real magic happens. You can pass these transcripts to the OpenAI API to summarize the core arguments and find weaknesses.
import openai
# Assuming you have a transcript string
prompt = f"""
Analyze the following YouTube video transcript.
1. What are the 3 main topics covered?
2. What important related topics did the creator FAIL to mention? (Content Gaps)
Transcript: {transcript_text[:10000]} # Truncate to fit context window
"""
response = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
print(response.choices[0].message.content)
Cost Considerations
Extracting transcripts is incredibly cheap, but processing them with AI can add up.
| Component | 100 Videos | 10,000 Videos | Cost Optimization Strategy |
|---|---|---|---|
| Extraction (Python) | $0.00 | $0.00 | The youtube-transcript-api is free and open-source. |
| Storage (CSV/DB) | $0.00 | $1.00 | Text data is extremely lightweight. |
| LLM Analysis (GPT-4o-mini) | $2.00 | $200.00 | Only send the first 20% of the transcript to the LLM (intros usually contain the thesis). |
| Total | $2.00 | $201.00 | ROI: Discovering one untapped SEO keyword can drive thousands of views. |
Best Practices
Do's
✅ Handle Missing Transcripts Gracefully - Not all videos have transcripts enabled. Some only have auto-generated ones, and some have none. Wrap your API calls in try/except blocks so one failed video doesn't crash your entire script.
✅ Use NLP Libraries for Keyword Extraction - Before paying for OpenAI, use free Python libraries like spaCy or NLTK to extract the most common noun phrases (e.g., "machine learning", "sales funnel") to build a basic topic map.
✅ Respect Rate Limits - Even though this isn't a traditional API, hitting YouTube's servers 1,000 times a second will get your IP temporarily banned. Add a time.sleep(2) between requests.
Don'ts
❌ Don't rely on auto-generated punctuation - YouTube's auto-generated transcripts often lack periods and commas. If you are feeding this text into an LLM, be aware that it is one massive run-on sentence.
❌ Don't scrape copyrighted content for republication - Scraping transcripts for internal analysis is generally considered fair use. Scraping transcripts to automatically generate and publish blog posts on your own site is copyright infringement.
❌ Don't forget translation - If your competitor operates globally, use the API's built-in translation feature to pull transcripts in your native language.
Conclusion
Content strategy should not be based on guessing what your competitors are talking about. It should be based on hard data.
Before (Manual Research):
- You watch 5 competitor videos at 2x speed.
- You take subjective notes.
- You miss the subtle trends and keywords they are targeting.
After (Programmatic Analysis):
- You download 500 competitor transcripts in 10 minutes.
- Python and AI map out their entire content matrix.
- You identify exactly what topics they are ignoring, and you create content to fill those gaps.
The investment: A simple Python script. The return: A mathematically proven content strategy.
Need to scale this across thousands of channels without getting blocked? SociaVault provides enterprise-grade extraction APIs. Try it free: sociavault.com
Found this helpful?
Share it with others who might benefit
Ready to Try SociaVault?
Start extracting social media data with our powerful API. No credit card required.