Social Media Data Extraction: The Complete Guide
Social media platforms hold massive amounts of data. Profiles, posts, comments, followers, engagement metrics—valuable data for marketing, research, and product development.
But getting that data? That's where it gets complicated.
This guide covers everything: what data exists, how to get it, legal considerations, and practical code examples.
What Data Can You Extract?
User Profiles
Every platform stores:
- Basic info: Username, display name, bio, profile picture
- Metrics: Followers, following, post count
- Verification: Blue checkmarks, business accounts
- Metadata: Account creation date, location, links
Content
- Posts/Videos: Text, media URLs, captions
- Engagement: Likes, comments, shares, views
- Timestamps: When content was posted
- Hashtags/Mentions: Tags and user mentions
Engagement Data
- Comments: Text, author, timestamp, replies
- Reactions: Like types, emoji reactions
- Shares/Reposts: Who shared, when
Network Data
- Followers: List of accounts following a user
- Following: List of accounts a user follows
- Connections: Mutual follows, relationships
Platform-by-Platform Breakdown
TikTok
| Data Type | Availability | Method |
|---|---|---|
| Profiles | ✅ Easy | API |
| Videos | ✅ Easy | API |
| Comments | ✅ Easy | API |
| Followers | ⚠️ Limited | API (first 200) |
| Analytics | ❌ Private | Business API only |
// Get TikTok profile
const profile = await fetch(
'https://api.sociavault.com/v1/scrape/tiktok/profile?username=charlidamelio',
{ headers: { 'Authorization': `Bearer ${API_KEY}` } }
).then(r => r.json());
console.log({
username: profile.data.username,
followers: profile.data.follower_count,
likes: profile.data.like_count,
videos: profile.data.video_count
});
| Data Type | Availability | Method |
|---|---|---|
| Public profiles | ✅ Easy | API |
| Public posts | ✅ Easy | API |
| Reels | ✅ Easy | API |
| Comments | ✅ Easy | API |
| Stories | ⚠️ Limited | Requires login |
| Private accounts | ❌ No | Not accessible |
// Get Instagram profile
const profile = await fetch(
'https://api.sociavault.com/v1/scrape/instagram/profile?username=natgeo',
{ headers: { 'Authorization': `Bearer ${API_KEY}` } }
).then(r => r.json());
// Get recent posts
const posts = await fetch(
'https://api.sociavault.com/v1/scrape/instagram/posts?username=natgeo&count=20',
{ headers: { 'Authorization': `Bearer ${API_KEY}` } }
).then(r => r.json());
YouTube
| Data Type | Availability | Method |
|---|---|---|
| Channels | ✅ Easy | API |
| Videos | ✅ Easy | API |
| Comments | ✅ Easy | API |
| Transcripts | ✅ Easy | API |
| Analytics | ⚠️ Limited | Creator Studio only |
// Get YouTube channel
const channel = await fetch(
'https://api.sociavault.com/v1/scrape/youtube/channel?channelId=UCX6OQ3DkcsbYNE6H8uQQuVA',
{ headers: { 'Authorization': `Bearer ${API_KEY}` } }
).then(r => r.json());
// Get video transcript (great for AI/RAG)
const transcript = await fetch(
'https://api.sociavault.com/v1/scrape/youtube/transcript?videoId=dQw4w9WgXcQ',
{ headers: { 'Authorization': `Bearer ${API_KEY}` } }
).then(r => r.json());
Twitter/X
| Data Type | Availability | Method |
|---|---|---|
| Profiles | ✅ Available | API |
| Tweets | ✅ Available | API |
| Replies | ✅ Available | API |
| Followers | ⚠️ Limited | Paginated |
| Analytics | ❌ No | Not accessible |
// Get Twitter user
const user = await fetch(
'https://api.sociavault.com/v1/scrape/twitter/user?username=elonmusk',
{ headers: { 'Authorization': `Bearer ${API_KEY}` } }
).then(r => r.json());
// Search tweets
const tweets = await fetch(
'https://api.sociavault.com/v1/scrape/twitter/search?query=AI startups&count=50',
{ headers: { 'Authorization': `Bearer ${API_KEY}` } }
).then(r => r.json());
| Data Type | Availability | Method |
|---|---|---|
| Public profiles | ⚠️ Limited | API |
| Companies | ✅ Available | API |
| Posts | ⚠️ Limited | API |
| Connections | ❌ No | Private |
| Data Type | Availability | Method |
|---|---|---|
| Profiles | ✅ Easy | API |
| Posts | ✅ Easy | API |
| Comments | ✅ Easy | API |
| Subreddits | ✅ Easy | API |
| Upvotes | ✅ Easy | API |
// Get subreddit posts
const posts = await fetch(
'https://api.sociavault.com/v1/scrape/reddit/posts?subreddit=programming&sort=hot&count=50',
{ headers: { 'Authorization': `Bearer ${API_KEY}` } }
).then(r => r.json());
Extraction Methods
1. Official APIs
Pros:
- Legal and sanctioned
- Stable endpoints
- Good documentation
Cons:
- Expensive (Twitter: $100/mo+)
- Limited access
- Strict rate limits
- Long approval processes
2. Third-Party APIs (Recommended)
Pros:
- One API for all platforms
- No approval wait
- Affordable pricing
- Handles complexity for you
Cons:
- Costs per request
- Dependent on provider
// One API for everything
const platforms = ['tiktok', 'instagram', 'youtube', 'twitter'];
const profiles = await Promise.all(
platforms.map(platform =>
fetch(`https://api.sociavault.com/v1/scrape/${platform}/profile?username=creator123`, {
headers: { 'Authorization': `Bearer ${API_KEY}` }
}).then(r => r.json())
)
);
3. Web Scraping
Pros:
- Full control
- No API costs
Cons:
- Breaks constantly
- Legal gray area
- Resource intensive
- Requires maintenance
See our Web Scraping vs API comparison.
4. Browser Extensions
Pros:
- Visual interface
- Works with your session
Cons:
- Manual process
- Doesn't scale
- Limited features
Python Implementation
import os
import requests
from typing import Dict, List, Optional
from dataclasses import dataclass
from datetime import datetime
API_KEY = os.getenv('SOCIAVAULT_API_KEY')
API_BASE = 'https://api.sociavault.com/v1/scrape'
@dataclass
class Profile:
platform: str
username: str
name: str
followers: int
following: int
posts: int
bio: str
avatar_url: str
class SocialDataExtractor:
def __init__(self, api_key: str):
self.api_key = api_key
self.headers = {'Authorization': f'Bearer {api_key}'}
def get_profile(self, platform: str, username: str) -> Profile:
response = requests.get(
f'{API_BASE}/{platform}/profile',
params={'username': username},
headers=self.headers
)
response.raise_for_status()
data = response.json()['data']
return Profile(
platform=platform,
username=username,
name=data.get('nickname') or data.get('full_name') or data.get('name', ''),
followers=data.get('follower_count') or data.get('followers', 0),
following=data.get('following_count') or data.get('following', 0),
posts=data.get('video_count') or data.get('posts_count', 0),
bio=data.get('bio') or data.get('description', ''),
avatar_url=data.get('avatar_url') or data.get('profile_pic_url', '')
)
def get_posts(self, platform: str, username: str, count: int = 20) -> List[Dict]:
endpoint = 'videos' if platform == 'tiktok' else 'posts'
response = requests.get(
f'{API_BASE}/{platform}/{endpoint}',
params={'username': username, 'count': count},
headers=self.headers
)
response.raise_for_status()
return response.json()['data'].get('posts') or response.json()['data'].get('videos', [])
def search(self, platform: str, query: str, count: int = 50) -> List[Dict]:
response = requests.get(
f'{API_BASE}/{platform}/search',
params={'query': query, 'count': count},
headers=self.headers
)
response.raise_for_status()
return response.json()['data']
# Usage
extractor = SocialDataExtractor(API_KEY)
# Get profile
profile = extractor.get_profile('tiktok', 'charlidamelio')
print(f"{profile.name}: {profile.followers:,} followers")
# Get recent posts
posts = extractor.get_posts('instagram', 'natgeo', count=10)
for post in posts:
print(f"- {post['like_count']:,} likes: {post['caption'][:50]}...")
JavaScript/TypeScript Implementation
interface Profile {
platform: string;
username: string;
name: string;
followers: number;
following: number;
posts: number;
bio: string;
avatarUrl: string;
}
interface Post {
id: string;
caption: string;
likeCount: number;
commentCount: number;
timestamp: string;
mediaUrl: string;
}
class SocialDataExtractor {
private apiKey: string;
private baseUrl = 'https://api.sociavault.com/v1/scrape';
constructor(apiKey: string) {
this.apiKey = apiKey;
}
private async fetch<T>(endpoint: string, params: Record<string, string>): Promise<T> {
const url = new URL(`${this.baseUrl}${endpoint}`);
Object.entries(params).forEach(([k, v]) => url.searchParams.set(k, v));
const response = await fetch(url.toString(), {
headers: { 'Authorization': `Bearer ${this.apiKey}` }
});
if (!response.ok) {
throw new Error(`API error: ${response.status}`);
}
const json = await response.json();
return json.data;
}
async getProfile(platform: string, username: string): Promise<Profile> {
const data = await this.fetch<any>(`/${platform}/profile`, { username });
return {
platform,
username,
name: data.nickname || data.full_name || data.name || '',
followers: data.follower_count || data.followers || 0,
following: data.following_count || data.following || 0,
posts: data.video_count || data.posts_count || 0,
bio: data.bio || data.description || '',
avatarUrl: data.avatar_url || data.profile_pic_url || ''
};
}
async getPosts(platform: string, username: string, count = 20): Promise<Post[]> {
const endpoint = platform === 'tiktok' ? '/videos' : '/posts';
const data = await this.fetch<any>(`/${platform}${endpoint}`, {
username,
count: count.toString()
});
return (data.posts || data.videos || []).map((post: any) => ({
id: post.id || post.post_id,
caption: post.caption || post.description || '',
likeCount: post.like_count || post.likes || 0,
commentCount: post.comment_count || post.comments || 0,
timestamp: post.timestamp || post.created_at,
mediaUrl: post.url || post.media_url || ''
}));
}
async search(platform: string, query: string, count = 50): Promise<any[]> {
return this.fetch(`/${platform}/search`, { query, count: count.toString() });
}
}
// Usage
const extractor = new SocialDataExtractor(process.env.SOCIAVAULT_API_KEY!);
const profile = await extractor.getProfile('tiktok', 'charlidamelio');
console.log(`${profile.name}: ${profile.followers.toLocaleString()} followers`);
Common Use Cases
1. Influencer Marketing
Find and vet creators:
def analyze_influencer(username, platforms=['tiktok', 'instagram']):
results = {}
for platform in platforms:
profile = extractor.get_profile(platform, username)
posts = extractor.get_posts(platform, username, count=30)
avg_engagement = sum(p['like_count'] + p['comment_count'] for p in posts) / len(posts)
engagement_rate = (avg_engagement / profile.followers) * 100 if profile.followers > 0 else 0
results[platform] = {
'followers': profile.followers,
'engagement_rate': round(engagement_rate, 2),
'posting_frequency': calculate_posting_frequency(posts),
'top_content': get_top_posts(posts, 3)
}
return results
2. Market Research
Monitor industry trends:
def track_hashtag(hashtag, platform='tiktok', days=7):
posts = extractor.get_hashtag_posts(platform, hashtag, count=500)
return {
'total_posts': len(posts),
'total_views': sum(p.get('view_count', 0) for p in posts),
'avg_engagement': calculate_avg_engagement(posts),
'top_creators': get_top_creators(posts),
'trending_sounds': extract_sounds(posts),
'peak_posting_times': analyze_timestamps(posts)
}
3. Competitor Analysis
Compare social performance:
def compare_competitors(usernames):
results = []
for username in usernames:
data = {
'username': username,
'platforms': {}
}
for platform in ['tiktok', 'instagram', 'youtube']:
try:
profile = extractor.get_profile(platform, username)
data['platforms'][platform] = {
'followers': profile.followers,
'posts': profile.posts
}
except:
data['platforms'][platform] = None
results.append(data)
return sorted(results, key=lambda x: sum(
p['followers'] for p in x['platforms'].values() if p
), reverse=True)
4. Content Research
Find what works:
def analyze_top_content(username, platform='instagram'):
posts = extractor.get_posts(platform, username, count=100)
sorted_posts = sorted(posts, key=lambda p: p['like_count'], reverse=True)
top_posts = sorted_posts[:10]
return {
'top_posts': top_posts,
'common_themes': extract_themes(top_posts),
'optimal_length': avg_caption_length(top_posts),
'best_hashtags': get_common_hashtags(top_posts),
'best_posting_times': get_posting_times(top_posts)
}
Legal Considerations
What's Generally OK
- Public data (no login required)
- Personal/internal use
- Research with consent
- Aggregate, anonymized data
What's Risky
- Private/protected data
- Violating ToS at scale
- Reselling personal data
- Scraping after cease & desist
Best Practices
- Respect robots.txt - At least read it
- Don't scrape private data - Stick to public info
- Rate limit requests - Don't hammer servers
- Store responsibly - Follow GDPR/CCPA
- Use APIs when available - Safer legally
Getting Started
- Sign up at sociavault.com
- Get 50 free credits - No credit card required
- Test in playground - Try endpoints at dashboard/playground
- Build your integration - Use the code examples above
Ready to extract social media data?
Get started at sociavault.com.
Related:
Found this helpful?
Share it with others who might benefit
Ready to Try SociaVault?
Start extracting social media data with our powerful API. No credit card required.