Social Media Data Quality: Validate, Clean & Enrich Extracted Data
You extract 10,000 Instagram profiles. You build dashboards, run analytics, generate reports. Everything looks great—until stakeholders notice:
- Follower counts are duplicated (same number for different days)
- 30% of locations are "Unknown" or null
- Engagement rates are wildly incorrect (someone with 100 followers and 50k likes)
- Bot accounts skew your influencer analysis
- Dates are in 5 different formats
The brutal truth: Garbage in, garbage out. Bad data = bad decisions = lost trust in your entire system.
The problem: Social media APIs return inconsistent, incomplete, and sometimes incorrect data. Most developers dump it straight into their database without validation. When analytics break, they blame the API—but the real issue is lack of data quality checks.
The solution: Production-grade data quality pipelines that validate, clean, standardize, and enrich data before storage. This guide shows you how to build bulletproof data quality systems.
Why Data Quality Matters (Real Examples)
Example 1: The $2M Influencer Mistake
Scenario: E-commerce brand pays $2M to top Instagram influencer
The data:
- Profile: 5 million followers
- Avg engagement: 150k likes per post
- Engagement rate: 3% (seems legit)
The problem: No bot detection
- 4.2 million followers were bots
- Real followers: 800k
- Real engagement rate: 0.3% (10x worse than reported)
- Result: Campaign flopped, $2M wasted
What data quality would have caught:
- Follower growth pattern (gained 3M followers in 1 week = bought followers)
- Comment quality (90% generic comments like "Nice!" from accounts with no posts)
- Engagement timing (all likes arrive in first 5 minutes = bot behavior)
Example 2: The Duplicate Data Disaster
Scenario: Social listening tool monitors brand mentions
The problem: No duplicate detection
- API returns same mention multiple times (webhooks + polling)
- Database stores 3 copies of each mention
- Metrics show 3x actual volume
- Result: False crisis alerts, wasted money on duplicates, broken analytics
What data quality would have caught:
- Duplicate detection by post ID + platform
- Deduplication before storing
Example 3: The Missing Data Mess
Scenario: Analyze competitor engagement rates
The problem: Inconsistent null handling
- Some profiles missing follower counts (API returned null)
- Engagement rate calculation:
likes / null = NaN - Dashboard shows "NaN" everywhere
- Result: Unusable analytics, lost stakeholder trust
What data quality would have caught:
- Missing value detection and default handling
- Validation before calculations
The 6 Pillars of Data Quality
Pillar 1: Completeness (Handle Missing Data)
Check: Are required fields present?
// data-completeness.js
function validateCompleteness(profile) {
const required = ['username', 'platform', 'follower_count'];
const missing = required.filter(field => !profile[field]);
if (missing.length > 0) {
return {
valid: false,
error: `Missing required fields: ${missing.join(', ')}`,
severity: 'CRITICAL'
};
}
// Check for suspicious nulls
const warnings = [];
if (!profile.engagement_rate && profile.avg_likes > 0) {
warnings.push('Engagement rate missing but likes data exists (can calculate)');
}
if (!profile.location) {
warnings.push('Location missing (less critical)');
}
return {
valid: true,
warnings
};
}
// Handle missing data
function handleMissingData(profile) {
// Strategy 1: Fill with defaults
profile.location = profile.location || 'Unknown';
profile.bio = profile.bio || '';
// Strategy 2: Calculate from other fields
if (!profile.engagement_rate && profile.follower_count > 0) {
profile.engagement_rate =
((profile.avg_likes + profile.avg_comments) / profile.follower_count) * 100;
}
// Strategy 3: Flag as incomplete
profile.data_quality_flags = profile.data_quality_flags || [];
if (!profile.verified_status) {
profile.data_quality_flags.push('VERIFICATION_STATUS_UNKNOWN');
}
return profile;
}
Best practices:
- Critical fields (username, platform): Reject record if missing
- Calculable fields (engagement rate): Derive from other fields
- Optional fields (location, bio): Use defaults or flag as incomplete
Pillar 2: Accuracy (Detect Incorrect Data)
Check: Does the data make sense?
// data-accuracy.js
function validateAccuracy(profile) {
const errors = [];
// Follower count sanity checks
if (profile.follower_count < 0) {
errors.push({
field: 'follower_count',
value: profile.follower_count,
error: 'Negative follower count (impossible)',
severity: 'CRITICAL'
});
}
if (profile.follower_count > 500000000) {
errors.push({
field: 'follower_count',
value: profile.follower_count,
error: 'Follower count exceeds Instagram record (500M max)',
severity: 'HIGH'
});
}
// Engagement rate sanity checks
if (profile.engagement_rate < 0 || profile.engagement_rate > 100) {
errors.push({
field: 'engagement_rate',
value: profile.engagement_rate,
error: 'Engagement rate must be 0-100%',
severity: 'CRITICAL'
});
}
// Following/follower ratio checks (bot detection)
const followRatio = profile.following_count / profile.follower_count;
if (followRatio > 10 && profile.follower_count > 1000) {
errors.push({
field: 'follow_ratio',
value: followRatio,
error: 'Suspicious follow ratio (likely bot)',
severity: 'MEDIUM',
action: 'FLAG_AS_BOT'
});
}
// Post count sanity check
if (profile.post_count > 10000 && profile.account_age_days < 30) {
errors.push({
field: 'post_count',
value: profile.post_count,
error: 'Impossible posting frequency (333 posts/day)',
severity: 'HIGH',
action: 'FLAG_AS_BOT'
});
}
// Engagement impossibility check
if (profile.avg_likes > profile.follower_count * 2) {
errors.push({
field: 'avg_likes',
value: profile.avg_likes,
error: 'Likes exceed 2x followers (likely bought engagement)',
severity: 'HIGH',
action: 'FLAG_AS_SUSPICIOUS'
});
}
return { valid: errors.length === 0, errors };
}
// Auto-correct obvious errors
function correctAccuracyErrors(profile) {
// Fix negative values
if (profile.follower_count < 0) {
profile.follower_count = Math.abs(profile.follower_count);
}
// Cap at platform maximums
const platformMaxFollowers = {
instagram: 500000000,
tiktok: 150000000,
twitter: 150000000
};
const max = platformMaxFollowers[profile.platform];
if (profile.follower_count > max) {
profile.follower_count = max;
profile.data_quality_flags.push('FOLLOWER_COUNT_CAPPED');
}
// Recalculate engagement rate if suspicious
if (profile.engagement_rate < 0 || profile.engagement_rate > 100) {
profile.engagement_rate =
((profile.avg_likes + profile.avg_comments) / profile.follower_count) * 100;
profile.data_quality_flags.push('ENGAGEMENT_RATE_RECALCULATED');
}
return profile;
}
Accuracy checks:
- Range validation (follower count 0-500M)
- Ratio checks (following/follower ratio)
- Impossible values (negative numbers, percentages > 100%)
- Cross-field validation (likes > followers * 2)
Pillar 3: Consistency (Standardize Formats)
Check: Is data in consistent formats?
// data-consistency.js
function standardizeProfile(profile) {
// Standardize platform names
const platformMap = {
'ig': 'instagram',
'insta': 'instagram',
'Instagram': 'instagram',
'INSTAGRAM': 'instagram',
'tiktok': 'tiktok',
'TikTok': 'tiktok',
'tt': 'tiktok'
};
profile.platform = platformMap[profile.platform] || profile.platform.toLowerCase();
// Standardize usernames (remove @ symbol, lowercase)
if (profile.username.startsWith('@')) {
profile.username = profile.username.substring(1);
}
profile.username = profile.username.toLowerCase().trim();
// Standardize dates (ISO 8601)
if (profile.created_at) {
profile.created_at = new Date(profile.created_at).toISOString();
}
// Standardize numbers (remove commas, parse as int)
if (typeof profile.follower_count === 'string') {
profile.follower_count = parseInt(profile.follower_count.replace(/,/g, ''), 10);
}
// Standardize locations
profile.location = standardizeLocation(profile.location);
// Standardize boolean fields
profile.is_verified = profile.is_verified === true ||
profile.is_verified === 'true' ||
profile.is_verified === 1;
// Standardize arrays (ensure they're arrays, not strings)
if (typeof profile.hashtags === 'string') {
profile.hashtags = profile.hashtags.split(',').map(h => h.trim());
}
return profile;
}
function standardizeLocation(location) {
if (!location) return 'Unknown';
// Location aliases
const locationMap = {
'NYC': 'New York, NY',
'LA': 'Los Angeles, CA',
'SF': 'San Francisco, CA',
'United States': 'USA',
'U.S.': 'USA',
'United Kingdom': 'UK'
};
return locationMap[location] || location;
}
// Standardize post content
function standardizePost(post) {
// Standardize hashtags (lowercase, no #)
post.hashtags = post.hashtags.map(tag =>
tag.toLowerCase().replace(/^#/, '')
);
// Standardize mentions (lowercase, no @)
post.mentions = post.mentions.map(mention =>
mention.toLowerCase().replace(/^@/, '')
);
// Standardize media types
const mediaTypeMap = {
'image': 'photo',
'picture': 'photo',
'img': 'photo',
'vid': 'video',
'clip': 'video',
'carousel_album': 'carousel'
};
post.media_type = mediaTypeMap[post.media_type] || post.media_type;
// Standardize URLs (ensure https)
if (post.url && post.url.startsWith('http://')) {
post.url = post.url.replace('http://', 'https://');
}
return post;
}
Consistency rules:
- Platform names always lowercase
- Usernames without @ symbol
- Dates in ISO 8601 format
- Numbers as integers, not strings
- Booleans as true/false, not 1/0 or "true"/"false"
Pillar 4: Uniqueness (Deduplicate Data)
Check: Are there duplicate records?
// data-uniqueness.js
import crypto from 'crypto';
function generateRecordHash(profile) {
// Create unique hash from identifying fields
const key = `${profile.platform}:${profile.username}:${profile.created_at}`;
return crypto.createHash('md5').update(key).digest('hex');
}
async function checkForDuplicate(profile, db) {
const hash = generateRecordHash(profile);
// Check if hash exists
const existing = await db.query(`
SELECT id, created_at, data_hash
FROM profiles
WHERE data_hash = $1
`, [hash]);
if (existing.rows.length > 0) {
return {
isDuplicate: true,
existingId: existing.rows[0].id,
action: 'UPDATE_EXISTING' // Or 'SKIP' or 'CREATE_NEW'
};
}
return { isDuplicate: false };
}
// Fuzzy duplicate detection (for slight variations)
function fuzzyDuplicateCheck(profile, recentProfiles) {
for (const existing of recentProfiles) {
// Same platform and username = duplicate
if (existing.platform === profile.platform &&
existing.username === profile.username) {
// Check if it's truly the same or an update
const timeDiff = new Date(profile.created_at) - new Date(existing.created_at);
if (timeDiff < 60000) { // Less than 1 minute apart
return {
isDuplicate: true,
reason: 'SAME_PROFILE_RAPID_SUCCESSION',
existingId: existing.id
};
}
// Check if data is identical (not an update)
if (existing.follower_count === profile.follower_count &&
existing.following_count === profile.following_count &&
existing.post_count === profile.post_count) {
return {
isDuplicate: true,
reason: 'IDENTICAL_DATA',
existingId: existing.id
};
}
// Different data = legitimate update
return {
isDuplicate: false,
isUpdate: true,
existingId: existing.id
};
}
}
return { isDuplicate: false };
}
// Deduplication pipeline
async function deduplicateAndStore(profile, db) {
// Generate hash
profile.data_hash = generateRecordHash(profile);
// Check for exact duplicate
const duplicateCheck = await checkForDuplicate(profile, db);
if (duplicateCheck.isDuplicate) {
if (duplicateCheck.action === 'UPDATE_EXISTING') {
// Update existing record with new data
await db.query(`
UPDATE profiles
SET
follower_count = $1,
following_count = $2,
post_count = $3,
updated_at = NOW()
WHERE id = $4
`, [
profile.follower_count,
profile.following_count,
profile.post_count,
duplicateCheck.existingId
]);
return { action: 'UPDATED', id: duplicateCheck.existingId };
} else {
return { action: 'SKIPPED', reason: 'DUPLICATE' };
}
}
// Not a duplicate - insert new record
const result = await db.query(`
INSERT INTO profiles (
platform, username, follower_count, following_count,
post_count, data_hash, created_at
) VALUES ($1, $2, $3, $4, $5, $6, NOW())
RETURNING id
`, [
profile.platform,
profile.username,
profile.follower_count,
profile.following_count,
profile.post_count,
profile.data_hash
]);
return { action: 'CREATED', id: result.rows[0].id };
}
Deduplication strategies:
- Hash-based (exact duplicates)
- Fuzzy matching (slight variations)
- Time-window dedup (same record within 1 minute = duplicate)
- Update vs duplicate (same account, different data = update)
Pillar 5: Validity (Spam and Bot Detection)
Check: Is this a real account or bot/spam?
// bot-detection.js
function calculateBotScore(profile) {
let score = 0;
const flags = [];
// Check 1: Follow ratio (bots follow many, have few followers)
const followRatio = profile.following_count / profile.follower_count;
if (followRatio > 5 && profile.follower_count < 1000) {
score += 25;
flags.push('HIGH_FOLLOW_RATIO');
}
// Check 2: No profile picture
if (!profile.profile_image_url || profile.profile_image_url.includes('default')) {
score += 15;
flags.push('NO_PROFILE_IMAGE');
}
// Check 3: No bio
if (!profile.bio || profile.bio.length < 10) {
score += 10;
flags.push('MINIMAL_BIO');
}
// Check 4: Random username (lots of numbers)
const numberCount = (profile.username.match(/\d/g) || []).length;
if (numberCount > 8) {
score += 20;
flags.push('RANDOM_USERNAME');
}
// Check 5: Low post count for age
const postsPerDay = profile.post_count / profile.account_age_days;
if (postsPerDay < 0.01 && profile.account_age_days > 30) {
score += 15;
flags.push('INACTIVE_ACCOUNT');
}
// Check 6: Impossible engagement (likes > followers * 3)
if (profile.avg_likes > profile.follower_count * 3) {
score += 30;
flags.push('SUSPICIOUS_ENGAGEMENT');
}
// Check 7: Recent follower spike
if (profile.follower_growth_last_7_days > profile.follower_count * 0.5) {
score += 25;
flags.push('SUDDEN_FOLLOWER_SPIKE');
}
return {
botScore: Math.min(score, 100),
flags,
classification: score >= 70 ? 'LIKELY_BOT' :
score >= 40 ? 'SUSPICIOUS' :
'LIKELY_HUMAN'
};
}
// Advanced: Comment quality analysis
function analyzeCommentQuality(comments) {
const genericComments = [
'nice', 'cool', 'great', 'awesome', 'love it', '❤️', '🔥',
'dm me', 'check my profile', 'follow me', 'check bio'
];
let genericCount = 0;
let spamCount = 0;
comments.forEach(comment => {
const text = comment.text.toLowerCase();
// Check for generic comments
if (genericComments.some(generic => text === generic || text.includes(generic))) {
genericCount++;
}
// Check for spam patterns
if (text.includes('check my bio') ||
text.includes('dm me') ||
text.includes('link in bio')) {
spamCount++;
}
});
const genericPercentage = (genericCount / comments.length) * 100;
const spamPercentage = (spamCount / comments.length) * 100;
return {
genericPercentage,
spamPercentage,
quality: genericPercentage < 30 ? 'HIGH' :
genericPercentage < 60 ? 'MEDIUM' :
'LOW',
isSuspicious: genericPercentage > 70 || spamPercentage > 20
};
}
// Filter out bots
function filterBots(profiles) {
return profiles.filter(profile => {
const botAnalysis = calculateBotScore(profile);
profile.bot_score = botAnalysis.botScore;
profile.bot_flags = botAnalysis.flags;
// Only keep likely humans
return botAnalysis.classification !== 'LIKELY_BOT';
});
}
Bot detection signals:
- High following/follower ratio
- No profile image or bio
- Random username with many numbers
- Suspicious engagement patterns
- Recent follower spikes
- Generic comments (❤️, "nice", etc.)
Pillar 6: Timeliness (Detect Stale Data)
Check: Is this data still fresh?
// data-timeliness.js
function checkDataFreshness(profile) {
const now = new Date();
const lastUpdated = new Date(profile.last_updated);
const ageInHours = (now - lastUpdated) / (1000 * 60 * 60);
// Define freshness thresholds by data type
const freshnessThresholds = {
follower_count: 24, // Update daily
post_metrics: 6, // Update every 6 hours for recent posts
profile_info: 168, // Update weekly (7 days)
story_metrics: 1 // Update hourly (stories expire in 24h)
};
const warnings = [];
if (ageInHours > freshnessThresholds.follower_count) {
warnings.push({
field: 'follower_count',
staleness: `${ageInHours.toFixed(1)} hours old`,
recommendation: 'UPDATE_NOW'
});
}
return {
isFresh: warnings.length === 0,
warnings,
lastUpdated: lastUpdated.toISOString(),
ageInHours: ageInHours.toFixed(1)
};
}
// Auto-refresh stale data
async function refreshStaleData(profiles, apiClient) {
const staleProfiles = profiles.filter(p => {
const freshness = checkDataFreshness(p);
return !freshness.isFresh;
});
console.log(`Found ${staleProfiles.length} stale profiles, refreshing...`);
for (const profile of staleProfiles) {
try {
const freshData = await apiClient.getProfile(profile.platform, profile.username);
// Update database with fresh data
await db.query(`
UPDATE profiles
SET
follower_count = $1,
following_count = $2,
post_count = $3,
last_updated = NOW()
WHERE id = $4
`, [
freshData.follower_count,
freshData.following_count,
freshData.post_count,
profile.id
]);
} catch (error) {
console.error(`Failed to refresh ${profile.username}:`, error);
}
}
}
Freshness rules:
- Follower counts: Update daily
- Post metrics: Update every 6 hours
- Profile info: Update weekly
- Stories: Update hourly (time-sensitive)
Data Enrichment: Add Value to Raw Data
Enrichment 1: Calculate Derived Metrics
// data-enrichment.js
function enrichProfile(profile) {
// Calculate engagement rate
profile.engagement_rate =
((profile.avg_likes + profile.avg_comments) / profile.follower_count) * 100;
// Calculate follower growth rate
if (profile.follower_count_7_days_ago) {
const growth = profile.follower_count - profile.follower_count_7_days_ago;
profile.follower_growth_rate = (growth / profile.follower_count_7_days_ago) * 100;
}
// Calculate post frequency
profile.posts_per_week = (profile.post_count / profile.account_age_days) * 7;
// Calculate account quality score
profile.quality_score = calculateQualityScore(profile);
return profile;
}
function calculateQualityScore(profile) {
let score = 0;
// Verified accounts get bonus
if (profile.is_verified) score += 20;
// Engagement rate scoring
if (profile.engagement_rate >= 5) score += 30;
else if (profile.engagement_rate >= 2) score += 20;
else if (profile.engagement_rate >= 1) score += 10;
// Follower count scoring
if (profile.follower_count >= 100000) score += 20;
else if (profile.follower_count >= 10000) score += 15;
else if (profile.follower_count >= 1000) score += 10;
// Content consistency scoring
if (profile.posts_per_week >= 3) score += 15;
else if (profile.posts_per_week >= 1) score += 10;
// Complete profile scoring
if (profile.bio && profile.bio.length > 50) score += 10;
if (profile.profile_image_url) score += 5;
return Math.min(score, 100);
}
Enrichment 2: Add External Data
// external-enrichment.js
async function enrichWithExternalData(profile) {
// Add geographic data
if (profile.location) {
profile.geo_data = await fetchGeoData(profile.location);
}
// Add demographic estimates
profile.audience_demographics = await estimateDemographics(profile);
// Add brand safety scores
profile.brand_safety_score = await checkBrandSafety(profile);
// Add category/niche detection
profile.categories = await detectCategories(profile);
return profile;
}
async function fetchGeoData(location) {
// Use geocoding API
const response = await fetch(
`https://api.geocoding.com/search?q=${encodeURIComponent(location)}`
);
const data = await response.json();
return {
city: data.city,
state: data.state,
country: data.country,
country_code: data.country_code,
latitude: data.lat,
longitude: data.lon,
timezone: data.timezone
};
}
async function estimateDemographics(profile) {
// Analyze follower sample or use ML model
// Simplified example
return {
age_ranges: {
'13-17': 10,
'18-24': 35,
'25-34': 30,
'35-44': 15,
'45+': 10
},
gender: {
male: 45,
female: 50,
other: 5
},
top_countries: ['USA', 'UK', 'Canada']
};
}
function detectCategories(profile) {
const categories = [];
const bio = (profile.bio || '').toLowerCase();
const recentPosts = profile.recent_posts || [];
// Keyword-based categorization
const categoryKeywords = {
fitness: ['fitness', 'workout', 'gym', 'health', 'training'],
fashion: ['fashion', 'style', 'outfit', 'ootd', 'clothing'],
food: ['food', 'recipe', 'cooking', 'chef', 'restaurant'],
travel: ['travel', 'adventure', 'explore', 'wanderlust'],
tech: ['tech', 'coding', 'developer', 'startup', 'ai']
};
for (const [category, keywords] of Object.entries(categoryKeywords)) {
const matchCount = keywords.filter(kw =>
bio.includes(kw) ||
recentPosts.some(post => post.caption?.toLowerCase().includes(kw))
).length;
if (matchCount >= 2) {
categories.push(category);
}
}
return categories;
}
Complete Data Quality Pipeline
Putting it all together:
// quality-pipeline.js
async function processDataQualityPipeline(rawProfile) {
console.log(`Processing profile: ${rawProfile.username}`);
// Stage 1: Validate completeness
const completeness = validateCompleteness(rawProfile);
if (!completeness.valid) {
console.error('Completeness check failed:', completeness.error);
return { status: 'REJECTED', reason: completeness.error };
}
// Stage 2: Handle missing data
let profile = handleMissingData(rawProfile);
// Stage 3: Validate accuracy
const accuracy = validateAccuracy(profile);
if (!accuracy.valid) {
const criticalErrors = accuracy.errors.filter(e => e.severity === 'CRITICAL');
if (criticalErrors.length > 0) {
console.error('Critical accuracy errors:', criticalErrors);
return { status: 'REJECTED', reason: 'ACCURACY_FAILED', errors: criticalErrors };
}
// Auto-correct non-critical errors
profile = correctAccuracyErrors(profile);
}
// Stage 4: Standardize format
profile = standardizeProfile(profile);
// Stage 5: Check for duplicates
const duplicate = await checkForDuplicate(profile, db);
if (duplicate.isDuplicate) {
console.log('Duplicate detected, updating existing record');
return { status: 'UPDATED', id: duplicate.existingId };
}
// Stage 6: Bot detection
const botAnalysis = calculateBotScore(profile);
profile.bot_score = botAnalysis.botScore;
profile.bot_flags = botAnalysis.flags;
if (botAnalysis.classification === 'LIKELY_BOT') {
console.warn('Bot detected, flagging profile');
profile.is_bot = true;
}
// Stage 7: Check freshness
const freshness = checkDataFreshness(profile);
if (!freshness.isFresh) {
console.log('Data is stale, scheduling refresh');
// Could trigger refresh here
}
// Stage 8: Enrich data
profile = enrichProfile(profile);
profile = await enrichWithExternalData(profile);
// Stage 9: Store with quality metadata
profile.data_quality_score = calculateDataQualityScore(profile);
profile.processed_at = new Date().toISOString();
const result = await db.query(`
INSERT INTO profiles (
platform, username, follower_count, following_count,
engagement_rate, bot_score, quality_score, data_quality_flags,
processed_at
) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, NOW())
RETURNING id
`, [
profile.platform,
profile.username,
profile.follower_count,
profile.following_count,
profile.engagement_rate,
profile.bot_score,
profile.quality_score,
JSON.stringify(profile.data_quality_flags)
]);
console.log(`Profile processed successfully: ID ${result.rows[0].id}`);
return {
status: 'SUCCESS',
id: result.rows[0].id,
qualityScore: profile.data_quality_score,
flags: profile.data_quality_flags
};
}
function calculateDataQualityScore(profile) {
let score = 100;
// Deduct for missing data
if (!profile.location) score -= 5;
if (!profile.bio) score -= 5;
// Deduct for bot suspicion
score -= profile.bot_score * 0.3;
// Deduct for data quality flags
score -= profile.data_quality_flags.length * 5;
// Bonus for verified accounts
if (profile.is_verified) score += 10;
return Math.max(0, Math.min(100, score));
}
Monitoring Data Quality
Track quality metrics over time:
// quality-monitoring.js
async function generateQualityReport() {
// Overall quality metrics
const stats = await db.query(`
SELECT
COUNT(*) as total_profiles,
AVG(data_quality_score) as avg_quality_score,
COUNT(*) FILTER (WHERE is_bot = true) as bot_count,
COUNT(*) FILTER (WHERE data_quality_score < 50) as low_quality_count,
COUNT(*) FILTER (WHERE data_quality_flags IS NOT NULL) as flagged_count
FROM profiles
WHERE created_at >= NOW() - INTERVAL '7 days'
`);
const report = stats.rows[0];
console.log('Data Quality Report (Last 7 Days)');
console.log('===================================');
console.log(`Total Profiles: ${report.total_profiles}`);
console.log(`Average Quality Score: ${parseFloat(report.avg_quality_score).toFixed(2)}/100`);
console.log(`Bots Detected: ${report.bot_count} (${(report.bot_count / report.total_profiles * 100).toFixed(1)}%)`);
console.log(`Low Quality: ${report.low_quality_count} (${(report.low_quality_count / report.total_profiles * 100).toFixed(1)}%)`);
console.log(`Flagged Records: ${report.flagged_count} (${(report.flagged_count / report.total_profiles * 100).toFixed(1)}%)`);
// Alert if quality drops
if (parseFloat(report.avg_quality_score) < 70) {
await sendAlert({
type: 'DATA_QUALITY_DEGRADATION',
severity: 'HIGH',
message: `Average data quality score dropped to ${report.avg_quality_score}/100`,
recommendation: 'Investigate data sources and validation pipeline'
});
}
}
// Run daily
cron.schedule('0 9 * * *', generateQualityReport);
Best Practices
Do's
✅ Validate at ingestion - Catch bad data before it enters database
✅ Standardize formats - Consistent data is queryable data
✅ Track quality metrics - Monitor for degradation
✅ Enrich proactively - Add value beyond raw API data
✅ Document quality rules - Team alignment on standards
✅ Flag, don't delete - Keep questionable data with warnings
Don'ts
❌ Don't trust API data blindly - Always validate
❌ Don't skip deduplication - Duplicates break analytics
❌ Don't ignore bots - They skew metrics
❌ Don't use null inconsistently - Standardize missing value handling
❌ Don't forget to monitor - Quality degrades over time
Conclusion
Data quality is the difference between insights you trust and analytics you ignore:
Before data quality pipeline:
- Duplicates, nulls, inconsistent formats
- Bot accounts skew metrics
- Missing data breaks calculations
- Stakeholders lose trust in data
After data quality pipeline:
- Clean, standardized, validated data
- Bot detection and filtering
- Complete records with enrichment
- Confidence in analytics and decisions
The investment: 30-50 hours to build. The return: Trustworthy data forever.
Ready to extract high-quality social data? SociaVault provides clean, structured data with built-in validation. Try it free: sociavault.com
Found this helpful?
Share it with others who might benefit
Ready to Try SociaVault?
Start extracting social media data with our powerful API