Back to Blog
Technical

Social Media Data Quality: Validate, Clean & Enrich Extracted Data

November 7, 2025
17 min read
By SociaVault Team
data qualitydata validationdata cleaningdata enrichmentETL

Social Media Data Quality: Validate, Clean & Enrich Extracted Data

You extract 10,000 Instagram profiles. You build dashboards, run analytics, generate reports. Everything looks great—until stakeholders notice:

  • Follower counts are duplicated (same number for different days)
  • 30% of locations are "Unknown" or null
  • Engagement rates are wildly incorrect (someone with 100 followers and 50k likes)
  • Bot accounts skew your influencer analysis
  • Dates are in 5 different formats

The brutal truth: Garbage in, garbage out. Bad data = bad decisions = lost trust in your entire system.

The problem: Social media APIs return inconsistent, incomplete, and sometimes incorrect data. Most developers dump it straight into their database without validation. When analytics break, they blame the API—but the real issue is lack of data quality checks.

The solution: Production-grade data quality pipelines that validate, clean, standardize, and enrich data before storage. This guide shows you how to build bulletproof data quality systems.

Why Data Quality Matters (Real Examples)

Example 1: The $2M Influencer Mistake

Scenario: E-commerce brand pays $2M to top Instagram influencer

The data:

  • Profile: 5 million followers
  • Avg engagement: 150k likes per post
  • Engagement rate: 3% (seems legit)

The problem: No bot detection

  • 4.2 million followers were bots
  • Real followers: 800k
  • Real engagement rate: 0.3% (10x worse than reported)
  • Result: Campaign flopped, $2M wasted

What data quality would have caught:

  • Follower growth pattern (gained 3M followers in 1 week = bought followers)
  • Comment quality (90% generic comments like "Nice!" from accounts with no posts)
  • Engagement timing (all likes arrive in first 5 minutes = bot behavior)

Example 2: The Duplicate Data Disaster

Scenario: Social listening tool monitors brand mentions

The problem: No duplicate detection

  • API returns same mention multiple times (webhooks + polling)
  • Database stores 3 copies of each mention
  • Metrics show 3x actual volume
  • Result: False crisis alerts, wasted money on duplicates, broken analytics

What data quality would have caught:

  • Duplicate detection by post ID + platform
  • Deduplication before storing

Example 3: The Missing Data Mess

Scenario: Analyze competitor engagement rates

The problem: Inconsistent null handling

  • Some profiles missing follower counts (API returned null)
  • Engagement rate calculation: likes / null = NaN
  • Dashboard shows "NaN" everywhere
  • Result: Unusable analytics, lost stakeholder trust

What data quality would have caught:

  • Missing value detection and default handling
  • Validation before calculations

The 6 Pillars of Data Quality

Pillar 1: Completeness (Handle Missing Data)

Check: Are required fields present?

// data-completeness.js
function validateCompleteness(profile) {
  const required = ['username', 'platform', 'follower_count'];
  const missing = required.filter(field => !profile[field]);
  
  if (missing.length > 0) {
    return {
      valid: false,
      error: `Missing required fields: ${missing.join(', ')}`,
      severity: 'CRITICAL'
    };
  }
  
  // Check for suspicious nulls
  const warnings = [];
  if (!profile.engagement_rate && profile.avg_likes > 0) {
    warnings.push('Engagement rate missing but likes data exists (can calculate)');
  }
  
  if (!profile.location) {
    warnings.push('Location missing (less critical)');
  }
  
  return {
    valid: true,
    warnings
  };
}

// Handle missing data
function handleMissingData(profile) {
  // Strategy 1: Fill with defaults
  profile.location = profile.location || 'Unknown';
  profile.bio = profile.bio || '';
  
  // Strategy 2: Calculate from other fields
  if (!profile.engagement_rate && profile.follower_count > 0) {
    profile.engagement_rate = 
      ((profile.avg_likes + profile.avg_comments) / profile.follower_count) * 100;
  }
  
  // Strategy 3: Flag as incomplete
  profile.data_quality_flags = profile.data_quality_flags || [];
  if (!profile.verified_status) {
    profile.data_quality_flags.push('VERIFICATION_STATUS_UNKNOWN');
  }
  
  return profile;
}

Best practices:

  • Critical fields (username, platform): Reject record if missing
  • Calculable fields (engagement rate): Derive from other fields
  • Optional fields (location, bio): Use defaults or flag as incomplete

Pillar 2: Accuracy (Detect Incorrect Data)

Check: Does the data make sense?

// data-accuracy.js
function validateAccuracy(profile) {
  const errors = [];
  
  // Follower count sanity checks
  if (profile.follower_count < 0) {
    errors.push({
      field: 'follower_count',
      value: profile.follower_count,
      error: 'Negative follower count (impossible)',
      severity: 'CRITICAL'
    });
  }
  
  if (profile.follower_count > 500000000) {
    errors.push({
      field: 'follower_count',
      value: profile.follower_count,
      error: 'Follower count exceeds Instagram record (500M max)',
      severity: 'HIGH'
    });
  }
  
  // Engagement rate sanity checks
  if (profile.engagement_rate < 0 || profile.engagement_rate > 100) {
    errors.push({
      field: 'engagement_rate',
      value: profile.engagement_rate,
      error: 'Engagement rate must be 0-100%',
      severity: 'CRITICAL'
    });
  }
  
  // Following/follower ratio checks (bot detection)
  const followRatio = profile.following_count / profile.follower_count;
  if (followRatio > 10 && profile.follower_count > 1000) {
    errors.push({
      field: 'follow_ratio',
      value: followRatio,
      error: 'Suspicious follow ratio (likely bot)',
      severity: 'MEDIUM',
      action: 'FLAG_AS_BOT'
    });
  }
  
  // Post count sanity check
  if (profile.post_count > 10000 && profile.account_age_days < 30) {
    errors.push({
      field: 'post_count',
      value: profile.post_count,
      error: 'Impossible posting frequency (333 posts/day)',
      severity: 'HIGH',
      action: 'FLAG_AS_BOT'
    });
  }
  
  // Engagement impossibility check
  if (profile.avg_likes > profile.follower_count * 2) {
    errors.push({
      field: 'avg_likes',
      value: profile.avg_likes,
      error: 'Likes exceed 2x followers (likely bought engagement)',
      severity: 'HIGH',
      action: 'FLAG_AS_SUSPICIOUS'
    });
  }
  
  return { valid: errors.length === 0, errors };
}

// Auto-correct obvious errors
function correctAccuracyErrors(profile) {
  // Fix negative values
  if (profile.follower_count < 0) {
    profile.follower_count = Math.abs(profile.follower_count);
  }
  
  // Cap at platform maximums
  const platformMaxFollowers = {
    instagram: 500000000,
    tiktok: 150000000,
    twitter: 150000000
  };
  
  const max = platformMaxFollowers[profile.platform];
  if (profile.follower_count > max) {
    profile.follower_count = max;
    profile.data_quality_flags.push('FOLLOWER_COUNT_CAPPED');
  }
  
  // Recalculate engagement rate if suspicious
  if (profile.engagement_rate < 0 || profile.engagement_rate > 100) {
    profile.engagement_rate = 
      ((profile.avg_likes + profile.avg_comments) / profile.follower_count) * 100;
    profile.data_quality_flags.push('ENGAGEMENT_RATE_RECALCULATED');
  }
  
  return profile;
}

Accuracy checks:

  • Range validation (follower count 0-500M)
  • Ratio checks (following/follower ratio)
  • Impossible values (negative numbers, percentages > 100%)
  • Cross-field validation (likes > followers * 2)

Pillar 3: Consistency (Standardize Formats)

Check: Is data in consistent formats?

// data-consistency.js
function standardizeProfile(profile) {
  // Standardize platform names
  const platformMap = {
    'ig': 'instagram',
    'insta': 'instagram',
    'Instagram': 'instagram',
    'INSTAGRAM': 'instagram',
    'tiktok': 'tiktok',
    'TikTok': 'tiktok',
    'tt': 'tiktok'
  };
  
  profile.platform = platformMap[profile.platform] || profile.platform.toLowerCase();
  
  // Standardize usernames (remove @ symbol, lowercase)
  if (profile.username.startsWith('@')) {
    profile.username = profile.username.substring(1);
  }
  profile.username = profile.username.toLowerCase().trim();
  
  // Standardize dates (ISO 8601)
  if (profile.created_at) {
    profile.created_at = new Date(profile.created_at).toISOString();
  }
  
  // Standardize numbers (remove commas, parse as int)
  if (typeof profile.follower_count === 'string') {
    profile.follower_count = parseInt(profile.follower_count.replace(/,/g, ''), 10);
  }
  
  // Standardize locations
  profile.location = standardizeLocation(profile.location);
  
  // Standardize boolean fields
  profile.is_verified = profile.is_verified === true || 
                        profile.is_verified === 'true' || 
                        profile.is_verified === 1;
  
  // Standardize arrays (ensure they're arrays, not strings)
  if (typeof profile.hashtags === 'string') {
    profile.hashtags = profile.hashtags.split(',').map(h => h.trim());
  }
  
  return profile;
}

function standardizeLocation(location) {
  if (!location) return 'Unknown';
  
  // Location aliases
  const locationMap = {
    'NYC': 'New York, NY',
    'LA': 'Los Angeles, CA',
    'SF': 'San Francisco, CA',
    'United States': 'USA',
    'U.S.': 'USA',
    'United Kingdom': 'UK'
  };
  
  return locationMap[location] || location;
}

// Standardize post content
function standardizePost(post) {
  // Standardize hashtags (lowercase, no #)
  post.hashtags = post.hashtags.map(tag => 
    tag.toLowerCase().replace(/^#/, '')
  );
  
  // Standardize mentions (lowercase, no @)
  post.mentions = post.mentions.map(mention =>
    mention.toLowerCase().replace(/^@/, '')
  );
  
  // Standardize media types
  const mediaTypeMap = {
    'image': 'photo',
    'picture': 'photo',
    'img': 'photo',
    'vid': 'video',
    'clip': 'video',
    'carousel_album': 'carousel'
  };
  
  post.media_type = mediaTypeMap[post.media_type] || post.media_type;
  
  // Standardize URLs (ensure https)
  if (post.url && post.url.startsWith('http://')) {
    post.url = post.url.replace('http://', 'https://');
  }
  
  return post;
}

Consistency rules:

  • Platform names always lowercase
  • Usernames without @ symbol
  • Dates in ISO 8601 format
  • Numbers as integers, not strings
  • Booleans as true/false, not 1/0 or "true"/"false"

Pillar 4: Uniqueness (Deduplicate Data)

Check: Are there duplicate records?

// data-uniqueness.js
import crypto from 'crypto';

function generateRecordHash(profile) {
  // Create unique hash from identifying fields
  const key = `${profile.platform}:${profile.username}:${profile.created_at}`;
  return crypto.createHash('md5').update(key).digest('hex');
}

async function checkForDuplicate(profile, db) {
  const hash = generateRecordHash(profile);
  
  // Check if hash exists
  const existing = await db.query(`
    SELECT id, created_at, data_hash
    FROM profiles
    WHERE data_hash = $1
  `, [hash]);
  
  if (existing.rows.length > 0) {
    return {
      isDuplicate: true,
      existingId: existing.rows[0].id,
      action: 'UPDATE_EXISTING' // Or 'SKIP' or 'CREATE_NEW'
    };
  }
  
  return { isDuplicate: false };
}

// Fuzzy duplicate detection (for slight variations)
function fuzzyDuplicateCheck(profile, recentProfiles) {
  for (const existing of recentProfiles) {
    // Same platform and username = duplicate
    if (existing.platform === profile.platform && 
        existing.username === profile.username) {
      
      // Check if it's truly the same or an update
      const timeDiff = new Date(profile.created_at) - new Date(existing.created_at);
      
      if (timeDiff < 60000) { // Less than 1 minute apart
        return {
          isDuplicate: true,
          reason: 'SAME_PROFILE_RAPID_SUCCESSION',
          existingId: existing.id
        };
      }
      
      // Check if data is identical (not an update)
      if (existing.follower_count === profile.follower_count &&
          existing.following_count === profile.following_count &&
          existing.post_count === profile.post_count) {
        return {
          isDuplicate: true,
          reason: 'IDENTICAL_DATA',
          existingId: existing.id
        };
      }
      
      // Different data = legitimate update
      return {
        isDuplicate: false,
        isUpdate: true,
        existingId: existing.id
      };
    }
  }
  
  return { isDuplicate: false };
}

// Deduplication pipeline
async function deduplicateAndStore(profile, db) {
  // Generate hash
  profile.data_hash = generateRecordHash(profile);
  
  // Check for exact duplicate
  const duplicateCheck = await checkForDuplicate(profile, db);
  
  if (duplicateCheck.isDuplicate) {
    if (duplicateCheck.action === 'UPDATE_EXISTING') {
      // Update existing record with new data
      await db.query(`
        UPDATE profiles
        SET 
          follower_count = $1,
          following_count = $2,
          post_count = $3,
          updated_at = NOW()
        WHERE id = $4
      `, [
        profile.follower_count,
        profile.following_count,
        profile.post_count,
        duplicateCheck.existingId
      ]);
      
      return { action: 'UPDATED', id: duplicateCheck.existingId };
    } else {
      return { action: 'SKIPPED', reason: 'DUPLICATE' };
    }
  }
  
  // Not a duplicate - insert new record
  const result = await db.query(`
    INSERT INTO profiles (
      platform, username, follower_count, following_count, 
      post_count, data_hash, created_at
    ) VALUES ($1, $2, $3, $4, $5, $6, NOW())
    RETURNING id
  `, [
    profile.platform,
    profile.username,
    profile.follower_count,
    profile.following_count,
    profile.post_count,
    profile.data_hash
  ]);
  
  return { action: 'CREATED', id: result.rows[0].id };
}

Deduplication strategies:

  • Hash-based (exact duplicates)
  • Fuzzy matching (slight variations)
  • Time-window dedup (same record within 1 minute = duplicate)
  • Update vs duplicate (same account, different data = update)

Pillar 5: Validity (Spam and Bot Detection)

Check: Is this a real account or bot/spam?

// bot-detection.js
function calculateBotScore(profile) {
  let score = 0;
  const flags = [];
  
  // Check 1: Follow ratio (bots follow many, have few followers)
  const followRatio = profile.following_count / profile.follower_count;
  if (followRatio > 5 && profile.follower_count < 1000) {
    score += 25;
    flags.push('HIGH_FOLLOW_RATIO');
  }
  
  // Check 2: No profile picture
  if (!profile.profile_image_url || profile.profile_image_url.includes('default')) {
    score += 15;
    flags.push('NO_PROFILE_IMAGE');
  }
  
  // Check 3: No bio
  if (!profile.bio || profile.bio.length < 10) {
    score += 10;
    flags.push('MINIMAL_BIO');
  }
  
  // Check 4: Random username (lots of numbers)
  const numberCount = (profile.username.match(/\d/g) || []).length;
  if (numberCount > 8) {
    score += 20;
    flags.push('RANDOM_USERNAME');
  }
  
  // Check 5: Low post count for age
  const postsPerDay = profile.post_count / profile.account_age_days;
  if (postsPerDay < 0.01 && profile.account_age_days > 30) {
    score += 15;
    flags.push('INACTIVE_ACCOUNT');
  }
  
  // Check 6: Impossible engagement (likes > followers * 3)
  if (profile.avg_likes > profile.follower_count * 3) {
    score += 30;
    flags.push('SUSPICIOUS_ENGAGEMENT');
  }
  
  // Check 7: Recent follower spike
  if (profile.follower_growth_last_7_days > profile.follower_count * 0.5) {
    score += 25;
    flags.push('SUDDEN_FOLLOWER_SPIKE');
  }
  
  return {
    botScore: Math.min(score, 100),
    flags,
    classification: score >= 70 ? 'LIKELY_BOT' : 
                   score >= 40 ? 'SUSPICIOUS' : 
                   'LIKELY_HUMAN'
  };
}

// Advanced: Comment quality analysis
function analyzeCommentQuality(comments) {
  const genericComments = [
    'nice', 'cool', 'great', 'awesome', 'love it', '❤️', '🔥',
    'dm me', 'check my profile', 'follow me', 'check bio'
  ];
  
  let genericCount = 0;
  let spamCount = 0;
  
  comments.forEach(comment => {
    const text = comment.text.toLowerCase();
    
    // Check for generic comments
    if (genericComments.some(generic => text === generic || text.includes(generic))) {
      genericCount++;
    }
    
    // Check for spam patterns
    if (text.includes('check my bio') || 
        text.includes('dm me') ||
        text.includes('link in bio')) {
      spamCount++;
    }
  });
  
  const genericPercentage = (genericCount / comments.length) * 100;
  const spamPercentage = (spamCount / comments.length) * 100;
  
  return {
    genericPercentage,
    spamPercentage,
    quality: genericPercentage < 30 ? 'HIGH' : 
            genericPercentage < 60 ? 'MEDIUM' : 
            'LOW',
    isSuspicious: genericPercentage > 70 || spamPercentage > 20
  };
}

// Filter out bots
function filterBots(profiles) {
  return profiles.filter(profile => {
    const botAnalysis = calculateBotScore(profile);
    profile.bot_score = botAnalysis.botScore;
    profile.bot_flags = botAnalysis.flags;
    
    // Only keep likely humans
    return botAnalysis.classification !== 'LIKELY_BOT';
  });
}

Bot detection signals:

  • High following/follower ratio
  • No profile image or bio
  • Random username with many numbers
  • Suspicious engagement patterns
  • Recent follower spikes
  • Generic comments (❤️, "nice", etc.)

Pillar 6: Timeliness (Detect Stale Data)

Check: Is this data still fresh?

// data-timeliness.js
function checkDataFreshness(profile) {
  const now = new Date();
  const lastUpdated = new Date(profile.last_updated);
  const ageInHours = (now - lastUpdated) / (1000 * 60 * 60);
  
  // Define freshness thresholds by data type
  const freshnessThresholds = {
    follower_count: 24, // Update daily
    post_metrics: 6, // Update every 6 hours for recent posts
    profile_info: 168, // Update weekly (7 days)
    story_metrics: 1 // Update hourly (stories expire in 24h)
  };
  
  const warnings = [];
  
  if (ageInHours > freshnessThresholds.follower_count) {
    warnings.push({
      field: 'follower_count',
      staleness: `${ageInHours.toFixed(1)} hours old`,
      recommendation: 'UPDATE_NOW'
    });
  }
  
  return {
    isFresh: warnings.length === 0,
    warnings,
    lastUpdated: lastUpdated.toISOString(),
    ageInHours: ageInHours.toFixed(1)
  };
}

// Auto-refresh stale data
async function refreshStaleData(profiles, apiClient) {
  const staleProfiles = profiles.filter(p => {
    const freshness = checkDataFreshness(p);
    return !freshness.isFresh;
  });
  
  console.log(`Found ${staleProfiles.length} stale profiles, refreshing...`);
  
  for (const profile of staleProfiles) {
    try {
      const freshData = await apiClient.getProfile(profile.platform, profile.username);
      
      // Update database with fresh data
      await db.query(`
        UPDATE profiles
        SET 
          follower_count = $1,
          following_count = $2,
          post_count = $3,
          last_updated = NOW()
        WHERE id = $4
      `, [
        freshData.follower_count,
        freshData.following_count,
        freshData.post_count,
        profile.id
      ]);
      
    } catch (error) {
      console.error(`Failed to refresh ${profile.username}:`, error);
    }
  }
}

Freshness rules:

  • Follower counts: Update daily
  • Post metrics: Update every 6 hours
  • Profile info: Update weekly
  • Stories: Update hourly (time-sensitive)

Data Enrichment: Add Value to Raw Data

Enrichment 1: Calculate Derived Metrics

// data-enrichment.js
function enrichProfile(profile) {
  // Calculate engagement rate
  profile.engagement_rate = 
    ((profile.avg_likes + profile.avg_comments) / profile.follower_count) * 100;
  
  // Calculate follower growth rate
  if (profile.follower_count_7_days_ago) {
    const growth = profile.follower_count - profile.follower_count_7_days_ago;
    profile.follower_growth_rate = (growth / profile.follower_count_7_days_ago) * 100;
  }
  
  // Calculate post frequency
  profile.posts_per_week = (profile.post_count / profile.account_age_days) * 7;
  
  // Calculate account quality score
  profile.quality_score = calculateQualityScore(profile);
  
  return profile;
}

function calculateQualityScore(profile) {
  let score = 0;
  
  // Verified accounts get bonus
  if (profile.is_verified) score += 20;
  
  // Engagement rate scoring
  if (profile.engagement_rate >= 5) score += 30;
  else if (profile.engagement_rate >= 2) score += 20;
  else if (profile.engagement_rate >= 1) score += 10;
  
  // Follower count scoring
  if (profile.follower_count >= 100000) score += 20;
  else if (profile.follower_count >= 10000) score += 15;
  else if (profile.follower_count >= 1000) score += 10;
  
  // Content consistency scoring
  if (profile.posts_per_week >= 3) score += 15;
  else if (profile.posts_per_week >= 1) score += 10;
  
  // Complete profile scoring
  if (profile.bio && profile.bio.length > 50) score += 10;
  if (profile.profile_image_url) score += 5;
  
  return Math.min(score, 100);
}

Enrichment 2: Add External Data

// external-enrichment.js
async function enrichWithExternalData(profile) {
  // Add geographic data
  if (profile.location) {
    profile.geo_data = await fetchGeoData(profile.location);
  }
  
  // Add demographic estimates
  profile.audience_demographics = await estimateDemographics(profile);
  
  // Add brand safety scores
  profile.brand_safety_score = await checkBrandSafety(profile);
  
  // Add category/niche detection
  profile.categories = await detectCategories(profile);
  
  return profile;
}

async function fetchGeoData(location) {
  // Use geocoding API
  const response = await fetch(
    `https://api.geocoding.com/search?q=${encodeURIComponent(location)}`
  );
  const data = await response.json();
  
  return {
    city: data.city,
    state: data.state,
    country: data.country,
    country_code: data.country_code,
    latitude: data.lat,
    longitude: data.lon,
    timezone: data.timezone
  };
}

async function estimateDemographics(profile) {
  // Analyze follower sample or use ML model
  // Simplified example
  return {
    age_ranges: {
      '13-17': 10,
      '18-24': 35,
      '25-34': 30,
      '35-44': 15,
      '45+': 10
    },
    gender: {
      male: 45,
      female: 50,
      other: 5
    },
    top_countries: ['USA', 'UK', 'Canada']
  };
}

function detectCategories(profile) {
  const categories = [];
  const bio = (profile.bio || '').toLowerCase();
  const recentPosts = profile.recent_posts || [];
  
  // Keyword-based categorization
  const categoryKeywords = {
    fitness: ['fitness', 'workout', 'gym', 'health', 'training'],
    fashion: ['fashion', 'style', 'outfit', 'ootd', 'clothing'],
    food: ['food', 'recipe', 'cooking', 'chef', 'restaurant'],
    travel: ['travel', 'adventure', 'explore', 'wanderlust'],
    tech: ['tech', 'coding', 'developer', 'startup', 'ai']
  };
  
  for (const [category, keywords] of Object.entries(categoryKeywords)) {
    const matchCount = keywords.filter(kw => 
      bio.includes(kw) || 
      recentPosts.some(post => post.caption?.toLowerCase().includes(kw))
    ).length;
    
    if (matchCount >= 2) {
      categories.push(category);
    }
  }
  
  return categories;
}

Complete Data Quality Pipeline

Putting it all together:

// quality-pipeline.js
async function processDataQualityPipeline(rawProfile) {
  console.log(`Processing profile: ${rawProfile.username}`);
  
  // Stage 1: Validate completeness
  const completeness = validateCompleteness(rawProfile);
  if (!completeness.valid) {
    console.error('Completeness check failed:', completeness.error);
    return { status: 'REJECTED', reason: completeness.error };
  }
  
  // Stage 2: Handle missing data
  let profile = handleMissingData(rawProfile);
  
  // Stage 3: Validate accuracy
  const accuracy = validateAccuracy(profile);
  if (!accuracy.valid) {
    const criticalErrors = accuracy.errors.filter(e => e.severity === 'CRITICAL');
    if (criticalErrors.length > 0) {
      console.error('Critical accuracy errors:', criticalErrors);
      return { status: 'REJECTED', reason: 'ACCURACY_FAILED', errors: criticalErrors };
    }
    
    // Auto-correct non-critical errors
    profile = correctAccuracyErrors(profile);
  }
  
  // Stage 4: Standardize format
  profile = standardizeProfile(profile);
  
  // Stage 5: Check for duplicates
  const duplicate = await checkForDuplicate(profile, db);
  if (duplicate.isDuplicate) {
    console.log('Duplicate detected, updating existing record');
    return { status: 'UPDATED', id: duplicate.existingId };
  }
  
  // Stage 6: Bot detection
  const botAnalysis = calculateBotScore(profile);
  profile.bot_score = botAnalysis.botScore;
  profile.bot_flags = botAnalysis.flags;
  
  if (botAnalysis.classification === 'LIKELY_BOT') {
    console.warn('Bot detected, flagging profile');
    profile.is_bot = true;
  }
  
  // Stage 7: Check freshness
  const freshness = checkDataFreshness(profile);
  if (!freshness.isFresh) {
    console.log('Data is stale, scheduling refresh');
    // Could trigger refresh here
  }
  
  // Stage 8: Enrich data
  profile = enrichProfile(profile);
  profile = await enrichWithExternalData(profile);
  
  // Stage 9: Store with quality metadata
  profile.data_quality_score = calculateDataQualityScore(profile);
  profile.processed_at = new Date().toISOString();
  
  const result = await db.query(`
    INSERT INTO profiles (
      platform, username, follower_count, following_count,
      engagement_rate, bot_score, quality_score, data_quality_flags,
      processed_at
    ) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, NOW())
    RETURNING id
  `, [
    profile.platform,
    profile.username,
    profile.follower_count,
    profile.following_count,
    profile.engagement_rate,
    profile.bot_score,
    profile.quality_score,
    JSON.stringify(profile.data_quality_flags)
  ]);
  
  console.log(`Profile processed successfully: ID ${result.rows[0].id}`);
  
  return {
    status: 'SUCCESS',
    id: result.rows[0].id,
    qualityScore: profile.data_quality_score,
    flags: profile.data_quality_flags
  };
}

function calculateDataQualityScore(profile) {
  let score = 100;
  
  // Deduct for missing data
  if (!profile.location) score -= 5;
  if (!profile.bio) score -= 5;
  
  // Deduct for bot suspicion
  score -= profile.bot_score * 0.3;
  
  // Deduct for data quality flags
  score -= profile.data_quality_flags.length * 5;
  
  // Bonus for verified accounts
  if (profile.is_verified) score += 10;
  
  return Math.max(0, Math.min(100, score));
}

Monitoring Data Quality

Track quality metrics over time:

// quality-monitoring.js
async function generateQualityReport() {
  // Overall quality metrics
  const stats = await db.query(`
    SELECT 
      COUNT(*) as total_profiles,
      AVG(data_quality_score) as avg_quality_score,
      COUNT(*) FILTER (WHERE is_bot = true) as bot_count,
      COUNT(*) FILTER (WHERE data_quality_score < 50) as low_quality_count,
      COUNT(*) FILTER (WHERE data_quality_flags IS NOT NULL) as flagged_count
    FROM profiles
    WHERE created_at >= NOW() - INTERVAL '7 days'
  `);
  
  const report = stats.rows[0];
  
  console.log('Data Quality Report (Last 7 Days)');
  console.log('===================================');
  console.log(`Total Profiles: ${report.total_profiles}`);
  console.log(`Average Quality Score: ${parseFloat(report.avg_quality_score).toFixed(2)}/100`);
  console.log(`Bots Detected: ${report.bot_count} (${(report.bot_count / report.total_profiles * 100).toFixed(1)}%)`);
  console.log(`Low Quality: ${report.low_quality_count} (${(report.low_quality_count / report.total_profiles * 100).toFixed(1)}%)`);
  console.log(`Flagged Records: ${report.flagged_count} (${(report.flagged_count / report.total_profiles * 100).toFixed(1)}%)`);
  
  // Alert if quality drops
  if (parseFloat(report.avg_quality_score) < 70) {
    await sendAlert({
      type: 'DATA_QUALITY_DEGRADATION',
      severity: 'HIGH',
      message: `Average data quality score dropped to ${report.avg_quality_score}/100`,
      recommendation: 'Investigate data sources and validation pipeline'
    });
  }
}

// Run daily
cron.schedule('0 9 * * *', generateQualityReport);

Best Practices

Do's

Validate at ingestion - Catch bad data before it enters database
Standardize formats - Consistent data is queryable data
Track quality metrics - Monitor for degradation
Enrich proactively - Add value beyond raw API data
Document quality rules - Team alignment on standards
Flag, don't delete - Keep questionable data with warnings

Don'ts

Don't trust API data blindly - Always validate
Don't skip deduplication - Duplicates break analytics
Don't ignore bots - They skew metrics
Don't use null inconsistently - Standardize missing value handling
Don't forget to monitor - Quality degrades over time

Conclusion

Data quality is the difference between insights you trust and analytics you ignore:

Before data quality pipeline:

  • Duplicates, nulls, inconsistent formats
  • Bot accounts skew metrics
  • Missing data breaks calculations
  • Stakeholders lose trust in data

After data quality pipeline:

  • Clean, standardized, validated data
  • Bot detection and filtering
  • Complete records with enrichment
  • Confidence in analytics and decisions

The investment: 30-50 hours to build. The return: Trustworthy data forever.

Ready to extract high-quality social data? SociaVault provides clean, structured data with built-in validation. Try it free: sociavault.com

Found this helpful?

Share it with others who might benefit

Ready to Try SociaVault?

Start extracting social media data with our powerful API