TL;DR: Bluesky crossed 50 million users in 2026 and has become the de facto destination for journalists, academics, and a meaningful chunk of the B2B world that left X. Unlike most social platforms, Bluesky is built on the open AT Protocol, which makes data extraction technically easier — but with its own quirks. This guide covers what's accessible, how to access it, and how to avoid the mistakes most teams make when starting.

A friend who covers tech for a major outlet told me last fall that her team had to rebuild their entire monitoring infrastructure when X's API became prohibitively expensive. They tried four alternatives. Three of them broke within a month. Bluesky was the one that stuck — not because it was the biggest, but because the data was the most accessible and the audience had migrated.

If you're building any kind of social listening, content analysis, or audience research tool in 2026 and you're not paying attention to Bluesky, you're missing one of the most important platforms in the new media landscape. The growth has been quiet by tech-press standards but undeniable by usage metrics. Tens of thousands of journalists, researchers, and B2B marketers are there. The conversation that used to happen on Twitter happens here now.

This guide covers the practical reality of extracting data from Bluesky. The protocol is more open than most platforms, which is genuinely useful — but it has constraints that aren't immediately obvious. Let's walk through what works.

What Makes Bluesky Different

Most social platforms are walled gardens. Their data is locked behind authentication, rate-limited APIs, and aggressive anti-scraping measures. Bluesky is fundamentally different.

Bluesky is built on the AT Protocol (Authenticated Transfer Protocol), an open standard for decentralized social networking. The architecture means:

Public posts are genuinely public. Anyone can fetch them via the AT Protocol's public endpoints. No login required. No rate limits in the same sense as Instagram or TikTok.

Data structure is documented and stable. Unlike scraping unofficial endpoints that change weekly, AT Protocol data structures are versioned and stable. A scraper built today won't break next week because someone deployed a UI change.

Federation enables third-party access. Anyone can run their own AT Protocol relay or client. Bluesky's data isn't owned by Bluesky the company — it's distributed across the protocol.

This makes Bluesky scraping closer to public web scraping than to the cat-and-mouse game you play with Instagram or TikTok. The legal and technical risks are lower. The data quality is higher.

What You Can Access

The data you can pull from Bluesky's public AT Protocol:

Profile data: display name, handle (e.g., @user.bsky.social), bio, follower count, following count, post count, profile image, banner, join date, badges, pinned post.

Posts (skeets): post text, timestamp, author, mentions, hashtags, URLs, embedded images, embedded video (since 2025), reposts, quote posts, reply context, like count, repost count, reply count.

Threads: full reply trees with all branches, including deeply nested conversations.

Lists and starter packs: custom curated lists of users, including who's on each list and who follows the list.

Feeds: algorithmic feeds (the equivalent of For You) and chronological feeds, including custom feeds built by third parties.

Following / followers: complete relationship graphs for any public account.

Search results: keyword search across the full firehose of public posts.

What you can't access through the public API:

DMs (private)
Account email or phone (private)
Mute lists (private to the account owner)
Notifications (private)
Account-level analytics (only available to account owners)

The public-vs-private split is intuitive and matches what users would expect.

The Three Approaches to Extracting Bluesky Data

Three options, depending on your needs.

Option 1: Direct AT Protocol API

The most flexible approach. You hit Bluesky's public endpoints directly:

// Fetch a profile
const response = await fetch(
  "https://public.api.bsky.app/xrpc/app.bsky.actor.getProfile?actor=bsky.app",
  { method: "GET" },
);
const profile = await response.json();

console.log(profile.displayName); // "Bluesky"
console.log(profile.followersCount); // huge number
console.log(profile.followsCount);
console.log(profile.postsCount);

Most read endpoints don't require authentication. For some operations (writing, advanced filtering) you need an account. Bluesky's developer documentation at docs.bsky.app is comprehensive.

Pros: No middleman, lowest cost, full control, supports custom queries.

Cons: You handle pagination, rate limits, error retries, and data normalization yourself. Building a robust pipeline is real work.

Option 2: AT Protocol SDK Libraries

For developers who don't want to call the raw API directly:

import { AtpAgent } from "@atproto/api";

const agent = new AtpAgent({ service: "https://bsky.social" });

// Search posts
const result = await agent.app.bsky.feed.searchPosts({
  q: "climate change",
  limit: 50,
});

result.data.posts.forEach((post) => {
  console.log(post.author.displayName, ":", post.record.text);
});

The official SDK handles auth (when needed), error handling, and provides typed responses. Less work than raw HTTP calls.

Pros: Cleaner code, official support, full feature coverage.

Cons: Still requires you to handle infrastructure (running scripts, scheduling, storing results).

Option 3: A Managed API Like SociaVault

For teams that want Bluesky data alongside other platforms in a consistent format:

const response = await fetch(
  "https://api.sociavault.com/v1/scrape/bluesky/profile?handle=user.bsky.social",
  { headers: { "x-api-key": "YOUR_API_KEY" } },
);
const data = await response.json();

The advantage: same authentication, same response format, same billing as your other platform integrations (Instagram, TikTok, YouTube, etc.). You don't have to maintain Bluesky-specific infrastructure.

Pros: Unified workflow across platforms, built-in retries and normalization.

Cons: Slightly higher per-call cost than direct API. You don't see every Bluesky-specific field if you need something niche.

For most teams running multi-platform monitoring, option 3 is the right pick. For Bluesky-only research projects with substantial volume, option 1 or 2 makes more sense.

Building a Real-Time Bluesky Monitor

Here's a practical example: a script that monitors Bluesky for mentions of a brand and logs them to a CSV.

import csv
from datetime import datetime
import requests
import time

BRAND_KEYWORDS = ['acme', 'acme corp', 'acmesoftware']
OUTPUT_FILE = 'bluesky_mentions.csv'
CHECK_INTERVAL_SECONDS = 300  # Every 5 minutes

def search_bluesky(keyword: str, since: str = None) -> list:
    """Search recent Bluesky posts for a keyword."""
    url = 'https://public.api.bsky.app/xrpc/app.bsky.feed.searchPosts'
    params = {'q': keyword, 'limit': 100, 'sort': 'latest'}
    if since:
        params['since'] = since

    resp = requests.get(url, params=params, timeout=30)
    resp.raise_for_status()
    return resp.json().get('posts', [])

def append_to_csv(posts: list):
    """Save posts to CSV."""
    with open(OUTPUT_FILE, 'a', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        for post in posts:
            writer.writerow([
                post.get('uri'),
                post.get('author', {}).get('handle'),
                post.get('author', {}).get('displayName'),
                post.get('record', {}).get('text', '')[:500],
                post.get('record', {}).get('createdAt'),
                post.get('likeCount', 0),
                post.get('repostCount', 0),
                post.get('replyCount', 0),
            ])

def main():
    seen_uris = set()

    # Write CSV header
    with open(OUTPUT_FILE, 'w', newline='', encoding='utf-8') as f:
        csv.writer(f).writerow([
            'uri', 'handle', 'display_name', 'text',
            'created_at', 'likes', 'reposts', 'replies'
        ])

    while True:
        new_posts = []
        for keyword in BRAND_KEYWORDS:
            try:
                posts = search_bluesky(keyword)
                for post in posts:
                    uri = post.get('uri')
                    if uri and uri not in seen_uris:
                        seen_uris.add(uri)
                        new_posts.append(post)
            except Exception as e:
                print(f'Error fetching "{keyword}": {e}')

        if new_posts:
            append_to_csv(new_posts)
            print(f'[{datetime.utcnow()}] Saved {len(new_posts)} new mentions')

        time.sleep(CHECK_INTERVAL_SECONDS)

if __name__ == '__main__':
    main()

Run it as a long-running process (or schedule via cron) and you have a basic brand mention monitor for under 50 lines of code. Extend it to push alerts to Slack, run sentiment analysis, or feed your CRM.

What's Different About Bluesky Data

A few practical differences from other platforms worth knowing.

Handles can change

A Bluesky user can change their handle (e.g., @oldname.bsky.social → @newname.bsky.social). The DID (decentralized identifier) is the stable reference. If you're storing references, store the DID, not the handle.

Verification looks different

Bluesky doesn't have a single blue checkmark. Instead, accounts can be verified by domain (e.g., @reuters.com) or by other trust signals from third-party labelers. Don't assume Twitter/X-style verification semantics.

Custom feeds matter

Unlike most platforms, users on Bluesky often follow custom feeds rather than the algorithmic For You. A well-built custom feed for a topic can have hundreds of thousands of subscribers. Knowing what feeds exist in your space tells you a lot about the conversation.

Reply graphs are deep

Bluesky conversations tend to be more substantive than X. Fewer posts, more replies per post, deeper threads. If you're analyzing engagement, look at reply depth and quality, not just like counts.

The firehose is genuinely available

Bluesky exposes a real-time firehose of all public activity through the AT Protocol. For sufficiently advanced use cases, you can subscribe to every public event in real time. This is unprecedented at scale on a social platform — most platforms either don't offer it or charge enterprise rates.

Common Pitfalls

A few things teams trip over.

Treating Bluesky like Twitter/X clone. It's not. The audience composition, posting cadence, and engagement patterns are genuinely different. Don't assume Twitter monitoring tactics will work directly.

Underestimating the audience size. "Only" 50 million users sounds smaller than X's much larger numbers, but the Bluesky audience is concentrated in journalists, technologists, academics, and policy people — exactly the audiences that drive narrative-shaping. Per-user influence is high.

Ignoring rate limits. While more permissive than other platforms, Bluesky does rate-limit. Burst requests will get you 429s. Build in exponential backoff.

Storing handles instead of DIDs. Mentioned above but worth repeating. If your system breaks because someone renamed themselves, you've stored the wrong identifier.

Missing custom feeds. Most teams monitor profiles and search. The custom-feed layer is where a lot of the conversation organizes itself, and most data tools ignore it.

Frequently Asked Questions

How does Bluesky compare to Twitter/X for data extraction in 2026?

Bluesky is dramatically easier and cheaper to extract data from. The AT Protocol is open. The API is free for most read operations. There's no equivalent of X's $100-$5,000/month API tiers. For data infrastructure, Bluesky is the better foundation.

Will Bluesky lock down its API?

Unlikely. The platform was specifically architected around the open AT Protocol, and locking it down would contradict the core value proposition that drove its growth. Some specific endpoints may evolve, but the public-data accessibility is structural.

Can I do at-scale data extraction without violating terms?

Yes. Bluesky's terms permit programmatic access to public data through their official API, which is exactly what you'd be doing. The federated nature of the protocol means there's no platform-level objection to third-party data access of public content.

How do I find historical data?

The Bluesky API supports pagination back through historical posts. For deeper historical data, AT Protocol relays archive content and some third-party services maintain public archives. Coverage is good but not complete — Bluesky predates 2023, but most data accumulation started after 2024.

Is Bluesky data useful for sentiment analysis?

Yes, particularly for topics that have moved to Bluesky. Climate, journalism, US/UK politics, technology criticism, science communication — sentiment on Bluesky often diverges meaningfully from X. For a complete picture, monitoring both platforms is increasingly necessary.

What about Bluesky's anti-scraping measures?

There aren't really any in the typical sense. The platform is built on protocol-level openness. If you're hammering an endpoint with malformed requests, you'll get rate-limited. If you're making reasonable requests, you'll get reasonable responses indefinitely.

Try SociaVault free → — 50 free credits to extract Bluesky data alongside other platforms.

Bluesky Scraping API Guide: How to Extract Data from the Fastest-Growing Social Platform of 2026