Contents

Bluesky's Decentralized Architecture Compared to Mastodon and Twitter/X

Bluesky's Decentralized Architecture Compared to Mastodon and Twitter/X webp image

We're now living through the second Twitter/X exodus. The first one saw a massive spike of users of the Fediverse, more commonly known through its most popular implementation, Mastodon. This time around, it's Bluesky that's winning in the popularity contest, with as much as a million new signups each day (!).

Two years ago, we analyzed the architecture of Mastodon and identified some challenges the network might encounter when scaling. The only logical next step is to take a closer look at Bluesky!

The network is currently almost entirely centralized: on the surface, it's not that different from Twitter/X. Nearly all services are run by a single company. However, Bluesky claims to have a decentralized architecture. And that's what we'll focus on: the elements that might make Bluesky significantly different from its competitors.

Is Bluesky fundamentally built in a way that guarantees that 15 years from now, the network won't share X's fate? Or are we just trading one network for another, with different leadership (although Threads already fills that niche)?

Where Bluesky comes from

The company was founded in 2021 in a rather unusual way—with the help of a no-strings-attached $13m grant—from Twitter itself. The fact that Bluesky is now seen as an X alternative might sound ironic, or maybe this was Twitter's then-CEO Jack Dorsey's plan all along. We might never know!

The research on the underlying technology started earlier, in 2019, in response to the following tweet:

twitter status

Gergely Orosz's two piece series covers the company's early history, its development culture, and the evolution of the architecture.

As we use it today, Bluesky has been in development since 2022, with invite-only access in 2023 and public access being available only this year. So, the project and the codebase are still very young—and evolving fast. New features might be tricky to deliver quickly, though. With the recent onset of users, I suspect most work now is around scaling rather than app or protocol development. But such scaling problems are what each startup dreams of, after all!

How to build a social network

If you were to build a new microblogging platform from scratch, the implementation would have to cover several aspects.

Of course, you can slice a system in many different ways, but since we're focusing on Bluesky, we'll follow its approach. Hence, the top-level components should include the following:

  • user data storage: the content that users produce (posts, likes, follows)
  • authentication & authorization: the way we can prove to the service that "it's us," our credentials, ways of determining what we can and can't do in a system
  • real-time streaming of new content: the "firehose," so that you can process the latest posts instantly
  • moderation: required by law, but also by users—there are whole categories of content that some users don't want to see, or that are simply illegal
  • feed generation: what makes it to the top of your feed? That's the all-mighty "Algorithm", which is said to influence everything, including elections
  • data aggregation: viewing a post might sound mundane on a social network, but we need to count the likes, reposts, engagements, etc.

X, Mastodon, Threads, and Bluesky indeed implement all of these, though in quite different ways.

How is X built

Analyzing the architecture of X from the point of view of centralization, the diagram is pretty simple:

x architecture

X is fully centralized. Of course, enormous complexity is hidden beneath this simple "X" box. Twitter's engineers had to overcome and often invent entirely new ways of scaling software to keep up (rather successfully) with the amounts of data that the network processes (or had to process).

However, X is completely in control of your data, your identity on X, and how the data is moderated (currently a flash point) and presented in a feed. The "Algorithm" is partly open-source; that helps with transparency, but you can only live with that knowledge without much chance of impacting how it works (unless you're close friends with Elon, of course).

Hence, X is a monolith. Can we do better?

How is Mastodon built

Taking our high-level centralization-oriented architectural view, the Mastodon/Fediverse network would present itself more or less as follows:

mastodon architecture

Each Mastodon instance is, in a way, a world on its own. The server's administrator and, by extension, the server's software decide the rules of participation and how the feed is generated (by default, there's no "Algorithm" at all, just a chronological list of posts). The server instance stores the user credentials, posts, and interactions and handles moderation and queries (data aggregation)—all in a single package.

The Fediverse architecture is often compared to that of email, where your messages (posts) are in the custody of your email provider (Mastodon server). Communication flows peer-to-peer; there's no central post office tasked with distribution. This works great for email but makes data aggregation a challenge for a social network: there's no single global view of any post's state, which includes replies, "favorites," and boosts.

Email-like architecture directly translates to usability issues, such as two users with accounts on different servers might see different reply threads to the same post! Decentralization comes at a cost.

Speaking of email, it is a success story regarding decentralization. And Mastodon tries to follow its route. Just as there are some huge email providers, there are disproportionately big Mastodon instances (mastodon.social is the biggest one). Still, the network sees a distribution of users among multiple servers. That's a success of Mastodon as well:

top mastodon instances

(we have our instance as well, softwaremill.social!)

Because of such architecture, there is a certain level of server lock-in, which is often cited as a weakness of Mastodon's architecture: your identity is tied to a server, and it's challenging to move to another one. In fact, without the cooperation of the original server, it might be impossible. So, choose your Mastodon server wisely!

How is Bluesky built

Let's finally get to Bluesky itself! You can view the Bluesky architecture in two ways: as it is used now and as it can potentially be used in the future.

When it comes to Bluesky today, for 99% of the users, a centralization-focused architectural diagram would look somewhat familiar:

bluesky architecture

That is, your data, identity, feed generation, and data aggregation are in Bluesky's (the company) control. If they pulled the cord, almost the entire network would disappear.

An important caveat here: an archive of your data might survive, as all activity on Bluesky is public and probably scraped. However you wouldn't be able to regain your identity, and continue posting on an alternative site. Of course, to be fair, the same is true if your Mastodon instance disappeared (or X, for that matter).

But in the introduction, we wondered if something in Bluesky's architecture makes it more resilient to hostile takeovers and differentiates it from X or Threads.

That's where the 1% comes in (it's probably less than that; I don't know the exact numbers). Bluesky's architecture allows—maybe even encourages—decentralization. While it's not in widespread use right now, there's potential.

An updated architectural diagram, taking into account the entire BlueSky network as it operates today (that is, including the services that operate independently), would instead look as follows:

bluesky architecture updated

In other words: X is fully centralized. Mastodon is fully decentralized. And Bluesky is somewhere in-between.

"The organization is a future adversary"

The above quote is how the Bluesky team describes their approach: Bluesky's leadership might be trustworthy now, but who knows what's going to happen to the company in the future.

Bluesky is a public benefit company, so profit is not its (only) goal. However, there is VC capital involved; as well as people, which means things might go both ways. OpenAI is a great example here: a company that is now open only in its name and has turned from research into a rather traditional for-profit business.

The quoted approach is reflected in how Bluesky is written and governed. Almost all (except for a single crucial one) components are open-source. Just as with X's Algorithm, this is an excellent start regarding transparency. But we can go further and self-host almost the entire Bluesky infrastructure if we have enough time & money to spare.

An overview of Bluesky's components

Let's summarize the components of Bluesky's architecture, classified by the roles we identified earlier.

RoleBluesky nameOpen source?Self-hosting cost
user data storagePDS (Personal Data Server)yeslow
authentication & authorizationPDS + PLC server (Public Ledger of Credentials)yeslow
real-time streamingRelayyeshigh
moderationLabeleryeslow to high
feed generationFeed generatoryeslow to high
data aggregationAppViewnovery high

In a simplified view, the data flow is as follows:

data flow

User data storage

Let's start with the PDS servers. That's where your data lives: all of your (public) activity, such as the posts you create, the accounts you follow, your likes and blocks. However, a PDS also contains the private keys used to sign your content to prove that you have created it.

What a PDS does not contain (a core difference from Mastodon) is any information about your followers and their actions on your posts; there's no peer-to-peer communication between PDS servers.

If you sign up to Bluesky, you'll be assigned to a Bluesky-managed PDS. That's what a majority of users do. Specifically, this also means that the above private keys are in Bluesky's custody. By default, they're in complete control.

However, a PDS is the easiest component to self-host, realizing "own your data". It's relatively low-cost and low-maintenance. Unless you (and a small group of other users with whom you might share a self-hosted PDS) create posts like mad, such an instance will be a low-traffic web server, serving an occasional HTTP request and propagating the data further for aggregation.

It's important to note that the complexity of self-hosting a PDS server is significantly lower than that of a Mastodon instance. In the case of Mastodon, you have to host the entire stack (data storage, aggregation, streaming, etc.). This ties your identity to that instance and has higher operational costs.

With Bluesky, these roles are separated into services (very much in the "microservices" spirit), and if you want to own your data, it's much simpler.

Indeed, the Bluesky services managed by Bluesky (the company) consume and aggregate the data stored on self-hosted PDSs. I haven't been able to find any numbers on how many self-hosted PDSs are out there, but I've seen people share that they've successfully done this.

Authentication & authorization

The question of identity on the web is a long-standing problem without a definite good solution (possibly, one will never exist). Still, Bluesky takes an interesting approach by combining identity with domain names.

Their argument is that domain names are a well-understood concept, human-readable, and somewhat decentralized (although ICANN is ultimately at the top), which is true and convincing.

It's not a permanent mapping, though: you aren't tied to a single domain name for a lifetime from the moment you register. Initially, you are assigned a [username].bsky.social subdomain. This can be changed later, thanks to an intermediate identifier, which builds upon the DID (Decentralized Identifiers) W3C standard.

did document

Your domain name should map through DNS records or an https:// request to a well-known endpoint to a DID identifier. This identifier is then looked up in a (Bluesky-managed) PLC server containing a DID document. The DID document should point back at your domain name; only when this circular relationship is in place can a DID document be considered valid. The DID document contains information such as your public key (for verifying signatures) and the address of your PDS server.

A DID document is not immutable; any updates must be signed using your private key, which requires access to your PDS server. Moreover, the DID identifier is a hash of the original (root) document. That way, you get a chain of DID document updates, which anybody can verify (using the public keys). You might think of this as a personal "mini-blockchain."

For example, my DID is did:plc:vpnzitxf5i6lzony3macvcbr, which is a DNS TXT record on warski.org. Through a request to plc.directory/did:plc:vpn..., you can view my current DID document. It points back to warski.org, and my Bluesky-manged PDS, https://shiitake.us-east.host.bsky.network (I lost the lottery with the server's name). And indeed, my Bluesky profile is bound to that domain name.

This architecture allows one to migrate to another display name (domain name) without much hassle. That's true even if you lose control over your old domain name—the "only" thing you need is to have control over the signing key in the PDS—directly or through Bluesky's custody. That's also a significant difference compared to Mastodon.

The one central component that plays a crucial role here is the PLC server, which Bluesky maintains; it maps the DID identifiers to DID documents. This might be seen as a single point of failure, but there are two important caveats. First, all data is public and self-verifiable. Hence, if any manipulations occur, the community might quickly call that out (which, reportedly, did already happen due to the deletion of some records for legal reasons).

Secondly, any manipulation in the public ledger other than deleting records would be trivially verifiable by third parties (provided Bluesky doesn't control the private keys, which is the case today). Hence, while the system isn't ideal, it offers good transparency and opportunities for future improvements: introducing mirrored PLC servers, transferring the maintenance of the public directory to a separate entity, or maybe this is (finally!) a good use-case for a private blockchain.

Critics of Bluesky's approach to identity management point out (among other aspects) that this solution is centralized, for two reasons: reliance on DNS and the single PLC server instance. And that's true—Mastodon's design is more decentralized, but this comes with a significant tradeoff in user experience.

In Mastodon, you truly own your identity (and server); as long as somebody sells you access to the Internet, you might publish on your (private) Mastodon instance. With Bluesky, it is theoretically possible that, for example, some future government will ban you from obtaining a domain name. Hence, you won't be able to maintain an identity on Bluesky's network.

Real-time streaming

The Relay service gathers real-time data from all of the PDSs (through WebSockets) and publishes it as a "firehose." As the network grows, this will be an increasingly complex piece of software, costing both in terms of compute, distribution, and the bandwidth it needs to operate.

Multiple downstream components, which we'll discuss next and which provide the core Bluesky functionality, then consume the firehose.

Currently, there's only one relay operated by Bluesky (the company); however, in theory, there might be multiple ones. Also, in theory, Bluesky would make it possible for such a third-party relay to consume all of the data from the PDSs it hosts. But you might also imagine Bluesky charging for that (even if it's only to cover egress costs) or denying data access to its potential competitors.

Of course, that's all hypothetical and in the future; currently, Bluesky acts as an open, trustworthy company, but we have to keep in mind the "organization is a future adversary" motto.

Labelers & feed generators

One of the firehose's clients is the Labeler & Feed Generator services. The first one assigns labels to posts—these might include content warnings, manually assigned flags by moderators, or anything else. A human might be involved, or an AI might do the labeling. The labels are then consumed by the next component in the pipeline, which handles data aggregation (discussed in the next section).

Similarly, a Feed Generator consumes the firehose and is responsible for generating feeds—that's the "Algorithm." Given a user, the service's task is to create the feed—this might be a simple chronological order, popular posts, posts about a particular topic, or even a static set of posts. Again, the method of assembling the feed is arbitrary—from manual to AI-driven.

These two components can be and are already run independently from the ones that Bluesky manages. Bluesky provides some out-of-the-box labelers (some of which you can configure, some of which you can't), but you can also subscribe to others. Same with feed generators: "Popular With Friends" is managed by Bluesky (the company), but there are many others.

bluesky labeler

Creating your own Labeler is relatively easy, which amounts to hosting a web service. However, if you end up with a particularly popular labeler, the costs of running such a service might become non-trivial! You can check some existing popular labelers here:

thrid party labelers

Data aggregation

We finally arrive at the largest and most important component: the AppView. That's where the combined data from the Relay (and hence all PDSs), Labelers, and Feed Generators are stored.

The task of AppView is to aggregate all the incoming data into what you later see in your Bluesky feed and when viewing individual posts. That's where posts are organized into threads, the number of likes and reposts are counted, and some moderation is applied (part of individual moderation settings are also applied on the PDS, which acts as a proxy when accessing a feed). AppView also determines how many followers you have—as this information is dispersed among the records of the following PDSs.

This component is also the most data-intensive and hardest to scale. It's where NoSQL databases, search clusters, real-time data pipelines, and the like come into play. It's also currently the only closed-source component —however, there are plans to change this. Or maybe at least factor out more functionality into separate services, as it happened with Labelers and Feed Generators?

There's currently only one AppView instance in operation. However—nothing stops you from writing your own! As long as you've got Relays that allow you to consume the data, you might create one that aggregates data to your liking. The ideal architecture of a federated network of services, with full-scale third-party participants, using the Bluesky architecture is presented in the paper by Martin Kleppmann and the Bluesky team:

bluesky architecture ideal

Mastodon is like email, Bluesky is like the web

While Mastodon's architecture is often said to be inspired by email, Bluesky's architecture is similar to that of the web. PDSs correspond to individual web servers hosting web pages. Anyone can host a web server with a number of pages, and similarly, anybody can host a PDS with a number of records (posts, likes, follows). Whether somebody is aware of the existence of such a web server/PDS and consumes the content is another question!

We've got PDS (web servers) and posts (web pages). What are Relays and AppViews, then? Google! Or rather, search engines that index the content of the web. Just as Google or Bing make your content discoverable to the world, Bluesky, the company, plays a similar role—it indexes the content of all PDSs and makes that searchable and public.

Still, one significant difference exists in the current picture. If we were to translate the current setup of Bluesky's components to the world of the web, we would end up with a single search engine, indexing the entire web but also hosting all of the web pages! It's a very unbalanced setup.

Is Bluesky different from Twitter, then?

From a technical perspective, the architecture is there to make Bluesky significantly different from Twitter/X. However, for that to happen, a number of social and business factors must align. Any progress in this area can only be made with full cooperation from Bluesky (the company).

One option for decreasing current centralization could be to separate data ownership from data indexing further: in web terms, separating web hosting from web indexing. We wouldn't want an Internet where everything is Google-controlled, not because Google is evil, but just to have a healthy ecosystem. The same is valid here.

Ideally, third-party services would emerge as alternatives to Bluesky-managed PDS hosting. But changing your PDS from one provider to another needs to be almost trivial, baked right into the Bluesky app's UI. I'm unsure if Bluesky (the company) has the incentives to do that (looking from a purely business perspective).

Also, we can only talk about decentralization once we have at least two big Relay / AppView implementations. The technical possibilities are there, but what good are they if they remain unused? Note that it's not a matter of just having another Relay / AppView: such a service needs to amass a high number of users to be a viable Bluesky alternative—high enough so that it can't be ignored and can't be data-starved with a flip of a switch from some future Bluesky owner.

It's a challenge for the community, but one that might be overcome. There are a number of open-source projects to create and a number of business models to discover, as running a Twitter-like service is expensive.

Bringing the above to life would be in line with Bluesky's "credible exit" goal. While the company and the team fully acknowledge the current rather centralized situation, their stated development goal is to provide a way of moving away from Bluesky, along with your identity and data, if the company goes out of business.

The work done by the Bluesky team so far—separating services with well-defined roles (PDS, Labeler, Feed Generator, Relay), which might be run independently from the main Bluesky infrastructure, and respecting these promises—builds confidence. However, we need to keep in mind that if we truly want to have a better Twitter/X alternative, we have to be ready for the fact that in 15 years, the company will experience a hostile takeover.

Bigger picture

Bluesky is also worth studying for another reason. It's only one "social mode" and one application built on top of the AT Protocol (Authenticated Transfer). The protocol's vision goes way beyond microblogging: it provides a way to build and maintain an online identity (DIDs and DNSs) and publish authenticated data on the web (using signed PDS records).

Just as we have the web, with hyperlinks being the core building block, here we have a vast ocean of signed AT Proto records, referencing each other using AT Proto URIs.

Especially in the age of generative AI, having a way of authenticating that a specific person or organization has produced some data might be more valuable than ever.

Sources

If you'd like to learn more about Bluesky, here are the articles that I've used:

And finally: follow us on Bluesky!

Blog Comments powered by Disqus.