Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 5 min | 925 words

Unless I get good feedback / questions on the other posts in the series, this is likely to be the last post on the topic. I was trying to show what kind of system and constraints you have to deal with if you wanted to build a social media platform without breaking the bank.

I talked about the expected numbers that we have for the system, and then set out to explain each part of it independently. Along the way, I was pretty careful not to mention any one particular technological solution. We are going to need:

  • Caching
  • Object storage (S3 compatible API)
  • Content Delivery Network
  • Key/value store
  • Queuing and worker infrastructure

Note that the whole thing is generic and there are very little constraints on the architecture. That is by design, because if your architecture can hit the lowest common denominator, you have a lot more freedom. Instead of tying yourself to a particular provider, you have a lot more freedom. For that matter, you can likely set things up so you can have multiple disparate providers without too much of a hassle.

My goal with this system was to be able to accept 2,500 posts per second and to handle reads of 250,000 per second. This sounds like a lot, but a most of the load is meant to be handled by CDN and the infrastructure, not the core servers. Caching in a social network is somewhat problematic, since you’ll have a lot of the work is obviously personalized. That said, there is still quite a lot that can be cached, especially the more popular posts and threads.

If we’ll assume that only about 10% of the reading load hits our servers, that is 25,000 reads per second. If we have just 25 servers for handling this (assuming five each in five separate data centers) we can accept the load at 1,000 requests per second. On the one hand, that is a lot, but on the other hand…. most of the cost is supposed to be about authorization, minor logic, etc. We can also at this point add more application servers and scale linearly.

Just to give some indication of costs, a dedicated server with 8 cores & 32 GB disk will cost 100$ a month, and there is no charge for traffic. Assuming that I’m running 25 of these, that will cost me 2,500 USD a month. I can safely double or triple that amount without much trouble, I think.

Having to deal with 1,000 requests per server is something that requires paying attention to what you are doing, but it isn’t really that hard, to be frank. RavenDB can handle more than a million queries a second, for example.

One thing that I didn’t touch on, however, which is quite important, is the notion of whales. In this case, a whale is a user that has a lot of followers. Let’s take Mr. Beat as an example, he has 15 million followers and is a prolific poster. In our current implementation, we’ll need to add to the timeline of all his followers every time that he posts something. Mrs. Bold, on the other hand, has 12 million followers. At one time Mr. Beat and Mrs. Bold got into a post fight. This looks like this:

  1. Mr. Beat: I think that Mrs. Bold has a Broccoli’s bandana.
  2. Mrs. Bold: @mrBeat How dare you, you sniveling knave
  3. Mr. Beat: @boldMr2 I dare, you green teeth monster
  4. Mrs. Bold: @mrBeat You are a yellow belly deer
  5. Mr. Beat: @boldMr2 Your momma is a dear

This incredibly witty post exchange happened during a three minute span. Let’s consider what this will do, given the architecture that we outlined so far:

  • Post #1 – written to 15 million timelines.
  • Post #2 - 5 – written to the timelines of everyone that follows both of them (mention), let’s call that 10 million.

That is 55 million timeline writes to process within the span of a few minutes. If other whales also join in (and they might) the number of writes we’ll have to process will sky rocket.

Instead, we are going to take advantage of the fact that only a small number of accounts are actually followed by many people. We’ll place the limit at 10,000 followers. At which point, we’ll no longer process writes for such accounts. Instead, we’ll place the burden at the client’s side. The code for showing the timeline will then become something like this:

In other words, we record the high profile users in the system and instead of doing the work for them on write, we’ll do that on read. The benefit of doing it in this manner is that the high profile users tiimeline reads will have very high cache utilization.

Given that the number of high profile people you’ll follow are naturally limited, that can save quite a lot of work.

The code above can be improved, of course, there are usually a lot of difference in the timeline posts, so we may have a high profile user that is off for a day or two, they shouldn’t show up in the current timeline and can be removed entirely. You need to do a bit more work around the time frames as well, which means that timeline should also allow us to query itself by most recent post id, but that is also not too hard to implement.

And with that, we are at the end. I think that I covered quite a few edge cases and interesting details, and hopefully that was interesting for you to read.

As usual, I really appreciate any and all feedback.

time to read 3 min | 551 words

A social media platform has to deal with the concept of now and its history. For the most part, most users are interacting with the current state of the system. Looking at their timeline, watching current posts, etc.  At the same time, there is a wealth of information that you can get from looking at the past.

It isn’t out of the question that you’ll have users diving into the history of posts of another user going as far back as possible. That can be a parent whose kids just left the house, looking at baby pictures or it can be a new friend, trying to learn some interesting tidbits before a party (when we still had those).

It can also be automated processes, such as: “5 years ago you posted…”

The architecture that I presented in these posts is relatively agnostic for such a scenario. Given the timeline feature, going back in time means that you can fairly easily discriminate based on age. Older sections in the timeline can be moved to lower class storage tier (basically, move to HDD instead of NVMe, for example). They are still accessible, still available, but far cheaper to store.

I don’t believe that you can usually go with an archive tier level for the timelines, not unless you are willing to effectively be unable to access them if a user requests it, but a policy of moving old and rarely used timeline sections and posts to HDD is absolutely doable. Note that things like intelligent tiering is not a good solution for our needs. That would move items based on age and access, but while we want to move items by age, older items are still access, just far more rarely, so we don’t want to move them back into hot storage if they are rarely accessed.

That said, certain posts are likely to generate active for a long time. So we can’t just send data to cold storage just based on age. Need to also take into account the recent access patterns. On the other hand, consider a post a few years ago that talks about Broccoli, when people still did that. Mr. Beat discovers that Mrs. Bold has such a post and blast it all over social media. Very quickly that old post become very active. That means that we should have a way to move data back to hot storage if there is enough access.

Ideally, we can rely on the underlying storage to do that for us, but we have to know how it actually works behind the scenes and understand what is actually going on there. The nice thing about this is that unlike most of the details we discussed so far, that is something that we can punt down the road, we already have the architecture in place that will allow us to introduce this cost savings measure down the line, we don’t have to have it figured out from day one. Given the fact that we have multi level caches, that means that we can probably just age out old information to cold storage and not usually have to think about it too much.

When we have enough data that this is a serious concern, on the other hand… we will have the time and resources to also handle it.

time to read 3 min | 555 words

Quite a few of the features that we consider native to social media actually came about as a result of users’ behavior, not pre-planned actions. For example, the #tagging and @mentions were both created by users and then adopted as an official action by the social media giants.

I already touched on how mentioned are handled, as part of writing the document itself, we insert the post id to the timeline of the mentioned user. For tagging, the process is very similar. Each tag has a timeline, and we can insert posts into the tag’s timeline. From there on, the process is basically identical for what we already describe.

I want to stop for a second and emphasis the coolness factor of a significant feature being handle (in the backend) via simply:

There is probably UI work to do here, but that is roughly all you’ll need to manage tags. Presumably you’ll want some better policies, but that is the core behavior you’ll need to support this sort of feature.

What about searching? Full search indexes are nothing new and you can get an off the shelve solution to manage your searches easily enough. That is likely to be one of the annoying pieces of the system. Luckily, we can usually handle things by offering two tiers of searching. We have the first tier, which cover posts in the recent past (a month or two) which must have very high speed queries and then we have full data search, for which we have a lot longer SLA. By far most queries are going to be hitting the recent data set, which makes the task itself easier. The actual choice of indexing solution and its usage is fairly irrelevant at this point. You’ll need something that is distributed, but there is enough variety there that you can get away with selecting pretty much anything.

We aren’t going to need to provide sophisticated full text search features, we just want users to be able to find results by text queries.

You’ll note that throughout this series of posts, I’m not trying to find novel ways to get the best solution. I’m using practical options for the actual use case presented and in many cases, I can get away with a lot by changing the requirements just slightly. For that matter, a lot of the limitations that I accept are real limitations that you’ll find with other social media networks as well.

Finally, I just wanted to show how we can enable basic search capability using minimal amount of code, given the infrastructure we have so far:

As you can see, we built a simple full text search here. To query it, you get the timeline for a particular term and get the list of post that it has.

For tags and searches, as you can imagine, this can be a huge list, which is partly why timelines are built on the concept of sections that can be so easily distributed.

The solution above isn’t actually a good one for full text search. I can’t easily turn that into a search by a phrase, and there are many other features that I’ll likely want to have, but that is a good example of how the infrastructure that we built for one part of the system can be utilized for completely different purpose.

time to read 4 min | 623 words

I touched briefly on the issue of posts statistics in a previous post, but it deserve its own post. There are all sort of metrics that we want to track on a post. Here are just a few of them:

image

Unlike most of the items that we discussed so far, these details are going to be very relevant for both reads and writes. In particular, it is very common for these numbers to be update concurrently, especially when talking about the popular posts. At the simplest level, these can be represented as a map<key, int64>. That gives us the maximum flexibility for our needs and can be also utilized in the future for additional use cases.

Given that this is effectively a distributed counter problem, there are all sort of ways that we can handle this. At the client level, we send the increment operation to the server and manually update the value. That gets us 90% there in terms of the UX factors, but there is a lot to handle this behind the scenes.

A good algorithm to use for this is the PN Counters model from the CRDT playbook. RavenDB implements these for you, for example. In essence, that means that we have the following data model:

The likes and replies object has a property per each node that increment a value. That contains the value that we have for that node as well as the etag for this change. It is easy to merge such a model between different versions, because we can always take value of the higher etag to get the latest value. In this way, we can allow concurrent and distributed updates across the entire system and it will resolve itself in the end to the right value. Another option may be to push the commands all way to the owning data center, where we’ll apply the operations, but that may add a high load on hot posts in the system. Better to distribute this globally and not really concern ourselves with the matter.

Looking at Twitter, there are about 200 billion tweets a year. That means that we have to be ready for quite a few of those values. Having that in a dedicated system is a good idea, since it has far different read & write skew than other parts of our system. As part of reading of posts, however, we’ll likely want to build some mechanism for pushing those counters to the post itself so we can remove that from the rest of the system. An easy way to handle that is to do some on an hourly basis. So instead of the format above, we’ll have:

Here we have the last two hours of updates of operations on the post. Once every hour we’ll consolidate all the updates from two hours ago and write them to the post itself. When we get to the point where we have no more updates in the post, we can safely delete the value.

The reason you want to add this complexity is that there is a big difference between all the posts in a social media and the active working set. That tends to be far smaller value and can dramatically reduce the amount of data we need to keep and manage. Assuming that the working set is at 25 millions posts or so across the network seems reasonable, and that amount of data can be easily handle by any server instance you care to use. Managing 200 billion per year, on the other hand, puts us in a different class of problem, and we’ll need more and more resources down the line.

time to read 5 min | 834 words

In the series so far, we talked about reading and writing posts, updating the timeline and distributing it, etc. I talked briefly about the challenges of caching data when we have to deal with updates in the background, but I haven’t really touched on that. Edits and updates are a pain to handle, because they invalidate the cache, which is one of the primary ways we can scale cheaply.

We need to consider a few different aspects of the problem. What sort of updates do we have in the context of a social media platform? We can easily disable editing of posts, of course, but you have to be able to support deletes. A user may post about Broccoli, which is verboten and we have to be able to remove that. And of course, users will want to be able to delete their own post, let’s their salad tendencies come back to hunt them in the future. Another reason we need to handle updates is this:

image

We keep track of the interactions of the post and we need to update them as they change. In fact, in many cases we want to update them “live”. How are we going to handle this? I discussed the caching aspects of this earlier, but the general idea is that we have two caching layers in place.


image

An user getting data will first hit the CDN, which may cache the data (green icon) and then the API, which will get the data from the backend. The API endpoint will query its own local state and other pieces of the puzzle are responsible for the data distribution.

When changes happen, we need to deal with them, like so:

image

Each change means that we have to deal with a policy decision. For example, a deletion to a post means that we need to go and push an update to all the data centers to update the data. The same is relevant for updates to the post itself. In general, updating the content of the post or updating its view counts aren’t really that different. We’ll usually want to avoid editing the post content for non technical reasons, not for lack of ability to do so.

Another important aspect to take into account is latency and updates. Depending on the interaction model with the CDN, we likely have it setup to cache data based on duration, so API requests are cached for a period of a few seconds to a minute or two. That is usually good enough to reduce the load on our servers and still retain good enough level of updates.

Another advantage that we can use is the fact that when we get to high numbers, we can reduce the update rate. Consider:

image

We now need to update the post only once every 100,000 likes or shares or once for 10,000 replies. Depending on the rate of change, we can skip that if this happened recently enough. That is the kind of thing that can reduce the load curve significantly.

There is also the need to consider live updates. Typically, that means that we’ll have the client connected via web socket to a server and we need to be able to tell it that a post has been updated. We can do that using the same cache update mechanism. The update cache command is placed on a queue and the web socket servers process messages from there. A client will indicate what post ids it is interested in and the web socket server will notify it about such changes.

The idea is that we can completely separate the different pieces in the system. We have the posts storage and the timeline as one system and the live updates as a separate system. There is some complexity here about cache usage, but it is actually better to assume that stuff will not work than to try for cache coherency.

For example, a client may get an update that a particular post was updated. When it query for the new post details, it gets a notice that it wasn’t modified. This is a classic race condition issue which can case a lot of trouble for the backend people to eradicate. If we don’t try, we can simply state that on the client side, getting a not modified response after an update note is not an error. Instead, we need to schedule (on the client) to query the post again after the cache period elapsed.

A core design tenant in the system is to assume failure and timing issues, to avoid having to force a unified view of the system, because that is hard. Punting the problem even just a little bit allows us a much better architecture.

time to read 12 min | 2245 words

In the series of posts so far, when discussing reads, I punted the part where we know what to read. I mentioned that we get a whole batch of post ids from some where and discussed how that is going to work, but that was it. Now I want to talk exactly on how this works.

The timeline concept is a fairly simple one. We have a list of posts ids that the user goes through. As they are browsing through the list, items are added at the top, etc. This is basically the Twitter model. Another alternative is that as you scroll, if there are new items in the list, you are shown them before older values (the Facebook approach), but that is more complex.

Conceptually, the timeline is as simple as:

In other words, we just have a list of post ids, we add items at the end and as we scroll we keep track of where we started. That is sufficient to get us pretty much all the features that we want, surprisingly enough.

When you go to the home page and look at your timeline, you’ll typically start with whatever the latest value there, we’ll record the last position we saw and then start scrolling backward in time. In other words, if we have 10,000 items in the timeline, we’ll record that the position we started at was 10,000 and then start going back toward zero. If there are new items, the size of the list will increase and we can jump back to the top, etc.

That is simple enough, but how does this actually help us? That may be good if I wanted to see the public timeline of the entire network, but what about the actual features. I don’t care that a restraint in Prague is now offering discounted deliveries, for example. I care about the accounts that I follow.

The idea is that we don’t have just a single such timeline, but many. In fact, pretty much all operations in the social platform can be represented using the timeline abstraction.

Let’s consider typical usage of an account. I’m adding posts, but I also want to be able to see people talking to me or about tags that I follow. How is that going to work?

Well, I’m actually going to have two timelines:

  • Public Timeline – where we’ll add all the posts from the user, and maybe posts that mention / reply, etc.
  • Private Timeline – where we’ll add posts from users that you follow, mentions, replies to discussion the user took part of, etc.

In both cases, the behavior of the system is identical. We simply go through the list.  If you’ll recall, I left a lot unsaid when I discussed writing posts. In particular, how do I publish them to interested parties. This is where we start to apply policies. Part of the process of adding a post is to figure out what timelines it should go to.

By moving most of the cost to the write side, we drastically reduce the overall complexity. Furthermore, it also make a lot more sense, given that most posts aren’t going to have a wide blast radius.

When posting a message, we need to consider the following:

  • Is this a high impact account? (Let’s say, > 50,000 followers) If so, we’ll have special behaviors.
  • Who is following this user?
  • Is this a reply to another post?
  • Are there mentions on this post? If so, need to apply the logic based on the mention policies.

As you can imagine, this is quite involved, but in general, the way it will work is something like this:

The key here is the whole manner in which this works is done via selecting what we’ll publish to. Further more, you can see that a user have multiple timelines, and in the case of a mention, we can apply additional policies to see how the post gets routed. This is complex and often changing, but it also happens a lot less often than reads. So it is a net benefit to move all the costs to the write side.

Another thing to notice in this case is how we handle a reply. A reply it just appended to the timeline of the post. In other words, it is timelines all the way down. We want to have a single simple abstraction to handle as much of the system as we want. In this case, we need to handle replies on a post and that can be anything from very few to hundreds of thousands. By generating a timeline for the post as well, we can reuse all the same behaviors and it just works.

As for the timeline itself? It is merely a queue of post ids, and it allows you to set a position in it in an efficient manner. The list of post ids above is how it works conceptually, but we have to think about the numbers here. How big can a timeline get?

  • The personal timeline of a user is limited to how many posts they can make. In general, even very heavy users will not top a few hundreds a day and low thousands a month. That means that we have a good reasonable upper bound to how big the personal timeline can grow. Ten years of posting 5,000 posts a month will get you over half a million, but that I would assume be the top rate for anything that isn’t an automated system.
  • Your Public Timeline is impact by how many people you follow and how prolific they are. There is a natural limit to how many people an account can follow, so there is a bound here, but assuming that you follow 1000 accounts that all post 1000 posts a month, that adds up to a million posts a month. Over a ten year span, that would be 120 million posts. That said, we’ll discuss other properties of the public timeline below.
  • Post’s timeline is all the replies that were made to the post. Most posts have very few replies, but some will garner a lot. It took me a minute to find a Tweet on Twitter that had close to 400,000 replies, for example.

So a timeline may be big, potentially very big. However, there is an interesting issue here, how much do we actually need to keep?

The purpose of the Public Timeline, for example, is to show you the front page of the site, how much data back do we need to keep? Is there a reason to keep your timeline from three years ago? The answer is probably no. We can keep the public timeline at a certain size and likely benefit from a lot of space savings. On the other hand, the replies for a post can be quite interesting, and while they can grow very big, it probably make sense to never trim them.

So we have the concept of a timeline, but what is it actually going to be?

In terms of REST API, we are going to have the following endpoints:

  • GET /timelines?id=1351081943163123854
    GET /timelines/sections/F4BE2048BF51F3DCC69EA4CA4ED08F12A36BD6524C9F12018BA0CE6F7C076BB2

In other words, we can access the timeline or a section in that timeline. I would rather show the output and then discuss what it all means. The first endpoint gives us the timeline itself, and looks like this:

There are a few things to note here. When we ask for a timeline, we get the most recent posts in the timeline, as well as the past few sections. The way a timeline works, you can always append posts ids to it, which works great, except that at some point the sheer size that is involved is starting to be problematic.

If we consider a big timeline, one with 400,000 posts in it, that comes to about 3.2 MB used, just to store the post ids (8 bytes each). In practice, due to concurrency and distribution concerns, we can’t have an actual list of post ids, so we need some better management. Another factor is that you very rarely need or want to get the entire timeline, you want to start from the top and work your way down.

We can handle that easily enough using two stage approach. First, all the new post ids appended to a timeline are written in a “loose” form. Each one with each own entry. Once we hit a certain limit (128, for example), we know that this is likely to grow bigger. We can grab the loose post ids in the timeline and gather them into a section. A section is an independently addressed part of the timeline. The idea is that we gather all of the post ids currently loose in the timeline, write them into a single object and compress that. Then we use the hash of the resulting object to as the key to an object store.

Side note, timelines are immutable. Once the section is created, it cannot change. You can add additional filtering on the timeline on read, on the other hand. The timeline also should handle the case where posts in the timeline has been deleted, since we aren’t cannot modify it. For ease of implementation, we’ll also allow duplication in the posts ids. Clients are expected to handle and ignore duplicate post ids that happen within a certain time range.

The reason that the timeline section is compressed is to reduce the size, obviously. In my testing, I was able to get 65% reduction in size without taking any special efforts. Throwing the compressed data into object storage (S3 and the like) also means that it is much easier to scale reads on them. If we have a user who is very popular, we can move that timeline to a compression section faster to reduce load. This design explicitly acknowledge the problems with distributed systems and concurrency. It is possible that a compressed section will have an id that also appears in the loose portion of the timeline. The responsibility to handle such a scenario is on the client code, which is able to do so far more easily than the server side portion.

After compressing the loose posts in the timeline, we record the new section hash and allow clients to access it. It might be easier to see how that would work in the following image:

image

Given a post id size of 8 bytes, and assuming that we can compress it by 65% (my naïve tests using Brotli & GZip says yes, can probably do better than that) we can state that every thousand post ids or so we can generate a new section (meaning that it would be about 2KB in size, in the end). Even a very big timeline with hundreds of thousands of entries would end up with just a few hundreds of sections at the top.

The entire mechanism is very limited, quite intentionally. The external operations we allow on a timeline are append and get, with the client expected to understand the manner in which they are going to go deeper into the timeline. The limitation and expectation from the client (like allowing duplicate post ids, handling post ids that point to deleted posts, etc) are all there to make it easy to handle scaling out the system.

Consider a typical use case, I go into a popular account and look at their posts. Effectively, I’m browsing their public timeline. My interactions with the server goes like this:

  • Get a list of the post ids in the timeline. The first step is: GET /timelines?id=1351081943163123854
    • This gets me the list of loose post ids and the recent segments.
    • Notice that this API call is open for caching as well, so we can get the scaling benefit of that as well.
  • Get the actual posts, which I can do with the batch post read API that we discussed earlier.

In many cases, the cost of getting the timeline for the first time will be amortize over the reading time of many posts. The bulk read API gets me 128 posts at a time, so as the user is reading, I can get the next batch ready and give them the next part immediately.

Once I’m done with the loose posts, I can go into the compressed sections and do the same there. If each section has about 1000 posts ids, that will be sufficient for quite some time. And because I’m driving this from the client, it is very easy for me to scale. Throwing the data into object store like S3 means that I can get both CDN support easily and my scaling issue is now: “serve a lot of small files”, which is a very well understood problem.

I still have to take into account permissions, but that is already something that we handle in the batch read API. Notice that for the common case of public posts, pretty much the whole thing has caching and distribution baked and the amount of work that we can let the rest of the system handle is very high.

Hot spots in the system are going to be handled by the infrastructure, not by our own code, big machines or clever algorithms. They are going to be handled by the architecture of the system making it simple to manage them, giving plenty of room for caching and CDN to take charge and reduce our costs.

That is, after all, the whole point of this series of posts.

time to read 9 min | 1673 words

In my previous post we looked at the process in which we process a request for a batch of posts and get their results. The code made no assumptions about where it is running and aside from specifying whatever it is okay or not to allow caching, did no such work.

Caching is important, it matters a lot for performance. To the point where if you aren’t using caching, you are past willful neglect and in the territory of intentional malpractice. The difference can be between needing 18,500 cores to serve a website and needing less than 400  to serve a much busier website. Except that the difference will likely be more pronounced.

Because it is so important, we need to take it into account at the architecture level. Another aspect we have to consider is the data distribution. Assuming we want to build a global social media platform, that means having to access it from multiple locations. Which mean, in turn, that we have to consider the fallacies of distributed computing in our system. Locality of reference is another key factor that you have to take into account. Which means that you have to consider the flow of data in the system.

Let’s assume that we have the following datacenters around the world:

image

We are using geo routing and the relevant infrastructure to make sure that you’ll always hit the nearest data center to you.

Let’s say that we have a Mr. Beat in our social platform, who is very popular and like to post controversial messages such as the need to abolish peppers from your menus. Mr. Beat is located in Australia, so when he is posting yet another “peppers have no place in the kitchen” post, the data center in Brisbane is going to be the one to field the request.

A system like that would do best if we can avoid any and all required coordination between the different data centers, as such, we are going to be using gossip to share the results among all the data centers. In other words, the post will be written in the Brisbane data center and then replicated to the rest of the data centers. There are several ways that we can implement such a feature.

The simplest way to replicate all the data to all the data centers. This way, we can always access it from the local store. However, that present two challenges:

  • First, there is latency involved with replicating information across the glove. We may get a request for a post in the Odessa data center for a post originating in Brisbane. If the network devils decided to have a party, we may not have that particular post yet in Odessa. What do you do then? This is where the format of the post id come into play. If we can’t find the post in our local storage, we can figure out who owns that (based on the machine id segment) and go ask the owning data center.
  • The second problem is that in many cases, the data is purely local. For example, consider this Twitter account. For the most part, everyone that is following that is going to be located nearby. I assume you may have a very small minority of followers who left the area who are still interested in following up on what is going on, but for the most part, this is not that common. That means that replicating all the information to all the data centers is likely a waste.

Given these two facts together, we are actually better off using a different model for data distribution and caching. The post id is generated using the following mechanism:

image

The machine segment in the post id (10 bits long) gives us a good indication of where that post was created. This is originally used only for the purpose of generating unique ids, but we can make far greater usage of this. We decide that the data center that created the post is also the one that owns it. In other words, Mr. Beat’s posts are “owned” by the Brisbane data center. The actual posts are held in a key/value store, but there is a difference in how we do lookup by id based on the ownership.

Here is the relevant code:

The idea is that we first check the key/value in the current data center for the post id. If it isn’t found, we check if we are the owner of the post. If so, it doesn’t exists or was removed. If it belongs to another data center, we’ll go and fetch it from there. Note that we’ll record a null if needed, so we’ll not need to go and fetch a missing value each time it is requested.

In other words, the first time that an item is requested from the owner data center, we’ll place it in our own key/value for next time. We do so with an indication to the key/value that this is a remote value. Remote values may be purge early or be put on a least frequently viewed rotation, etc.

There are other things that we need to deal with, of course. Concurrency, for example, if we have multiple concurrent requests to the same missing id. We’ll have multiple chats between data centers to fetch the same value. I don’t think that this is a problem. That is likely to only happen the first time, and then it is cached. We might want to monitor how often remote values are requested and only get them after a certain number of requests, but in all honesty, it probably doesn’t matter.

The key/value store architecture is already likely to cause us to use both disk and memory. We can take advantage of the fact that the remote key isn’t important locally and not persist it to disk right away, or not in a durable format. When we need to remove the value from memory, we can see if it had enough hits to warrant writing locally or not.

The most complex issue, however, is related to the cache itself. We have a strong ownership model, in which the data center that created a post is its owner. What happens when we need to update a post?

Twitter, for example, doesn’t have an Edit feature. That is a great reduction of the complexity we have to deal with, but it isn’t all that simple. An update to the post can also be a delete. For example, let’s say that Mr. Beat posted: “eating steamed Br0ccoli”. Such a violation of community standards cannot stand. Even though Mr. Beat cleverly disguised his broccoli tendencies by typo-ing the forbidden term. Mr. Beat is very popular and his posts have likely spread across many data centers. An admin marking the post as deleted also has to deal with the possibility that the post is located on other locations as well.

We can try to keep track of this, but to be fair, it is easier to simply queue a delete command on all the data centers except the owner. That will ensure that they will remove the cached version and have to re-read it from the owner data center.

Everything that I describe so far was about behavior, but we also ought to talk about policies. As part of the work we do when writing a post, we can apply all sort of interesting policies. For example, we may know that Mr. Beat is popular globally and as part of writing a post from him preemptively send that post to all the other data centers. If we have an update to a post, instead of sending a delete command to the rest of the data centers and let it refresh automatically, we can send the updated post content immediately.

I intentionally don’t want to dig too deeply into those policies, because they are important, but they aren’t on the same level as the infrastructure I describe here is. Those are like the cherry on top, if you like such a thing, it can take something good and make it great. But given that those are policies that can be applied on a per item basis and modified as you go along, there isn’t a reason to start going there yet.

Finally, there is a last aspect to discuss: Expiry. 

In most social media, the now is important beyond all else. In other words, you are very unlikely to be seeing posts from two years ago and if you are, it matters a lot less if you have to deal with slightly higher latency. Expiring remote content that is over 3 months old, for example, and not placing that in our local key/value at all can be a great way to handle long tail issues. For that matter, given that old content is rarely accessed, we can also optimize our storage by compressing old posts instead of holding on to them directly.

And after all of this discussion, I wanted to point out that you are also likely to want to have another layer of caching in place. The API calls you make in many cases may be good candidate for at least short term caching. In other words, if you put them behind something like Cloudflare, given that we explicitly state what post ids we want, we can set a cache duration of 1 – 3 minutes without needing to worry too much about updates. That can massively reduce the number of requests that we actually have to handle, and it costs a lot less. Under what scenarios would that be useful?

Consider people going to view a popular user, such as Mr. Beat’s page. The list of posts there is going to be the same for anyone, and even a short duration on the cache would massively help our load.

As you can see, the design of the system assume caching and actively work to make it possible for you to utilize the cache at multiple levels.

time to read 4 min | 769 words

So far in this series of posts I looked into how we write posts. The media goes to S3 compatible API and the posts themselves will go to  a key/value store. Reading them back, on the other hand, isn’t that simple. For the media, I”m going to assume that the S3 is connected to CDN and that is handled, but I want to focus on the manner in which we deal with reading posts. In particular, I’m not talking here about how we can display the timeline for a user. That is going to be the subject on another post, right now, I’m assuming that this is handled and talking about the next step. We have a list of post ids that we want to get and we need to manage that.

The endpoint in question would look like this:

GET /api/v1/read?post=1352410889870336005&post=1351081958063951875

The result of this API is a JSON object with the keys as the posts ids and the values as the content of the post.

This simple API is about as simple as you can imagine, but even from this simple scenario you can see a few interesting details:

  • The API is using GET, which means that there is a natural limit to the size of the URL. This is good and by design. We will likely limit this to a maximum of 128 items as a time anyway.
  • The API is inherently about dealing with batches of information.
  • The media is handled separately (generated directly from the client) so we can return far less information.

In many cases, this is going to be a unique set of posts, for example, when you view your timeline, it is likely that you’ll see a unique set of posts. However, in many other cases, you’ll send a request that is similar or identical to what others will use.

When you are looking at a popular thread, for example, you’ll be asking the same posts ids as everyone else, which means that there is a good chance to easily add caching for this via CDN or the like and benefit greatly as a result.

Internally, the implementation of this API is probably just going to issue direct reads by ids to the key/value store and just return the result. There should be a minimal amount of processing involved, usually, except for one factor, authorization.

Assuming that the key/value interface has a get(id) method, the backend code for this API critical API should be something like the code below. Note that this is server side code, I'm not showing any client side code in this series of posts. This is the backend code to handle address the reading of a batch of ids from the client.

The code itself assumes that there is no meaning to doing batch operation on the key/value itself, mind. That isn’t always the case, but I’ll assume that. We issue N async promises to the key/value and wait to get them all back. This assumes that the latency from the API node to the key/value servers is minimal and let us batch a lot of remote calls into near calls.

The vast majority of the function is dedicated to the auth behavior. A post can be marked as public or protected, and if it is the later, we need to ensure that only people that the author of the post follow will be able to see this. You’ll note that I’m doing a lot of stuff in an async manner here. For example, we’ll only issue a single check per post author and we can safely assume that most posts are public anyway. I’m including the “full” code here to give you an indication about the level of complexity that I would expect to see in the API.

You should also note that we indicate whatever we allow to cache the results or not. In the case of a request that include a protected post, we don’t allow it. But for the most part, we can expect to see high percentage of posts that would be only public and can benefit from that.

Because we are running in a distributed system, we also have to take into account all sort of interesting race conditions. For example, you may be trying to read a post that has been removed. We explicitly clear all such null items from the results. Another way to handle that is to replace the content of the post and set a marker flag, but we’ll touch that on another post.

Finally, the code above doesn’t handle caching or distribution. That is going to be handled both above and below this code. I’ll have a dedicated post around that tomorrow.

time to read 6 min | 1076 words

This design deal with creating what is effectively a Twitter clone, seeing how we can do that efficiently. A really nice feature of Twitter is that it has just one type of interaction a tweet. The actual tweet may be a share, a reply, a mention or any number of other things, but those are properties on the post, not a different model entirely. Contrast that with Facebook, on the other hand, where you have Posts and Replies as very distinct items.

As it turns out, that can be utilized quite effectively to build the core foundation of the system with great efficiency. There are two separate sides for a social network, the write and read portions. And the read side is massively bigger than the write side. Twitter currently has about 6,000 tweets a second, for example, but it has 186 million daily users.

We are going to base the architecture on a few assumptions:

  • We are going to favor reads over writes.
  • Reads’ speed should be a priority at all times.
  • It is fine to take some (finite, small) amount of time to show a post to followers.
  • Favor the users’ experience over actual guarantees.

What this means is that when we write a new post, the process is going to be roughly so:

  • Post the new message to a queue and send confirmation to the client.
  • Add the new post to the user’s timeline on the client side directly.
  • Done.

Really important detail here. The process of placing an item on the queue is simple, trivial to scale on an infinite basis and can easily handle huge spikes in load.

On the other hand, the fact that the client’s code will show the user that the message in their timeline is usually sufficient for good user experience.

There is the need to process that and send it to followers ASAP, but that is as soon as possible in people’s terms. In other words, if it takes 30 seconds or two minutes, it isn’t a big deal.

With just those details, we are pretty much done with the write side. We accepted the post, we pretend to the user that we are done processing it, but that is roughly about it. All the rest of the work that we need to do now is to see how we can most easily generate the read portion of things.

There are some other considerations to take into account. We need to deal not just with text but also images and videos. A pretty core part of the infrastructure is going to be an object storage with S3 compatible API. The S3 API has became an industry standard and is pretty widely supported. That help us reduce the dependency issue. If needed, we can MinIO, run on Backblaze, etc.

When a user send a new post, any media elements of the post are stored directly in the S3 storage and then the post itself is written to a queue. Workers will fetch items from the queue and process them. Such processing may entail things like:

  • Stripping EXIF data from images.
  • Re-encoding videos.
  • Analyzing content for language / issues. For example, we never want to have posts about Broccoli, so we can remove / reject them at this stage.

This is where a lot of the business logic will reside, mind. During the write portion, we have time. This is an asynchronous process that we can afford to take some time. Scaling workers to read from a queue is cheap, simple an easy technique, after all. That means that we can afford to shift most of the work required to this part of the process.

For example, maybe a user posted a reply to a message that only allow replies from users mentioned on the post? That sort of thing.

Once processed, we end up with the following architecture:

image

The keys for each post are numeric (this will be important later). We can generate them using the Snowflake method:

image

In other words, we use 40 bits with 16 millisecond precision for the time, 10 bits (1,024) as the machine id and 14 bits (16,384) as the sequence number. The 16 ms precision is already the granularity that you can expect from most computer clocks, so we aren’t actually losing much by giving it up. It does means that we don’t really have to think about it. A single instance can generate 16K ids each 16 ms, or about a million ids per second. More than enough for our needs.

The key about those ids is that they are going to be roughly sorted. That will be very nice to use later on.  When accepting a post, we’ll generate an id for that, and then place that in the key/value store using that id. All other work from that point of is about working with those ids, but we’ll discuss that with more details when we talk about timelines.

For now, I think that this post gives a good (and intentionally partial) view of how I expect to handle a new write:

  • Upload any media to S3 compatible API.
  • Generate a new ID for the post.
  • Run whatever processing you need for the post and the media.
  • Write the post to the key/value store under the relevant id.
    • This include also the appropriate references for the parent post, any associated media, etc.
  • Publish to the appropriate timelines. (I’ll discuss this in a future post)

I’m using the term key/value store here generically, because we’ll do a lookup per id and find the relevant JSON for the post. Such systems can scale pretty much linearly with very little work. Given the fact that we use roughly time based ids and the time base nature of most social interactions, we can usually move most posts to archive mode in a very natural way. But that would be a separate optimization step that I don’t think that would actually be relevant at this point. It is good to have such options, though.

And that is pretty much it for writes. There are probably pieces here that I’m missing, but I expect that they are related to the business processing that you’ll want to do on the posts, not the actual infrastructure. On my next post, I’ll deal with the other side. How do we actually read a post? Given the difference in scale, I think that this is a much more interesting scenario.

time to read 5 min | 832 words

Following the discussion a few days ago, I thought that I would share my high level architecture for building a social media platform in a way that would make sense. In other words, building software that is performant, efficient and not waste multiples of your yearly budget on unnecessary hardware.

It turns out that 12 years ago, I wrote a post that discusses how I would re-architect twitter. At the time, the Fail Whale would make repeated appearances several times a week and Twitter couldn’t really handle its load. I think that a lot of the things that I wrote then are still applicable and would probably allow you to scale your system without breaking the bank. That said, I would like to think that I learned a lot since that time, so it is worth re-visiting the topic.

Let’s outline the scenario, in terms of features, we are talking about basically cloning the key parts of Twitter. Core features include:

  • Tweets
  • Replies
  • Mentions
  • Tags

Such an application does quite a lot a the frontend, which I’m not going to touch. I’m focusing solely on the backend processing here. There are also a lot of other things that we’ll likely need to deal with (metrics, analytics, etc), which are separate and not that interesting. They can be handled via existing analytics platforms and don’t require specialized behavior.

One of the best parts of a social media platform is that by its very nature, it is eventually consistent. It doesn’t matter if I post a tweet and you see it now or in 5 seconds. That gives us a huge amount of flexibility in how we can implement this system efficiently.

Let’s talk about numbers I can easily find:

There are problem with those stats, however. A lot of them are old, some of them are very old, nearly a decade!

Given that I’m writing this blog to myself (and you, my dear reader), I’m going to make some assumptions so we can move forward:

  • 50 million users, but we’ll assume that they are more engaged than the usual group.
  • Out of which 50% (25,000,000) would post on a given month.
  • 80% of the users post < 5 posts a month. That means 20 million users that post very rarely.
  • 20% of the users 5 million or so, post more frequently, with a maximum of around 300 posts a month.
  • 1% of the active users 50,000) posts even more frequently, to the tune of a couple of hundred posts a day.

Checking my math, that means that:

  • 50,000 high active users with 150 posts a day for a total of 225 million posts.
  • 5 million active users with 300 posts a month for another 1.5 billion posts.
  • 20 million other users with 5 posts a month, given us another 100 million posts.

Total month posts in this case, would be:

  • 1.745 billion posts a month.
  • 2.4 million posts an hour.
  • 670 posts a second.

That assume that there is  a constant load on the system, which is probably not correct. For example, the 2016 Super Bowl saw a record of 152,000 tweets per minute with close to 17 million tweets posted during the duration of the game.

What this means is that the load is highly variable.  We may have low hundreds of posts per second to thousands. Note that 152,000 posts per minute are “just” 2,533 posts per second, which is a lot less scary, even if it means the same.

We’ll start by stating that we want to process 2,500 posts per second as the current maximum acceptable target.

One very important factor that we have to understand is what exactly do we mean by “processing” a post. That means recording the act of the post and doing that within an acceptable time frame, we’ll call that 200 ms latency for the 99.99%.

I’m going to focus on text only mode, because for binaries (pictures and movies) the solution is to throw the data on a CDN and link to it, nothing more really needs to be done. Most CDNs will already handle things like re-encoding, formatting, etc, so that isn’t something that you need to worry about to start with.

Now that I have laid the ground works, we can start thinking about how we can actually process this. That is going to be handled in a few separate pieces. First, how do we accept the post and process it and then how do we distribute it to all the followers. I’ll start dissecting those issues in my next post.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 4 days from now
  3. What happens when a sparse file allocation fails? - 6 days from now
  4. NTFS has an emergency stash of disk space - 8 days from now
  5. Challenge: Giving file system developer ulcer - 11 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}