Querying your way to madness: the Facebook timeline
Facebook certainly changed the way we are doing things. Sometimes, that ain’t always for the best, as can be seen from the way too much time humanity as a whole spend watching cat videos.
One of the ways that Facebook impacted our professional lives is that a lot of people look at that as a blue print of how they want their application to be. I am not going to touch on whatever that is a good thing or not, suffice to say that this is a well known model that is very easy for a lot of users to grasp.
It is also a pretty hard model to actually design and build. I recently had a call from a friend who was tasked with building a Facebook like timeline. Like most such tasks, we have the notion of social network, with people following other people. I assume that this isn’t actually YASNP (yet another social network project), but I didn’t check too deeply.
The question was how to actually build the timeline. The first thing that most people would try is something like this:
1: var user = session.Load<User>(userId);
2: var timelineItems =
3: session.Query<Items>()
4: .Where(x=>x.Source.In(user.Following))
5: .OrderByDescending(x=>x.PublishedAt)
6: .Take(30)
7: .ToList();
Now, this looks good, and it would work, as long as you have small number of users and no one follows a lot of people. And as long as you don’t have a lot of items. And as long as you don’t have to do any additional work. When any of those assumption is broken… well, welcome to unpleasantville, population: you.
It can’t work. And I don’t care what technology you are using for storage. You can’t create a good solution using queries for something like the timeline.
Nitpicker corner:
- If you have users that are following a LOT of people (and you will have those), you are likely to get into problems with the query.
- The more items you have, the slower this query becomes. Since you need to sort them all before you can return results. And you are likely to have a LOT of them.
- You can’t really shard this query nicely or easily.
- You can’t apply additional filtering in any meaningful way.
Let us consider the following scenario. Let us assume that I care for this Rohit person. But I really don’t care for Farmville.
And then:
Now, try to imagine doing this sort of thing in a query. For fun, assume that I do care for Farmville updates from some people, but not from all. That is what I mean when I said that you want to do meaningful filtering.
You can’t do this using queries. Not in any real way.
Instead, you have to turn it around. You would do something like this:
1: var user = session.Load<User>(userId);
2: var timelineItmes = session.Query<TimeLineItems>()
3: .Where(x=>x.ForUser == userId)
4: .OrderBy(x=>x.Date)
5: .ToList();
Note how we structure this. There is a set of TimeLineItems objects, which store a bit of information about a set of items. Usually we would have one per user per day. Something like:
- users/123/timeline/2013-03-12
- users/123/timeline/2013-03-13
- users/123/timeline/2013-03-14
That means that we get well scoped values, we only need to search on a single set of items (easily sharded, with a well known id, which means that we can also just load them by id, instead of querying for them).
Of course, that means that you have to have something that builds those timeline documents. That is usually an async process that run whenever you have a user that update something. It goes something like this:
1: public void UpdateFollowers(string itemId)2: {
3: var item = session.Include<Item>(x=>x.UserId)
4: .Load(itemId);
5:
6: var user = session.Load<User>(item.UserId);
7:
8: // user.Followers list of documents with batches of followers9: // we assume that we might have a LOT, so we use this techinque10: // to avoid loading them all into memory at once11: // http://ayende.com/blog/96257/document-based-modeling-auctions-bids12: foreach(var followersDocId in user.Followers)13: {
14: NotifyFollowers(followersDocId, item);
15: }
16: }
17:
18: public void NotifyFollowers(string followersDocId, Item item)19: {
20: var followers = session.Include<FollowersCollection>(x=>x.Followers)
21: .Load(followersDocId);
22:
23: foreach(var follower in followers.Followers)24: {
25: var user = session.Load<User>(follower);
26: if(user.IsMatch(item) == false)27: continue;28: AddToTimeLine(follower, item);
29: }
30: }
As you can see, we are batching the operation, loading the followers and batched on their settings, decide whatever to let that item to be added to their timeline or not.
Note that this has a lot of implications. Different people will see this show up in their timeline in different times (but usually very close to one another). Your memory usage is kept low, because you are only processing some of it at any given point in time. For users with a LOT of followers, and there will be some like those, you might want to build special code paths, but this should be fine even at its current stage.
What about post factum operations? Let us say that I want to start following a new user? This require special treatment, you would have to read the latest timeline items from the new user to follow and start merging that with the existing timeline. Likewise when you need to delete someone. Or adding a new filter.
It is a lot more work than just changing the query, sure. But you can get things done this way. And you cannot get anywhere with the query only approach.
Comments
Ha. I've just been helping build an app with a Raven backend that has a timeline on it.
Hit lots of similar questions. Thankfully the number of 'followers' is limited, so mostly it was down to volume of 'things' on the timeline.
Had a fractional load for the timeline and an index that did the heavy lifting of items with a subject User and a 'context user' - i.e. the subject if the thing and the person who was looking at it. Worked quite well. All items returned implemented an interface to translate them into timelineitems for display.
Lots of room for improvement, but interesting all the same.
So in essence you are applying CQRS. Not CQRS as in the full fledged eventing architecture, but CQRS as in thinking about putting which responsibilities on the query side and which on the command side.
@ayende @adrian hall When I worked for a popular social website in the Netherlands we created strikingly similar constructs. Querying would just kill sql server. Instead we created 'events' that were propagated to the right persons. At display time we would also filter events based on combi of security settings, the owners, the viewers. Indeed none of this could have been done efficiently at query time alone.
// Ryan
Tackling the same problem at the moment and came to similar conclusions. I opted for fanout-on-write to well known collections based on the aution and bids model. To make it easier, i generalized the auction-and-bids handling so i have better code reuse across various interactions. As Oren said, this simplifies things because we end up with well known document ids we can grab with Load.
Even so, this gets interesting. For example, for a post, we can have likes, shares, comments, and follows (stored separately, A & B style). When displaying posts, we fetch random samples in of each (for "you, Rohit, Sam and 5 others liked this post", etc). We also need to determine the user's relationship to people interacting with the post, so they can be offered the opportunity to follow the other users if desired.
A few things make this easier for us 1. SignalR, so that we eliminate polling on updates 2. An async CQSish architecture which allows us to do some intelligent caching with invalidation. For instance, when a post is added, we queue an update to the timelines of all the user's followers. The post id is pushed to an A & B collection "user/rohit/timeline", and also to a capped list in Redis. The post body is preprocessed and cached, then a call to SignalR is queued to handle the real-time updates.
Still the users need to query the timeline. The ids of the first few hundred items are cached per user, so virtual scrolling is smooth for the first few pages. Time line queries are accomplished by simply hitting the "user/rohit/timeline" collection.
For me there's still questions about performance, because i tend to prefer reading from indexes where possible, so i flatten the documents related to interactions in M/R. My fear is the amount of disk io this will incur because these collections tend to be "hot".
You could use a Reactive Programming library* which let you define Linq queries against ObservableCollections which self-update the result set as items in the source are changed.
*e.g. https://github.com/wasabii/OLinq
This problem looks like it would be much better handled by modeling timelines as streams
I don't see what the issue is with the query approach. Shouldn't a sorted secondary index resolve any query time issues and rapidly give you a sorted timeline?
David, Try it with the filtering example I just gave...
There's actually a presentation from Facebook on this matter: Megafeed by
broadcasting writes to your friends or Multifeed by multi-fetch and aggregate stories at read time
Here is the link to the presentation:
http://www.infoq.com/presentations/Facebook-News-Feed
Here are the slides:
http://qconnewyork.com/2012/dl/qcon-newyork-2012/slides/facebook%20news-feed%20qcon%202012.pdf
@GregYoung, I was waiting for your streaming opinion ;-)
Ayende, the method you gave to accomplish this requires local filtering after getting the data from the database. How is that better from doing a query and then filtering locally?
David, It happens once, instead on every query
Oh I see, so you're saying to do this at storage time and not retrieval time. What happens though when a user changes their settings and now is interested in Farmville? Now you have to go through months/years of data to recreate their feed.
David, Yes, you do. Still better than anything else.
I've just finished working on Yet Another Social Network Project recently and I faced the exact issue described here.
We ended up coming up with something halfway between the pure query technique and Ayende's solution. As a two developer team and very limited time, we couldn't build out a solution to deal with every post-factum change (eg following a new user and back-filling history, updating all historic posts in everyone's timeline when the post is modified).
For anyone interested, the entire project is open source, so you can get an idea about how we tackled it: http://github.com/bowerbird/bowerbird-web
Comment preview