Reliable paging through a data set while it is being modified
What happens when you want to to page through result sets of a query while the underlying data set is being constantly modified?
This seems like a tough problem, and one that you wouldn’t expect to encounter very often, right? But a really good example of just this issue is the notion of a feed in a social network. To take Twitter for simplicity, you have many people generating twits, and other users are browsing through their timeline.
What ends up happening is that the user browsing through their timeline is actually trying to page through the results, but at the same time, you are going to get more updates to the timeline while you are reading it. One of the key requirements that we have here, then, is that we want to be sure that we aren’t actually creating a lot of jitter for the user as they scroll through the timeline. Luckily, because this is a hard problem, the users are already quite familiar with some of the side affects. It would surprise no one to see the same twit multiple times in the timeline. It can be because of a retweet or a like by a user you follow, or it can be a result of the way paging is done.
Now that we understand what we want to achieve, let’s see how you can try getting there. The simplest way to handle this is to ask the database for some sort of a stable reference for the query. So instead of executing the query and being done with it, you’ll have the server maintain it for a period of time and send you the pages as you need them. This is simple, easy to implement and costly in terms of system resources. You’ll need to keep the query results in memory for each one of your users, and that can be quite a lot of memory to keep around just in case. Especially given the different between human’s interaction times and the speed of modern systems.
Another way to do that is to ask the database engine to generate a way to re-create the query as it was at that time. This is sometimes called a continuation token or some such. That works great, usually, but come with its own complications. For example, imagine that I’m doing this on the following query:
from Users order by LastName limit 5
Which gives us the following result:
And I got the first five users, and now I want to get the next five. Between the first and second query, a user whose last name is “Aardvark” was inserted in the system. At this point, what would you expect to get from the query? We have two choices here, as you can see below:
The problem is that from my perspective, both of those has serious issues. To start with, to compute the results shown in orange, you’ll need to jump through some serious hoops on the backend, and the result looks strange. To get the results in green is quite easy, but it will mean that you missed out on seeing Aardvark.
You might have noticed that the key issue here isn’t so much with the way we build the query, but the order in which we need to traverse it. We ask to sort it by last name, but we also want to get results as they come by. As it turns out, the process becomes a whole lot simpler if we unify these concepts. If we issue the following query, for example, our complexity threshold drops by a lot.
from Messages order by CreatedAt desc limit 5
This is because we now get the following results:
And paging through this now is pretty easy, if we want to page down, we can now issue the query to get the next page of data:
from Messages order by CreatedAt desc where CreatedAt < 217 limit 5
By making sure that we are paging and filtering on the same property, we can easily scroll through the results without having to do too much work, either in the application or in our database backend. We can also query if there are new stuff that were missed by checking CreatedAt > 222, of course.
But there is one wrinkle here. I intentionally used the CreateAt field but put numeric values there. Did you notice that there was no 220 value? That one was created on an isolated node and didn’t arrive yet. When it will show up in the local database, we’ll need to decide if we’ll give it a new value (making sure it will show up in the timeline) or store it as is, meaning that it might get lost.
These type of questions are probably more relevant at the business level. You might want to apply different behaviors based on how many likes a twit has, for example.
Another option is to have an UpdatedAt field as well, which can allow you to quickly answer the question: “What items in the range I scan has changed?”. This method also allows for simpler model for handling updates, but much of that depends on the kind of behavior you want to get. This method handles updates, including updates to the parts seen and unseen in a reasonable way and predictable cost.
Comments
Not sure if I got it right: You propose to change the sorting from "LastName" to "CreatedAt". That might be nice from a technical point of view. But as a user I want to get e.g. my phonebook in alphabetical order - not sorted by modify date or so.
Tom, I do not expect your contacts to change that often, do you? Sorting them by name is reasonable. Sorting Tweets or Reddit Posts by Name, Author etc. is not reasonable - the volume of posts ist just too high.
TomCollins, That isn't really the same scenario, though. In the one case, you are not really interested in also seeing updates. Your phone book doesn't receive too many new contacts as you browse through them. Consider looking at the reddit front page, for example. You don't look at things by any order but time.
Comment preview