Reminding you that the end of year coupon has 2 remaining uses.
You can use: 0x21-celebrate-new-year to purchase any of the following with a 21% discount:
This is valid for today only, so hurry up.
Reminding you that the end of year coupon has 2 remaining uses.
You can use: 0x21-celebrate-new-year to purchase any of the following with a 21% discount:
This is valid for today only, so hurry up.
The major portion of the RavenDB core team is co-located in our main office in Israel. That means that we get to do things like work together, throw ideas off one another, etc.
But one of the things that I found stifling in other work places is a primary focus on the current work. I mean, the work is important, but one of the best ways to stagnate is to keep doing the same thing over and over again. Sure, you are doing great and can spin that yarn very well, but there is always that power loom that is going to show up any minute. So we try to do some skill sharpening.
There are a bunch of ways we do that. We try to have a lecture every week (given by one of our devs). The topics so far range from graph theory to distributed gossip algorithms to new frameworks and features that we deal with to the structure of the CPU and memory architecture.
We also have mini debuggatons. I’m not sure if the name fit, but basically, we show a particular problem, split into pairs and try to come up with a root cause and a solution. This post is written while everyone else in the office is busy looking at WinDBG and finding a memory leak issue, in the meantime, I’m posting to the blog and fixing bugs.
We were talking about comics at lunch. And the next day, Tal came to the office with the following.
Nitpicker corner: This is a doodle, not a finished product, yes, we know that there are typos.
I like this enough that we will probably be doing something significant with this. Opinions are welcome.
Do you have a RavenDB as a super hero idea? Or a Dilbert’s style skit for RavenDB?
I’ve had a couple of good days in which I was able to really sit down and code. Got a lot of cool stuff done. One of the minor requirements we had was to send some data over the wire. That always lead to a lot of interesting decisions.
How to actually send it? One way? RPC? TCP? HTTP?
I like using HTTP for those sort of things. There are a lot of reasons for that, it is easy to work with and debug, it is easy to secure, it doesn’t require special permissions or firewall issues.
Mostly, though, it is because of one overriding priority, it is easy to manage in production. There are good logs for that. Because this is my requirement, it means that this applies to the type of HTTP operations that I’m making.
Here is one such example that I’m going to need to send:
public class RequestVoteRequest : BaseMessage
{
public long Term { get; set; }
public string CandidateId { get; set; }
public long LastLogIndex { get; set; }
public long LastLogTerm { get; set; }
public bool TrialOnly { get; set; }
public bool ForcedElection { get; set; }
}
A very simple way to handle that would be to simply POST the data as JSON, and you are done. That would work, but it would have a pretty big issue. It would mean that all requests would look the same over the wire. That would mean that if we are looking at the logs, we’ll only be able to see that stuff happened ,but won’t be able to tell what happened.
So instead of that, I’m going with a better model in which we will make request like this one:
http://localhost:8080/raft/RequestVoteRequest?&Term=19&CandidateId=oren&LastLogIndex=2&LastLogTerm=8&TrialOnly=False&ForcedElection=False&From=eini
That means that if I’m looking at the HTTP logs, it would be obvious both what I’m doing. We will also reply using HTTP status codes, so we should be able to figure out the interaction between servers just by looking at the logs, or better yet, have Fiddler capture the traffic.
Reminding you that the end of year coupon has 7 remaining uses.
You can use: 0x21-celebrate-new-year to purchase any of:
Happy Holidays and a great new years.
The release of 3.0 means that we are free to focus on the greater features that we have in store for vNext, but that doesn’t mean that we aren’t still working on small features that are meant to make your life easier.
In this case, this feature is here to deal with index definition churn. Consider the case where you have a large database, and you want to define an index, but aren’t quite sure exactly what you want. This is especially the case when you are working with the more complex indexes. The problem is that creating the index would mean having to process all the data in your database. That is usually fine, because RavenDB can still operate normally under those conditions, but it also means that you have to wait for this to happen, check the results, make a change to the index, and so on.
Because of the way we are optimizing new index creation, we are trying to do as much as possible, as soon as possible. So by the time we expose partial results for the index, you are already waiting for too long. This isn’t a problem for standard usage, because this behavior reduces the overall time for indexing a new index, but it is a problem when you are developing, because it increase the time for first visible result.
Because of that, we have added this feature:
This will create a temporary test index:
This index is going to operate on a partial data set, instead of all the data, but aside from that, it is identical in all respects to any other index, and you can test it out using all the normal tools we offer. Once you have change the index definition to your liking, you can make the index permanent, which actually goes into processing all documents.
This make rapid index prototyping on a large dataset much easier.
I’m reading a lot, and I thought that I would post a bit about my favorite subjects. I decided to summarize this year with great books that don’t really fall into standard categories, which I really enjoyed.
The AlterWorld – By a Russian author, and with a great background there (how to identify a Russian was great), and are really good. The premise is that you can get stuck in a MMORPG and it is beautifully done. Unlike a fantasy book, the notion of levels, gaining strength and power is really nice. Especially since the hero isn’t actually taking the direct path to that. There is also a lot of interaction with the real world, and in general, this is a fully featured universe that is really good. It looks like there are going to be 3 more books, which is absolutely wonderful from my point of view.
Those books were good enough that I started playing RPGs again, just because it was so much fun reading the status messages in the books. If you know of other books in the same space, I would love to know about it.
NPCs tells the tale from the point of view of Non Player Characters, which is quite interesting and done in a very believable way.
Caverns & Creatures is a series of books (lost count, there are a lot of short stories as well as full length books) that deals with the idea of people getting stuck in RPG world. This one is mostly meant for humor’s sake, I think. And it does get to toilet level humor all too frequently, but it is entertaining.
Waldo Rabbit tells the tale of a guy that really tries to be an evil overload, but his idea of scary beast is a… rabbit. It is a really well written, and I’m looking forward for the next book.
Wizard 2.0 talks about finding proof that the entire world is a computer simulation, and what happens when certain people find out about it. My guess is that this is written by a programmer, because the parts where they talk about software and programming wasn’t made up in whole cloth and didn’t piss me off at all. This is also really good series, and I’m looking forward to reading the 3rd book. I especially liked that there isn’t some big Save The World theme going on, this is just life as you know it, if you are a bunch of pixels.
Velveteen is a “superhero” novel, but a very different one than the usual one. I’m not really sure how to categorize it, but it was a really great read.
Daniel Black’s is a single book series, with a second book, Black Coven set to follow Fimbulwinter. It is a great book, with a very well written background and story. What is more, the hero doesn’t rely on brute force or the author to rescue him when he stupidly gets into trouble, he thinks and plans, and that is quite great to read. I’m eagerly waiting for the next book.
Voron uses a fairly standard journaling system to ensure ACID compliance. Whenever you commit a transaction, we write the changed pages to the log.
Let us take the simplest example, here is a very simple page:
Now we write the following code:
for(var i = 0; i< 3; i++ )
{
using(var tx = env.NewTransaction(TransactionFlags.ReadWrite))
{
var users = tx.ReadTree("users");
users.Add("Users/" + NextUserId(), GetUserData());
tx.Commit();
}
}
After executing this code, the page is going to look like this:
But what is going to happen in the log? We have to commit each transaction separately, so what we end up is the following journal file:
The log contains 3 entries for the same page. Which make sense, because we changed that in 3 different transactions. Now, let us say that we have a crash, and we need to apply the journal. We are going to scan through the journal, and only write Page #23 from tx #43. For the most part, we don’t care for the older version of the page. That is all well and good, until incremental backups comes into play.
The default incremental backup approach used by Voron is simplicity itself. Just create a stable copy of the journal files, and you are pretty much done, there isn’t anything else that you need to do. Restoring involves applying those transactions in the order they appear in the journal, and rainbows and unicorns and peace on Earth.
However, in many cases, the journal files contain a lot of outdated data. I mean, we really don’t care for the state of Page #23 in tx #42. This came into play when we started to consider how we’ll create Raft snapshots on an ongoing basis for large datasets. Just using incremental backup was enough to give us a lot of promise, but it also expose the issue with the size of the journal becoming an issue.
That is when we introduced the notion of a minimal incremental backup. A minimal incremental backup is a lot more complex to create, but it actually restore in the exact same way as a normal incremental backup.
Conceptually, it is a very simple idea. Read the journals, but instead of just copying them as is, scan through them and find the latest version of all the pages that appear in the journal. In this case, we’ll have a Page #23 from tx #43. And then we generate a single transaction for all the latest versions of all the pages in the transactions we read.
We tried it on the following code:
for (int xi = 0; xi < 100; xi++)
{
using (var tx = envToSnapshot.NewTransaction(TransactionFlags.ReadWrite))
{
var tree = envToSnapshot.CreateTree(tx, "test");
for (int i = 0; i < 1000; i++)
{
tree.Add("users/" + i, "john doe/" + i);
}
tx.Commit();
}
}
This is an interesting experiment, because we are making modifications to the same keys (and hence, probably the same pages), multiple times. This also reflects a common scenario in which we have a high rate of updates.
Min incremental backup created an 8Kb file to restore this. While the standard incremental backup created a 67Kb file for the purpose.
That doesn’t sounds much, until you realize that those are compressed files, and the uncompressed sizes were 80Kb for the minimal incremental backup, and 1.57Mb for the incremental backup. Restoring the min incremental backup is a lot more efficient. However, it ain’t all roses.
An incremental backup will result in the ability to replay transactions one at a time. A min incremental backup will merge transactions, so you get the same end results, but you can’t stop midway (for example, to do partial rollback). Taking a min incremental backup is also more expensive, instead of doing primarily file I/O, we have to read the journals, understand them and output the set of pages that we actually care about.
For performance and ease of use, we limit the size of a merged transaction generated by a min incremental backup to about 512 MB. So if you have made changes to over 512MB of your data since the last backup, we’ll still generate a merged view of all the pages, but we’ll actually apply that across multiple transactions. The idea is to avoid trying to apply a very big transaction and consume all resources from the system in this manner.
Note that this feature was developed specifically to enable better behavior when snapshotting state machines for use within the Raft protocol. Because Raft is using snapshots to avoid having an infinite log, and because the Voron journal is effectively the log of all the disk changes made in the system, that was something that we had to do. Otherwise we couldn’t rely on incremental backups for snapshots (we would have just switched the Raft log with the Voron journal, probably we no save in space). That would have forced us to rely on full backups, and we don’t want to take a multi GB backup very often if we can potentially avoid it.
My previous few posts has talked about specific algorithms for gossip protocols, specifically: HyParView and Plumtrees. They dealt with the technical behavior of the system, the process in which we are sending data over the cluster to all the nodes. In this post, I want to talk a bit about what kind of messages we are going to send in such a system.
The obvious one is to try to keep the entire state of the system up to date using gossip. So whenever we make a change, we gossip about it to the entire network, and we are able to get to an eventually consistent system in which all nodes have roughly the same data. There is one problem with that, you now have a lot of nodes with the same data on them. At some point, that stop making sense. Usually gossip is used when you have a large group of servers, and just keep all the data on all the nodes is not a good idea unless your data set is very small.
So you don’t do that. Gossip is usually used to disseminate a small data set, one that can fit comfortably inside a single machine (usually it is a very small data set, a few hundred MB at most). Let us consider a few types of messages that would fit in a gossip setting.
The obvious example is the actual topology of the whole network. A node joining up the cluster will announce its presence, and that will percolate to the entire cluster, eventually. That can allow you to have an idea (note, this isn’t a certainty) about what is the structure of the cluster, and maybe make decisions based on it.
The system wide configuration data is also a good candidate for gossip, for example, you can use gossip as a distributed service locator in the cluster. Whenever a new SMTP server comes online, it announces itself via gossip to the cluster. It is added to the list of SMTP servers in all the nodes that heard about it, and then it get used. In this kind of system, you have to take into account that servers can be down for a long period of time, and miss up on messages. Gossip does not guarantee that the messages will arrive, after all. Oh, it is going to do its best, but you need to also build an anti entropy system. If a server finds that it missed up on too much, it can request one of its peers to send it a full snapshot of the current global state as that peer know it.
Going in the same vein, nodes can gossip about the health state of the network. If I’m trying to send an email via an SMTP server, and it is down, I’m going to try another server, and let the network know that I’ve failed to talk to that particular server. If enough nodes fail to communicate with the server, that become part of the state of the system, so nodes that learned about it can avoid that server for a period of time.
Moving into a different direction, you can also do gossip queries, that can be done by sending a gossip message on the cluster with a specific query to it. A typical example might be “which node has a free 10GB that I can use?”. Such queries typically carry with them a timeout element. You send the query, and any matches are sent back to (either directly or also via gossip). After a predefined timeout, you can assume that you got all the replies that you are going to get, so you can operate on that. More interesting is when you want to query for the actual data held in each node. If we want to find all the users who logged in today, for example.
The problem with doing something like that is that you might have a large result set, and you’ll need to have some way to work with that. You don’t want to send it all to a single destination, and what would you do with it, anyway? For that reason, most of the time gossip queries are actually aggregation. We can use that to get an estimate of certain things in our cluster. If we wanted to get the number of users per country, that would be a good candidate for this, for example. Note that you won’t necessarily get accurate results, if you have failures, so there are aggregation methods for getting a good probability of the actual value.
For fun, here is an interesting exercise. Look at trending topics in a large number of conversations. In this case, whenever you would get a new message, you would analyze the topics for this message, and periodically (every second, let us say), you’ll gossip to your peers about this. In this case, we don’t just blindly pass the gossip between nodes. Instead, we’ll use a slightly different method. Each second, every node will contact its peers to send them the current trending topics in the node. Each time the trending topics change, a version number is incremented. In addition, the node also send its peer the node ids and versions of the messages it got from other nodes. The peer, in reply, will send a confirmation about all the node ids and versions that it has. So the origin node can fill in about any new information that it go, or ask to get updates for information that it doesn’t have.
This reduce the number of updates that flow throughout the cluster, while still maintain an eventually consistent model. We’ll be able to tell, from each node, what are the current trending topics globally.
Unlike the previous two posts, this is going to be short. Primarily because what I wanted to talk about it what impresses me most with both HyParView and Plumtree. The really nice thing about them is that they are pretty simple, easy to understand and produce good results.
But the fun part, and what make it impressive is that they manage to achieve that with a small set of simple rules, and without any attempt to create a global view. They operate just fine with potentially very small set of the data overall, but still manage to operate, self optimize and get to the correct result. In fact, I did some very minor attempts to play with this at large scale, and we see a pretty amazing emergent behavior. Without anyone knowing what is going on globally, we are able to get to the optimal number of interactions in the cluster to distribute information.
That is really pretty cool.
And because this post is too short, I’ll leave you with a question. Given that you have this kind of infrastructure, what would you do with it? What sort of information or operations would you try to send using this way?
And 4 more posts are pending...
There are posts all the way to Feb 17, 2025