Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 4 min | 634 words

Still going on with the discussion on how to handle failures in a sharded cluster, we are back to the question of how to handle the scenario of one node in a cluster going down. The question is, what should be the system behavior in such a scenario.

In my previous post, I discussed one alternative option:

ShardingStatus status;
va recentPosts = session.Query<Post>()
          .ShardingStatus( out status )
          .OrderByDescending(x=>x.PublishedAt)
          .Take(20)
          .ToList();

I said that I really don’t like this option. But deferred the discussion on exactly why.

Basically, the entire problem boils down to a very simple fact, manual memory management doesn’t work.

Huh? What is the relation between handling failures in a cluster to manual memory management? Oren, did you get your wires crossed again and continued a different blog post altogether?

Well, no. It is the same basic principle. Requiring users to add a specific handler for this result in several scenarios, none of them ideal.

First, what happen if we don’t specify this? We are back to the “ignore & swallow the error” or “throw and kill the entire system”.

Let us assume that we go with the first option, the developer has a way to get the error if they want it, but if they don’t care, we will just ignore this error. The problem with this approach is that it is entirely certain that developers will not add this, at least, not before the first time we have a node fail in production and the system will simply ignore this and show the wrong results.

The other option, throw an exception if the user didn’t ask for the sharding status and we have a failing node, is arguably worse. We now have a ticking time bomb. If a node goes down, the entire system will go down. The reason that I say that this is worse than the previous option is that the natural inclination of most developers is to simply stick the ShardingStatus() there and “fix” the problem. Of course, this is basically the same as the first option, but this time, the API actually let the user down the wrong path.

Second, this is forcing a local solution on a global problem. We are trying to force the user to handle errors at a place where the only thing that they care about is the actual business problem.

Third, this alternative doesn’t handle scenarios where we are doing other things, like loading by id. How would you get the ShardingStatus from this call?

session.Load<Post>("tri/posts/1");

Anything that you come up with is likely to introduce additional complexity and make things much harder to work with.

As I said, I intensely dislike this option. A much better alternative exists, and I’ll discuss this in the next post…

time to read 2 min | 276 words

Still going on with the discussion on how to handle failures in a sharded cluster, we are back to the question of how to handle the scenario of one node in a cluster going down. The question is, what should be the system behavior in such a scenario.

In the previous post, we discussed the option of simply ignoring the failure, and the option of simply failing entirely. Both options are unpalatable, because we either transparently hide some data from the user (which reside on the failing node) or we take the entire system down when a single node is down.

Another option that was suggested in the mailing list is to actually expose this to the user, like so:

ShardingStatus status;
va recentPosts = session.Query<Post>()
          .ShardingStatus( out status )
          .OrderByDescending(x=>x.PublishedAt)
          .Take(20)
          .ToList();

This will give us the status information about potentially failing shards.

I intensely dislike this option, and I’ll discuss the reasons why on the next post. In the meantime, I would like to hear your opinion about this API choice.

time to read 2 min | 246 words

In my previous post, I discussed how to handle partial failure in sharded cluster scenario. This is particularly hard because this is the case where one node out of a 3 nodes (or more) cluster is failing, and it is entirely likely that we can give the user at least partial service properly.

The most obvious, and probably easiest option, is to simply catch and log the error for the failing server and not notify the calling application code about this. The nice thing about this option is that if you have a failing server, you don’t have your entire system goes down, and can handle this at least partially.

The sad part about this option is that there really is a good chance that you won’t notice that some part of the system is down, and that you are actually returning only partial results. That can lead to some really nasty scenarios, such as the case where we “lose” an order, or a payment, and we don’t show this to the user.

That can lead to some really frustrating scenarios where a customer is saying “but I see the payment right here” and the help desk says “I am sorry, but I don’t see it, therefor you don’t get any service, have a good day, bye”.

Still… that is probably the best case scenario considering the alternative being the entire system being unavailable if any single node is down.

Or is it… ?

time to read 1 min | 187 words

An interesting question came up recently. How do we want to handle sharding failures?

For example, let us say that I have a 3 nodes clusters of RavenDB, serving posts for a blog (just to give some random example). The way the sharding has been setup, we are doing sharding using Round Robin based on posts (so each post goes to a different machine, and anything related to post goes to the same node as the post). Here is how it can be set:

image

Now, we want to display the main page, and we would like to show the most recent posts. We can do this using the following code:

image

The question is, what would happen if the second server if offline?

I’ll give several alternative in the next few posts.

time to read 2 min | 376 words

We got several questions about this in the mailing list, so I thought that this would be a good time to discuss this in the blog.

One of the best part about RavenDB is that we are able to deliver quickly and continuously.  That means that we can deliver changes to the users very rapidly, often resulting in response times of less than an hour from “I have an issue” to “it is already fixed and you can download it”.

That is awesome on a lot of level, but it lack something very important, stability. In other words, by pushing things so rapidly, we are living on the bleeding edge. Which is great, except that you tend to bleed.

That is why we split the RavenDB release process into Unstable and Stable. Unstable builds are released on a “several times a day” basis, and only require that we will pass our internal test suite. This test suite is hefty, over 1,500 tests so far, but it is something that can be run in about 15 minutes or so on the developer machine to make sure that our changes didn’t break anything.

The release process for Stable version is much more involved. First, of course, we run the standard suite of tests. Then we have a separate set of tests, which are stress testing RavenDB by trying to see if there are any concurrency issues.

Next, we take the current bits and push them to our own internal production systems. For example, at the time of this writing, this blog (and all of our assets) are currently running on RavenDB build 726 and have been running that way for a few days. This allows us to test several things. That there are no breaking changes, that this build can survive running in production over extended period of time and that the overall performance remains excellent.

Finally, we ask users to take those builds for a spin, and they are usually far more rough on RavenDB than we are.

After all of that, we move to a set of performance tests, comparing the system behavior on a wide range of operations compared to the old version.

And then… we can do a stable release push. Phew!

time to read 1 min | 134 words

After several months in public beta, I am proud to announce that RavenHQ, the RavenDB as a Service  on the cloud has dropped the beta label and is now ready for full production use.

That means that we now accept signups from the general public, you no longer need an AppHarbor account and you can use it directly. It also means that you can safely start using RavenHQ for production purposes.

RavenHQ is a fully-managed cloud of RavenDB servers and scalable plans, you’ll never have to worry about installation, updates, availability, performance, security or backups again.

We offer both standard and high availability plans, and are the perfect fit for RavenDB users who can safely outsource all the operational support of your databases in the RavenHQ’s team capable hands.

time to read 2 min | 228 words

Recently I updated several of my VS plugins, and immediately after that I noticed a changed in the way VS behaves.

image

This is actually quite nice, and useful. (Although I was a bit confused for a time about when it shows and when it doesn’t, but that is beside the point).

the problem is that in many cases, like the first two example, it is actually quite easy to press on those notifications. And if you do that, you get:

image

And that is annoying. Sure, I get that you want to encourage people to buy your products, I even agree. But this sort of a feature is something that is very easy to invoke by mistake, and it completely throws you out of your current context.

I have VSCommands installed for the simple reason that I really like the Reload All feature. But it isn’t worth it if I have to be careful where I put my mouse in VS.

This single feature is the reason that I uninstalled it.

A bad test

time to read 1 min | 78 words

image

This is a bad test, because what it does is ensuring that something does not works. I just finished implementing the session.Advaned.Defer support, and this test got my attention by failing the build.

Bad test, you should be telling me when I broke something, not when I added new functionality.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}