Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 3 min | 437 words

This is bloody strange. I have a test failing in our unit tests, which isn’t an all too uncommon occurrence after a big work. The only problem is that this test shouldn’t fail, no one touched this part.

For reference, here is the commit where this is failing. You can reproduce this by running the Raven.Tryouts console project.

Note that it has to be done in Release mode. When that happens, we consistently get the following error:

Unhandled Exception: System.NullReferenceException: Object reference not set to an instance of an object.
at Raven.Client.Connection.MultiGetOperation.<TryResolveConflictOrCreateConcurrencyException>d__b.MoveNext() in c:\Work\ravendb\Raven.Client.Lightweight\Connection\MultiGetOperation.cs:line 156
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)

Here is the problem with this stack trace:

image

Now, this only happens in release mode, but it happens there consistently. Now, I’ve verified that this isn’t an issue of me running old version of the code. So this isn’t possible. There is no concurrency going on, all the data this method is touching is only touched by this thread.

What is more, the exception is not thrown from inside the foreach loop in line 139. I’ve verified that by putting a try catch around the inside of the loop and still getting the NRE when thrown outside it. In fact, I have tried to figure it out in pretty much any way I can. Attaching a debugger make the problem disappear, as does disabling optimization, or anything like that.

In fact, at this point I’m going to throw up my hands in disgust, because this is not something that I can figure out. Here is why, this is my “fix” for this issue. I replaced the following failing code:

image

With the following passing code:

image

And yes, this should make zero difference to the actual behavior, but it does. I’m suspecting a JIT issue.

time to read 4 min | 800 words

We are pretty much done with RavenDB 3.0, we are waiting for fixes to internal apps we use to process orders and support customers, and then we can actually make a release. In the meantime, that means that we need to start looking beyond the 3.0 release. We had a few people internally focus on post 3.0 work for the past few months, and we have a rough outline for what we done there. Primarily we are talking about better distribution and storage models.

Storage models – the polyglot database

Under this umbrella we put dedicated database engines to support specific needs. We are talking about distributed counters (high scale out, rapid throughput), time series and event store as the primary areas that we are focused on. For example, the counters stuff is pretty much complete, but we didn’t have time to actually make that into a fully mature product.

I talked about this several times in the past, so I’ll not get into too many details here.

Distribution models

We have been working on a Raft implementation for the past few months, and it is now in the stage where we are starting to integrate it into the rest of our software. Raft is planned to be the core replication protocol for the time series and events databases. But you are probably going to see if first as topology super layer for RavenDB and RavenFS.

Distributed topology management

Replication support in RavenDB and RavenFS follow the multi master system. You can write to any node, and your write will be distributed by the server to all the nodes. This has several advantages, in particular, the fact that we can operate in disconnected or partially disconnected manner, and that we need little coordination between clients to get everything working. It also has the disadvantage of allow conflicts. In fact, if you are writing to multiple replicating nodes, and aren’t careful about how you are splitting writes, you are pretty much guaranteed to have conflicts. We have repeatedly heard that this is both a good thing and something that customers really don’t want to deal with.

It is a good thing because we don’t have data loss, it is a bad thing because if you aren’t ready to handle this, some of your data is inaccessible because of the conflict until it is resolved.

Because of that, we are considering implementing a server side topology management system. The actual replication mechanics are going to remain the same. But the difference is how we are going to decide how to work with it.

A cluster (in this case, a set of RavenDB servers and all databases and file systems on them)  is composed of cooperating nodes. The cluster is managed via Raft, which is used to store the topology information of the cluster. Topology include each of the nodes in the system, as well as any of the databases and file systems on the cluster. The cluster will select a leader, and that leader will also be the primary node for writes for all databases. In other words, assume we have a 3 node cluster, and 5 databases in the cluster. All the databases are replicated to all three nodes, and a single node is going to serve as the write primary for all operations.

During normal operations, clients will query any server for the replication topology (and cache that) every 5 minutes or so. If a node is down, we’ll switch over to an alternative node. If the leader is down, we’ll query all other nodes to try to find out who the new leader is, then continue using that leader’s topology from now on. This give us the advantage that a down server cause clients to switch over and stay switched. That avoid an operational hazard when you bring a down node back up again.

Clients will include the topology version they have in all communication with the server. If the topology version doesn’t match, the server will return an error, and the client will query all nodes it knows about to find the current topology version. It will always chose the latest topology version, and continue from there.

Note that there are still a chance for conflicts, a leader may become disconnected from the network, but not be aware of that, and accept writes to the database. Another node will take over as the cluster leader and clients will start writing to it. There is a gap where a conflict can occur, but it is pretty small one, and we have good mechanisms to deal with conflicts, anyway.

We are also thinking about exposing a system similar to the topology for clients directly. Basically, a small distributed and consistent key/value store. Mostly meant for configuration.

Thoughts?

time to read 1 min | 105 words

In Oredev, beside sitting in a booth and demoing why RavenDB is cool for about one trillion times, I also gave a talk. I intended it to be a demo packed 60 minutes, but then I realized that I only have 40 minutes for the entire thing.

The only thing to do was to speed things out, I think I breathed twice throughout the entire presentation. And I think it went great.

RAVENDB: WOW! FEATURES - THE THINGS THAT YOU DIDN'T KNOW THAT YOUR DATABASE CAN DO FOR YOU from Øredev Conference on Vimeo.

RavenDB 3.0 RTM!

time to read 1 min | 135 words

RavenDB 3.0 is out and about!

RavenDB

It is available on out downloads page and on Nuget. You can read all about what is new with RavenDB 3.0 here.

This is a stable release, fully supported. It is the culmination of over a year and a half of work, a very large team and enough improvements to make you dance a jig.

You can play with the new version here, and all of our systems has been running on 3.0 for a while now, of course.

And with that, I’m exhausted, thrilled and very excited. Have fun playing with 3.0, and check by tomorrow to see some of the cool Wow features.

Open-mouthed smile

time to read 4 min | 621 words

So far we tackled the idea of large compute cluster, and a large storage cluster. I mentioned that the problem with the large storage cluster is that it doesn’t handle consistency within itself. Two concurrent requests can hit two storage nodes and make concurrent operations that aren’t synchronized between themselves. That is usually a good thing, since that is what you want for high throughput system. The less coordination you can get away with, the more you can actually do.

So far, so good, but that isn’t always suitable. Let us consider a case where we need to have a consistent approach, for some business reason. The typical example would be transactions in a bank, but I hate this example, because in the real world banks deal with inconsistency all the time, this is an explicit part of their business model. Let us talk about auctions and bids, instead. We have an auction service, which allow us to run a large number of auctions.

For each auction, users can place bids, and it is important for us that bids are always processed sequentially per auction, because we have to know who place a bid that is immediately rejected ($1 commission) or a wining bid that was later overbid (no commission except for the actual winner). We’ll leave aside the fact that this is something that we can absolutely figure out from the auction history and say that we need to have it immediate and consistent. How do we go about doing this?

Remember, we have enough load on the system that we are running a cluster with a hundred nodes in it. The rough topology is still this:

image

We have the consensus cluster, which decide on the network topology. In other words, it decide which set of servers is responsible for which auction. What happens next is where it gets interesting.

Instead of just a set of cooperating nodes that share the data between them and each of which can accept both reads and writes, we are going to twist things a bit. Each set of servers is their own consensus cluster for that particular auction. In other words, we first go to the root consensus cluster to get the topology information, then we add another command to the auction’s log. That command go through the same distributed consensus algorithm between the three nodes. The overall cluster is composed of many consensus clusters for each auction.

This means that we have a fully consistent set of operations across the entire cluster, even in the presence of failure. Which is quite nice. The problem here is that you have to have a good way to distinguish between the different consensuses. In this case, an auction is the key per consensus, but it isn’t always so each to make such distinction, and it is important that an auction cannot grow large enough to overwhelm the set of servers that it is actually using. In those cases, you can’t really do much beyond relax the constraints and go in a different manner.

For optimization purposes, you usually don’t run an independent consensus for each of the auctions. Or rather, you do, but you make sure that you’ll share the same communication resources, so for auctions/123 the nodes are D,E,U with E being the leader, while for auctions/321 the nodes are also D,E,U but U is the leader. This gives you the ability to spread processing power among the cluster, and the communication channels (TCP connections, for example) are shared between both auctions consensuses. 

time to read 1 min | 100 words

Barring anything major, we’ll be releasing RavenDB 3.0 in 5 days Smile.

It will be  a stable release and you’re encourage to move to it as soon as it is available, using the Esent database.

The Voron database is still in RC mode (mostly because we’re paranoid and want to have more real world experience before we go full forward with this), but it is going to be fully supported.

Upgrading instances will use Esent, and new databases will default to Esent unless you explicitly select Voron.

time to read 3 min | 464 words

In my previous post, I talked about how we can design a large cluster for compute bound operations. The nice thing about this is that is that the actual amount of shared data that you need is pretty small, and you can just distribute that information among your nodes, then let them do stateless computation on that, and you are done.

A much more common scenario is when can’t just do stateless operations, but need to keep track of what is actually going on. The typical example is a set of users changing data. For example, let us say that we want to keep track of the pages each user visit on our site. (Yes, that is a pretty classic Big Table scenario, I’ll ignore the prior art issue for now). How would we design such a system?

Well, we still have the same considerations. We don’t want a single point of failures, and we want to have very large number of machines and make the most of their resources.

In this case, we are merely going to change the way we look at the data. We still have the following topology:

image

There is the consensus cluster, which is responsible for cluster wide immediately consistent operations. And there are all the other nodes, which actually handle processing requests and keeping the data.

What kind of decisions do we get to make in the consensus cluster? Those would be:

  • Adding & removing nodes from the entire cluster.
  • Changing the distribution of the data in the cluster.

In other words, the state that the consensus cluster is responsible for is the entire cluster topology. When a request comes in, the cluster topology is used to decide into which set of nodes to direct it to.

Typically in such systems, we want to keep the data on three separate nodes, so we get a request, then route it to one of those three nodes that match this. This is done by sharding the data according the the actual user id whose page views we are trying to track.

Distributing the sharding configuration is done as described in the compute cluster example, and the actual handling of requests, or sending the data between the sharded instances is handled by the cluster nodes directly.

Note that in this scenario, you cannot ensure any kind of safety. Two requests for the same user might hit different nodes, and do separate operations without being able to consider the concurrent operation. Usually, that is a good thing, but that isn’t always the case. But that is an issue of the next post.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}