Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 5 min | 925 words

floppy-disk-27810_640
As a developer, it is easy to operate at the level of message passing between systems, utilizing sophisticated infrastructure to communicate between nodes and applications. Everything works, everything in its place and we can focus on the Big Picture.

That works great, because while everything is humming, you don’t need to know any of the crap that lies beneath the surface. Unfortunately, if you are developing distributed systems, you kinda of really need to know these things. As in, you can’t do proper job with them if you can’t.

We’ll start from the basics. Any networked service needs to listen on the network, which means that need to provide the service with the details on what to listen on.

Now, typically you’ll write something like http://my-awesome-service and leave it as that, not thinking about this any further, but let break this apart for a bit. When you hand the tcp listener an address such as this one, it doesn’t really know what it can do with this. At the TCP level, such a thing is meaningless.

At the TCP level, we deal with IP and ports. For simplicity’s sake, I’m going to ignore IPv4 vs. IPv6, because they don’t matter that much at this level (except that they do, but we’ll get to that later). This means that we need to have some way to translate “my-awesome-service” into an IP address. At this point, you are probably recalling about such a thing as DNS (Domain Name System), which is exactly the solution for that. And you would be right, almost.

The problem is that we aren’t dealing with a flat network view. In other words, we can’t assume that the IP address that “my-awesome-service” is mapped to is actually available on the end machine our server is running on. But how can that be? The whole point is that I can just point my client there and the server is there.

The following is a non exhaustive list (but certainly exhausting) of… things that are in the way.

  • NAT (Network Address Translation)
  • Firewalls
  • Routers
  • Load balancers
  • Proxies
  • VPN
  • Monitoring tools
  • Security filtering
  • Containers

In the simplest scenario, imagine that you have pointed my-awesome-service to IP address 31.167.56.251. However, instead of your service being there, there is a proxy that will forward any connections on port 80 to an internal address at 10.0.12.11 at port 8080. That server is actually doing on traffic analysis, metering and billing for the cloud provider you are using, after which it will pass on the connection to the actual machine you are using, at 10.0.15.23 on port 8888. You might think that the journey is over, but you are actually forgot that you are running your service as a container. So the host machine needs to forward that to the guest container on 127.0.0.3 on port 38880.  And believe it or not, this is probably skipping half a dozen steps that actually occur in production.

If you want to look at the network route, you can do that using “traceroute” of “tracert”. Here is what a portion of the output looks like from my home to ayende.com. Note that this is how long it takes me to get to the IP address that the DNS says is hosting ayende.com, not actually routing from that server to the actual code that runs this blog.

myhome.mynet [10.0.0.138]
bzq-179-37-1.cust.bezeqint.net [212.179.37.1]
10.250.0.170
bzq-25-77-10.cust.bezeqint.net [212.25.77.10]
bzq-219-189-2.dsl.bezeqint.net [62.219.189.2]
bzq-219-189-57.cablep.bezeqint.net [62.219.189.57]
ae9.cr1-lon2.ip4.gtt.net [46.33.89.185]
et-2-1-0.cr3-sea2.ip4.gtt.net [141.136.110.193]
amazon-gw.ip4.gtt.net [173.205.58.86]
52.95.52.146
52.95.52.159
54.239.42.215
54.239.43.130
52.93.13.44
52.93.13.35
52.93.12.110
52.93.12.133

This stuff is complex, but usually, you don’t need to dive that deeply into this. At certain point, you are going to call to the network engineers and let them figure it out. We are focused on the developer aspect of understanding distributed systems.

Now, here is the question, to test if you are paying attention. What did your service bind to, to be able to listen to over the network?

If you just gave it “http://my-awesome-service” as the configuration value, there isn’t much it can do. It cannot bind to 31.168.56.251, since that URL does not exist on the container. So the very first thing that we need to understand as distributed systems developers is that “what do I bind to” can be very different from “what do I type to get to the service”.

This is the first problem, and it is usually a major hurdle to grok. Mostly because when we are developing internally, you’ll typically use either machine names or IP addresses and you’ll typically consider a flat network view, not the segmented one that is actually in place. Docker and containers actually make you think about some of that a lot more, but even so most people don’t consider this too much.

I’m actually skipping on a bunch of details here. For example, a server may want to listen to multiple IPs (internal & external), maybe with different behaviors on each. A server may actually have multiple network cards and want to listen to both (for more bandwidth). It is also quite common to have a dedicate operations network, so the server will listen to the public network to respond to general queries, but things like SNMP or management interface is only exposed to the (completely separated) ops network. And sometimes things are even weirder, with crazy topologies that look Escher paintings.

This post has gone on long enough, but I’ll have at least another post in this series.

time to read 6 min | 1039 words

RavenDB is a distributed database, it has been a distributed database since pretty much the very start, although over time we have been making the distribution part easier and easier. You might be able to tell that the design of RavenDB was heavily influenced by the Dynamo paper and RavenDB implements a multi master system that allow every node to accept writes and disseminate them across the network.

This is great, because it ensure that we have a high stability in the face of error, but this also opens us up to some interesting failure modes. In particular, if a document is modified in two nodes at the same time, there is no way to immediately detect that. Unlike a single master system, where such a thing would be detected, but requires at least a majority of the nodes to be up. A scenario where we have concurrent modifications on the same document on different server is called a conflict, and is something that RavenDB is quite able to detect and handle.

For a very long time, we had multiple mechanism to handle such conflicts. You could specify that RavenDB would resolve them automatically, in favor of a particular node, or using the latest or specifying a resolution strategy on the server or the client.  But by default, if you did nothing, a conflict would cause an exception and require you to resolve it.

No one ever handled that exception, and very few users set the conflict resolution or did something meaningful with it. We typically heard about it as support calls about “this document is not accessible and the sky has just fallen”. Which is perfectly understandable from the point of view of the user, but incredibly frustrating from ours. Here we are, careful to account for correctness in behavior in a distributed system, properly detecting conflicts and brining them up to the attention of the user and the result is… they just want the error to go away.

In the vast majority of the cases, the user didn’t care about the conflict at all. It wasn’t important and any version would do. And that is after we went to all the trouble of making sure that you have a powerful conflict resolution option and allow you to do some really fun things. The overwhelming response we got was “make this go away”. The problem is that we can’t really make such a thing go away, this is a fundamentally an issue a multi master distributed system must handle. And just throwing one of the conflicted versions under the bus didn’t sit right with us.

RavenDB is an ACID database because I strongly believe that transactions matters, that you data is important and should be respected, not shredded to pieces on a moments notice in fear of someone figuring out that there has been a conflict. I wrote about another aspect of this issue previously what the user expects and the right things are decidedly at odds here. In particular because the right thing (handling conflicts) can be hard for the user, and something that you would typically do only on some parts of your domain model.

Because of this,  with RavenDB 4.0 we moved to automatic conflict resolution. Unless configured outside, whenever RavenDB discover a conflict, it will automatically resolve it (in an arbitrary but consistent manner across the cluster). Here is what this looks like:

image

Notice the flags? This document is the resolve of conflict resolution. In this case, we had both 1st and 2nd as conflicting versions, and we chose one of them.

But didn’t I just finished telling you that RavenDB doesn’t shred your data? The key here is that in addition to the Resolved flag, we also have the HasRevisions flag. In this case, the database doesn’t have revisions defined, but even so, we have revisions for this document. Let us look at them, shall we?

image

We have three of them:

Created on Node A Created on Node B

Resolved

image image image

Pay special attention to the flags. You can see that we have here three revisions. The conflicted versions as well as the resolved document. We’ll be reporting these documents in the studio, so an admin can go and take a look and verify that nothing was missed and this also applies to conflict resolution that wasn’t done by arbitrarily choosing a winner.

Remember, this is the default configuration, so you can set RavenDB to manual mode, in which case you’ll get an error on access a conflict and will need to resolve it manually, or you can define a script that would resolve the conflict. This can be defined on a per collection basis or globally for the database.

Here is an example of how you can handle conflicts using a script:

Regardless of the way you chose to resolve the conflict, you will still have all the conflicting versions available after the resolution, so if your script missed something, no data has been lost.

The idea is that we want to enable you to deploy a distributed system without too much hassle, but without taking undue risks or running in a configuration that is only suitable for demos. I think that this is the kind of features that you would never really notice, until you really notice that it just saved you a bunch of sleepless nights.

And as this is written at 00:12 AM, I think that I’ll listen to my own advice, hit the post button and head to bed.

Update: Here is how you configure this in the studio:

image

time to read 1 min | 184 words

One of the things that we have been saving is the hooking together of all the work we have ben doing to expose how RavenDB works into the operations dashboard. This has just landed in the nightly and can give you a lot of insight into exactly what is going on inside your server.

You can see some of the screenshots below. The idea is that in addition to exposing all of these metrics over dedicated endpoints and SNMP, we will also save users the trouble of setting up monitoring and just show them what is going on directly.

Operators can just head to this page and see what is going on, and it is meant to be put as a background for users to observe this during routine operations.

image

image

image

time to read 3 min | 437 words

This post isn’t about RavenDB, at least not directly. In Hibernating Rhinos, we use all sorts of tools to communicate. It moves from email (direct,groups and mailing lists), Slack, Skype, bug tracking and the odd face to face interaction thingie.

The problem is that some of these discussion happen in different circles, for example, a few devs working on the UI might talk with each other and make decisions about what we need to do, and then later a bug pops up with a “fix the interaction of the replication components”. This is particularly bad when we are doing this face to face, but it can also be something like: “Optimize the process as per the slack discussion” or “The example that led to this bug is a perfect model of Foobarizm”.

There are two problems with this approach. First, if you weren’t part of the discussion, and usually you wouldn’t be, this is like sitting in a cafe with your parents and their high school friends, listening to them talking about other high school friends. It is both a pain and utterly incomprehensible. The other problem is that even if you were part of the conversation, you might not be able to connect the dots to this particular bug report. Or worse, it might be you a few weeks or months later, looking at the bug report and wondering what was going on there.

This is especially the case when we investigate something a few months or years after the change was made, and we do some archaeology to figure out what is actually going on there. At that time, you might look at a piece of code, run blame to see where it came from, track down the specific commit and issue number that were responsible for the change and end up scratching your head and trying to figure out what was meant there, because the text assume context not in evidence.

The other side here is that we can create dozens of issues per week, and they range from “move the text so it will align in this view” to “fix the race condition on failure of recovering node on the cusp of promotion”. Some of them are worth further treatment, with full explanation and discussion, but a three days chase to resolve an issue that ended up needing to move a piece of code three lines higher isn’t going to get a good description.

What we do need to pay attention to is that we leave enough information to figure out what the story was behind the issue, without making it a chore to actually create issues.

time to read 1 min | 90 words

This is part of a PR related to making sure that disposing once works. It contains this code:

image

This loses critically important information. Namely, the stack trace of the original exception. That leaves aside the issue that an aggregate exception may contain multiple exceptions as well.

In general, and I know this is old hat, whenever you see “throw e;” or “throw e.InnerException;” of any kind, you should always treat it as a bug.

time to read 4 min | 783 words

“In theory, there is no difference between theory and the real world.”

One of the more annoying things to learn was that the kind of things that you worry about from inside the product are almost never the kind of things that your users worry about. Case in point, we spend an amazing amount of time making sure that RavenDB is crash proof, that you will not have data corruption and that transactions are atomic and durable in the face of what is sometimes horribly broken environments.  Users often just assume “this must work” and move along, having no idea how hard some of these things are.

But that much, I get. It make sense that you would just assume that things should work this way. In fact, one of the reason that RavenDB exists is that none of the NoSQL products at the time didn’t provide what I considered to be basic functionality. Since then I learned that what a user consider basic functionality and what a database consider basic functionality are two very distinct things.

But I think that the most shocking thing was that users tend to not care about data consistency anyway near the level you would expect them to. We spend time and effort and a whole lot of coding to ensure that it would be possible to reason about the behavior of a distributed and concurrent system in an fairly predictable manner, that data is never lost or misplaced, and no one notices. What is worse, when you get things right, and another database engine gets it clearly wrong, users will sometimes want to use the other guy (wrong) implementation, because doing the clearly wrong thing is easier for them.

For example, consider the case of two concurrent modifications to the same document. If you do nothing, you’ll get a Last Write Wins scenario. You can also do the proper thing and error when the second write comes, because it is based of a now out of date version of the document. A few weeks ago I got a frantic call from one of the marketing & sales people about “I broke our database” and “found a major issue”. That was quite strange, given that the person talking to me wasn’t a developer, instead, she was using one of our internal systems to update a customer purchase and got an error. She then proceeded to figure out that she could reproduce this error at will. All she had to do was edit the same customer record at the same time as a colleague was also editing it. Whoever saved the record first would work, and the second would get an error.

For the developers among you, that is Optimistic Concurrency in action, absolutely expected and what we want in this scenario. But I had to give a full explanation of how this is not a bug, tell the marketing guys to put down the “Serious Bug Fixed, Upgrade Immediately” email template down and that this is how it is meant to work. The problem, by the way, wasn’t that they couldn’t understand the issue. They just couldn’t figure out why they got an error in the first place, surely the “system” was supposed to figure out what to do there and not given them an error.

I’ll freely admit that we skimp on the UX of our internal systems because… well, they are internal, and it is easier to train a few dozen people on how the systems work than to train the systems how people work at that scale. But this really hit home because even after I explained the issue, asked them what they expected to happen and how this is supposed to work, I couldn’t get through. An error shown to them is obviously something that is wrong in the system. And being able to generate an error by their own actions means that the system is broken.

It took showing the same exact behavior in the accounting software (made by an external company) before they were half convinced that this isn’t actually an issue.

Now, to be fair, our marketing people aren’t technical, so they aren’t expected to understand concurrency and handling thereof, and typically any error by the internal system means that something is broken in the infrastructure level so I can absolutely understand where they are coming from.

The sad thing is, this isn’t isolated to non technical people and we have to be careful to design things in such a manner that they match what the user expect. And user in this case is the developers working with RavenDB and the ops teams responsible for its care and feeding. I’ll talk about one such decision in the next post.

time to read 1 min | 110 words

I had to reject the following change in a recent PR. IN this context, the flags and conflicted.Flags are the same, and that wasn’t the problem. Can you spot the issue?

image

The problem is that the second version does an allocation. It does this silently, and you need to know about this issue to know that this happens. There is good discussion on this in this StackOverflow question.

It looks like this has been fixed in the JIT for CoreCLR and will be part of the 2.1 release when it is out.

time to read 3 min | 453 words

There is a reason why people talk about idiomatic code. Code that is idiomatic to the language matches what it expect and it generally faster / easier to work with for both developers and the compiler / runtime.

During a PR review, I run into this code:

image

The idiomatic manner for writing this code would have been any of:

  • “@id” == property
  • Constants.Documents.Metadata.Id == property
  • property.Equals(Constants.Documents.Metadata.Id)
  • Constants.Documents.Metadata.Id.Equals(property)

I can argue that the second option is the most idiomatic, and that the 3rd option can fail with Null Reference Exception if the property is null, but all of them are pretty clear.

Now, RavenDB has a lot on non idiomatic code, usually when we need to get more performance. For example:

image

This is code that is doing very much what is done above, but it does this on the raw byte buffer, and it knows that it is accessing UTF8 characters, so we can do some nice optimizations there to compare by just doing two instructions.

Indeed, when queried, the developer answered:

Most of the time its going to be false and comparing ints is cheaper than strings

There are several problems with this. First, this particular piece of code isn’t in a part of the code that is extremely performance sensitive. The string buffer work above is for processing requests from the network, a piece of code that can be called tens and hundreds of thousands of times per second. Performance there matters, a lot.  This code is meant to be called as part of streaming results to the user, so it is likely to handle very large volume of data. Performance there matters, for sure, but we need to consider how much it matters.

Second, let us peek into what will actually happen if we drop the property.Length check. The call will end up calling to the native string routines in the CLR, and the relevant portion is:

image

In other words, this check is already going to happen, we didn’t really save anything from making it.

Third, and the most subtle of them all. This check is using a check against a constant, whose value is “@id”. It also check that the property .Length is equal to 3. The whole point of using a constant is that we need to replace it in just one location. But in this case, we will likely change the constant value, not realize that there is a hardcoded length elsewhere in the code and fail miserably with hard to explain behavior.

time to read 1 min | 113 words

The law of Demeter goes like this, a method m of an object O may only invoke the methods of the following kinds of objects:

  1. O itself
  2. m's parameters
  3. Any objects created/instantiated within m
  4. O's direct component objects
  5. A global variable, accessible by O, in the scope of m

And then we have this snippet from a 1,741 lines index that was sent to us to diagnose some performance problems.

image

There are at least two separate leaks of customer data here, by the way, can you spot them?

This is it for this post, I really don’t have anything else left to say.

time to read 2 min | 394 words

Sometimes we get requests from customers to evaluate and help specify the kind of hardware their RavenDB servers is going to run on. One of the more recent ones was to evaluate a couple of options and select the optimal one.

We got the specs of the two variants and had a look. Then I went and took a look at the actual costs. These are physical machines, and the cost of each of the options we were given, even maxed out, was around 2 – 3K $.

One of the machines that they wanted was using a 7,200 RPM hard disk and was somewhat cheaper than the other. To the point where we got some pushback from the customer about selecting the other option (with SSD). It took me a while to figure out what was going on.

This organization is using RavenDB (and the new machines in question) to run their core business. One of the primary factors in their business in the speed in which they can serve request (for that business, SEO is super critical metric). This business is also primarily focused on getting eyes on the site, which means that their organizational structure looks like this:

image

And the behavior of the organization follow the money. The marketing & sales department is much larger and can steer the entire organization, while the tech (which the entire organization depends on) is decidedly second string, for both decision making and budgeting.

I have run into this several times before, but it took me a long while to figure out that in many cases, the result of such arrangements is that the tech department relies on other people (in this case, us) to tell the organization at large what needs to be done. “It isn’t us (the poor relation) that needs this expensive (add 300$ or so) add on, but the database guys says that it really matters (and you’ll take the blame if you don’t approve it)”.

I don’t have anything to add, I’m afraid, I just wanted to share this observation. I’m hoping that understanding the motivation can help alleviate some of the hair pulling involved in those kind of interactions (yes, water is wet, and spending a very small amount of money is actually worth it).

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}