Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 5 min | 855 words

This week or early next week, we’ll have the RavenDB 4.0 beta out. I’m really excited about this release, because it finalize a lot of our work for the past two years. In the alpha version, we were able to show off some major performance improvements and a few hints of the things that we had planned, but it was still at the infrastructure stage. Now we are talking about unveiling almost all of our new functionality and design.

The most obvious change you’ll see is that we made a fundamental  change in how we are handle clustering. In prior versions of RavenDB, clusters were created by connecting together database instances running on independent nodes. In RavenDB 4.0, each node is always a member of a cluster, and databases are distributed among those nodes. That sounds like a small distinction, but it completely reversed how we approach distributed work.

Let us consider three nodes that form a RavenDB cluster in RavenDB 3.x. Each database in RavenDB 3.x is an independent entity. You can setup replication between different databases and out of the cooperation of the different nodes and some client side help, we get robust high availability and failover. However, there is a lot of work that you need to do on all the nodes (setup master/master between all the nodes on each can grow very tedious). And while you get high availability for reads and writes, you don’t get that for other tasks in the database.

Let us see how this works in RavenDB 4.0, shall we? The first thing we need to do is to spin up 3 nodes.

image

As you can see, we have three nodes, and Node A has been selected as the leader.  To simplify things to ourselves, we just assign arbitrary letters to the nodes. That allow us to refer to them as Node A, Node B, etc. Instead of something like WIN-MC2B0FG64GR. We also expose this information directly in the browser.

image

Once the cluster has been created, we can create a database, and when we do that, we can either specify what the replication factor should be, or manually control what nodes this database will be on.

 

image

Creating this database will create it on both A and C, but it will do a bit more than that. Those aren’t independent databases that hooked together. This is actually the same database, running on two different nodes. I created the sample data on Node C, and this is what I see when I look on Node A.

image

We can see that the data (indexes and documents) has been replicated. Now, let us see how we can work with this database:

You might notice that this looks almost exactly like you would use RavenDB 3.x. And you are correct, but there are some important differences. Instead of specifying a single server url, you can now specify several. And the actual url we provided doesn’t make any sense at all. We are pointing it to Node B, running on port 8081. However, that node doesn’t have the Northwind database. That is another important change. We can now go to any node in the cluster and ask for the topology of any database, and we’ll get the current database topology to use.

That make it much simpler to work in a clustered environment. You can bring in additional nodes without having to update any configuration, and mix and match the topology of databases in the cluster freely.

Another aspect of this behavior is the notion of database tasks. Here are a few of them.

image

Those are tasks (looks like we need to update the icon for backup) that are defined at the database level, and they are spread over all the nodes in the database automatically. So if we defined an ETL task and a scheduled backup, we’ll typically see one node handling the backups and another handling the ETL. If there is a failure, the cluster will notice that and redistribute the work transparently.

We can also extend the database to additional nodes, and the cluster will setup the database on the new node, transfer all the data to it (by assigning a node to replicate all the data to the new node), wait until all the data and indexing is done and only then bring it up as a full fledged member of the database, available for failover and for handling all the routine tasks.

The idea is that you don’t work with each node independently, but the cluster as a whole. You can then define a database on the cluster, and the rest is managed for you. The topology, the tasks, failover and client integration, the works.

time to read 4 min | 652 words

imageA few weeks ago we started looking into what it would take to run RavenDB 4.0 over HTTPS.

Oh, not the actual mechanics of that, we had that covered a long time ago, and pretty much everything worked as expected. No, the problem that we set out to solve was whatever we could get RavenDB to Just Work over HTTPS without requiring the admin to jump through hops. Basically, what I really wanted was a way to just spin up the server and have it running on HTTPS by default.

That turned out to be a lot harder then I wished it would be.

HTTPS has two very distinct goals:

  • To encrypt communication between two parties.
  • To ensure that the site you visited is actually the site you thought you visited.

The first portion can be handled by generating the certificate yourself, and the communication between client & server would be encrypted. So far so good, but the second portion is probably more important. If my communication with ThisIsNotPayPal.com is encrypted, that doesn’t really help me all that much, I’m afraid.

Verifying who you are is a very important component of HTTPS, and that is something that we can’t just ignore. Well, technically speaking I guess that RavenDB could have installed a root CA into the system during installation, but the mere thought of doing that is giving me a pause, so I really don’t want to try and do that.

And without doing that, we can’t really support HTTPS. Remember that things like Let’s Encrypt won’t work here. RavenDB is often deployed on closed networks, and without having a publicly visible domain to run. My RavenDB is running on oren-pc.hrhinos.local, for example, and I think you’ll find that it is a bit hard to get a Let’s Encrypt certificate for this.

So we can’t just magically get a certificate and have it work.

While I wish there was a way to just have encryption over the wire, without validation of identity, that would be pretty pointless with such things as man in the middle attacks.

So what do we do in RavenDB 4.0 with regards to HTTPS?

We rely on the admin (shocking, I know). They can either generate a self signed certificate and trust it ( a matter of a few shell commands on any platform ) or use their organization’s certificate (either trusted internally or externally obtained). RavenDB doesn’t care about that, but if you provide a certificate, it will ensure that all communication are SSL encrypted.

The client API exposes a method that let you control certificate validation, which make it easier if you need to customize the authentication policy. On the server side, however, we take things differently. Instead of letting the user configure trust policies in certificates, we decided to ignore the issue completely. Or, to be rather more exact, to specify that RavenDB is going to lean on the operating system for such decisions. A simple scenario is an administrator that define a cluster of servers and generate a self signed certificate(s) for them to use. The administrator need to make sure that the certificate(s) in question are trusted by all nodes in the cluster. RavenDB will refuse to connect over HTTPS to an untrusted source.

Yes, I’m aware of all the horrible things that this can do (certificate expiration kills the system, for example), but we couldn’t think of any way were not doing this wouldn’t result in even worse situations.

RavenDB has support for encrypted databases, but we don’t allow them to be accessed from non secured connection, or to connect to non secure destinations. So the data is encrypted at rest and over the wire, and the admin is responsible to making sure that the certs are up to date and valid (or at least trusted by the machines in question).

time to read 3 min | 449 words

imageOne of very conscious choices we made in RavenDB 4.0 is to use threads. Not just run code that is multi threaded, but actively use the notion of a dedicated thread per task.

In fact, most of the work in RavenDB (beyond servicing requests) is done by a dedicated thread. Each index, for example, has its own thread, transactions are merged and pushed using (mostly) a single thread. Replication is using dedicated threads, and so does the cluster communication mechanism.

That runs contrary to advice that I have been told many times, that threads are expensive resource and that we should not hold up a thread if we can use async operations to give it up as soon as possible. And to a certain extent, I still very much believe it. Certainly all our request processing is using this manner. But as we got faster and faster we observed some really hard issues with using thread pools, since you can’t rely on them executing a particular request in a given time frame, and having a mix bag of operations in a thread pool is a mix for slowing the whole thing down.

Dedicated threads give us a lot of control and simplicity, from the ability to control the priority of certain tasks to the ability to debug and blame them properly.

For example, here is how a portion of the threads on a running RavenDB server look like in the debugger:

image

And done of our standard endpoints can give you thread timing for all the various tasks we do, so if there is an expensive thing going on, you can tell, and you can act accordingly.

All of this has also led to an architecture that is mostly about independent thread doing their job in isolation, which means that a lot of the backend code in RavenDB doesn’t have to worry about threading concerns. And debugging it is as simple as stepping through it.

That isn’t true for request processing threads, which are heavily async, so they are all doing mostly reads. Even when you are actually sending a write to RavenDB, what is actually happening is that we have the request thread parse the request, then queue it up for the transaction merging thread to actually process it. That means that there is usually little (if any) contention on locks throughout the system.

It is a far simpler model, and it has proven itself capable of making a very complex system understandable and maintainable for us.

time to read 2 min | 218 words

Prismatic-Sketched-Brain-Silhouette-300pxToday I had a really strange revelation. We had an issue that related to race conditions in a distributed group, and we could just not figure out what was going on from the tests.

Then we switch to a production mode, where each node was a separate process, and we debugged one of them. And it was easy, and it was obvious, and everything just worked.  That was a profound revelation.

What happened was that we build our system so we can service them in production, that includes the internal design and exposed instrumentation data that we have. When we were trying to figure things out from the tests, we were running everything in a single process, and the act of trying to debug a single thread would cause all other threads to stop. That would trigger cascade behaviors (since timeout would be fired, and the cluster would move into recovery mode).

With us debugging just a single node in the cluster, the rest of the cluster just thought it was slow, it didn’t have any impact, but allow us to observe the behavior of the system very easily. The solution was quite obvious once we got into that stage.

time to read 2 min | 203 words

A few weeks ago I talked about how we can keep secrets on Linux. I wasn’t really happy with the solution we came up with, namely a file that is ACLed so only the database user can access it, and I kept mulling that over in my head. We wanted something better, but we didn’t want to start taking dependencies on technologies that the user might not have.

What we came up with eventually was to externalize that decision. The only thing that we actually need is the single root key, so instead of keeping it on a file, we can let the admin provide a it for us (a script / executable / etc). What we’ll do is we’ll check if the user configured an executable to run, and if so, we’ll run that exec, send it over STDIN the data, and read the result back over STDOUT.  That free us from having to take dependencies and give the admin a great degree of freedom with how to deal with keeping secret in the way the organization is used to.

After figuring this out, I found out that Apache is already doing something very similar with SSL Pass Phrase Dialog.

time to read 4 min | 642 words

Edgar F. Codd formulated the relational model in 1969. Ten years later, Oracle 2.0 comes to the market. And Sybase SQL Server came out with its first version in 1984. By the early 90s, it was clear that relational database has pushed out the competition (such as navigational or object oriented databases) to the sidelines. It made sense, you could do a lot more with a relational database, and you could do it easier, usually faster and certainly in a more convenient manner.

Let us look at what environment those relational databases were written for. In 1979, you could buy the IBM's 3370 direct access storage device. It offered a stunning 571MB (read, megabytes) of storage for the mere cost of $35,100. For reference the yearly salary of a programmer at that time was $17,535. In other words, the cost of a single 571MB hard drive was as much as two full time developers, for an entire year.

In 1980, the first drives with more than 1 GB storage appeared. The IBM 3380, for example, was able to store a whopping 2.52 GB of information, the low end version cost, at the time, $97,650 and it was about as big as a washing machine. By 1986, the situation improved and purchasing a good internal hard drive with all of 20MB at merely $800. For reference, a good car at the time would cost you less than $7,000.

Skipping ahead again, by 1996 you could actually purchase a 2.83 GB drive for merely $2,900. A car at that time would cost you $12,371. I could go on, but I'm sure that you get the point by now. Storage used to be expensive. So expensive that it dominated pretty much every other concern that you can think of.

At the time of this writing, you can get a hard disk with 10 TB of storage for about $400 [1]. And a 1 TB SSD drive will cost you less than $300[2]. Those numbers give us about a quarter of a dollar (26 cents, to be exact) per GB for SSD drives, and less than 4 cents per GB for the hard disk.

Compare that to a price of $38,750 per gigabyte in 1980. Oh, except that we forgot about inflation, so the inflation adjusted price for a single GB was $114,451.63. Now, you will be right if you'll point out that this is very unfair comparison. I'm comparing consumer grade hardware to high end systems. Enterprise storage systems, the kind you actually run databases on tend to be a bit above that price range. We can compare the cost of storing information in the cloud, and based on current pricing it looks like storing a GB on Amazon S3 for 5 years (to compare with expected life time of a hard disk) will cost less than $1.5, with Azure being even cheaper.

The really interesting aspect of those numbers is the way they shaped the software written at that time period. It made a lot of sense to put a lot more on the user, not because you were lazy, but because it was the only way to do things. Most document databases, for example, are storing the document structure alongside the document itself (so property names are stored in each document. It would be utterly insane to try to do that in a system where hard disk space was so expensive. On the other hand, decisions such as “normalization is critical” were mostly driven by the necessity to reduce storage costs, and only transitioned later on to the “purity of data model” reasoning once the disk space cost became a non issue.

 


[1] ST10000VN0004 - 7200RPM with 256MB Cache

[2] The SDSSDHII-1T00-G25 - with great then 500 MB / sec read/write speeds and close to 100,000 IOPS

time to read 6 min | 1012 words

imageOn the right you can see how the new database creation dialog looks like, when you want to create a new encrypted database. I talked about the actual implementation of full database encryption before, but todays post’s focus is different.

I want to talk a out managing encrypted databases. As an admin tasked working with encrypted data, I need to not only manage the data in the database itself, but I also need to handle a lot more failure points when using encryption. The most obvious of them is that if you have an encrypted database in the first place, then the data you are protecting is very likely to be sensitive in nature.

That raise the immediate question of who can see that information. For that matter, are you allowed to see that information? RavenDB 4.0 has support for time limited credentials, so you register to get credentials in the system, and using whatever workflow you have the system generate a temporary API key for you that will be valid for a short time and then expire.

What about all the other things that an admin needs to do? The most obvious example is how do you handle backups, either routine or emergency ones. It is pretty obvious that if the database is encrypted, we also want the backups to be encrypted, but are they going to use the same key? How do you restore? What about moving the database from one machine to the other?

In the end, it all hangs on the notion of keys. When you create a new encrypted database, we’ll generate a key for you, and require that you confirm for us that you have persisted that information in some manner. You can print it, download it, etc. And you can see the key right there in plain text during the db creation. However, this is the last time that the database key will ever reside in plain text.

So what about this QR code, what is it doing there? Put simply, it is there to capture attention. It replicates the same information that you have in the key itself, obviously. But what for?

The idea is that users are often hurrying through the process, (the “Yes, dear!” mode) and we want to encourage them to stop of a second and think. The use of the QR code make it also much more likely that the admin will print and save the key in an offline manner, which is likely to be safer than most methods.

So this is how we encourage administrators to safely remember the encryption key. This is useful because that give the admin the ability to take a snapshot on one machine, and then recover it on another, where the encryption key is not available, or just move the hard disk between machines if the old one failed. It is actually quite common in cloud scenarios to have a machine that has an attached cloud storage, and if the machine fails, you just spin up a new machine and attach the storage to the new one.

We keep the encryption keys secret by utilizing system specific keys (either DPAPI or decryption key that only the specific user can access), so moving machines like that will require the admin to provide the encryption key so we can continue working.

The issue of backups is different. It is very common to have to store long term backups, and needing to restore them in a separate location / situation. At that point, we need the backup to be encrypted, but we don’t want it it use the same encryption key as the database itself. This is mostly about managing keys. If I’m managing multiple databases, I don’t want to record the encryption key for each as part of the restore process. That is opening us to a missed key and a useless backup that we can do nothing about.

Instead, when you setup backups (for encrypted databases it is mandatory, for regular databases, optional) we’ll give you the option to provide a public key that we’ll then use to encrypted the backup. That means that you can more safely store it in cloud scenarios, and regardless of how many databases you have, as long as you have the private key, you’ll be able to restore the backup.

Finally, we have one last topic to cover with regards to encryption, the overall system configuration. Each database can be encrypted, sure, but the management of the database (such as connection strings that it replicates to, API keys that it uses to store backups and a lot of other potentially sensitive information) is still stored in plain text. For that matter, even if the database shouldn’t be encrypted, you might still want to encrypted the full system configuration. That lead to somewhat of a chicken and egg problem.

On the one hand, we can’t encrypt the server configuration from the get go, because then the admin will not know what the key is, and they might need that if they need to move machines, etc. But once we started, we are using the server configuration, so we can’t just encrypt that on the fly. What we ended up using is a set of command line parameters, so if the admins wants to run encrypted server configuration, they can stop the server, run a command to encrypt the server configuration and setup the appropriate encryption key retrieval process (DPAPI, for example, under the right user).

That gives us the chance to make the user aware of the key and allow to save it in a secure location. All members in a cluster with an encrypted server configuration must also have encrypted server configuration, which prevents accidental leaks.

I think that this is about it, with regards to the operations details of managing encryption, Smile. Pretty sure that I missed something, but this post is getting long as it is.

Beautiful errors

time to read 2 min | 322 words

The following code in a recent PR caused it to be rejected, can you figure out why?

image

The error clearly states that what the error is, but it fails to provide crucial details. Namely, which files have been corrupted. If I’m seeing an error like this in my logs, I need to be able to figure out what happened, and not hope & guess.

I’m picking up on this particular change because I found myself tallying in my head the number of comments I make on PRs from our team, and quite a large portion of that involve these kind of changes. What I’m looking for with error handling is not just to do it properly and handle all edge cases, but to also properly report it so the person who will end up reading this error message (very likely years from now) will have a clue about what they are supposed to do now.

Sometimes we go to great lengths to ensure that this is the case. We have an entirely different server mode dedicated to handling catastrophic errors, so when you try to get into RavenDB, you’ll get a meaningful error page that will at least try to give you an idea about how to fix this issue, for example. The sad part is that it is very easy to have a single error sour up a really good experience, because it doesn’t provide you with the right information to fix it.

We spend a lot of time just crafting errors properly. They go to the log, they are sometimes blocking the UI (if the server cannot start), we have dedicated alert system that handles error and alert distribution across the cluster so an admin can get into any node and see all the stuff that they need to know about across the entire cluster.

time to read 1 min | 199 words

The reason for this post is simple, this code is so brilliantly simple that I just had to write about it.

On the face of it, it isn’t doing much that is interesting, but it is showing off something very critical. It is both obvious and easy to reason about. And if you don’t have local functions, trying to do something like that will require you to jump through several hops and pretty much always generate code that the compiler and any static analysis tool will consider suspect.

The fact that the reference to the local function can be added and then remove also means that we can do things like this:

A self cleaning delegate, which is only usually possible with code trickery that force you to capture a variable that you have set to null and then set to the value you are trying to use.

I know that this isn’t a really a major thing, but it make certain very specific scenarios so much easier, so it is just a joy to see. And yes, the impetus for this code was actually seeing it used in our code and going Wow!

time to read 1 min | 80 words

This code try very hard to ensure that the secret key provided to it is eradicated after it is properly saved.

This is because we try to reduce the attack vector for keeping the encryption key in memory.

However, there are at least two different ways that this code is failing in what it is trying to do. Can you find them?

For that matter, how much sense does it make to even attempt what it is doing?

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}