Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 3 min | 420 words

We have been working on a big benchmark of RavenDB recently. The data size that we are working on is beyond the TB range and we are dealing with over a billion documents. Working with such data sizes can be frustrating, because it takes quite a bit of time for certain things to complete. Since I had downtime while I was waiting for the data to load, I reached to a new toy I just got, a Raspberry PI 400. Basically, a Raspberry Pi 4 that is embedded inside a keyboard. It is a pretty cool machine and an awesome thing to play around with:

image

Naturally, I had to try it out with RavenDB. We have had support on running on ARM devices for a long while now, and we have dome some performance work on the Raspberry PI 3. There are actually a whole bunch of customers that are using RavenDB in production on the Pi. These range from embedding RavenDB in industrial robots, using RavenDB to store traffic analysis data on vehicles and deploying Raspberry PI servers to manage fleets of IoT sensors in remote locations.

The Raspberry PI 4 is supposedly much more powerful and I got the model with 4GB of RAM to play around with. And since I already had a data set that hit the TB ranges lying around, I decided to see what we could do with both of those.

I scrounged an external hard disk that we had lying around that had sufficient capacity and started the import process. This is where we are after a few minutes:

image

A couple of things to notice about this. At this point the import process is running for about two and half minutes and imported about 4 million documents. I want to emphasize that this is running on an HDD (and a fairly old one at that). Currently I can feel its vibrations on the table, so we are definitely I/O limited there.

Once I’ll be done with the data load (which I expect to take a couple of days), we’ll be testing this with queries. Should be quite fun to compare the costs of this to a cloud instance. Given typical cloud machines, we can probably cover the costs of the PI in a few daysSmile.

time to read 6 min | 1093 words

I mentioned that we are currently hiring for a junior dev position and we have been absolutely swamped with candidates. Leaving aside the divorce lawyer that tried to apply to the position and the several accountants (I don’t really get it either) we typically get people with very little experience.

In fact, this position is explicitly open to people with no experience whatsoever. Given that most junior positions require a minimum of two years, I think that got us a lot of candidates.

The fact that we don’t require prior experience doesn’t meant that we don’t have prerequisites, of course. We are a database company and the fundamentals are important to us. A typical task in RavenDB involves a lot of taxes, from ACID compliance, distributed computing, strict performance requirements, visibility into the actions of the database, readability of the code, etc.

I talked before about the cost of a bad hire, and in the nearly a decade that passed since I wrote that post, I hasn’t changed my mind. I would rather end up with no one than hire someone that isn’t a match for our needs.

Our interview process is composed of a phone call, a few coding questions and then an in person interview. At this point, given that I have been doing that for over a decade, I think that I interviewed well over 5,000 people. A job interview stresses some people out a lot. Yesterday I had to listen to a candidate speak so fast that I could barely understand the words and I had to stop a candidate and tell them that they are currently in the 95% percentile of people I spoke to, so they wouldn’t freeze because of a flubbed question.

I twitted(anonymously) about the ups and down of the process and seem to have created quite a lot of noise. A typical phone call for a potential candidate takes about 15 – 30 minutes and is mostly there to serve as an explicit filter. If they don’t meet the minimum requirements that we have, there is no point in wasting either of our time.

One of the questions that I ask is: Build a phone book application that stores the data in memory and outputs the records in lexical order. This can stump some people, so we have an extra question to help. Instead of trying to output the data in lexical order, how would you ensure that you don’t have a duplicate phone number in such a system? Scanning through the entire list of records each time is obviously not the way to go. If they still can’t think of a way to do that the next hint is to think about O(1) and what data structure would fit this requirement. On the Twitter thread, quite a few people were up in arms about that.

Building a phone book is the kind of task that I remember doing in high school programming class as a teenager. Admittedly, that was in Pascal, but I just checked six different computer science degrees and for all of them, data structures was a compulsory course. Moreover, things like “what is the complexity of this operation” are things that we do multiple times a day here. We are building a database here, so operations on data is literally our bread and butter. But even for “normal” operations, that is crucial. A common example, we need to display some information to the user about their database. The information actually come from two sources internally. One is the database list which contains various metadata and one is the active database instance, which can give us the running stats such as the number of requests for this database in the past minute.

Let’s take a look at this code:

The complexity of this code is O(N^2). In other words, for ten databases, it would cost us a hundred. But for 250 databases it would cost 62,500 and for 500 it would be 250,000. Almost the same code, but without the needless cost:

This is neither theoretical nor rare, and it is part of every programming curriculum that you’ll find. Further more, I’m not even asking about the theoretical properties of various algorithms, or asking a candidate to compute the complexity of a particular piece of code. I’m presenting a fairly simple and common task (make sure that a piece of data in unique) and asking how to perform that efficiently. 

From a lot of the reactions, it seems that plenty of people believe that data structures aren’t part of the fundamentals and shouldn’t be something that is deeply embedded in the mindset of developers. To me, that is like saying that a carpenter shouldn’t be aware of the difference between a nail or a screw.

Rob has an interesting thread on the topic, which I wanted to address specifically:

It does not matter what language, actually. In JavaScript, you’ll not use a “new Hashtable()”, you’ll use an object, sure, but that is a meaningless detail. In fact, arrays implemented as hashes in JavaScript maintain most of their important properties, actually. O(1) access time by index being the key. If you want to go deeper, than in practice, the JS runtime will usually use an actual array if it detects that this is how you use the variable.

And I’m sorry, that is actually really important for many reasons. The underlying structure of our data has huge impact on what you can do with that data and how you can operate on it. That is why you have ArrayBuffer and friends in JavaScript now, because it matters, a lot.

And to the people whose 30 years experience never included ever needing to know those details, I have two things to say:

  • Either your code is full of O(N^2) traps (which is sadly common) or you know that, because lists vs. hash is not something that you can really get away with.
  • In this company, implementing basic data structures is literally part of day to day tasks. Over the years, we needed to get customize versions of arrays, lists, dictionaries, trees and many more. This isn’t pie in the sky stuff, that is Monday morning.
time to read 9 min | 1604 words

imageI was talking to a developer recently and had a really interesting discussion around the notion of consistency. For simplicity’s sake, let’s imagine that we are talking about a game and we need to deal with awarding achievements.

The story in question begins with a seemingly innocent business requirement:

We want to announce a unique achievement in the game. The first player to kill 10,000 rabbits will get a unique class-appropriate item.

Conceptually, that means that we need to write the following code:

The code is simple, easy and trivial. I wish we could end the post at this point, but as you can imagine, the situation is a bit more complex.

Consider the typical architecture of a game, we have the game world, which is composed of multiple servers, often located in different parts of the world. In this case, we can assume that we have three data centers, in separate continents.

image

Notice that we have a world first scenario here? That means that we need to synchronize the state across all of them in some manner, including when there are issues in the network, failures, etc.

For that matter, when we are talking about an achievement for a character, we can at least be sure that there is a single chain of events that we can follow, but what happens if we were to apply this achievement to a guild? In this case, multiple players may be competing to complete this achievement. If we allow to to also run on different servers (and regions), the situation now become quite complex.

Are there are technical ways to resolve this? Of course there are.

We can use distributed transactions to handle this, at first glance. However, that introduce an utterly unworkable spanner in the works (See what I did here Smile). Games care a lot about performance and latency, forcing us to run a globally distributed transaction on every rabbit kill is not a possible solution. There are stiff performance costs associated with it. For that matter, in a partial failure scenario, you still want to allow players to kill rabbits, even if you lost some servers in another continent. That lead to several additional scenarios that you need to cover:

  • What happened if a player killed rabbits but wasn’t able to record that using the world scope state? Do we also store that in a character / server level state? What happens if we already awarded the achievement and then find out that there are delayed updates in the mix?
  • What happens when two players kill their 10,000th rabbit at times T1 and T2 (where T2 > T1), but this happens near enough to one another that the write about the 2nd player is committed first?
    • Note that this can happen on the same server, on the same region or across different regions.
  • What happens when two players kill their 10,000th rabbit at the same instant?
    • Note that this can happen on the same server, on the same region or across different regions.
    • Note that this can actually happen at different times, but clock divergence will say it happened at the same time
  • What happens when a player kill their 10,000 rabbit, but we had a network hiccup and couldn’t record the action in time?

In fact, there are a multitude of issues that we have to deal with, if we accept the scenario as is. And the impact on the entire system because of the unstated requirements of a single achievement are huge.

That is likely not what the intended result is. We can do some minimal changes to the system and get pretty much what we want at a much reduced cost.  We start by giving up the implicit assumption that we have to award the achievement for the 10,000th within the same tick as the actual kill.  If we give up this requirement, it means that we have a far more relaxed environment to deal with.

We can say that we’ll process the achievements at the end of the next hour, for example. That gives us enough time to get updates from the rest of the system, settle the dust and avoid millisecond level decision making processes. In almost all businesses, there is no such thing as a race condition. The best example of that is “the check is in the mail”. The payment date for a check is not when you cashed it but when you posted it. My uncle used to go to the post office at 4:50 PM on a Friday to post checks. They would get the right time stamp, but would sit in the post office over the weekend before actually being delivered. That gave his 3 – 5 extra days before the check was cashed.

In the case of our game, giving us the time for things to settle make sense. It also make sense from a marketing and community perspective. World wide announcements don’t just happen, they can be scheduled for a particular time frame to maximize impact. Delaying the announcement of such an achievement make it a lot easier from a technical perspective but also give us more business level benefits to work with.

In many situation, we jump to assumptions about the kind of requirements that we have to meet, but usually there is a lot more flexibility. And discovering where you can get some slack can be of tremendous value.

Possible reaction from the business in this scenario can be:

  • We don’t actually care if this is exact. We want to give the achievement when this happens but it is okay if there is a little fudge factor and the “wrong” person won if there is a tiny amount of time difference.
  • It is actually fine if two players get the achievement when it happens at the “same” time.
  • Oh, I didn’t realize that this is so hard. Can we make it a server-wide achievement, then? Would that be easier?

Any of those responses will translate to a great simplification of what we have to deliver. In the first case, we can track the state of bunny killing without the use of distributed transactions and only apply a distributed transaction when we get to the 10,000th rabbit. That is going to be a rare event, so it is fine to get to pay a little extra there.

For that matter, a good question is to ask is how important the operation is. What happened when we have a failure in the distributed transaction when we record the 10,000th rabbit? I’m not talking about someone else getting there first, I”m talking bout a network hiccup or a faulty wire that cause the operation to fail. Do we retry? Do we output an error (and if so, to where?) or just ignore this? Do we need to have a secondary mechanism for checking for errors in the process?

The answer is that it depends, you can get into effectively infinite loop of trying to solve ever more unlikely scenarios. The question is what is the impact on the system at large and is it worth the cost?

For a game achievement, the answer is probably no (but then again, I’m not a gamer). People usually draw the line at money as the place where they’ll do their utmost to avoid issues.  This is a strange scenario, because money specifically has so many ways that you can run compensating actions as part of the normal workflow that you shouldn’t bend yourself into a pretzel to avoid that.

Let’s say that we are offering three unique mounts in our game, selling them for 9.99$. We obviously have to deal with race conditions here. There are only three mounts, but there are more users who want it and are willing to pay for the privilege. We are dealing with money here, so we can assume that we want to be careful about that, the question is, how careful?

We have three mounts, but more users. The payment process itself takes at least a few seconds, so how do you deal with it?

  • Throw the purchase attempts into a queue.
  • Pull the first three offers from the queue and attempt to charge them.
  • If charging failed, we mark the purchase attempt as errored and move on to the next item in the queue.
  • Once we successfully processed three purchases, we mark all the other purchase attempts as failed.

Simple, right?

What happens when a charge took longer than expected and a request timed out, but the charge actually happened? This can happen if you have SMS authentication in the process. So the card company will send a message to the card holder and they have to send an SMS back to approve the transaction. That can take long enough that the transaction will time out. But the user did authorize the transaction, funds were transferred, but you already sold the unique mount to someone else, with worse credit card security features.

What happens then? You can refund the money, provide another unique mount, etc.

The key here is that even for something that may consider critical, the number of failure modes is too high. Trying to handle them and ensuring a consistent world is usually too expensive. It is far better to have the hooks in place to handle failures and apply compensations. They are far rarer and much easier to deal with in isolation.

When you start seeing that something happens on a frequent basis, that is when you want to figure out a way to automate that failure mode.

time to read 1 min | 188 words

Just had to trouble shoot a really strange problem in a production system. The situation is pretty simple. At one point in time, a machine had an issue and shortly afterward became utterly unable to reach the outside world. I was luckily able to SSH into the machine, but any outgoing connection would fail.

Given that TCP worked (I was using it or SSH), how can that be? I can create a connection to the system and have bidirectional communication, but not initiate any calls to the outside world.

Strange doesn’t begin to cover how weird this is. The reason for the failure, by the way, was that the disk was absolutely full. The reason for the network outage was a mystery.

It took a while to figure out that the error was that the disk was full, which caused an error is systemd-resolved, which failed to setup DNS properly. I was also unable to make connections via IP address, which was strange, but the problem went away after the disk was cleared.

I have to say, network failure due to disk errors was not how I expected to start the day.

time to read 3 min | 418 words

We are hiring again (this time for Junior C# Dev positions in Israel). That means that I go through CVs (quite a few, actually).  I like going over the resumes directly, to get a feel for not just a particular candidate but what is, for lack of a better term, the state of the market.

This time, I noticed a much higher percentage of resumes with a GitHub repository link. Anytime that I see such a link, I go and look at what they have there. That is often really interesting. Then again, you run into things like this:

image

On the one hand, this is non production code, it is obviously a teaching project, which is awesome. On the other hand, I find such code painful to look at.

In the past, I would rate highly anyone that would show a GitHub account in the CV, since I could expect to see some of their projects there, usually unique ones. This time? I’m seeing a lot of basically homework assignments, and those aren’t really that interesting to review or look at. Especially since a lot of the candidates apparently had the same courses, so I saw the same 5 projects repeated over and over again.

In other words, just a GitHub account with some repositories are no longer that interesting or meaningful.

Another thing that I noticed was that a lot of those candidates had profiles with profile pictures like:

image

A small tip, if you expect people to visit your profile (and I assume you do, since you provided the link in the resume), it is worth it to put a (professional) picture of yourself there. The profiler readme on GitHub is also surprising attractive when looking at a candidate.

Another tip, if you see a position for a C# Junior Developer, it is acceptable to apply if you don’t have all the requirements, or if you exceed them. But if you are trying to find a new job as a lawyer specializing in family law, maybe don’t try to apply to a tech company.

And yes, I’m using this post as a want to vent while going over so many CVs.

Most CVs are dry, but one candidate just got bumped to the next stage based solely on the fact that in they had a “Making awesome pancakes” in the CV, which made me laugh.

time to read 5 min | 801 words

We got an interesting modeling question from a customer: “What is the optimal manner to represent time sensitive information in RavenDB?”

The initial thought was that they would use revisions and they asked about querying those. The issue is that this isn’t really the purpose for revisions, they are great if you want to see what the state of the system at a particular point in time, but not so good if your business logic has meaning over time.

The best scenario for temporal data that I’m aware of is payroll. You have a lot of data that make sense only in the context of the time if was relevant for. For example, consider an employee that is hired at a given salary level, then given raises over time. The data in this case is divided into several layers.

I’m going to use paper documents as the model here, because it makes it much easier to consider the modeling implications than when talking about JSON or class structures.

On the most basic model, we have the Payslip document, which represent the amount (work, deductions, taxes, etc) that was paid to an employee at a particular point in time. This is similar to this:

image

Once created, such a document will not change. It represent an action that happened in the past and is immutable. From this you can figure out taxes, overall payments, etc.

The Payslip is computed from the Timesheet document, which is similar to this one:

image

A Timehseet document is updated during the Payroll Period whenever an employee signs in or out. At the end of the Payroll Period, a manager will sign off on the Timesheet and approve it for payment. At this point, all the relevant business rules are run and the final Payslip is generated for each employee. Once the Timesheet is signed off and paid, it is no longer mutable and will not change. This make sense, since it represent something that already happened.

In some cases, you’ll have new information, such as an employee that worked, but didn’t report their hours. They will need to do so in a new Timesheet and a new Payslip will be generated.

Using the real world analogy, the Timesheet document is stored at the head office, and you cannot go and update that once it was submitted.

So far, we haven’t seen anything related to things that change over time. In fact, the fact that we have separate documents for Payslips and Timesheets means that we can ignore a lot of the complexity that you’ll usually have to deal with in temporal databases.

We can’t completely get away from it, however. We need to consider the employee’s Contract, however. Usually when we think about the employment contract we think about something like this:

image

The contract specify details such as the hourly rate, overtime payment, vacation time, etc. In payroll systems, contracts are actually more complex than that, because you have to take into account that they change.

For example, consider the following scenario:

  • 1996 – Hired as mailroom clerk – 4.75$ / hour
  • 1998 – Promoted to junior clerk – 5.25$ / hour
  • 1999 – Promoted to clerk – 5.40$ / hour
  • 2002 – Promoted to senior clerk – 6.20$ / hour

How do you handle something like that, in terms of modeling?

The answer depends quite heavily on how your business logic handles this. One way to handle this is to create revisions. Using the real world logic, we are talking about signing a new contract and expiring the old one. But in reality, that isn’t how things are done. You’ll usually just update the payment terms.

How does this looks like in terms of the data modeling when using RavenDB,however? Well, there are two options. We can represent the data as simple values, like so:

image

When the data changes, we update those values (which generates revisions for the old data). However, that isn’t usually ideal, because business logic usually want to access the past values. For example, your contract may change mid payroll period, so your hourly rate is different depending if the hours worked on Monday or Thursday.

In this case, you’ll want to represent the values changing in the model directly, like so:

image

In most cases, this is the best option for modeling data where the temporal aspects of the data needs to be directly exposed to the business logic.

time to read 2 min | 225 words

I have a consultant that did some work for me. While the majority of our people are working in Israel and Poland, we actually have people working for us all over the map.

The consultant submitted their invoice at the end of the month and we sent a wire transfer to the provided account. So far, pretty normal and business as usual. We do double verification of account details, to avoid common scams, by the way, so we know that the details we sent were correct. Except… the money never arrived.

When we inquired, it turned out that the money transfer was reversed. The reason why? This is the address that the consultant provided (not the actual address, mind, but has the same issue). Note that the

Mr. Great Consultant
1234 Cuba Avenue
Alta Vista, Ottawa, K1G 1L7
Canada

The wire transfer was flagged as potential international sanctions violation and refused.

That was… very strange.

It appears that someone saw Cuba in the address, decided that this is a problem and refuse the transfer.  I’m not sure if I would rather that this is the case of an over active Regex or a human not applying critical thinking.

We are now on week two of trying to resolve that with the bank and it is quite annoying.

Next port of call, buying Monero on the dark web… Smile.

time to read 3 min | 563 words

During benchmark run on a large dataset, I started to notice that longer benchmarks were showing decidedly worse numbers than short ones. In other words, a benchmark that is run for 1 minute is orders of magnitude higher latencies than a benchmark that is run for 30 seconds. And the longer the benchmark, the worst things off.

That raised a lot of red flags, and spawn a pretty serious investigation. We take performance very seriously and the observed behavior was that we were getting slower over time. We suspected a leak, high number of GC calls, or memory fragmentation. The scenario under test was a web application using the RavenDB API to talk to RavenDB. We run both the web application and the server under profilers and found a few hot spots, but nothing really major. There was no smoking gun.

Then we noticed that the load testing  machine was sitting there with 100% CPU. I initially thought that this is us generating too much load for the machine, but that wasn’t it. We are using wrk2, which is capable of handling million of requests per seconds.

We were generating the requests dynamically using a Lua script, and in one of the scenarios under test, we have code like this:

path = "/orders/user/" .. page * pageSize .. "/" .. pageSize .. "/?userId=" .. item.id .. "&deep=y"

That isn’t the most optimal way to do things, I’ll admit. We can do better by using something like table.concat(), but the problem was that regardless of how you build the string, this is supposed to be fairly cheap. The wrk2 project is using LuaJIT, which has a reputation as a really scripting system. I never really thought that this would be a problem. Sure, it is a little wasteful, but it isn’t too bad, a few string temporaries and maybe some realloc() calls, but nothing major.

Instead, this resulted in us getting far worse results over time. It took a while to actually figure out why, but the root cause is in the way LuaJIT handles string hashing.

a = lj_getu32(str);
h ^= lj_getu32(str+len-4); b = lj_getu32(str+(len>>1)-2); h ^= b; h -= lj_rol(b, 14); b += lj_getu32(str+(len>>2)-1);

Strings in Lua are interned, which means that there is just a single copy of a string per value. That means that hashing is important, but the way it does hashing is to take the first 4 bytes, the last 4 bytes and the 4 bytes in the middle and use that for a hash. And that is it.

If you have a bunch of strings where those 3 locations match… well, welcome to hash collisions. At which point, what is supposed to be a O(1) call becomes an O(N) call. And creating strings will turn the operations into an O(N^2) operation!

Here is the reproduction code:

Change the prefix to be an empty string for a major performance boost. The actual bug is well known (5 or 6 years), but it was only recently fixed and not on the version that wrk2 is using.

We had to toss out the entire benchmarking set and start over because of this.

We were generating requests with random data, so some of them would hit this problem hard, and some would avoid it by magic. I was not expecting to debug hash collision in Lua code while trying to get some performance numbers from overloading RavenDB, quite random, literally.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}