Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 3 min | 411 words

The trigger for this post is a StackOverflow question that caught my eye.

Let us imagine that you have the following UI, and you need to implement the search function:

enter image description here 

For simplicity’s sake, I’ll assume that you have the following class:

And we need to implement this search, we want users to be able to search by the restaurant name, or its location or its cuisine, or all of the above, for that matter. A query such as”Rama Thai” or “coffee 48th st” should all give us results.

One way of doing that is to do something like this:

Of course, that would only find stuff that matches directly. It will find “Rama” or “Thai”, but “Rama Thai” would confuse it. We can make it better, somewhat, but doing a bit of work on the client side and changing the query, like so:

That would now find results for “Rama Thai”, right? But what about “Mr Korean” ? Consider a user who have no clue about the restaurant name, let alone how to spell it, but just remember enough pertinent information “it was Korean food and had a Mr in its name, on Fancy Ave”.

You can spend a lot of time trying to cater for those needs. Or you can stop thinking about the data you search as the same shape of your results and use this index:

Note that what we are doing here is picking from the restaurant document all the fields that we want to search on and plucking them into a single location, which we then mark as analyzed. This let RavenDB know that it is time to start cranking. It merge all of those details together and arrange them in such a way that the following query can be done:

And now we don’t do a field by field comparison, instead, we’ll apply the same analysis rules that we applied at indexing time to the query, after which we’ll be able to search the index. And now we have sufficient information not just to find this a restaurant named “Fancy Mr Korean” (which to my knowledge doesn’t exist), but to find the appropriate Korean restaurant in the appropriate street, pretty much for free.

Those kind of features can dramatically uplift your applications’ usability and attractiveness to users. “This sites gets me”. 

time to read 2 min | 362 words

I want to start by saying that this isn’t my idea, I read about it a few times, and I recently encountered it with sodium_malloc, so I decided to write my own implementation of the world’s most expensive memory allocator.

What this code does is pretty simple, and quite brutal. It allocates memory in such a fashion that absolutely guarantee that you can’t get away with a whole host of memory problems.

For example, if you try overwrite a buffer allocated by this method, you’ll immediately hit the guard page and die horribly (and predictably, in the place where the error actually happened, not a long way off). If you somehow write before the buffer, that will be detected on free if this is a small under write (which tend to be much rarer, by the way), or immediately if this is a big change.

What is more, once the memory is freed, it is poisoned, and can never be used again. This pretty much rely on us running on 64 bits with effectively unlimited virtual memory, and has the nasty side effect of turning a 50 bytes allocation to something requiring 12 KB. Having said that, as a debugging tool, this is invaluable.

And yes, I’m aware that windows already have that with the heap verifier. But as I’m using this in .NET code, I needed to write my own (this also pretty much work the same way with Linux, you just need to switch the API, but the functionality is the same).

This was built because we were chasing a memory corruption error, and I run this, but it pointed me to a totally different location than suspected. So it is either doing a very good job, or it found me another issue.

… investigating …

Or, as the case may be, we found a bug in the actual memory guard (we didn’t handle allocations of exact page size correctly, and they broke), but at least it broke consistently and was pretty easy to find once I looked in the right place Smile.

time to read 3 min | 428 words

Image result for encryption iconWith RavenDB 4.0, we are looking to strengthen our encryption capabilities. Right now RavenDB is capable of encrypting document data and the contents of indexes at rest. That is, if you look at the disk, the data is securely encrypted. However, in memory, we keep quite a bit of information in plain text (mostly in caches of various kinds), and the document metadata isn’t encrypted, so documents keys are visible.

With RavenDB 4.0 we are looking into making some stronger guarantees. That means that we want to keep all data encrypted on disk, and only decrypt it during transaction, after which it will immediately be encrypted back.

Now, encryption and security in general are pretty big fields, and I’m by no means an expert, so I thought that I would outline the initial goals of our research and see if you have anything to add.

  • All encryption / decryption operations are done on data that is aligned on 4KB boundary and is always in multiples of 4 KB. It would be extremely helpful if the encryption would not change the size of the data. Given that the data is always in 4KB increments, I don’t think that this is going to be an issue.
  • We can’t use managed API to do so. Out data is actually residing in unmanaged memory, so ideally we would need something like this:




  • I also need to do this be able to call this from C#, and it needs to run on Windows, Linux and hopefully Mac OS.
  • I’ve been looking at stuff like this page, trying to understand what it means and hoping that this is actually using best practices for safety.

Another problem is that just getting the encryption code right doesn’t help without managing all the rest of it properly. Selecting the appropriate algorithm and mode, making sure that the library we use is both well known and respected, etc. How do we distributed / deploy / update it over multiple platforms?

Any recommendations?

You can see some sample code that I have made here: https://gist.github.com/ayende/13b206b9d83e7aa126df77d6b12711f3

This is basically the sample OpenSSL translated to C# with a bit of P/Invoke. Note that this is meant for our own use, so we don't need padding since we always pass a buffer that is a multiple of 4KB. 

I'm assuming that since this is based on the example on the OpenSSL wiki, it is also a best practice sample. There is a chance that I am mistaken, however, which is why we have this post.

time to read 4 min | 645 words

I run into this post, which was pretty interesting to read. It compares a bunch of ways to index inside CouchDB, so I decided to see how RavenDB 4.0 compares.

I wrote the following code to generate the data inside RavenDB:

In the CouchDB post, this took… a while. With RavenDB, this took 7.7 seconds and the database size at the end was 48.06 MB. This is my laptop, 6th gen i7 with 16 GB RAM and SSD drive. The CouchDB tests were run over a similar machine, but with 8 GB RAM, but RavenDB didn’t get to use much memory at all throughout the benchmark.

It is currently sitting on around 205MB working set, and allocated a total of 70 MB managed memory and 19 MB of native memory.

I then created the following index:

image

This does pretty much the same as the indexes created in CouchDB. Note that in this case, this is running in process, so the nearest equivalent would be to Erlang native views.

This took 1.281 seconds to index 100,000 documents, giving us a total of 78,064 indexed documents per second.

The working during the indexing process grew to 312 MB.

But those were small documents, what about larger ones? In the linked post, there was another test done, this time with each index also having a 300KB text property. Note that this is a single property.

The math goes like this: 300KB * 100,000 documents = 30,000,000 KB = 29,296 MB = 28.6 GB.

Inserting those took 1 minute and 40 seconds. However, the working set for RavenDB peaked at around 500 MB, and the total size of the database at the end was  512.06 MB. It took me a while to figure out what was going on.

One of the features that we have with the blittable format is the fact that we can compress large string values. The large string property was basically lorem ipsum repeated 700 times, so it compressed real well. And the beauty of this is that unless we actually need to touch that property and do something to it, it can remain compressed.

Indexing that amount of data took 1.45 seconds, for 68,965 indexes documents per second.

My next test was to avoid creating a single large property and go with a lot of smaller properties.

This generates 100,000 documents in the 90KB – 900 KB range. Inserting this amount of data took a while longer. This is mostly because of the amount of data that we write in this scenario which is in the many GB range. In fact, this is what my machine looked like while the insert was going on:

image

Overall, the insert process took 8:01 minutes to complete. And the final DB size was around 12 GB.

Indexing that amount of information took a lot longer, and consume a few resources:

image

Time to index? Well, it was about twice as much as before, with RavenDB taking a total of 2.58 seconds to index 100,000 documents totaling 12 GB in size.

That comes to 38,759 indexes documents per second.

A large part of that is related to the way we are storing information in blittable format. We don’t need to do any processing to it, so we are drastically faster.

Comparing to the CouchDB results in the linked post, we have:

  RavenDB CouchDB
Small documents 78,064 19,821
Large documents 38,759   9,363

In other words, we are talking about being 4 times faster than the faster CouchDB option.

time to read 3 min | 428 words

imageFor the past two years or so, we have been working very hard on RavenDB 4.0. You have seen some of the posts in this blog detailing some of the work we have been doing. In RavenDB 4.0 we have put a tremendous amount of emphasis on performance and reducing the amount of work we are doing. We have been quite successful. Our metrics show a minimum of 10x improvement across the board, and often we are even faster.

I’m going to be posting a lot about RavenDB 4.0 in the next few weeks, to highlight some of the new stuff, this time in context, but for now, let me just say that I haven’t been this excited about releasing software since the 1.0 release of RavenDB.

You can download the new alpha here, we have pre-built binaries for Windows, Linux (Ubuntu) and Raspberry PI. Mac OS support is scheduled and will likely be in the next alpha.

The alpha version supports building databases, working with documents, subscriptions, changes, indexing (simple, full text and map/reduce). There have been significant improvements to indexing speed and how they work, memory utilization across the board is much lower and CPU utilization is much reduced.

We aren’t ready to do full blown benchmarks yet, but a sample scenario that has RavenDB 3.5 pegged at 100% with 45,000 requests per second has RavenDB 4.0 relaxing with 70% CPU with 153,000 requests per second.

We would like to get feedback on RavenDB 4.0, and so please download the software and give it a whirl.

You need to make sure that you use both server & client from the same version (older versions will not work with the new server). We actually have a couple of clients, we have the updated client for 4.0, which is pretty much the same as the old client, just updated to work with the new server, and we have a completely new client, which is built to provide much higher performance and apply the same lessons we learned while building the server to the client code. Over time, the new client is going to be the default one, but while we aren’t there yet, we want people to have access to it from the get go.

Please remember that this is pre-release alpha software, as such, bugs are expected, and should be reported.

time to read 4 min | 613 words

A feature is something that your application/service does. Usually we don’t give it a lot more thought, but I recently had an interesting discussion about the exact distinctions between a business feature and a technical feature.

Let us imagine that we are talking about an application that allow to send snail mail, we already seen it before. A user will call the API and then a few days later a physical letter will show up at your door. So far, it is pretty simple. The question is, what can you offer in addition to expand the business.

For example, we might offer:

  • Mail tracking – providing a way to ensure that the recipient actually got the letter.
  • Snail mail to email – getting a physical email, and having that sent to the customer.

Those two are obvious extensions to the core business, and from the point of view of the business, it is great. From a technical perspective, that is a whole lot of complexity. You need to integrate with FedEx to handle the mail tracking, and you need to setup some sort of an automated system that will sort the mail, scan it and upload it to the customer’s account.

The problem is that at this point, you don’t really know what kind of reaction those features are going to have. They are both non trivial and in some cases require major capital expenditure to implement and are pretty hard to properly size upfront.

So you split it. Instead of doing this as a single feature, you have a business feature and a technical feature. A business feature means that your business offers this service, building that requires research to show that we can actually offer that, check whatever there are legal ramifications (some mail can be sensitive, privacy concerns, etc), check what kind of pricing we can charge, etc.  The technical feature is actually implementing all of that.

But the key observation here is that you don’t actually do the technical implementation, at least not just yet. You do the work around the business end of the feature, and then you announce this feature availability. As in, right now you can track the snail mail, or right now you can get your mail scanned and uploaded. This is done with minimal technical work in the backend, and with the caveat that this still experimental and pricing might change.

This isn’t cheating, mind you. Once you announced this feature, you wait to see what kind of reaction we’ll have. One of the options is that users will really love this feature, and start immediately using it. In this case, you have a good problem, people are flocking to give you money. In the meantime, you have Joe and Samantha, from the local high school working for minimum wage in the afternoon to manually do the work. So you can complete the customer expectations, as you are now working to complete the technical side and automate the whole thing (firing Joe and Samantha along the way).

The key here is that you don’t have to do any major upfront investment, in development or in facilities, before you can have this feature. And most of the time, even if it is a major feature, the ramp up time is enough for you to have a pretty good idea about what you actually need to do. And in the meantime, you have a micro service architecture, it is just that the services aren’t called FedExTrackingService and ScanAndSortPhysicalMailService but Joe and Samantha.

In other words, you have mechanical Turk the feature until you can teach you system to properly play chess.

time to read 1 min | 172 words

I was asked recently why I’m “burning” my interview questions by posting them on the blog. That actually has several reasons:

  1. If a candidate reads my blog and is able to produce high quality code based on this, well… that is pretty much the job description right there.
  2. We just finish another recruitment round, and we aren’t planning another one for at least 4 – 6 months.
  3. The fact that I’m posting the answers to a specific question doesn’t mean that the the subject matter is closed.

For example, let us take this question & answer. Note that this is approachable because  there is just standard .NET code.

However, the code as posted contains a bug, it is a small one, and all unit tests are passing, but it result is slightly inefficient behavior. Exposing that to a unit test is relatively easy, but going from the failing test back to the root cause and then fixing it would be an interesting investigative technique and good show of skills.

time to read 1 min | 141 words

In a previous post, I showed how to get 10x performance boost by batching remote calls. The code included the following method:

This code has a subtle bug in it. Take a look and see if you can find it.

Yes, I’ll wait, honestly.

Go read the code and think about interesting ways to break it.

Okay, found it? Awesome, now let me explain anyway Smile.

This code suffer from the slow arrival syndrome. If we have a new request coming in every 100 ms, then the first request here will languish in the queue for over 25 seconds!

Luckily, the fix is simple:

We just need to make sure that we aren’t waiting for too long. In this case, we will wait a maximum of less than half a second for the messages to go over the wire.

time to read 2 min | 334 words

During the move of RavenDB to Linux we had to deal with the usual share of cross platform bugs. The common ones are things like file system case sensitivity, concatenating paths with \ separators, drive letters, etc. The usual stuff.

Note that the hairy stuff (different threading model, different I/O semantics, etc) I’m not counting, we already had dealt with them before.

The real problems had to do with the file systems, in particular, ext3 doesn’t support fallocate, which we kind of wanna use to allocate our disk space in advance (which can give the file system information so it can allocate all of that together). But that was pretty straightforward. ENOSUP is clear enough for me, and we could just use ext4, which works.

But this bug was special. To start with, it snuck in and lain there dormant for quite a while, and it was a true puzzle. Put simply, a certain database would work just fine as long as we didn’t restart it. On restart, some of its indexes would be in an error state.

It took a while to figure out what was going on. At first, we thought that there was some I/O error, but the issue turned out to be much simpler. We assumed that we would be reading indexes in order, in fact, our code output the index directories in such a way that they would be lexically sorted. That worked quite well, and had the nice benefit that on NTFS, we would always read the indexes in order.

But on Linux, there is no guaranteed order, and on one particular machine, we indeed hit the out of order issue. And in that case, we had an error raised because we were trying to create an index whose id was too low. Perfectly understandable, when you realize what is going on, and trivial to fix, but quite hard to figure out, because we kept assuming that we did something horrible to the I/O system.

time to read 5 min | 826 words

A few months ago I asked this question, and since then I’ve been asking it from some candidates. This is my 2nd tier question, and answering this correctly is going to give you quite a few brownie points.

The idea is to implement this interface:

But the only field that you can put in the implementation is a 32 KB byte array. See the original post for the full details.

Let us get rid of the easy stuff first, since the only value I can keep is a 32 KB byte array, saving and loading the values is trivial:

We can just save and load it directly, nothing much to do except validate that the size match on load.

Now, let us see what we need to do, here is an example of a trie:

Typically, you would implement that using nested dictionaries, but that isn’t going to work given this exercise limitations.

So let us consider a more primitive alternative. We need to store the following information about each node in the trie. We start by defining the following node:

This wouldn’t work for the limits we have, but we are going to be building this in pieces. So this is an important step to understand how we go about it.

We are going to store the data in the following format:

image

So on each level, we are able to jump to the next stage and check its values. Searching for the right value in the children array is going to be simply a matter of doing  binary search on the children array.

The full code listing is linked from the bottom of the post, if you want to see it in context.

And here it is when it is stored as an object tree. Note that each child store the full key, not just the part it owns, because it make it easier to visualize.

image

Let us look at how we are building this kind of trie. We’ll start with two pretty simple helper methods.

There isn’t much to really see here, FindDifference will find the first difference between two strings from a given offset and FindMatch does a standard binary search on the array. The only special thing here is the fact that we are returning the match position even if we failed, we needed that to be able to know where to put the next entry. This will be clear when we’ll look at the Insert method.

This is quite a bit of code, but it can neatly divide into three separate options. In the first case, we reach an empty section in the tree, so we can just create a new leaf and call it a day. This is why we use a ref variable here, because it allows us to mutate the given parameter, which can be either the root, or the array on a nested node.

If there are already values at this level, we check if we need to go into the next level (creating the new level if necessary).

If there isn’t a value at this level that start with the current prefix, we just add it to this level as a value.

Pretty simple, once you break it down. The fun part about this is that as written, this is actually safe for multi threaded use, so a single write can make modifications at the same time that multiple readers can read, without any need for synchronization.

Reading from the trie given this structure is pretty simple:

We simply use the same logic as before, but because we have to make no changes, this is much simple to work with.

The last bit that we still need to handle is the deletes. Here is how this is handled.

That is basically the inverse of Insert, but it need to handle patching up the trie on the way back up again.

And this is pretty much it, right?

Except… that this is not what I started with. Where is the byte array? We are allocating memory here like crazy and all we did was implement a pretty standard trie without the use of dictionaries.

What is the point?

Well, now that we have this implementation, we have a good understanding on what is actually required of us, and we can take this code and move it to the array, but that would be in the next post.

In the meantime, you can read the entire code in one go here.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}