Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,162
Privacy Policy · Terms
filter by tags archive
time to read 5 min | 835 words

imageAfter the last three parts, I think that I have a good high level understanding of what is going on in the Noise codebase, as a reminder, I’m reviewing 48d808375e541eca383826ea928272976bce064d.

I love reading code, especially stuff that I’m not familiar with, such as a new platform or doing something novel. The Noise codebase is quite interesting, and it is a very early stage in the project. That said, when looking at what it is doing and comparing to what I’m doing in my day job, a few things popped up.

I couldn’t help but think that this is a very similar model to the default Azure DocumentDB. All the data is indexed, and the values are shredded as they go in. The model you use was very similar. The nice thing about this kind of model is that it gives you a lot of freedom and flexibility, but the downside is that if you need more, there aren’t any alternatives. Storing large JSON documents inside Noise is likely to cause problems, especially if you are typically working with the full document. If you are just throwing things into it and then pulling back pieces, especially if the documents you write are relatively small, this is likely to be much more suited.

Noise is making some clever usage of the LSM nature of RocksDB to do cleanups, and I really like the elegance of the indexing engine. The query model is also interesting, in that it gives you a lot of freedom to operate. Select the right fields, define the appropriate aggregation, etc. But queries are limited (no computation) and aggregation queries are going to run with cost of O(N) with respect to their results. Given that this is intended to be embedded, and apparently into low end devices, that might very well be the right decision.

A side affect of the shredding is that you need to reconstruct documents all the time, since they were broken into component parts. Things like scanning results by one field while returning another looks like they can generate O(N * log N) number of queries in RocksDB. I’m not sure how much that can scale with regards to queries. On the one hand, there are very clear layers between the pieces of the code, which is great, but that also have some built in issues with regards to how much performance you can squeeze out of the system. I don’t have a feeling for Rust code performance, nor do I think that the project is ready at this early stage for performance analysis, but it is a concern. Then again, I did just spent the past two years head down in a major performance work, so that might very well be the reason. And it is entirely possible that the code will perform significantly better when the project is done.

One last thing to note, when I tried building a replacement to Lucene one of the annoying thing to pop up was that it is really hard to replace Lucene efficiently and still be better in term of time / space / complexity. Noise in particular is storing all the data at least twice (once as terms to be searched, once as the actual document data). This make is very easy to do certain operations (such as update a document), but it also means that the cost of updating a document is now doubled. We first need to index the new document, then we need to re-index old document, compare between the two and then write the changes. I guess that it is possible that reducing the amount work to be written would help matters, but the cost of indexing is likely to dominate.

I would expect a Noise index to be significantly larger than the Lucene equivalent, because Lucene does a lot to reduce file size (in quite evil & nasty ways, to be fair).

Another thing that I don’t think that is good is the reliance on a query language in an embedded solution. A much better idea in my option would be to expose the raw documents out and let the client handle it. If it is running in the same memory space, there is no extra cost. This is because the moment that this open up a huge number of features / issues. I’m talking specifically about the notion of “return” here. The idea is really cool, you can project exactly what you want from the document, and that allows the engine to only pull the relevant data.

At any rate, this is really getting late now, so I’ll close with the note that Noise is a really cool codebase to read. Even if you don’t know Rust, going ahead and reading something non trivial is a really good exercise in learning new things. And I’m really interested in what Noise is going to do next.

time to read 5 min | 837 words

After the previous two parts, I think that I have a good high level understanding of what is going on in the Noise codebase, as a reminder, I’m reviewing 48d808375e541eca383826ea928272976bce064d.

Now I’m reading through the JsonValue file. The first interesting thing here is:

image

Given all the code I have read so far, I think that this is a really elegant way to handle JSON, both in terms of how you can define such a recursive structure in 7 lines of code and in how this allow you to express relatively complex scenario quite nicely.

One thing that bugs me is that there are a lot of things like this:

image

This is done to have enough space for escaping. The problem is that this waste quite a bit of space. The last few blog posts I had were about 2KB of text each. Imagine doing a batch of a few thousands of them, just this types of calls can take up a lot of extra memory. Now, I’m used to managed memory, where this kind of thing hangs around and need to be collected. This code is in Rust, and it will clean itself up as soon as it is out of scope. But given that we hold on to it for a non trivial amount of time, I think that it is a concern. It might actually be cheaper to go over the string multiple times to figure out exactly how many escape chars we need.

One of the first optimizations we did when we wrote our own JSON parser was get rid of the costly escape char handling. It took literally more time than anything else, because we kept having to scan the string and break it over and over again.

Another issue that popped up is here:

image

A silent assumption here is that the order of properties in the two objects is the same. Two identical objects that have different property order will sort unequal. That is likely to be a surprise to the user.

The next file is parser.rs, which is over 1,300 lines of code (almost 20% of the entire codebase of Noise!), this is pretty typical parsing code, low level, nasty and quite tedious. I skimmed it briefly, but I don’t think that there is anything really important going on there. Its main purpose is to take a query string and produce a filter, next.

The next file is query.rs and I’m going to assume that this will lead me to also jump to filter.rs which I previous skipped.

The code here start using terms that I’m familiar with from Lucene. We’ll start where the money is, in Query::get_matches, which starts by building the query itself:

image

Basically, create a snapshot of the index, parse the query. We need to pass the snapshot to the parser so it can pass it on to the filters that it builds internally. The order of operations in this code also determines the required order of appearances of the clauses in the Noise query syntax.

The rest of the method goes over the query and build it, setting up ordering, aggregation and such. There is close cooperation between query.rs, filter.rs and snapshot.rs. The query is a high level concept of everything that must happen, filter is a much lower level, just finding the right stuff. On top of that the query layer boosting, scoring, ordering and aggregation. The filter uses the snapshot to access the stored information.

For example, the AllDocsFilter is using the snapshot.new_all_docs_iterator, which is defined as:

image

Basically, scan all the values with the prefix of S, which I think stands for sequence number.

More complex stuff can be seen in the Scorer, as a good example. We have terms prefixed with C, which I think stands for Count, which gives the number of times that a particular term appear in a particular position. And K is the number of times a particular property has shown up in general. This seems to be the basis of tf-idf calculation. Those are merged together using RocksDB merge command and the custom merge policy that we looked at in the index.rs.

At this point I looked at the clock and realized that this is now after midnight and that I got to the most interesting parts that I wanted to see, the rest are pretty familiar details, since a lot of this follows the Lucene construction.

I’m going to have another blog post about my thoughts about this project, which will show up tomorrow.

time to read 6 min | 1169 words

In my previous post, I started going over the Noise project code. We got to the indexer, but there is still a lot to do, as a reminder, I’m reviewing 48d808375e541eca383826ea928272976bce064d.

The indexer code handed off a lot of work to the shredder, so it is a good thing that this is my next target for review. At a glance, it looks like Noise is using rustc_serialize to do the actual JSON parsing. I’m guessing it make sense at this point, but given that RavenDB 4.0 had rolled its own JSON parser, and given that Noise is accepting the data as strings, I’m going to assume that JSON parsing costs are going to be high on any profiler output.

The first thing to see in the file is the shredder struct:

image

The existing_key_value_to_delete is very interesting, since it implies that this is how deletes & updates are handled.

It looks like this is working really closely with the key builder. The first method on the file is:

image

There are a whole bunch of stuff like that, with roughly the same pattern (add_bool_null_entries, add_stemmed_entries, etc). Because of the close cooperation of those two, I’m going to be reading them in parallel. The KeyBuilder struct looks like:

image

At this point, I’m guessing that what is going on is that the shredder goes over the JSON  document, and add the property path to the KeyBuilder. But given add_number_entries, I don’t know see how that would work. We are adding the document sequence number to the end of the key path, and store the value of the property in the value. That is… strange. The KeyBuilder data contains the property paths and the current position in the array (possibly nested), so that much is clear, its usage, not so much at this stage.

Because this is confusing, I’ll start from the shred function and go from there, to see what it is exactly that this is going on.

image

This takes the string, pass it to the JSON parser and start iterating over it. Okay, going over the code, that seems easier to follow now. This parse the JSON one token at at time. When it encounters a property, it add that to the KeyBuilder. So the KeyBuilder is basically the path into the current value in the document that we are parsing.

This bugs me, though:

image

I’m following what is going on, and it makes perfect sense. But all those special key prefixes that are scattered are not documented anywhere that I could see, and as I mentioned previously, single letter prefixes are hard on readability and I don’t think they add much, given that RocksDB is using prefix compression anyway.

There is already a PR to fix this, which is likely to be applied by the time you are reading this blog post, by the way. This change the prefixes to be constants, improving code readabily.

A lot of the functionality seems to be inside maybe_add_value, see the call sites:

image

It looks like all the numbers are recorded as floats (given that JSON use double precision IEEE floating point, I guess that make sense). You can see the different prefixes that are given for the values.

The tests at the bottom of the file completes the picture, and if I understand this properly, given the following JSON:

{ "name":"John", "age":31, "city":"New York", “hobbies”: [“reading”, “dogs”] }

The shredder is going to output the following entries, given a document sequence of 555:

  • W.age!f31#555,
  • W.hoobies$!dogs#555,1
  • W.hobbies$!reading#555,0
  • W.name!john#555,

I think that the code calls this key path, and the general format is (white space added:

W <property path> ! <value> # <document sequence> , <array path>

A critical observation is that an array is denoted as a $ in the property path, and there are as many indexes in the array path as there are $ signs in the property path.  I’m currently ignoring the full text search aspect of things, since it is not important for our purposes. With the understanding of what is being generated, let’s move to seeing what is actually going on.

Let us go back to merge_existing_doc, and see how that works. This is called after we shredded a document, so we have done all of the interesting work with regards to parsing the document and building the tokens, splitting the data, etc. At this point, we check if we already have this document, and if so, we load all the existing values and compare them to the current version. The nice thing is that we get to optimize. If a certain value didn’t change, we don’t need to do anything at all. On the other hand, if a value was there and is no longer in the current version, we need to delete it. That is effectively how updates are handled.

With deletes, we do the same, but in reverse, we reconstruct the document, find all the indexed values, and then we remove it. Finally, we have add_all_to_batch, which take all the in memory work that was done and apply that to the RocksDB batch, pretty much finishing all the work. I don’t like that some methods are holding into the state (shredded_key_values / existing_key_value_to_delete)  then flush it and some are manipulating the RocksDB batch directly, it seems like there should be either or kind of approach.

Coming back to the issue that I had with the compaction routine, we have this code, that add a number to the entries:

image

Now, if I’m reading things properly, if the value is zero (not uncommon by any means), that will result in 8 bytes of zeros being written as the value. Combine this with this:

image

And now I think that this will cause that value to be removed during compaction, resulting in data corruption. No idea if this is actually the case, since I’m just going over the code, and not running / testing it thoroughly.

It turns out that I’m not reading this properly (pun intended). I missed the negation on the first line of compaction_filter. The C & K values are the only things that will be removed during compaction, not the other way around, so no data corruption, just me remember that there is a reason I use “== false” instead of “!”, because I keep missing them.

time to read 11 min | 2097 words

I run across this post by Damien Katz, and was immediately intrigued. That has several reasons.  The last time I did a deep review of Damien’s code was about a decade ago, when I went over the CouchDB source code, in an attempt to learn Erlang. What end up happening was that I wrote RavenDB.

Note: This is pretty much my thoughts as I’m going through the code, it isn’t meant to be a guide or anything except a record of what went through my head as I read the code.

Another reason is that this is obviously a subject matter that is near & dear to my heart. Searching is something that we do all  the time, and I would dearly love to read some code that isn’t Lucene to do so. Finally, the code is written in Rust, which I have some passing interest at, so I consider this a win all around. You can find the project site here and the code is here. I’m reviewing 48d808375e541eca383826ea928272976bce064d.

Some things that are obvious from the get go, just by skimming through the docs. This is a small project, less than 10,000 lines of code. That is great, since it means that I can review it in one sitting. I also don’t really care that much for the parser and format that they do, I’m mostly interested in how they are storing and searching the data. The storage is done via RocksDB, which is interesting, because it impose certain limitations on the solution (in particular, single key space) so it is interesting how the data is stored.  Most importantly, I really want to see how this is handling updates, which is what I find to be the hardest problem for search impl.

As usual, I’m going to start reading code in file name order, without any relation to the current structure of the code, and jump around from there.

In aggregates.rs, we find this piece of code, which allows to run an aggregation. The init / action / extract are the set  of operations that happen in the aggregation process.

image

Right now I don’t know what JsonValue is, but it is interesting to read just from the signatures. The action function can get a reference to a mutable instance, an immutable instance and a reference to an immutable instance. I’m assuming that the mutable instance is the one that is being modified to hold the aggregated value.

The aggregation itself is implemented by defining an enum that has a get_fun_impls method that initialize the AggregateFunImpls based on the specific aggregation value required.

Sum is defined as two functions, sum_init and sum, without any final step. The actual implementation is:

image

That is fun to read, because it pretty simple, although the syntax in the if let line is really confusing to me (first time in the code, non Rust dev). Basically, it says that we’ll take the JsonValue instance, convert it from a JsonValue to a number instance, and then set it to the same variable name as the original one.

If I was writing it in C it would probably look like:

JsonValue* existing = …;
int* existing = extract_number_(existing);
existing += new;

That is leaving aside the fact that the “new” that is used here and the “new’ in  the function arguments are not the same (well, they are conceptually, but not the same type, etc). I find this confusing, but I know that this is fairly common in Rust code.

The rest of the file is pretty similar constructs for other aggregation operations, and the next is error.rs, which is just boilerplate code that Rust requires from anything. The next file is filters.rs, which seems really interesting, but it doesn’t make sense at the current stage of the code, because it refers to a lot of other things that I haven’t checked yet, so moving to the next one, index.rs. Hopefully that will start the juicy bits.

Indeed, the interesting bits start when we call open, where we open up a rocksdb instance and start playing around. There is some interesting error handling here:

 

image

In particular, note that if the error failed because the database didn’t exists, it is created, and a header version is added. I wholeheartedly approve, because such early planning will pay dividends down the road. It is important to note that assert statements in Rust are going to be there for release. Only debug_assert will be removed in release builds.

The index has the ability to create a snapshot, based on the RocksDB Snapshots. I wrote a blog post on how that works on LevelDB (which RocksDB is based on).

image

I do wonder about the unwrap call here. If I understand correctly, this would cause a panic if I’m access an instance that isn’t open. That might be bad if you have a query hitting the server during shutdown, I’m guessing. I assume that this is oversight because it make the code easier to consume.

The add method there is where stuff really start to happen.

image

This is quite interesting, because you pass it a json string, and a batch (a Noise concept that basically wraps a RocksDB batch).

Actually, thinking about this, there is something that might be trippy here. RocksDB allow concurrent batches, but while it has support for optimistic concurrency, I’m guessing (at this point it is a guess) that this isn’t used by Noise, and that can lead to severe problems if you are trying to add to the index from concurrent threads. This is important because parallel indexing is pretty much the first optimization step that anyone will try, and Lucene calls it out explicitly as a best practice.

imageThe json is parsed by a class called shredder, I’m assuming that this is the JSON parser used by Noise.

The actual behavior is quite interesting.

image

We create a new shredder, then set it on the JSON string. The “let … if let” syntax gave me a pause for a bit. It is elegant, in a crazy sort of way.

But the really interesting bit here is that this code says quite a lot about the way Noise works. First, we now know that Noise has stable document ids, since we can see it being used when we check if the document is already in the index. That means that there is update support, which is pretty cool. Although I do wonder why the values are merged back in. For deletion, or are they just creating a large piece?

If there is no id, Noise uses a guid. I would usually call out that this might be a problem (because down the line, given enough work, it will create a lot of compaction work around the files), but I’m suspecting that not specifying an id is going to be a pretty rare occurrence, so that isn’t important.

Finally, we add it to the batch:

image

That means tat the shredder is probably a very important piece in the system. I’ll continue to read the index code for now and see what I can figure out, before jumping to that code.

Next, we have delete, and the key part of that method is:

image

So we have delete support, and comparing that to the update process is very interesting. They are different, which means that updates aren’t just delete/insert. That is interesting.

Next we have gather_doc_fields, which gets a string document id. What happens from there is interesting. Let us assume that our document id is “users/1”.

Noise searches the RocksDB for a value named “Iusers/1” <- Note that I prefix, I assume to stand for id. That happens in fetch_seq. If there is such a key, its value would be a string representation of int64, which is parsed and returned. That seq is the internal sequence number, I’m assuming. And I’m guessing it is used for most operations. Let us assume that the seq number returned in 1234.

gather_doc_fields then do another RocksDB search for “V1234#”. This is done via a KeyBuilder, which Noise is using to generate keys based on the depth of the search, I’m guessing. Haven’t really looked into that yet.

The code to collect the keys for a document sent me into a bit of a wild goose chase, in the sense that I couldn’t figure out what was going on here:

image

RocksDB is a key/value database, it doesn’t support multiple values per key. Took me a while to figure out that this is a (I’m guessing inefficient) way to to clone the value from the database owned memory to memory controlled by the Noise.

The next method, flush, just update the internal data (version and last sequence number) and submit the write batch. That concludes it as far as I’m concerned, unless something really strange is going on, indexing in Noise must be done in a single threaded manner. Note that there isn’t anything wrong with that (this is what we ended up doing in RavenDB 4.0), but it is something that needs to be stated out load, because it is contrary to the expected practice. Then again, I haven’t seen anything on indexing speed or any sign of benchmarking yet, maybe the team didn’t get there at this point.

The last thing on this file that is of interest are the RocksDB management functions. The compaction_filter – Will keep any key that starts with K or C, but if the value doesn’t start with them, then the value is interpreted as int32, and if it is 0, it is discarded. This is interesting, because HDB (the header for the db) key is not one of the protected values, and so the reason it isn’t thrown out is that Noise is using little endian value here, and accidently the int64 value (1 in this case) is interpreted as non zero int32. I’m assuming that the same holds true for the rest of values, but it seems more like accidental behavior. There is also the fact that we already know that there are some values (sequence numbers) whose value is a stringified int64. That also will pass the “not zero int32”, but even if it works, it looks wrong.

image

So basically, we have a few more key prefixes, which have a certain meaning, yet unknown. But if they are those values, they should be compared using the KeyBuilder comparison, otherwise, just compare them as is. I don’t like the single letter prefixes. RocksDB uses prefix compression, so I don’t think that using more readable prefix would cost anything.

Finally, we have the sum_merge, which is called by RocksDB when you make a Merge call. This is an interesting approach to handling updates, and what this basically does in the case of Noise is that it allow to merge multiple calls to the same value and return the sum total of them. I’m assuming that the value used here is the ref count or some such. Given the interplay between sum_merge and compaction_filter.

That is it for the index.rs file. There are actually tests there for the index, which is interesting. I don’t like having tests in the same file, but I think it is common in Rust.

The only thing of note there is that we find some interesting information about the format of the keys:

The document:

image

Is translated into the following:

image

That is the job of the KeyBuilder, I think, but I’ll leave digging into that to another post.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 4 days from now
  3. What happens when a sparse file allocation fails? - 6 days from now
  4. NTFS has an emergency stash of disk space - 8 days from now
  5. Challenge: Giving file system developer ulcer - 11 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}