Reviewing Noise Search EnginePart II

time to read 6 min | 1169 words

In my previous post, I started going over the Noise project code. We got to the indexer, but there is still a lot to do, as a reminder, I’m reviewing 48d808375e541eca383826ea928272976bce064d.

The indexer code handed off a lot of work to the shredder, so it is a good thing that this is my next target for review. At a glance, it looks like Noise is using rustc_serialize to do the actual JSON parsing. I’m guessing it make sense at this point, but given that RavenDB 4.0 had rolled its own JSON parser, and given that Noise is accepting the data as strings, I’m going to assume that JSON parsing costs are going to be high on any profiler output.

The first thing to see in the file is the shredder struct:

image

The existing_key_value_to_delete is very interesting, since it implies that this is how deletes & updates are handled.

It looks like this is working really closely with the key builder. The first method on the file is:

image

There are a whole bunch of stuff like that, with roughly the same pattern (add_bool_null_entries, add_stemmed_entries, etc). Because of the close cooperation of those two, I’m going to be reading them in parallel. The KeyBuilder struct looks like:

image

At this point, I’m guessing that what is going on is that the shredder goes over the JSON  document, and add the property path to the KeyBuilder. But given add_number_entries, I don’t know see how that would work. We are adding the document sequence number to the end of the key path, and store the value of the property in the value. That is… strange. The KeyBuilder data contains the property paths and the current position in the array (possibly nested), so that much is clear, its usage, not so much at this stage.

Because this is confusing, I’ll start from the shred function and go from there, to see what it is exactly that this is going on.

image

This takes the string, pass it to the JSON parser and start iterating over it. Okay, going over the code, that seems easier to follow now. This parse the JSON one token at at time. When it encounters a property, it add that to the KeyBuilder. So the KeyBuilder is basically the path into the current value in the document that we are parsing.

This bugs me, though:

image

I’m following what is going on, and it makes perfect sense. But all those special key prefixes that are scattered are not documented anywhere that I could see, and as I mentioned previously, single letter prefixes are hard on readability and I don’t think they add much, given that RocksDB is using prefix compression anyway.

There is already a PR to fix this, which is likely to be applied by the time you are reading this blog post, by the way. This change the prefixes to be constants, improving code readabily.

A lot of the functionality seems to be inside maybe_add_value, see the call sites:

image

It looks like all the numbers are recorded as floats (given that JSON use double precision IEEE floating point, I guess that make sense). You can see the different prefixes that are given for the values.

The tests at the bottom of the file completes the picture, and if I understand this properly, given the following JSON:

{ "name":"John", "age":31, "city":"New York", “hobbies”: [“reading”, “dogs”] }

The shredder is going to output the following entries, given a document sequence of 555:

  • W.age!f31#555,
  • W.hoobies$!dogs#555,1
  • W.hobbies$!reading#555,0
  • W.name!john#555,

I think that the code calls this key path, and the general format is (white space added:

W <property path> ! <value> # <document sequence> , <array path>

A critical observation is that an array is denoted as a $ in the property path, and there are as many indexes in the array path as there are $ signs in the property path.  I’m currently ignoring the full text search aspect of things, since it is not important for our purposes. With the understanding of what is being generated, let’s move to seeing what is actually going on.

Let us go back to merge_existing_doc, and see how that works. This is called after we shredded a document, so we have done all of the interesting work with regards to parsing the document and building the tokens, splitting the data, etc. At this point, we check if we already have this document, and if so, we load all the existing values and compare them to the current version. The nice thing is that we get to optimize. If a certain value didn’t change, we don’t need to do anything at all. On the other hand, if a value was there and is no longer in the current version, we need to delete it. That is effectively how updates are handled.

With deletes, we do the same, but in reverse, we reconstruct the document, find all the indexed values, and then we remove it. Finally, we have add_all_to_batch, which take all the in memory work that was done and apply that to the RocksDB batch, pretty much finishing all the work. I don’t like that some methods are holding into the state (shredded_key_values / existing_key_value_to_delete)  then flush it and some are manipulating the RocksDB batch directly, it seems like there should be either or kind of approach.

Coming back to the issue that I had with the compaction routine, we have this code, that add a number to the entries:

image

Now, if I’m reading things properly, if the value is zero (not uncommon by any means), that will result in 8 bytes of zeros being written as the value. Combine this with this:

image

And now I think that this will cause that value to be removed during compaction, resulting in data corruption. No idea if this is actually the case, since I’m just going over the code, and not running / testing it thoroughly.

It turns out that I’m not reading this properly (pun intended). I missed the negation on the first line of compaction_filter. The C & K values are the only things that will be removed during compaction, not the other way around, so no data corruption, just me remember that there is a reason I use “== false” instead of “!”, because I keep missing them.

More posts in "Reviewing Noise Search Engine" series:

  1. (20 Jun 2017) Summary
  2. (19 Jun 2017) Part III
  3. (16 Jun 2017) Part II
  4. (15 Jun 2017) Part I