Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 3 min | 499 words

I run into this over twitter:

image

There were some suggestions there to go to meetups, find a mentor, etc. Those are important, but I consider them secondary to what you need to be a good developer.

My advice:

Write code, you'll likely write crap code, but write code, and a lot of it.

Read code, you'll not understand some, but try to.

The order matter.

The only way to be a good developer is to be a bad developer first. I have a drawer full of old hard disks that contain old code, some of it goes back over 20 years. I still remember being incredibly proud in writing a full BBS system in VBScript & ASP (classic!) that didn’t use a database but rather manipulated the HTML files on disk directly so you had what is effectively a static website that would self modify itself. The impressive thing was this was a single nested switch statement that went on for thousands of lines. I somehow managed to keep it all in my head enough to be able to actually complete the project.

It would never work in practice (I didn’t have any concept of “what happens if two requests happen at the same time”) and it was never deployed, but it was code that I wrote, and that thought me what works. More importantly, it told me what doesn’t work. That meant reading errors, figuring out how to find faults in my program, getting used to run <----> modify cycle, etc.

I wrote web systems, gesture recognition systems that would serve as hot keys in Windows, shell extensions and a lot of random stuff. Most of it was never meant to be anything, it was just a way for me to explore. The more I wrote, the more I knew what was going on.

At that point, reading other people’s code would have done nothing for me. I wasn’t at a level that I could grasp what other people were doing. It took a long time until I was ready to actually peek into other people’s code and actually be able to make sense of it. More to the point, it took a long time until I was able to actually learn something from that, rather then just go with a targeted “what do I need to make X work”.

Having other people there to help can be very useful, but it can also be a crutch. At least initially, you need to fall down a lot to figure things out. Mostly because people have very hard time telling you how they found the problem in your code. “It’s obvious that this is here” doesn’t give you much to learn from except possibly that you are stupid for missing the obvious. A lot of the advice that this tweet got is absolutely something that I can get behind, but I would put it significantly later in the process.

time to read 3 min | 411 words

imageWhen you setup a RavenDB server securely, it doesn’t require a valid X509 certificate to be able to connect to it.

You might want to read that statement a few times to realize what I’m saying. You don’t need a certificate to connect to a RavenDB service. But on the other hand, we use client certificate to authenticate connections. So which is it?

The problem is with the specific OSI layer that we are talking about. When using SSL / TLS you can require a client certificate on connection. That ensures that only connections with a client certificate can be processed by the server. This is almost what we want, but it has the sad effect of having really poor error reporting capabilities.

For example, let us assume that you are trying to connect to a secured RavenDB server without a certificate. If we just require client certificate, then the error would be raised a the SSL layer, giving you errors that look something like:

  • The connection has been closed
  • Could not establish trust relationship for the SSL/TLS

This can also happen if there is a firewall in the middle, if enough packets were dropped, if there is a solar storm or just because it is Monday.

Or, it can be because you don’t have a certificate, your certificate expired, it is not known or you are trying to access a resource you are not authorized to.

In all of those cases, the error you’ll receive if you require & validate the certificate at OSI Layer 6 is going, to put it plainly, suck. Just imagine trying to debug something like that in a protocol that by design is intended to be hard / impossible to eavesdrop to. Instead, we can allow access to the server without a certificate and enforce the required certificate at a higher level.

That means that when you connect to RavenDB and you don’t have a certificate, we’ll accept the connect and then either give you an error (if you are an API) that you can reason about (because it will tell you what is wrong) or you’ll be redirected into a nice error page (if you are using the browser). Alternatively, if you have a certificate and it is not valid for any number of reasons, RavenDB will tell you all about it.

That reduce the amount of hair loss in the case of errors significantly.

time to read 2 min | 387 words

imageI talked about the Merge Games in somewhat of a jest, but more seriously, there is a lot to worry about once you have long running branches. In our case, it isn’t so much that we have a lot of long running branches as we have a ton of changes that are happening in multiple branches in parallel and it sometimes can take a few weeks until the work is done and we can merge it all.

This put a lot o pressure on the code review part of the process. One of the things that I really like with GitHub is the PR / review processes, and it works great when you have a small commits / PRs. The problem is that when you are talking about large scope of work, you are left with few options for proper review.

One option is to get a PR with dozens of commits, and having to slog through each of them to understand what is going on. Another is to get a PR with a single commit, that contains a lot of changes. This means that you have to grasp the whole change in one shot. Either option is really hard, and can lead the reviewer to skim through the code.  That isn’t something that we want to do, instead, we really want to pay as much attention to the code as we did while writing it.

My process for handling this is to lean heavily on GitHub. What I do is create a PR very early in the process, sometimes immediately after the first commit in that branch. That gives me the ability to review things incrementally. Instead of having to deal with it all at once, I can review the changes as they come in. Whenever one of the developers push their commits, I’ll get a notification about be able to go over the details and comment on the spot.

That shortens the feedback cycle and remove a lot of the complexity from the review process. It also means that we can more easily note that one developer is doing something that is also being done by another team, so we can integrate the work earlier in the process.

time to read 2 min | 275 words

imageWe currently have four different teams that are working on large modifications to RavenDB.

Large modifications means that they are working on a feature for a relatively long time and very frequently need to do modifications to large swaths of the code base. Oh, and the common theme for all of them is that they are all big enough that it means that you cannot just merge back into the main branch. There are often a lot of failing tests or even uncompilable state during a long refactoring session.

The good thing is that we are pretty good about making sure that we merge from the main branch on a regular basis. The bad thing is that once we start merging those big changes, the other large refactoring are going to have to deal with a lot of changes happening very quickly.

Hence, the merge games. The fastest one of those team is able to hit the “can we put this in the main branch” point, the less work it is going to be for them. On the other hand, the slower you are in getting to that point, the more conflicts you are likely to run into and have to resolve.

I don’t think it would be a best seller book series and I doubt that I’ll get a movie deal from it, but in a certain select group of people, I think that this will be an amazingly fun game (as long as you aren’t the one left holding the shitty end of the merge conflicts).

time to read 3 min | 576 words

In the previous post I talked about what I wanted to get, and how I decided to use the GOLD parser to formally define the language grammar. However, I had no intention of actually generating the parser, I wanted to write it myself. This is because I have a lot of poor experience with the results of parser generators. They generate horrible code that manages to be both unreadable and force you to go into it frequently enough to be maddening.

In particular, I haven’t been able to find anything that would be good enough as a hand rolled parser in terms of performance, readability of the code and the quality of errors it will generate. The later might sound strange, but a key part of the usefulness of any language is the kind of errors that you get when you give it invalid output.

An important consideration is that I’m intending to accept input over HTTP, so I can assume that my input is already wrapper in a pretty System.String package, which saves a lot of the complications that you usually have to deal with if your input is a streaming medium. Initially I tried to go with a scanner / parser in the same place, but that ended up being a Bad Idea, it was too complex to handle. Instead, I split the responsibility to a scanner and parser classes.

The scanner is responsible for scanning the string and finding the next token. But I decided to make it reactive, so you’ll tell it what you expect, and it will see if the next bit matches. Scanners will typically scan the input and produce a stream of tokens. That works, but it means that you need a lot more state in the scanner, and it can be more complex. I decided to simply if as much as I possible could.

Here is how we find identifiers in the scanner:

You might notice that the code is pretty simply, it runs over the input string (_q) and check if the next token is an identifier based on our rules. That make the parser code easier to handle. Let us consider how we can parse a  FROM clause. The formal definition is:

<From> ::= 'FROM INDEX' <From Source> | 'FROM' <From Source>

<From Source> ::= <Simple Field> 'AS' <Simple Field> | <Simple Field>

<Simple Field> ::= Identifier | StringLiteral

If you are familiar with BNF, this is quite readable. Now, how do we parse this? Here is the actual parser code:

As you can see, we start by asking for a FROM token (the scanner is case insensitive, so we don’t need to worry about casing) then check if this is followed by INDEX term then we get actual identifier or string literal to use, and possible alias.

This code can parse the following:

  • from Users
  • from Users as user
  • from index “Users/ByActiveMarker” AS u

You can note that I’m taking full advantage of the possibly of asking the input several questions, because it make my code simpler overall. I’m also not doing any substring operations. Instead, I’m passing indexes into the overall query string that allow me to get the information without paying the price for allocating all those strings.

In this manner, we also get something that is pretty easy to work with, and we can compare it to the formal definition to guide us in the parsing. At the same time, we get code that is readable and has quite good performance.

time to read 4 min | 746 words

imageSome tasks are fun, they are self contained, easy to conceptualize (though not always easy to build) and challenging.  A few weeks ago I spent my weekend writing a parser, and it was a lot of fun.

I’ve been writing parsers for a very long time, the book came out in 2010 and I was playing with Boo since 2005. The ANTLR book was an interesting read and  it thought me a lot about how to approach text parsing.

However, you might have noticed that I shifted my thinking about a lot of design problems. In particular, performance, number of allocations, exceptions thrown during parsing, the readability of errors when getting invalid input, etc.

In particular, the Lucene query parser is a really good example of a really crappy one. It fail on pretty much all points above. I worked with a bunch of parser generators, and I never found one whose output was something that I could really like. They typically generate unreadable code and customization of their behavior are non obvious, to say the least.

Martin Fowler (only slightly out of context):

…it's hard to write a parser.

My most recent parser is the RavenDB JSON parser, you can see the progression of the ideas around that in this series of posts. That isn’t something that you’ll really want to read without a cup of coffee and some writing instruments to write notes.

Most non trivial parsers are composed of at least two pieces, a tokenizer and an builder. The tokenizer goes over the input, break it into tokens that the builder use to build the final format. The JSON scanner in RavenDB is called UnmanagedJsonParser and the builder is  BlittableJsonDocumentBuilder. Traditionally they would be called scanner and parser, for the roles they play, but we’ll leave the names as is because it doesn’t really matter. This code is not fun to go through, it has been through multiple performance reviews, each time making it uglier then before, but much faster.

JSON is also one of the simplest possible textual formats. The formal definition of JSON fits a post-it note. The JSON scanner I have for RavenDB is close to 900 lines of code and is only part of the parsing process.

But the parser I built over the weekend wasn’t for JSON. Instead, I wanted to play with a query language, so I naturally wanted something SQL like. And that is anything but trivial to do.

The first thing I needed was to actually sit down and figure out what the language is going to look like. In order to do that, you almost always want to use a BNF notation of some kind. This allow you to specify what your language should look like, not just as a few snippets in a notepad window but in a more structure manner.

More to the point, there are a lot of tools out there to use. I decided to use GOLD Parser, it was last updated in in 2012 and it shows, but it had the lowest friction of all the parser IDEs that I tried and it has great support for debugging and working with grammars. Why not use ANTLR, which is pretty much the default choice? Put simply, it is usually too much of a hassle to setup ANTLR properly and I didn’t want to get bogged down with the actual details of generating the parsers, I wanted to focus on the grammar.

I actually don’t know how to parse text using the GOLD Parser. It looks like it generate a binary file that you feed to some library that would do it for you, but I’m not sure and it doesn’t matter. What I care about is that I can develop a formal definition and debug it easily. I’m not actually going to use it to generate the parser.

Huh?! Why do all this work for no reason?

A formal definition of the language is incredibly helpful when you consider a new syntax, because you can verify that you aren’t creating holes and ambiguities in your language. It also give you a pretty clear guideline on how to implement the language.

I’m going to go into more details about the language itself and building a parser for it in my next post.

time to read 1 min | 129 words

During reviewing a PR I run into what seemed like a strange thing. Take a look at this change:

image

This came with its own exception class, and left me pretty confused. Why would I want to have something like that?

Here we have some error handling code that doesn’t seem to add any additional value. Everything in the error here can be deducted from the details of the exception that will be thrown if we did nothing.

The fact we throw a specialized exception might be meaningful, but looking at the code, this isn’t actually used for anything.

Like all code, error handling needs to justify itself, and this one doesn’t pass the bar.

time to read 4 min | 752 words

imageLooking back at this series, I have the strong feeling that I’m being unfair to Resin, I’m judging it using the same criteria I would use to judge our own production, highly optimized code. The projects have very different goals, maturity and environments. That said, I think that a lot of the comments I have to the project are at the implementation level. That is, they can be fixed (except maybe the analyzer / tokenizer pipeline) by simply optimizing one method at a a time. Even the architectural change with analyzing the text isn’t very big. What is important is that the code is quite clear, easy to follow and have a well defined structure. That means that it is actually possible to make this changes as the project mature.

And now that this is out of the way, let me cover some of the things that I would have done differently in the codebase. A lot of them are around I/O related. In particular, the usage of all those different files and the way this is done is decidedly non optimal. In particular, opening and closing of the files constantly, reading and seeking all over the place, etc. The actual design seems to be based around LSM, even if this isn’t state explicitly. And that have pretty good semantics already for writes, but reads currently are probably leaning very heavily on the file system cache, but that won’t work as the data grows beyond a certain scope.

When running on a Unix system, you also need to consider the fact that there is a limit to the number of open files you have, so smaller number of files are generally preferred. I would go with merging all those files into a single large one, similar to the compound format that Lucene uses.

Once that is done, I would also memory map the entire file to memory and use directly memory accesses to handle all I/O. This has several very important advantages. First, I’m being a lot more explicit about using the file system cache, and that would allow us to avoid a lot of system calls. Second, the data is already mostly structured as arrays, so it would be very natural to do so. This also avoid the need to manually buffer things in our own memory, which is always nice.

Next, there is the need to consider consistency checks. Resin as it stands now (I’m not sure if this is an explicit design decision) takes the position that it is not its job to ensure file consistency. Lucene make some attempt to ensure consistency, and usually fails at that horribly at the most inconvenient moments. Adding a hash to the file will allow to ensure that the data is okay, but it means  having to read the entire file when you open it, which is probably too expensive.

The other aspect that need attention is the data structure used. In particular LcrsTrie is a good idea to save space and might work well for in memory usage, but it isn’t a good choice for persistent data structures. B+Tree or SST are the common choices, and need to be evaluated for the job.

As part of this, and quite important, I would recommend getting a look at the full I/O status. That means:

  • How you write to disk?
  • How you update data?
  • Do you have write amplification (merges)?
  • Are you trying for consistency / ACID?
  • Can you explain how your data is persisted, using algorithms / approaches that are well known and trusted?

The latter is good both for external users (who can then reason about your system) and for yourself. If you are using LSM, you know that you have a set of problems (compactions, write amplifications) and solutions (auto optimize over time, etc) that is well known and you can make use of that. If you are using B+Trees, then the problems and solutions space are different, but there is even more information about them.

If you are using consistency, are you using WAL or append only? What are your consistency guarantees, etc?

Those are all questions that needs answers, and they have an impact on the design of the project as a whole.

And with this, this series is over. I have to say that I didn’t think that I would have so much to write about. It is a very interesting project.

time to read 4 min | 735 words

In the previous part, I looked at how indexing and queries are handled in Resin. This post is mostly about the pieces I haven’t talked about so far. We’ll start with the query parser and move to the trie.

Queries in Resin looks like this:

image

This is sort of looking like the Lucene syntax, but it looks like it keeps the same context until a new field comes along.

Range queries looks better, sort of:

image

I had a hard time figuring this one out, until I realized that this is not an XML tag in the middle.

The problem is that the Lucene query syntax kinda sucks. Actually, it sucks a lot. It is complex and ambiguous to parse and it is full of all those little things like the ~ over there that is not very obvious but is very important to the query. I would actually suggest something more like SQL. Sure, that wouldn’t be what you’ll put in the search box, but programmers will appreciate you for that.

Looking at the parser code, there aren’t any surprises there. It is using a hard rolled system using regex and split, which can be vastly improved. One thing to note is that because of the simplicity of the parser, it isn’t really able to process things like a search for a token with a colon in it, so it can’t process this query: 

url:http://ayende.com

Anyway, the query parser isn’t really the most important thing here. The core of Resin, and what I haven’t looked at so far at all is the trie…

LcrsTrie stands for Left child Right sibling, there is a good discussion on the reasons why you’ll want to use this here. At this point, I’m not really sure why the choice of Lcrs was used. In general, they are used to reduce space and simplify the representation, but I don’t think that this is a good idea for a persistent structure. I’ll look at that later. Right now I’m reading the code, and it is mostly pretty obvious code. But then you get to this:

image

This pattern of using IEnumerable to return a single value is something that I’ve seen in other places in the codebase, and I don’t really get it.

I like the use of the Levenshtein distance in fuzzy search, mostly because we don’t need to store a lot of data to get fuzzy search working. In particular, it looks like suggestion style queries are pretty easy, and would be much cheaper then it would be in Lucene.

Probably the core operation you always perform on a trie is the search, and the core of that in this case is the TryFindPath method:

There is nothing surprising in this code, but it is a pure in memory implementation, which is a very different environment then a persistent data structure.

The persistent data structure is actually the MappedTrieReader, so let us examine it. Looking at it, there is some reference to the notions of segments within the file, but I’m not seeing where it is used. This is what the “*.six” file is used for, it seems. I think that this might be related to merging, but I don’t really know.

And here is the reason for the IsWord design:

image

When using a single LcrsTrie, it isn’t needed. But when using a possibly segmented reader, we might have multiple results for the same word.

What is worrying here is that the same access pattern for the trie that is used in memory is also using in the persistent mode. That means that each time we need to traverse the trie, we’ll need to seek. Actually, it looks like that might only be needed when we aren’t on the right path, but that is actually pretty common, so there are going to be a lot of seeks.

That is enough for now, my next post will be more detailed analysis of the Resin I/O structure and what I would probably do instead.

time to read 5 min | 964 words

In the previous part, I looked at UpsertTransaction in Resin and speculated about how the queries work. In this one, I’m going to try to figure out how queries work. Our starting point is this:

image

We start from the index header, and we’ll traverse down from there. One one the first things that happen in the Collector is the creation of the DocHashReader, whose sole purpose is to… read a document hash. It is doing it like this:

The problem is that there is really no need to write all this code. It would be simple to use:

image

It does the exact same thing, but with a lot less work all around.

The core of the Collect method is:

image

For our purposes, we are running just a single query, so no need to worry about sub queries at this time. Looking at the Scan method, the first thing it does is to open the tri file. It looks like I missed a bunch of stuff there.

The field name hash is the one used in the key, not the name itself. That means that you aren’t limited to just stuff that is safe to use on the file system. There is also a “.six” file that I’ve not seen before, it is related to tries, and I’m skipping it for now because I want to have a separate post about them.

It is used like this:

image

The problem I have is that this means that the GetTreeReader will open a bunch of files, then immediately close them. That is going to be a lot of system calls that are being generated, which can probably be saved with some effort.

The really interesting bit is here:

This is where the magic happens. This is the core for actually searching over the tries and figuring out what values we actually have there.

The result of this is a List<(string Field, Word Word)>. And Word contains:

image

Reminder, the Postings is actually the list of all the documents that contain this value in this field, and the number of times that this value appear in the document.

The next method is GetPostings, which starts by reading them:

image

The problem I have here is that this method looks like it has been refactored half way. It can only return a single list, and again, there is the over use of Linq operations and their allocations.

As an aside on code formatting, in many places in the code so far, I have chosen to minify the code without changing its meaning, because there is such a high overhead to the differences. I’m doing this fairly automatically, because it help me read and understand. Here is a before and after example, which was drastic enough to make me realize I’m doing this.

Before

After

image image

Functionally, those two are doing the same, and I fine the after option much more readable.

The Sum method here is pretty horrible, in the sense that it has high complexity, luckily, it is never called with more then one list, so that cost is hidden.

A fun exercise would be to compute the actual complexity with real inputs. I just looked at it and went “this gotta be expensive” then figured out that the code only ever call it with a single list, so I skipped it.

After getting the posting, we need to score the query. This is where we see the usage of the document hash. They are used to go from the document id to check if the document has been deleted. The actual scoring is Tf-Idf, so pretty standard and not interesting here.

It does bugs me to see things like this:

image

Sorting can be very expensive, and I’m pretty sure that it is not actually needed here, and it would improve performance quite impressively to remove it.

Okay, now we are almost done with the query, all that remains is investigating this line:

image

The unbounded result set is annoying, but I gave up that fight, I’m afraid. Let us see what Reduce does. In complex queries, I expect that it would merge all the subqueries and do filtering / intersection/etc.

image

And it does just that, which is great. I do wonder if scoring the results could be pushed after the query reducing, because that would reduce the amount of work that needs to be done, but that is a small optimizations, probably.

Okay, that is enough for this post. I now have a pretty good understanding on how queries actually work. Next, I’m going to look at some other pieces of the code that I haven’t looked at, then focus of the trie.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}