Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 3 min | 576 words

In the previous post I talked about what I wanted to get, and how I decided to use the GOLD parser to formally define the language grammar. However, I had no intention of actually generating the parser, I wanted to write it myself. This is because I have a lot of poor experience with the results of parser generators. They generate horrible code that manages to be both unreadable and force you to go into it frequently enough to be maddening.

In particular, I haven’t been able to find anything that would be good enough as a hand rolled parser in terms of performance, readability of the code and the quality of errors it will generate. The later might sound strange, but a key part of the usefulness of any language is the kind of errors that you get when you give it invalid output.

An important consideration is that I’m intending to accept input over HTTP, so I can assume that my input is already wrapper in a pretty System.String package, which saves a lot of the complications that you usually have to deal with if your input is a streaming medium. Initially I tried to go with a scanner / parser in the same place, but that ended up being a Bad Idea, it was too complex to handle. Instead, I split the responsibility to a scanner and parser classes.

The scanner is responsible for scanning the string and finding the next token. But I decided to make it reactive, so you’ll tell it what you expect, and it will see if the next bit matches. Scanners will typically scan the input and produce a stream of tokens. That works, but it means that you need a lot more state in the scanner, and it can be more complex. I decided to simply if as much as I possible could.

Here is how we find identifiers in the scanner:

You might notice that the code is pretty simply, it runs over the input string (_q) and check if the next token is an identifier based on our rules. That make the parser code easier to handle. Let us consider how we can parse a  FROM clause. The formal definition is:

<From> ::= 'FROM INDEX' <From Source> | 'FROM' <From Source>

<From Source> ::= <Simple Field> 'AS' <Simple Field> | <Simple Field>

<Simple Field> ::= Identifier | StringLiteral

If you are familiar with BNF, this is quite readable. Now, how do we parse this? Here is the actual parser code:

As you can see, we start by asking for a FROM token (the scanner is case insensitive, so we don’t need to worry about casing) then check if this is followed by INDEX term then we get actual identifier or string literal to use, and possible alias.

This code can parse the following:

  • from Users
  • from Users as user
  • from index “Users/ByActiveMarker” AS u

You can note that I’m taking full advantage of the possibly of asking the input several questions, because it make my code simpler overall. I’m also not doing any substring operations. Instead, I’m passing indexes into the overall query string that allow me to get the information without paying the price for allocating all those strings.

In this manner, we also get something that is pretty easy to work with, and we can compare it to the formal definition to guide us in the parsing. At the same time, we get code that is readable and has quite good performance.

time to read 4 min | 746 words

imageSome tasks are fun, they are self contained, easy to conceptualize (though not always easy to build) and challenging.  A few weeks ago I spent my weekend writing a parser, and it was a lot of fun.

I’ve been writing parsers for a very long time, the book came out in 2010 and I was playing with Boo since 2005. The ANTLR book was an interesting read and  it thought me a lot about how to approach text parsing.

However, you might have noticed that I shifted my thinking about a lot of design problems. In particular, performance, number of allocations, exceptions thrown during parsing, the readability of errors when getting invalid input, etc.

In particular, the Lucene query parser is a really good example of a really crappy one. It fail on pretty much all points above. I worked with a bunch of parser generators, and I never found one whose output was something that I could really like. They typically generate unreadable code and customization of their behavior are non obvious, to say the least.

Martin Fowler (only slightly out of context):

…it's hard to write a parser.

My most recent parser is the RavenDB JSON parser, you can see the progression of the ideas around that in this series of posts. That isn’t something that you’ll really want to read without a cup of coffee and some writing instruments to write notes.

Most non trivial parsers are composed of at least two pieces, a tokenizer and an builder. The tokenizer goes over the input, break it into tokens that the builder use to build the final format. The JSON scanner in RavenDB is called UnmanagedJsonParser and the builder is  BlittableJsonDocumentBuilder. Traditionally they would be called scanner and parser, for the roles they play, but we’ll leave the names as is because it doesn’t really matter. This code is not fun to go through, it has been through multiple performance reviews, each time making it uglier then before, but much faster.

JSON is also one of the simplest possible textual formats. The formal definition of JSON fits a post-it note. The JSON scanner I have for RavenDB is close to 900 lines of code and is only part of the parsing process.

But the parser I built over the weekend wasn’t for JSON. Instead, I wanted to play with a query language, so I naturally wanted something SQL like. And that is anything but trivial to do.

The first thing I needed was to actually sit down and figure out what the language is going to look like. In order to do that, you almost always want to use a BNF notation of some kind. This allow you to specify what your language should look like, not just as a few snippets in a notepad window but in a more structure manner.

More to the point, there are a lot of tools out there to use. I decided to use GOLD Parser, it was last updated in in 2012 and it shows, but it had the lowest friction of all the parser IDEs that I tried and it has great support for debugging and working with grammars. Why not use ANTLR, which is pretty much the default choice? Put simply, it is usually too much of a hassle to setup ANTLR properly and I didn’t want to get bogged down with the actual details of generating the parsers, I wanted to focus on the grammar.

I actually don’t know how to parse text using the GOLD Parser. It looks like it generate a binary file that you feed to some library that would do it for you, but I’m not sure and it doesn’t matter. What I care about is that I can develop a formal definition and debug it easily. I’m not actually going to use it to generate the parser.

Huh?! Why do all this work for no reason?

A formal definition of the language is incredibly helpful when you consider a new syntax, because you can verify that you aren’t creating holes and ambiguities in your language. It also give you a pretty clear guideline on how to implement the language.

I’m going to go into more details about the language itself and building a parser for it in my next post.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 4 days from now
  3. What happens when a sparse file allocation fails? - 6 days from now
  4. NTFS has an emergency stash of disk space - 8 days from now
  5. Challenge: Giving file system developer ulcer - 11 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}