Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 3 min | 544 words

In the previous post I talked about the issues we had with wanting to run untrusted code and wanting to use Jurassic to do so. The problem is that when the IL code is generated, it is then JITed and run on the machine directly, we have no control over what it is doing. And that is a pretty scary thing to have. A simple mistake causing an infinite loop is going to take down a thread, which requires us to Abort it, which means that we are now in a funny state. I don’t like funny states in production systems.

So we were stuck, and then I realized that Jurrasic is actually generating IL for the scripts. If it generates the IL, maybe we can put something in the IL? Indeed we can, here is the relevant PR for Jurrasic code. Here is the most important piece of code here:

The key here is that we can provide Jurrasic with a delegate that will be called at specific locations in the code. Basically, we do that whenever you evaluate a loop condition or just before you jump using goto. That means that our code can be called, which means that we can then check whatever limits such as time / number of iterations / memory used have been exceeded and abort.

Now, the code is written in a pretty funny way. Instead of invoking the delegate we gave to the engine, we are actually embedding the invocation directly in each loop prolog. Why do we do that? Calling a delegate is something that has a fair amount of cost when you are doing that a lot. There is a bit of indirection and pointer chasing that you must do.

The code above, however, is actually statically embedding the call to our methods (providing the instance pointer if needed). If we are careful, we can take advantage of that. For example, if we mark our method with aggressive inlining, that will be taken into account. That means that the JavaScript code will turn into IL and then into machine code and our watchdog routine will be there as well. Here is an example of such a method:

We register the watchdog method with the engine. You’ll notice that the code smells funny. Why do we have a watchdog and then a stage 2 watchdog? The reason for that is that we want to inline the watchdog method, so we want it to be small. In the case above, here are the assembly instructions that will be added to a loop prolog.

These are 6 machine instructions that are going to have great branch prediction and in general be extremely lightweight. Once in a while we’ll do a more expensive check. For example, we can check the time or use GC.GetAllocatedBytesForCurrentThread() to see that we didn’t allocate too much. The stage 2 check is still expected to be fast, but by making it only rarely, it means that we can actually afford to do this in each and every place and we don’t have to worry about the performance of that.

This means that our road is opened to now move to Jurassic fully, since the major limiting factor, trusting the scripts, is now handled.

time to read 3 min | 537 words

RavenDB is a JSON document database, and the natural way to process such documents is with JavaScript. Indeed, there is quite a lot of usage of JS scripts inside RavenDB. They are used for ETL, Subscription filtering, patching, resolving conflicts, managing administration tasks and probably other stuff that I’m forgetting.

The key problem that we have here is that some of the places where we need to call out to JavaScript are called a lot. A good example would be a request to patch request on a query. The result can be a single document modified of 50 millions. If this is the later case, given our current performance profile, it turns out that the cost of evaluating JavaScript is killing our performance.

We are using Jint as the environment that runs our scripts. It works, it is easy to understand and debug and it is an interpreter. That means that it is more focused on correctness then performance. Over the years, we were able to extend it a bit to do all sort of evil things to our scripts ,but the most important thing is that it isn’t actually executing machine code directly, it is always Jint code running and handling everything.

Why is that important? Well, these scripts that we are talking about can be quite evil. I have seen anything from 300 KB of script (yes, that is 300 KB to modify a document that was considerably smaller) to just plain O(N^6) algorithms (document has 6 collections iterate on each of them while iterating on each of the others). These are just the complex ones. The evil ones do things like this:

We have extended Jint to count the number of operations are abort after a certain limit has passed as well as prevent stack overflow attacks. This means that it is much easier to deal with such things. They just gonna run for a while and then abort.

Of course , there is the perf issue of running an interpreter. We turned out eyes to Jurrasic, which took a very different approach, it generate IL on the fly and then execute it. That means that as far as we are concerned, most operations are going to end up running as fast as the rest of our code. Indeed, benchmarks show that Jurrasic is significantly faster (as in, order of magnitude or more). In our own scenario, we saw anything from simply doubling the speed to order of magnitude performance improvement.

However, that doesn’t help very much. The way Jurrasic works, code like the one above is going to hang or die. In fact, Jurassic documentation calls this out explicitly as an issue and recommend dedicating a thread or a thread pool for this and calling Thread.Abort if need. That is not acceptable for us. Unfortunately, trying to fix this in a smart way take us to the halting problem, and I think we already solved that mess.

This limiting issue was the reason why we kept using Jint for a long time. Today we finally had a breakthrough an were able to fix this issue. Here is the PR, but it tells only part of the story, the rest I’ll leave for tomorrow’s post.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 4 days from now
  3. What happens when a sparse file allocation fails? - 6 days from now
  4. NTFS has an emergency stash of disk space - 8 days from now
  5. Challenge: Giving file system developer ulcer - 11 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}