Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 2 min | 366 words

I asked the following question, about code that uses AsyncLocal as well as async calls. Here is the code again:

This code prints False twice, the question is why. I would expect that the AsyncLocal value to remain the same after the call to Start(), since that is obviously the point of AsyncLocal. It turns out that this isn’t the case.

AsyncLocal is good if you are trying to pass a value down to child tasks, but it won’t be applicable to other tasks that are called in the same level. In other words, it works for children, not siblings tasks. This is actually even more surprising in the code above, since we don’t do any awaits in the Start() method.

The question is why? Looking at the documentation, I couldn’t see any reason for that. Digging deeper into the source code, I figured out what is going on.

We can use SharpLab.io to lower the high level C# code to see what is actually going on here, which gives us the following code for the Start() method:

Note that we call to AsyncTaskMethodBuilder.Start() method, which ends up in AsyncMethodBuilderCore.Start(). There we have a bunch of interesting code, in particular, we remember the current thread execution context before we execute user code, here. After the code is done running, we restore it if this is needed, as you can see here.

That looks fine, but why would the execution context change here? It turns out that one of the few places that interact with it is the AsyncValue itself, which ends up in the ExecutionContext.SetLocalValue. The way it works, each time you set an async local, it creates a new layer in the async stack. And when you exit an async call, it will reset the async stack to the place it was before the async call started.

In other words, the local in the name AsyncLocal isn’t a match to ThreadLocal, but is more similar to a local variable, which goes out of scope on function exit.

This isn’t a new thing, and there are workarounds, but it was interesting enough that I decided to dig deep and understand what is actually going on.

time to read 2 min | 229 words

A user contacted us to tell us that RavenDB does not work in his environment. As you can imagine, we didn’t really like to hear that, so we looked deeper into the issue. The issue in question included the actual problem, which looked something like this:

{
    "Url": "/auth/",
    "Type": "Raven.Client.Exceptions.Routing.RouteNotFoundException",
    "Message": "There is no handler for path: GET /auth/",
    "Error": "Raven.Client.Exceptions.Routing.RouteNotFoundException: There is no handler for path: GET /auth/\n"
}

My reaction to that was… huh?!

That is a really strange error, since RavenDB does not have an “/auth/” endpoint. The problem isn’t with RavenDB, it is with something else.

In this case, the user ran RavenDB on port 8080 (which is the normal thing to do) and then tried to access RavenDB in the browser.

The problem was that they previously ran some other software, and that software had the following interaction:

* Connected to 127.0.0.1 port 8080
> GET / HTTP/1.1
> Host: 127.0.0.1
> User-Agent: curl/7.85.0
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Location: http://127.0.0.1:8080/auth/
< Content-Type: text/html; charset=UTF-8

In other words, it redirected the browser to the “/auth/” endpoint. It’s critical to understand that 301 response means: Moved Permanently. That means that they are actually cached by the browser. In this case, the scenario was reusing the same endpoint for a different software product, and the browser cache meant that we got strange results.

time to read 4 min | 749 words

incaseofemergency

Preconditions, postconditions, and invariants, oh my!

The old adage about Garbage In, Garbage Out is a really important aspect of our profession. If you try to do things that don’t make sense, the output will be nonsensical.

On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.

~Charles Babbage – Inventor of the first computer

As you can see, the issue isn’t a new one. And there are many ways to deal with that. You should check your inputs, assume they are hostile, double check on every layer, etc.

Those are the principles of sound programming design, after all.

This post is about a different topic. When everything is running smoothly, you want to reject invalid operations and dangerous actions. The problem is when everything is hosed.

The concept of emergency operations is something that should be a core part of the design, because emergencies happen, and you don’t want to try to carve new paths in emergencies.

Let’s consider a scenario such as when the root certificate has expired, which means that there is no authentication. You cannot authenticate to the servers, because the auth certificate you use has also expired. You need to have physical access, but the data center won’t let you in, since you cannot authenticate.

Surely that is fiction, right? Happened last year to Facebook (bad IP configuration, not certs, but same behavior).

An important aspect of good design is to consider what you’ll do in the really bad scenarios. How do you recover from such a scenario?

For complex systems, it’s very easy to get to the point where you have cross dependencies. For example, your auth service relies on the database cluster, which uses the auth service for authentication. If both services are down at the same time, you cannot bring them up.

Part of the design of good software is building the emergency paths. When the system breaks, do you have a well-defined operation that you can take to recover?

A great example of that is fire doors in buildings. They are usually alarmed and open to the outside world only, preventing their regular use. But in an emergency, they allow the quick evacuation of a building safely, instead of creating a chokepoint.

We recently got into a discussion internally about a particular feature in RavenDB (modifying the database topology). There are various operations that you shouldn’t be able to make, because they are dangerous. They are also the sort of things that allow you to recover from disaster. We ended up creating two endpoints for this feature. One that included checks and verification. The second one is an admin-only endpoint that is explicitly meant for the “I know what I mean” scenario.

RavenDB actually has quite a bit of those scenarios. For example, you can authenticate to RavenDB using a certificate, or if you have a root access on the machine, you can use the OS authentication mechanism instead. We had scenarios where users lost their certificates and were able to use the alternative mechanism instead to recover.

Making sure to design those emergency pathways ahead of time means that you get to do that with a calm mind and consider more options. It also means that you get to verify that your emergency mechanism doesn’t hinder normal operations. For example, the alarmed fire door. Or in the case of RavenDB, relying on the operating system permissions as a backup if you are already running as a root user on the machine.

Having those procedures ahead of time, documented and verified, ends up being really important at crisis time. You don’t need to stumble in the dark or come up with new ways to do things on the fly. This is especially important since you cannot assume that the usual invariants are in place.

Note that this is something that is very easy to miss, after all, you spend a lot of time designing and building those features, never to use them (hopefully). The answer to that is that you also install sprinklers and fire alarms with the express hope & intent to never use them in practice.

The amusing part of this is that we call this: Making sure this areup to code.

You need to ensure that your product and code are up to code.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}