Things we learned from production, part III–singleton thinking makes long queues

time to read 3 min | 471 words

One of the more interesting things that we had to learn in production was that we aren’t an only child. It is a bit more complex than that, and I am not explaining well, let me start at the beginning.

Usually, when we work on RavenDB, we work within the scope of a single database, all of our efforts are usually scoped to that. That means that when we worked on the multi database feature for RavenDB, we actually focused on the process of loading a single database up in the air. We considered how multiple databases will interact, and we made sure that they are isolated from one another, but that was about it.

In particular, as mentioned in the previous post, starting up and shutting down were done sequentially, on a per database basis. In order to prevent issues, we had a lock on the initialize database part of the process, so two requests to the same database will not result in the same database being loaded twice.

I mentioned that we were thinking on a single database mindset, right?

Can you guess what happened?

  • Request for DB #1 – lock acquired, starting up
    • Request for DB #1 – waiting for lock to release
    • Request for DB #1 – waiting for lock to release
    • Request for DB #1 – waiting for lock to release
  • DB initialized, lock released
  • All requests are now freed and can be processed.

What happen when we have multiple databases, however?

  • Request for DB #1 – lock acquired, starting up
    • Request for DB #1 – waiting for lock to release
    • Request for DB #2 – waiting for lock to release
    • Request for DB #3 – waiting for lock to release
  • DB initialized, lock released
  • Request for DB #2 – lock released, lock acquired, starting up
    • Request for DB #3 – waiting for lock to release

You guessed it, we actually had a global lock for starting (or disposing, for that matter) databases. That meant that a single db that took time to start would impact other databases.

More importantly, it would means that other requests, which were waiting for that database to load and then had to load their own database, had far less time to actually do the processing they needed. Which meant that they were far more likely to run into the request time limit and be aborted by IIS. Which left them in an inconsistent state. Which was a nightmare to figure out.

We resolved this issue by making sure that the lock is now handled only on the same database, and that we won’t lock forever, if after a while we still don’t have the db, we will error early and give you a 503 Service Unavailable error until the db is ready to rock.