Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive
time to read 4 min | 606 words

I mentioned in a pervious post that we have a code quality gateway to make sure that all logging statements are wrapped in an if statement, to reduce the number of allocations when the user has logging turned off. This is done because logging can be expensive, and is often turned off, so there is no point in paying a penalty for stuff that isn’t working.

There seems to be some confusion about why this is done. Let us assume that we have the following logging code:

void Debug(string msg)
{
	if(IsDebugEnabled)
		Console.WriteLine(msg);
}

void Debug(string format, params object[] args)
{
	if(IsDebugEnabled)
		Console.WriteLine(format, args);
}

void Debug(Func<string> generateMsg)
{
	if(IsDebugEnabled)
		Console.WriteLine(generateMsg());
}

Now, the obvious bad example would be to use:

Debug("Hello "+ user.Name);

Since that is going to allocate a new string, and this will happen regardless of whatever logging is enabled or not. On high frequency call sites, this can end up allocating a lot of useless stuff.

So we will move to this mode:

Debug("Hello {0}", user.Name);

And we saved the allocation, right? Except that this actually generate this code, do you see the allocation now?

Debug("Hello {0}", new[] { user.Name });

So let us introduce a better option, shall we? We’ll add a few common overloads without the use of params.

void Debug(string format, object p1);
void Debug(string format, object p1, object p2);
void Debug(string format, object p1, object p2, object p3);

And now we saved the allocation. Unless…

int requestNumber = ...;
Debug("Request # {0}", requestNumber);

Do you see the allocation now? We pass an int to a object, which require us to do boxing, which is an allocation Smile.

So let us try using the lambda method, this way nothing is executed!

Debug(() => return "Hello " + user.Name);

Except… this is actually translated to:

var loggerMsg = new LoggerMessage(user);
Func<string> func = new Func<string>(loggerMsg.Write);
Debug(func);

Here are all the allocations.

There is another issue with logging via lambdas, consider the following code:

void Index(JsonDocument[] docs)
{
	var batch = new IndexBatchStats();
	database.Stats.Add(batch);// long lived
	batch.Completed += () => database.Stats.IncrmentCompletedBatches(batch);
	Log(() => "Indexing " + docs.Length + " documents");
}

You might notice that we have two lambdas here. C# is optimizing the number of types generated, and will generally output a single type for all the lifted member in all the lambdas in the method. This means that we have:

void Index(JsonDocument[] docs)
{
	var batch = new IndexBatchStats();
	database.Stats.Add(batch);// long lived

	var args = new { database, batch, docs }; // all lifted members


	batch.Completed += (args) => args.database.Stats.IncrmentCompletedBatches(args.batch);
	Log((args) => "Indexing " + args.docs.Length + " documents");
}

As you can see, we have a long lived lambda, which we think is using only other long lived objects (database & batch), but is actually holding a reference to the docs, which are VERY large.

Except for the last issue, which require moving the logging to a separate method to avoid this optimization, all of the issues outlined above can be handled by explicitly calling IsDebugEnabled in the calling code.

And this is why we require it.

time to read 2 min | 319 words

I just merged a Pull Request from one of our guys. This is a pretty important piece of code, so it went through two rounds of code reviews before it actually hit the PR stage.

That was the point where the tests run (our test suite takes over an hour to run, so we run a limited test frequently, and leave the rest for later), and we got three failing tests. The funny thing was, it wasn’t a functional test that failed, it was the code quality gateways  tests.

The RavenDB team has grown quite a lot, and we are hiring again, and it is easy to lose knowledge of the “unimportant” things. Stuff that doesn’t bite you in the ass right now, for example. But will most assuredly will bite you in the ass (hard) at a later point in time.

In this case, an exception class was introduced, but it didn’t have the proper serialization. Now, we don’t actually use AppDomains all that often (at all, really), but our users sometimes do, and getting a “your exception could not be serialized” makes for an annoying debug session. Another couple of tests related to missing asserts on new configuration values. We want to make sure that we have a new configuration value is actually connected, parsed and working. And we actually have a test that would fail if you aren’t checking all the configuration properties.

There are quite a few such tests, from making sure that we don’t have void async methods to ensuring that the tests don’t leak connections or resources (which could harm other tests and complicate root cause analysis).

We also started to make use of code analyzers, for things like keeping all complex log statements in conditionals (to save allocations) to validating all async calls are using ConfigureAwait(false).

Those aren’t big things, but getting them right, all the time, give us a really cool product, and a maintainable codebase.

We are hiring

time to read 1 min | 84 words

It is that time again, and Hibernating Rhinos is hiring developers to work on our flagship product, RavenDB. We need passionate developers, and we don’t care if you are just starting out, or if you have a decade of experience.

If you want to join our team, drop us a line at jobs@ravendb.net, with a link to github profile page (or similar) with projects that you have worked on, as well as your CV.

This position is only available in Israel.

time to read 1 min | 120 words

So I was telling my wife about this issue, and she had two interesting comments.

When I told her that the customer was complaining about performance, she stopped me to figure out what I meant by that. When I told her that they had to wait for 20 minutes, her response was:

Oh, I thought you were talking about 1/16 of a millisecond here again. Well, if it is 20 minutes, why make a fuss about it? They should go have a cup of coffee or something.

The second observation was when I told her that we narrowed things down to suggestions taking up a lot of a CPU:

Oh, so you told them to go and buy some more RPM, right?

time to read 4 min | 661 words

A customer called with a major issue, on a customer machine, a particular operation took too long. In fact, it takes quite a bit more than too long. Instead of the few milliseconds or (at worst, seconds), the customer saw a value in the many minutes.

At first, we narrowed it down to an extreme load on the indexing engine. The customer had a small set of documents that were referenced using LoadDocument by large number of other documents. That meant that whenever those documents were updated, we would need to reindex all the referencing documents.

In their case, that was in the tens to hundreds of thousands of referencing documents in some cases. So an update to a single document could force re-indexing of quarter million documents. Except… that this wasn’t actually the case. What drove everyone crazy was that here was a reasonable, truthful and correct answer. And on one machine the exact same thing took 20 – 30 seconds, and on the customer machine the process took 20 minutes.

The customer also assured us that those documents that everyone referenced are very rarely, if at all, touched or modified, so that shouldn’t be the issue.

The machine with the problem was significantly more powerful from the one without the problem. This issue also started to occur recently, out of the blue. Tracing the resource utilization in the system showed moderate CPU usage, low I/O and memory consumption and nothing much really going on. We looked at the debug logs, and we couldn’t really figure out what it was doing. There were very large gaps in the log where nothing seems to be happening. % Time in GC was low, so that ruled out a long GC pause that would explain the gap in the logs.

This is in version 2.5, which predates all of our introspection efforts, so figuring out what was going on was pretty hard. I’ll have another post talking about that in this context later.

Eventually we gained access to the machine and was able to reproduce this, and take a few mini dumps along the way. Looking at the stack traces, we found this:

image

And now it all became clear. Suggestions in RavenDB is a really cool feature, which allows you to ask RavenDB to figure out what the user actually meant to ask. It is also extremely CPU intensive during indexing, which is really visible when you try to pump large number of documents through it. And it is a single threaded process.

Except that the customer wasn’t using Suggestions in their application…

So, what happened, in order to hit this issue the following things all needed to happen:

  • Suggestions to be enabled on the relevant index/indexes. Check while the customer wasn’t currently using it, they were using it in the past, and unfortunately that stuck.
  • A very large number of documents need to be indexed. Check – that happened when they updated one of the commonly referenced documents.
  • A commonly referenced document needed to be modified. Check – that happens when they started work for next year, which modified those rarely touched documents.

Now, why didn’t it manifest itself on the other machines? Simple, on those machines, they used the current version of the application, which didn’t use suggestions. On the machines that were problematic, they upgraded to the new version, so even though they weren’t using suggestions, that was still in affect, and still running.

According to a cursory check, those suggestions has been running there for over 6 months, and no one noticed, because you needed the confluence of all three aspects to actually get this issue.

Removing the suggestions once we know they were there was very easy, and the problem was resolved.

time to read 1 min | 77 words

On Nov 25, I’ll be speaking in Kiev Ukraine at the Build Stuff Encore community event about what it means to be a good developer, including what exactly does it take.

This is actually an expansion of a conversation I had recently with a bunch of people, and it is the distillation of what is currently going on with our hiring practices. In a single sentence, it is trying to the oh so elusive quality, passion.

time to read 3 min | 427 words

optimized-7mkgWriting books takes a lot of time, and quite a bit of effort. Which is why I was delighted when Itamar Syn-Hershko decided to write the RavenDB in Action book.

You can now get a 39% discount for the book (both physical and electronic versions) using the coupon code: 39ravendb

Itamar has worked at Hibernating Rhinos for several years, mostly dealing with Lucene integration, but really all over the place. Off the top of my head, Itamar is responsible for Spatial searches in RavenDB, he wrote the first periodic backup implementation as well as the “backup to cloud” functionality, implemented the server side document json (which supports cheap cloning and is key for some very interesting performance optimizations) and in general worked all over the RavenDB codebase.

In other words, this is a person who knows RavenDB quite well, and he has done an excellent in passing on that knowledge with the RavenDB in Action book. This book is likely to be the most up to date resource for getting started with RavenDB.

Itamar has covered working with RavenDB, how indexes work, and how to work with indexes (two distinctly different issues Smile) and most importantly, in my opinion, document based modeling.

Users bringing relational modeling to RavenDB is usually the most common reason they run into trouble. So document modeling is quite important, and Itamar did a good job in covering it. Both as an independent concept and by contrasting that with relational model and discussing the cons and pros as it relates to RavenDB.

And, naturally, the book also cover the fun parts of working with RavenDB.

  • Replication for high availability and load balancing
  • Sharding for scalability and increased throughput
  • Geo spatial queries
  • Full text, more like this and reporting queries
  • The advanced API options and how (and when) to use it.

There is even a full chapter talking about how you can extend RavenDB on both the client side and the server side.

Overall, I highly recommend the RavenDB in Action book, if you are using RavenDB, or even if you just want to learn about it, this is a great resource.

And remember that you can get a 39% discount for the book (both physical and electronic versions) using the coupon code: 39ravendb

time to read 3 min | 587 words

Every programmer knows that input validation is important for good application behavior. If you aren’t validating the input, you will get… interesting behavior, to say the least.

The problem is that what developers generally don’t consider is that the system configuration is also users’ input, and it should be treated with suspicion. It can be something as obvious as a connection string that is malformed, or just wrong. Or it can be that the user has put executionFrequency=“Monday 9AM” into the the field. Typically, at least, those are fairly easy to discover and handle. Although most systems have a fairly poor behavior when their configuration doesn’t match their expectations (you typically get some error, usually with very few details, and frequently the system won’t start, so you need to dig into the logs…).

Then there is the perfectly valid and legal configuration items, such as dataDirectory=”H:\Databases\Northwind”, when the H: drive is in accessible for some reason (SAN drive that was snatched away). Or listenPort=”1234” when the port is currently busy and cannot be used. Handling those gracefully is something that devs tend to forget, especially if this is an unattended application (service, for example).

The problem is that we tend to assume that the people modifying the system configuration actually knows what they are doing, because they are the system administrators. That is not always the case, unfortunately, and I’m not talking about just issues of wrong path or busy port.

In a recent case, we had a sys admin that called us with high memory usage when creating an index. As it turned out, they setup the minimum number of items to load from disk to 10,000. And they have large documents (in the MBs range). The problem was that this configuration meant that before we could index, we had to load 10,000 documents to memory (each of them about 1 MB in average, and only then could we start indexing. That means 10GB of documents to load, and then start indexing (which has its own memory costs). That resulted in pushing other stuff from memory, and in general slowed things down considerably, because each indexing batch had to be at least 10GB in size.

We also couldn’t reduce the memory usage by reducing the batch size (as would normally would be the case under memory pressure), because the minimum amount was set so high.

In another case, a customer was experiencing a high I/O write rate, when we investigated, it looked like this was because of a very high fan out rate in the indexes. There is a limit on the number of fan out entries per document, and it is there for a reason, but it is something that we allow the system admin to configure. They have disabled this limit and went on with very high fan out rates, with the predictable result of issues as a result of it.

So now the problem is what to do?

On the one hand, accepting administrator input about how to tune the system is pretty much required. On the other hand, to quote a recent ops guy I spoke to “How this works? I have no idea, I’m in operations Smile, I just need it to work“, referring to a service that his company wrote.

So the question is, how do you handle such scenario? And no, writing warnings to the log won’t do, no one reads that.

time to read 1 min | 113 words

I’m going to be in Oredev again this year. In fact, several members of the RavenDB core team are going to be in a booth there, and we are going to be talking about what it does, what we are doing with it and where it is going.

We area also going to be giving away a ton of stuff (I know, I have to lug it Smile).

I’ll also be speaking about system architecture with a non relational database, which present different trade offs and require a change in thinking to get the best out of your system.

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}