Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,546
|
Comments: 51,161
Privacy Policy · Terms
filter by tags archive

The role of logs

time to read 2 min | 344 words

You are probably familiar with logging, and log levels. In particular, the following scheme is very popular.

I have been using this scheme for pretty much every software project I had in over a decade. And it works. Which is good.

But I found that there are actually not really useful for our scenarios. In particular, we found that there are many cases where a component that logged something as a warn or error would generate a support ticket by eagle eyed admin, while it was just a valid part of the software behavior. The other side of that is that when we need to use the logs to find a problem, we never use fine grained logs, we just need it all.

This lead me to believe that we only actually need two and a half levels.

  • Informative – we are reporting so one of our support engineers or a team member can look at that and understand what the software is doing.
  • Operational – something deserves attention from the customer’s operations team.

There isn’t really anything in between those. Either this is something that we fully expect an operation team at the customer to look at, or this is something that the development and/or support engineers need to see.

But I mentioned two and a half, what is that half? Well, above operational, there is another half a level, which I like to call Pay Attention Now. This isn’t actually a log level, it is a way to attract someone’s attention to the log, or to some sort of operational alert. This is an SMS sent, or an email, etc.

But beyond the “something need attention now”, there is nothing else that is needed. The production logs are either for routine operations monitoring (in which case they should be relatively succinct and written with an eye to a reader who isn’t familiar with the internal working of RavenDB) or “what is going on here” mode, which is typically done by a team member who needs the most information possible.

time to read 3 min | 565 words

You might have noticed that we are doing a lot of work around performance. Some of this work can be done with just optimizing the way we are doing certain operations, but for some things, we cannot just optimize things, and a behavior change is also required.

In this case, we are talking about the authentication method for RavenDB. The current way it works goes something like this.

  • The user is making a request to a RavenDB server.
  • The server requires security, and ask the user to authenticate. For this case, we’ll sue API Keys.
  • The user & server will have a side channel discussion for authentication via API Keys, which will result in a token.
  • This token is sent as a header in all future requests.
  • The token is cryptographically signed, so the server can validate that it is valid.

So far, so good, but this does pose some issues.

To start with, we took a lot from OAuth, so that means that we assume that there are multiple entities in this. The user, the server and the authenticator, and the use of cryptographic signature is meant to ensure that the server can trust the token generated by the authenticator and served by the user.

However, in pretty much all cases, the server and the authenticator are the same. There are some special cases relating to replication and load balancing, but they aren’t relevant at this point, and we can work around them if need be.

And we’ll want to do that. The reason that this is problematic is very simple, right now, we need to cryptographically validate the token on every single request. And that is expensive. Of course, in some cases, it is actually meant to be expensive, that is why it is secure.

So we need to reduce this cost, and we can reduce that by saying that we can just keep a simple token. Conceptually, this now becomes:

  • The user is making a request to a RavenDB server.
  • The server requires security, and ask the user to authenticate. For this case, we’ll sue API Keys.
  • The user & server will have a side channel discussion for authentication via API Keys, which will result in a token.
  • This token is sent as a header in all future requests.
  • The token is just a unique id (guid), which is stored on the server memory.

Because the token is unique, and per server, we don’t need to do any crypto validation on the value. We can just check if the value is in our memory, and that would be it.

The first reaction we typically get is “but what about security? someone can re-use that guid to authenticate as someone else”. That assumes that you can sniff the conversation between client and server. If you can do that, then you are probably already don’t care about security, since you aren’t using HTTPS. But note that the same behavior occurs using the crypto token as well. If you manage to take that, you can present it to the server as your own token, and the server can’t tell the difference between a valid client and a malicious one.

The good part of this is that now the authentication part of the request is a dictionary lookup, instead of cryptographic signature validation. And the performance of authenticated requests is much higher.

time to read 4 min | 743 words

In a recent blog post, I talked about a problem we found in NLog with our usage scenario. I mentioned that we’ll probably have our own solution to fit our needs. Eric Smith replied with:

So I guess there is no chance of contributing to the NLog project to improve it instead of creating yet another logging framework?

And the short answer for that is that there is very little chance for that. Primarily because what we are doing is not really suitable for consumption elsewhere.

Let us consider something that we found to be very useful. The Metrics.NET is a great way to get various metrics, and we have been using it (and contributed to it) for several years. But when looking at our 4.0 work, we also looked at what it would take to support metrics, and we had to flat out reject Metrics.NET for our use. Why is that?

Primarily because it is doing too much. It is extremely flexible and can do quite a lot, but we don’t need it to do a lot, and we don’t want to pay for the stuff we don’t use. For example, Histograms. The problematic issue for us was here:

public void Update(long value, string userValue = null)
{
    this.last = new UserValueWrapper(value, userValue);
    this.reservoir.Update(value, userValue);
}

Which ends up calling:

public void Update(long value, string userValue = null)
{
    long c = this.count.Increment();
    if (c <= this.values.Length)
    {
        values[(int)c - 1] = new UserValueWrapper(value, userValue);
    }
    else
    {
        long r = NextLong(c);
        if (r < values.Length)
        {
            values[(int)r] = new UserValueWrapper(value, userValue);
        }
    }
}

So that means that we’ll have two allocations for each time we update the histogram, which can happen a lot.  We don’t want to pay that price.

Now, we can certainly modify Metrics.NET to meet our needs, in fact, we have done just that, and ended up about 6 files that we needed, after we cleaned them up according to our current requirements. Contributing them back wouldn’t make much sense, they are full of constraints that are meant to allow us to run a database server for months on end with predictable and reliable performance under very high load. They don’t make sense in many other scenarios, and it can be incredibly limiting.

For a logging framework, we can most certainly add another option to NLog to do what we consider the right thing, but at this point, that wouldn’t really make sense. We don’t use a lot of the features, and reduce the surface area would mean reducing complexities. We can target something very narrow and get very high performance (in both CPU utilization and memory costs) much more easily & cheaply than trying to convert an existing codebase.

This is not a new concept. There is a very real cost to making something reusable. And in our case, RavenDB doesn’t need frameworks internally, we have a very good grasp about what is going on, and we want to have full ownership of every bit that is executed.

time to read 1 min | 187 words

The final piece for this candidate is the code to actually search through the sorted array. As you'll recall, the candidate certainly made our computer pay for that sorting and merging. But we are fine now, we have manage to recover from all of that abuse, and we are ready to rock.

Here is how the search (over sorted array) was implemented.

public List<int> EmailSearcher(string email)
{
    List<int> answer = new List<int>();
    for (int i = 0; i < emailsArray.Length; i++)
    {
        if (emailsArray[i].STR.ToString().Equals(email, StringComparison.OrdinalIgnoreCase))
        {
            answer = emailsArray[i].ID;
            return answer;
        }
    }

    return answer;
}

And with that, I have nothing left to say.

time to read 3 min | 590 words

The task in question was to read a CSV file (which is large, but can fit in memory) and allow quick searches on the data by email to find the relevant users. Most candidates handle that with a dictionary, but since our candidate decided to flip things around, he also included an implementation of doing this with sorted arrays. That is actually quite nice, in theory. In practice, it ended up something like this:

funny, funny pictures, funny photos, hilarious, wtf, weird, bizarre, funny animals, 17 Bizarre Photoshop Hybrid Animals

In order to truly understand the code, I have to walk you through a  bunch of it.

It starts with the following fields:

List<Structure> emails = new List<Structure>();

Structure[] namesArray = new Structure[0];

Structure, by the way, is a helper class that just has a key and a list of ids.  The duplicate is strange, but whatever, let move on.

The class constructor is reading one line at a time and add an structure instance with the email and the line id to the emails list.

public void ToArray()
{
    emailsArray = this.emails.ToArray();
}

This code just copy emails to the array, we are not sure why yet, but then we have:

public void Sort()
{
    Array.Sort(emailsArray, delegate(Structure x, Structure y) { return x.STR.CompareTo(y.STR); });
}

So far, this is right in line with the "use a sorted array" method that the candidate talked about. There is a small problem here, because emails are allowed to be duplicated, but no fear, our candidate can solve that…

public void UnifyIdLists()
{
    for (int i = 0; i < emailsArray.Length; i++)
    {
        if (i + 1 == emailsArray.Length)
            break;
        if (emailsArray[i].STR.Equals(emailsArray[i + 1].STR))
        {
            emailsArray[i].ID.AddRange(emailsArray[i + 1].ID);
            emailsArray[i + 1] = null;
            List<Structure> temp = emailsArray.ToList<Structure>();
            temp.RemoveAll(item => item == null);
            emailsArray = temp.ToArray();
        }
    }
}

The intent of this code is to merge all identical email values into a single entry in the list.

Now, to put things in perspective, we are talking about a file that is going to be around the 500MB in size, and there are going to be about 3.5 million lines in it.

That means that the emailsArray alone is going to take about 25MB.

Another aspect to consider is that we are using dummy data in our test file. Do you want to guess how many duplicates there are going to be there? Each of which is generating two 25MB allocations and multiple passes over an array of 3.5 million items in size.

Oh, and for the hell of it, the code above doesn't even work. Consider the case when we have three duplicates…

time to read 3 min | 472 words

We typically don't look too deeply into the overall design on a candidate's code. Most coding tasks just aren't big enough to warrant it.

When we do, it is typically because the candidate has done something… peculiar. For example, we have had a candidate that send a submission with no less than 17 classes to manager reading a CSV file and populating a dictionary. At some point we knew that this candidate wasn't going to move forward, but it became a challenge, to try to outline how the data actually traveled through that particular maze. But as much as we love Mr. Goldberg, I'm afraid that dear Rubbie isn't going to be remembered for much longer.

Here is how this candidate code started:

static void Main(string[] args)
{
    //Call for parsing method
    HelperClass cr = new HelperClass();
    cr.CsvParser();
    cr.ToArray();
    cr.Sort();
    cr.UnifyIdLists();
    //ask the user to enter input
    string input = cr.InputReader();

I actually started explaining the issues in this code, but I run out of… pretty much everything.

I have a challenge for you, can you think of a single OOP principle that isn't being broken here?

For… fun, I guess, you can also look at the following code, which come just after the one above:

Start:
    //check input type
    switch (cr.InputTester(input))
    {
        case 0:
            //valid email structure, but the input is not found in the file
            List<int> emails = new List<int>();
            emails = cr.EmailSearcher(input);
            if (emails.Count == 0)
            {
                Console.WriteLine("The input you have entered has not been found in the file. \nPlease try again.");
                input = cr.InputReader();
                goto Start;
            }

I guess someone heard that loops are a performance bottleneck in most applications and decided to do away with them.

time to read 2 min | 362 words

The first question that ask in the coding task goes something like this: "Given a CSV file containing users' data, who is large, but can fully fit into memory, we want to be able to search by email address very quickly and get all the matching user ids. Optimize for fast queries over fast startup".

The intent is basically that you'll read the file and load that into a dictionary, then use that to perform queries.

This candidate has done just that, although things started being strange pretty much from the get go…

Dictionary<int, string> emails = new Dictionary<int, string>();

That seemed like a pretty strange way to set things up, but we have seen crazier. At this point I thought that they are probably storing the hash code of the email in the dictionary, and the string value is the concatenated list of all the matching ids.

The truth was… far more interesting. Here is the code for querying:

public Dictionary<int, int> EmailSearcher(string email)
{
    Dictionary<int, int> answer = new Dictionary<int, int>();
    int count = 0;
    foreach (var entry in emails)
    {
        if (entry.Value.ToString().Equals(email, StringComparison.OrdinalIgnoreCase))
        {
            answer.Add(count, entry.Key);
            count++;
        }

    }
    return answer;
}

This code actually have multiple levels of strangeness. The obvious one is that this is doing a linear search on the dictionary, but look at the return type as well…

The candidate was well aware that this code is not able to handle large amount of information, so the candidate sent us an alternative implementation. But I'll talk about that in my next post.

time to read 4 min | 615 words

You might have noticed that we are looking for more people again. Mostly because I gripe about some of the bad ones here every now and then.

One of my absolute requirements is that I want to read a candidate's code before hiring them. Some of them have made significant contributions to Open Source code, but a large number don't have any significant body of code that they can show. That is why we ask candidates to send us a coding test.

We had people flat out refuse to do the coding test, and some who thought that the questions were incredibly hard and unrealistic. We had people who sent in code that was so bad it caused migraines, we had people who sent in code that wouldn't compile, we  had people do stuff like read a 15TB file (16,492,674,416,640)! times. And just to clarify, that is factorial(16492674416640). I don't know what the actual value of this number is, but is it big.

The nice thing is that you can usually tell right away when the code is bad. We also play a game called "guess the candidate's background". We have about 89% success rate there*.

Spotting good code is much harder, because a really successful submission is going to be boring. It does what needs to be done, and that is it. Our most recent hire's code submission was so boring, we had to analyze it using our standard metrics to find issues (our standard metrics are for production code running in a high performance DB for months on end, not the same metrics we evaluate code submissions).

And then… then there is this candidate, whose code is so unique that I decided to dedicate a full week to explore it. The code looks good, documented, there are explanations that show what is going on and they are going in the right direction, on the surface of it.

And then there is the devil in the details. I have quite a lot to write about this particular submission, but I'll just start with the following:

//Find today's log file in the directory
public string LogFileFinder()
{
    string[] files = Directory.GetFiles(LoggingAggregation.Program.GlobalVar.path, "*.log");
    for (int i = 0; i < files.Length; i++)
    {
        int slash = files[i].LastIndexOf(@"\");
        int dot = files[i].LastIndexOf(".");
        string fileName = files[i].Substring(slash + 1, dot - slash - 1);
        string fileDate = DateTime.Now.ToString("yyyy-MM-dd");
        if (fileName.Equals(fileDate))
            return fileName + ".log";
    }

    return null;
}

Now, another way to write this code would be:

Path.Combine(LoggingAggregation.Program.GlobalVar.path, DateTime.Now.ToString("yyyy-MM-dd") +".log") 

I literally stared at this piece of code for several minutes, trying to figure out what is actually going on there. I wasn't able to.

As an aside, we sent the candidate a rejection notice, along with a few pointers to some of the more egregious things that were wrong in the code.  They know how to code, it is just that it goes sideway in the middle.

* The game call to guess the candidate's default language settings. That is, someone who write C# code, but has been brought up on C, so have a C style to the code.

time to read 4 min | 753 words

One of our clients had opened a support incident. Under certain conditions, RavenDB will issue so much I/O requests that it will effectively lock the I/O system. The condition was cold boot when you have many hundreds of databases on the server, and you force all of them to start immediately. We applied a policy to rate limit the number of concurrently loading databases, sent a hotfix, and was done.

Everyone was happy, there were birds singing, babies chasing puppies and long walks on the beach.

Then the customer called back, “ever since we applied the hotfix, everything became slower”. A code review of the changes by two different team members uninvolved in the original hotfix produced similar conclusion, the affected code was only on the db load code path, and there was no possible way it could impact requests. Attempts to reproduce this locally has been unsuccessful.

Working with the customer, we managed to figure out that the problem was related to high CPU usage, which also ruled out the hotfix, since that would have been slowness because of a lock. Digging further, we discovered that most of the time was spent in GC work. That gave us a good handle to start digging. We discovered that on the production system, RavenDB was allocating roughly 4GB / second.

That is… a lot of memory to allocate. And the size of the Gen2 heap was really high.

But what was causing all of this angst?

As it turned out, during the previous support incidents, the log level has been changed to use Debug. And we output a lot to the log. Now, while that is pretty obvious, and easily resolved, we dug down deeper and found out exactly what was causing this.

To understand what is going on, you have to remember the current situation, we are on a busy server, hundreds of databases, and a log level set to debug. The logs were writing to the same disk as the databases, which happened to be on a remote SAN.  Combined with the I/O pressure from all of those databases, that meant that I/O writes to the log could take quite a bit of time.

Now, RavenDB is configured to use async logging exactly for this purpose. So we’ll not pay for slow I/O in the logs. And NLog is a mature project, which already considered the case of flooding and potential memory leak as a result of too many log messages.

This is implemented in the following method:

public void Enqueue(AsyncLogEventInfo logEventInfo)
{
    lock (this)
    {
        if (this.logEventInfoQueue.Count >= this.RequestLimit)
        {
            InternalLogger.Debug("Async queue is full");
            switch (this.OnOverflow)
            {
                case AsyncTargetWrapperOverflowAction.Discard:
                    InternalLogger.Debug("Discarding one element from queue");
                    this.logEventInfoQueue.Dequeue();
                    break;
                // other options removed for brevity
            }
        }

        this.logEventInfoQueue.Enqueue(logEventInfo);
    }
}

The log is configured to use Discard on overflow, which is why I’m showing only this code.

I wasn’t aware that NLog just took a lock like that, but that isn’t what bothering me. What this code is doing is saying: “Okay, I run to the limit, let us discard the stuff that is in the queue the longest time”.

What this effectively did was to keep references to the log entries in memory, just long enough for them to hit Gen2, and then just discard them. In the meantime, any new entry would go into the queue, where it would mature into Gen2, but likely will be discarded as well.

There is a word for it, and it goes like this:

FUTURE POSTS

  1. Partial writes, IO_Uring and safety - about one day from now
  2. Configuration values & Escape hatches - 5 days from now
  3. What happens when a sparse file allocation fails? - 7 days from now
  4. NTFS has an emergency stash of disk space - 9 days from now
  5. Challenge: Giving file system developer ulcer - 12 days from now

And 4 more posts are pending...

There are posts all the way to Feb 17, 2025

RECENT SERIES

  1. Challenge (77):
    20 Jan 2025 - What does this code do?
  2. Answer (13):
    22 Jan 2025 - What does this code do?
  3. Production post-mortem (2):
    17 Jan 2025 - Inspecting ourselves to death
  4. Performance discovery (2):
    10 Jan 2025 - IOPS vs. IOPS
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}