How to lead a convoy to safety
I recently run into a convoy situation in NH Prof. Under sustained heavy load (not a realistic scenario for NH Prof), something very annoying would happen.
Messages would stream in from the profiled application faster than NH Prof could process them.
The term that I use for this is Convoy. It is generally bad news. With NH Prof specifically, it meant that it would consume larger and larger amounts of memory, as messages waiting to be processed queued up faster than NH Prof could handle them.
NH Prof uses the following abstraction to handle queuing:
public interface IQueue<T>
{
void Enqueue(T o);
T Dequeue();
bool IsEmpty { get; }
}
Now, there are a few things that we can do to avoid having a convoy. The simplest solution is to put some threshold on the queue and just start dropping messages if we reached it. NH Prof is actually designed to handle such things as interrupted message stream, but i don’t think that this would would be nice thing to do.
Another alternative would be write everything to disk, so we don’t have memory pressure and can handle much larger queue sizes. The problem is, of course, that this requires something very subtle. T now must be serializable, and not just T, but everything that T references.
Oh, Joy!
This is one of the cases where just providing the abstraction is not going to be enough, providing an alternative implementation means having to touch a lot of other code as well.
Comments
Use an object database :)
"not a realistic scenario for NH Prof" <-- I think you overestimate your customers.
I can think of at least half a dozen pages in one web application I work on that take anything from 70-700 SQL/cache requests per hit (30-40 mapped classes, 500 tables, 30GB database). During this time NH Prof frequently becomes unresponsive, and often remains busy for a few secs after the session ended.
We know our code is not the best -- using domain models for building a report, automapper resolvers getting more details per item, recursive trees, leaning far too much on the cache etc. Even after lots of fetching/joins/caching tuning there is still lots of SELECT N+1.
So unfortunately overloading NH Prof is a very realistic scenario for us.
Maybe you should add an option of offline profiling - some small component would write all the trace information to a log and NH Prof would then be used to analyze that log? Live profiling is a problem in production environment - if you have memory/performance problems and want to analyze that with a profiler, the profiler will add more load to the system and seriously worsen the situation.
My question would be...what questions regarding NH usage can NH Prof answer in a heavy load scenario that couldn't be answered when running the app under less heavy load?
In such a case it might be OK to have NH Prof "degrade" to processing only messages of severe importance until it catches up again...
Of course this falls down again if the application is so bitchy that all messages are severe...
Richard,
I am sorry, but we have different definitions for what sustained heavy load _means_. When I am talking about this I am talking about doing this for 30 minutes or so of non stop activity. That is rarely the case.
Anyway, I already have a branch where I am taking care of this, and I'll publish it sometimes this week.
Rafal,
InitializeOfflineProfiling() - it is there. :-)
Frank,
The problem isn't with showing the information, the problem is in processing it fast enough
I didn't think UI was the problem...so I gather that the queuing of messages is absolutely "dumb" in that all possible messages are gathered, while I thought that there might be some form of "pre-processing". I suppose that isn't really possible, though, since defining whether a message is "severe" or not probably involves quite a bit of knowledge (= processor time).
Otoh, how expensive is RAM these days? If you're profiling an app with such throughput I'd hope that people could spare a few dollars on a couple of GBs.
Frank,
It is possible that this would lead to an Out Of Memory Exception
And in general it is better not to try walking that line
Hm, I wonder if you could do a meta-analysis over a given number of messages knowing that some messages have been dropped. For example, if your profiler could run, say, 10 times on the same system with approximately the same load, you could average together the results, in a sense, to guarantee a stable conclusion. This would probably require some kind of ability to drop pseudo-random messages though, as you wouldn't be able to rely on just dropping when it starts to get overloaded - if you tried that, then you could very well be missing the exact thing which is causing the overload.
Differently, you could define certain messages (and that which they are dependent on) to be knowingly serializable, then only serialize those with a marker saying where they show up in the queue. This would probably end up creating a scheduling problem over the queue, though, so it's most likely not worth it.
Instead of dropping messages when you reach a threshold... why not simply block the host application, so it has to wait until it can write the next message to the queue?
granted, this would reduce the performance of the host application, but if i want to debug/trace my application i usually would want to get all messages, even if it means that my application may run a bit slower while being traced...
Can you gain efficiency through batching? For instance, are you updating the screen on every update? With a slow resource such as a UI, file, or socket, batching can give you better throughput by merging updates and limiting the number of slow calls required.
For Example:
public void OnBatch(List <updates updates){
}
This way updates are efficiently throttled and the Queue doesn't fall far behind.
Of course, this is something that Retlang does for you.
http://code.google.com/p/retlang/>
Mike
Thomas,
One of the design goals is to have as little impact as possible on the profiled application.
Stopping the profiled application is not an option.
Mike,
You seem to be missing the point. It isn't the time to update the screen that is meaningful. It is the time to process the messages.
I'll have a separate post about it, but let us just say that the same problem exists with no UI as well
Comment preview