Digging into the CoreCLRSome bashing on the cost of hashing
Note: This post was written by Federico.
Recently at CoreFX there has been a proposal to deal with the typical case of everyone writing their own hash combining logic. I really like the framework to push that kind of functionality because hashing is one of those things that unless you really know what you are doing, it can backfire pretty bad (and even if you know, you can step over the cliff too). But this post is not about the issue itself, but it showcases a few little details that happen at JIT land (and that we usually abuse, of course J). This is the issue: https://github.com/dotnet/corefx/issues/8034#issuecomment-260733285
Let me illustrate with some examples.
ValueTuple is a new type that is being introduced that is necessary for some of the new tuples functionality in C# 7. Because hashing is important, of course they implemented the ability to combine hashes. Now, let’s suppose that we take the actual code hashing code that is being used for ValueTuple and use constants to call it.
Now under an optimizing compiler chances are that there shouldn't be any difference, but in reality there is.
This is the actual machine code for ValueTuple:
So now what can be seen here? First we are creating an struct in the stack, then we are calling the actual hash code.
Now compare it with the use of HashHelper.Combine which for all purposes it could be the actual implementation of Hash.Combine
I know!!! How cool is that???
But let’s not stop there... let’s use actual parameters:
We use random values here to force the JIT to treat them as parameters and not being able to eliminate the code and convert it yet again into a constant.
The good thing, this is extremely stable. But let’s compare it with the alternative:
The question that you may be asking yourselves is: Does that scale?. So, let’s go overboard...
And the result is pretty illustrative:
The takeaway of the analysis is simple: even if the holding type is a struct, it doesn't mean it is free :)
More posts in "Digging into the CoreCLR" series:
- (25 Nov 2016) Some bashing on the cost of hashing
- (12 Aug 2016) Exceptional costs, Part II
- (11 Aug 2016) Exceptional costs, Part I
- (10 Aug 2016) JIT Introduction
Comments
I've read this post twice and TBH I do not understand the point that the author is trying to make. Is the point that ValueTuple should not be used just for purposes of generating hashes because it is slower then custom coded method? OK - I was not even expecting that it should be faster or same as custom coded hashing method.
Also:
I have also read this post twice, but I cannot understand it because I don't speak x86 assembly. But, know what? I don't want to speak it either. I speak,T-SQL, c#, c, c++, CIL, Typescript, Pascal and many others..but I don't want to speak asm.
Dalibor, This relates to common optimization pattern of using structs to allow the JIT to inline the code. The problem is that it isn't always able to do so, and in some cases there is a cost to just creating the struct and then making a method call.
The reason that the last image is awesome is that the JIT was able to optimize away multiple calls to this method so the whole thing is nice and tight without any method calls.
This is much faster than having to jump around, push things into the stack, spill registers, etc.
IJesús, I agree that you don't want to speak it, but if you care about performance, you need to understand what is actually being executed, and what the impact of that is. In this case, the key takeaway from this post is to understand that while structs are cheap, they aren't free.
@ayende Oh! Is there something free in computing? How much costs structs vs classes? I would like to see some numbers instead of asm code.
Jesús, It is impossible to answer this.
For example, class allocated on heap, structs on stack. So passing structs is expensive, but creating them relatively cheap. Passing classes around is cheap, but they require garbage collection.
Generics with interface constraint build using struct variables can have inline methods, while classes cannot, etc.
You learn something new every day. Can you provide a link or something regarding this pattern? This is the first time I've heard about it.
IMHO this is something that should be stated in the post. Just placing bunch of MSIL and stating 'it is awesome' does not tell me much. I just don't know what should be awesome ... compiler inlining single line methods ... some other optimization which I miss because I'm not good at reading MSIL code?
Dalibor, The JIT inline magic is complex and not really well documented that I have seen. A lot of what I know is from observing its behavior, not formal docs. This is hard to do, because it is based on a lot of small optimizations and they keep making it better.
In particular, see: https://blogs.msdn.microsoft.com/davidnotario/2004/11/01/jit-optimizations-inlining-ii/#comment-243 https://blogs.msdn.microsoft.com/vancem/2008/08/19/to-inline-or-not-to-inline-that-is-the-question/
And note that this isn't MSIL, but x86 ASM
I've looked at the links that you've provided but I do not see any reference to 'common optimization pattern of using structs to allow the JIT to inline the code'. Perhaps I'm missing something there.
Dalibor, It is about inlining, and it isn't something that you really need to know unless you are dealing with micro optimizations. When you do need to know that, you start figuring out what will trigger inlining so you can reap the benefits.
For more reference: https://en.wikipedia.org/wiki/Inline_function http://www.greenend.org.uk/rjk/tech/inline.html
I am aware what inlining is. As far as I can tell you are claiming that there is something special about using structure/value types which makes their methods somehow 'special' when it comes to inlining. From your post I assume that when instantiating a structure type and then calling its method and not using the instance again the instance should be optimized away and just the inline 'body' of the method should remain?
If this is so then I'm simply asking on what are you basing this assumption on. If this is something that you've personally observed then I would encourage you to maybe make another post regarding this optimization technique so that people would get better acquainted with it. As it is I've found no mention of this particular optimization after taking some time to google it up.
@Dalibor The underlying problem is that the JIT is ever changing. Not long ago we had a piece of code that made a few copies of its parameters, work with them and then end. See: https://github.com/dotnet/coreclr/issues/6014 . We could have written about it, but it was solved a few days later (took a couple of months to be available, but its fine). This was more general, it shows, that the opportunities for the JIT to optimize code exist but you have to be aware of certain restrictions. The best way to know what are the current restrictions is to look over the area-CodeGen label at CoreCLR https://github.com/dotnet/coreclr/issues?utf8=%E2%9C%93&q=is%3Aissue%20label%3Aarea-CodeGen%20
@Jesus C# tuples are going to be a great addition and they will be able to optimize automatically a more than a few of common patterns, but with great power comes great responsability. Tuples are going to start stressing your stack allocations and passing tuples around is going to be a major hog on your system. The point is that nothing is free, everything has a cost and therefore you must know where those costs are; for those not wanting to go straight to the assembler the root of those costs are hidden. There are many sources of costs so that's probably why it is not a simple topic to show numbers, you can easily build an example where structs are fast, then add a few methods and some logic and then your classes are way faster. You grow the size of your struct and now those extra initialization instructions will hurt you on tight code. Those and many other behaviors make the problem of comparing them hard.
There are also other optimization opportunities that wont be accesible if you dont want to go deep down into the assembler; for example in a commit yesterday we achieved 40% improvement in the LZ4 compression just controlling the assembler we wrote. https://github.com/ravendb/ravendb/commit/3ef1d60a9f43b5daa38f7071d7819ce249a78f70 . Granted, this is not required for the 99.9% of the projects out there (and it is just fine), but when you are doing high performance programming, you need to know how to deal with that. There are a lot of people that do want to speak C# fluently at this level because they may end up needing to write this kind of code eventually.
Oren,
Thank you and very impressive.
Comment preview