The cost of counting: The silliest optimization…
For a lecture that one of the guys is giving, we started to talk about the approach to take to explain a certain feature (distributed counters).
I suggested that we’ll measure how fast a single thread can count, which resulted in the following code:
On my machine, this results in 25,943,321 in 00:00:01.0000060. Which is great, but seems to be very low. So we set out to see how fast we can count things. Which is about the silliest thing ever. However, it is fun.
It is obvious that the cost here is the Stopwatch calls. So let’s try reducing them. I changed it so it will only check every so often.
This one gives us 205,840,000 in 00:00:01.0010240, so that is much nicer. Can we do better? How about is we changed how we check?
This give me 644,081,297 in 00:00:01.0005354. And I don’t think that we can do better than this very easily.
Comments
You can do better, e.g., by removing division from 2nd variant and replacing it with bitwise operations:
On my machine this is faster than threaded solution by a couple percent and significantly faster than 2nd variant (with division).
Oops, that was on debug build. In release mode my variant is 3x faster than threaded variant:
562282964 in 00:00:01.0012022 for 3rd (thread-based) variant
1610678272 in 00:00:01.0010230 for my variant
Even faster on my computer is the combination of dmitry_vk and the threaded version
(also changed
done
to abyte
seems to give better precision but im not 100% sure) dmitry_vk: 1526071296 in 00:00:01.0010249 combined: 1675231232 in 00:00:01.0011874So I guess unrolling the loop to leverage the fact CPUs can do more than one operation per clock and are bad at branching would be cheating?
@Philippe, not really... It is a very valid optimization. We do that a lot :) https://github.com/ravendb/ravendb/blob/v4.0/src/Sparrow/Memory.cs#L247
This has the most throughput per second on my machine:
''' var max = 1000000000; var sw = Stopwatch.StartNew();
'''
in vscode in debug mode: 2073 ms for 1000000000 items = 482392667.631452 /second
@Harold, never run in debug mode for optimization (the assembler code is highly inefficient). We have processes here (because of highly tuned routines) that can have up to 4x performance difference between debug and release code. In fact, non optimized code is usually faster in debug mode than highly optimized one.
for the release build I get these results: 338 ms for 1000000000 items = 2958579881.6568 /second
Thats ~ 6x better. And its about the same as my clock speed of 3GHz (not sure if thats coincidence).
ok, one last comment then I stop spamming. When I change var i = 0; to long i = 0;
it is about half the speed (1800 000 000)
I'm confused. The first version elapsed 25,943,321 in 00:00:01.0000060. Second version elapsed 205,840,000 in 00:00:01.0010240. The first version is better than second version. So, "that is much nicer"?
Diego G. Fritz higher is better:
The second version has 10x the performance on your computer.
@Harold The point is to stop the loop after ~1 sec.
Sorry. I am a fool. I do not read well the statement.
@Thomas, no one says stopping after one sec is a requirement. The original requirement is: "I suggested that we’ll measure how fast a single thread can count,"
But maybe I'm being a smartass ;-)
Why oh why do you check the time in the cycle?
Even volatile read across threads is wasteful and wrong. CPU caches don't like it.
Just run it for a fixed few billion times, measure time at the start and end, then you get your speed.
Effectively what you're doing is giving Usain Bolt a stopwatch. How the hell is he meant to focus on running, whilst looking at the watch face? Silly! Just let Usain run his 100m at his 100% speed, and you do the measurement.
Mihailik, if you read the title of the post, you might understand why they have to check the time. A performance counter is worth little if you cannot read it continuously...
Have you tried with ThreadPool.QueueUserWorkItem instead of Task.Factory.StartNew?
Another approach, though certainly not recommended for 'real' code, but I was curious to see how it fared. (it is slower than the fastest method above by 20% +- but as the seconds increase the difference should decrease)
Comment preview