NuGet Perf, Part VIII: Correcting a mistake and doing aggregations
I hope this is the last one, because I can never recall what is the next Latin number.
At any rate, it has been pointed out to me that I made an error in importing the data. I assumed that the DownloadCount field that I got from the Nuget API is the download count for the specific package, but it appears that this is the total downloads count, across all versions of this package. The actual download number for a specific package is: VersionDownloadCount.
That changes things a bit, because the way Nuget sorts things is based on the total download count, not the download count for a specific version. The reason this complicate things is that we aren’t going to store the total download count in all the version documents. First, let us see the sort of query we need to write. In SQL, it would look like this:
select top 30 skip 30 Id, PackageId, Created, (select sum(VersionDownloadCount) from Packages all where all.PackageId = p.PackageId) as TotalDownloadsCount from Packages p where IsPrerelease = 0 order by TotalDownloadsCount desc, Created
This is a much simplified version of the real query, and something that you can’t actually write this simply in SQL, most probably. But it gets the point.
Note that in order to process this query, the RDMBS would have to first aggregate all of the data (for each row, mind) then do the paging, then give you the results. Sure, you can keep a counter for all the downloads for a package, but considering the fact that downloads are highly parallel and happen all the time, waiting for writers to finish doing their update.
Instead, with RavenDB, we are going to use a map/reduce index and query on that.
This should be fairly simple to follow. In the map we go over all the packages, and output their package id, whatever they have been released, the specific version download count and the date it was created.
In the reduce, we group by the package id and whatever is was pre released or not ( I am assuming that we usually don’t want to show the pre-release stuff there).
Finally, we sum up all of the individual package downloads and we output the oldest created date. Using all of that, we can now move to the next step, and actually query that:
There is a small bug here, since I don’t see RavenDB in the results, but I guess I’ll have to wait until I get the updated data from Nuget.
Actually, that is not quite true, for pre-released software, we are pretty high up:
That explains much, RavenDB 1.2 is pretty awesome.
Comments
IX
The real question is when will RavenDB 1.2 become 'stable'? Or is the Duke Nukem Forever version of RavenDB? :)
Hi Ayende,
a lot of people will for sure agree that they'll happily help you count in roman numbers if you continue this interesting series of posts we can indeed learn a lot from.
So as grega_g already posted:
IX X XI XII XIII XIV XV XVI XVII XVIII XIX XX
But you also could look at http://www.novaroma.org/via_romana/numbers.html which explains the numbers and has a handy converter on the right side :-)
Thanks for the entertaining and informative content so far
@Paul
Geez man that is uncalled for :)
When you query the NuGet feed, each result contains the DownloadCount aggregated across all package versions. For example, this query:
http://nuget.org/api/v2/Packages?$filter=Id%20eq%20'nuget.core'
How would you combine the map/reduce query with a search query to accomplish this same goal?
Chris, Wait for it, I have it in a future post.
Paul, We have been actively working on 1.2, you can get it _right now_. It hasn't even been 6 months, I don't think that the comparison is appropriate.
@Ayende, sorry, no offence intended, I know it's available on the pre-release channels. I'm just excited for it to come to the stable channel so I can start using the features.
While it has the 'unstable' or 'pre-release' tags, I'm hesitant to switch to it in case it causes my customer's computers to explode and I get blamed for using something clearly labelled 'unstable' (even though I know it's far more stable than most software out there).
Hey Ayende, any chance you can tag you series of posts with a per-series tag? Like tagging this series as "nuget-perf" or something. Like now most posts just have "raven" as tag... that is so very useless for filtering. I want to see the list of posts in this series, and I can't really do that without manually scrolling through the recent post list.
I would love to be able to just click "nuget-perf" tag and get all the articles for easy reading.
Alexei, That is a great idea, I'll do so.
Comment preview