Awesome indexing with RavenDB
I am currently teaching a course in RavenDB, and as usual during a course, we keep doing a lot of work that pushes what we do with RavenDB. Usually because we try to come up with new scenarios on the fly and adapting to the questions from the students.
In this case, we were going over the map/reduce stack and we kept coming more and more complex example and how to handle them, and then we got to this scenario.
Given the following class structure:
1: public class Animal2: {
3: public string Name { get; set; }4: public string Species { get; set; }5: public string Breed { get; set; }6: }
Give me the count of all the species and all the breeds. That is pretty easy to do, right? In SQL, you would write it like this:
1: SELECT Species, Breed, Count(*) FROM Animals2: GROUP BY Species, Breed
And that is nice, but it still means that you have to do some work on the client side to merge things up to get the final result, since we want something like this:
- Dogs: 6
- German Shepherd: 3
- Labrador: 1
- Mixed: 2
- Cats: 3
- Street: 2
- Long Haired: 1
In RavenDB, we can express the whole thing in a simple succinct index:
1: public class Animals_Stats : AbstractIndexCreationTask<Animal, Animals_Stats.ReduceResult>2: {
3: public class ReduceResult4: {
5: public string Species { get; set; }6: public int Count { get; set; }7: public BreedStats[] Breeds { get; set; }8:
9: public class BreedStats10: {
11: public string Breed { get; set; }12: public int Count { get; set; }13: }
14: }
15:
16: public Animals_Stats()17: {
18: Map = animals =>19: from animal in animals20: select new21: {
22: animal.Species,
23: Count = 1,24: Breeds = new [] {new {animal.Breed, Count = 1}}25: };
26: Reduce = animals =>
27: from r in animals28: group r by r.Species29: into g30: select new31: {
32: Species = g.Key,33: Count = g.Sum(x => x.Count),34: Breeds = from breed in g.SelectMany(x => x.Breeds)35: group breed by breed.Breed36: into gb37: select new {Breed = gb.Key, Count = gb.Sum(x => x.Count)}38: };
39:
40: }
41: }
And the result of this beauty?
And that is quite pretty, even if I say so myself.
Comments
In SQL you could do similar things. This example works on Oracle 11:
CREATE TABLE Animal ( Name VARCHAR2(30), Species VARCHAR2(30), Breed VARCHAR2(30));
INSERT INTO Animal (Name, Species, Breed) VALUES ('Albert','Dog','German Shepherd'); INSERT INTO Animal (Name, Species, Breed) VALUES ('Berta','Dog','German Shepherd'); INSERT INTO Animal (Name, Species, Breed) VALUES ('Carla','Dog','German Shepherd'); INSERT INTO Animal (Name, Species, Breed) VALUES ('Doris','Dog','Labrador'); INSERT INTO Animal (Name, Species, Breed) VALUES ('Emil','Dog','Mixed'); INSERT INTO Animal (Name, Species, Breed) VALUES ('Fritz','Dog','Mixed'); INSERT INTO Animal (Name, Species, Breed) VALUES ('Gerd','Cat','Street'); INSERT INTO Animal (Name, Species, Breed) VALUES ('Hans','Cat','Street'); INSERT INTO Animal (Name, Species, Breed) VALUES ('Ida','Cat','Long Haired');
SELECT Species, Breed, Count FROM (SELECT Species, 'Sum' AS Breed, 0 AS Detaillevel, COUNT() AS Count FROM Animal GROUP BY Species UNION SELECT Species, Breed, 1 AS Detaillevel, COUNT() AS Count FROM Animal GROUP BY Species, Breed ORDER BY 1, 3, 2);
Result: SPECIES BREED COUNT Cat Sum 3 Cat Long Haired 1 Cat Street 2 Dog Sum 6 Dog German Shepherd 3 Dog Labrador 1 Dog Mixed 2
Hope this format will look nicer:
CREATE TABLE Animal ( Name VARCHAR2(30),
INSERT INTO Animal (Name, Species, Breed) VALUES ('Albert','Dog','German Shepherd');
INSERT INTO Animal (Name, Species, Breed) VALUES ('Berta','Dog','German Shepherd');
INSERT INTO Animal (Name, Species, Breed) VALUES ('Carla','Dog','German Shepherd');
INSERT INTO Animal (Name, Species, Breed) VALUES ('Doris','Dog','Labrador');
INSERT INTO Animal (Name, Species, Breed) VALUES ('Emil','Dog','Mixed');
INSERT INTO Animal (Name, Species, Breed) VALUES ('Fritz','Dog','Mixed');
INSERT INTO Animal (Name, Species, Breed) VALUES ('Gerd','Cat','Street');
INSERT INTO Animal (Name, Species, Breed) VALUES ('Hans','Cat','Street');
INSERT INTO Animal (Name, Species, Breed) VALUES ('Ida','Cat','Long Haired');
SELECT Species, Breed, Count FROM
(SELECT Species, 'Sum' AS Breed, 0 AS Detaillevel, COUNT(*) AS Count FROM Animal GROUP BY Species
UNION
SELECT Species, Breed, 1 AS Detaillevel, COUNT(*) AS Count FROM Animal GROUP BY Species, Breed
ORDER BY 1, 3, 2);
SPECIES BREED COUNT
Cat Sum 3
Cat Long Haired 1
Cat Street 2
Dog Sum 6
Dog German Shepherd 3
Dog Labrador 1
Dog Mixed 2
SELECT Species, Breed, Count(*) FROM Animals
GROUP BY Species, Breed
WITH ROLLUP
What about ROLLUP which was designed for exactly this purpose and is almost every SQL implementation?
SQL server has had it since at SQL 2000:
http://msdn.microsoft.com/en-us/library/ms189305(v=SQL.90).aspx
Doing this with rollup in SQL yields the same result - but there's an important difference: the result of the projection is materialized, so it's extremely fast to query. Of course, if your SQL engine supports materialized views, again, you can accomplish something similar - but I'll be you anything, RavenDB indexes will still beat the performance of materialized views in any SQL engine ;-)
select species, breed, count(1) from animals group by rollup(species, breed)
And that is pretty if I say so myself :)
Ajai
If only there were an in memory version of sql server to allow tests on top of this to execute quickly.
"but I'll be you anything, RavenDB indexes will still beat the performance of materialized views in any SQL engine ;-)"
I'll bet you a database engine that been around many more years than Raven has a good chance of winning that bet. Dumping row from a materialized view will be one of the fastest thing a SQL server can do.
"If only there were an in memory version of sql server to allow tests on top of this to execute quickly."
SQL server will host the pages of a materialized view in RAM for reads if it has enough. If your worried about write performance and not durability then put the SQL server db files on a RAMDISK:
http://www.sqltreeo.com/wp/sql-server-on-steroids-with-ram-disk/
"http://www.sqltreeo.com/wp/sql-server-on-steroids-with-ram-disk/"
That sounds suspiciously like effort. I'd rather have something xcopy deployable.
I'm all enthusiastic about NoSQL but we still don't have a true data processing language. The C# + map-reduce construct is not very elegant compared to SQL select with rollup.
Justin, There is a very important difference between the way RavenDB and SQL Server. RavenDB is going to compute the result once and then use that as the source for the query. SQL Server is going to have to go through the entire data set, on each query. Materialized views (indexed views in SQL Server) and quite limited - for example, they can't do Rollups.
"RavenDB is going to compute the result once and then use that as the source for the query."
Probably why everyone is talking about materialized views no?
In the original post you said: "And that is nice, but it still means that you have to do some work on the client side to merge things up to get the final result"
This has nothing to do with how often Raven or SQL computes the result, you said SQL simply could not compute the result on it own, which means one of two things, either you did not know ROLLUP existed or you are purposely are saying SQL the language can do less than it actually does.
Also the limitation of ROLLUP not being in am materialized view is a MS SQL Server limitation not a SQL/Relational limitation. Oracle and DB2 have no problem with ROLLUP in materialized view. Again this just shows that you simply don't know what is possible in the relational world or you are purposely ignoring the facts. For MS SQL ever heard of triggers or the job scheduler?
Oracle can even defer the refresh of of the materialized view for later if you want Raven DB like staleness.
Justin, Are you aware of my history with regard to RDBMS?
Yes, so which is it, did you not know about ROLLUP or purposely say SQL can't do it?
Justin, The only place where I talked about SQL was when I wanted to show the simple aggregation. I had nothing to say about the feasibility of this in SQL.
Still not answering the question.
You most definitely talked about the feasibility in SQL by saying work must be done on the client side while this was not necessary in Raven, simply untrue.
The whole article's premise was a contrast to SQL yet didn't mention the ROLLUP operator at all even though it is the textbook example of ROLLUP. You might as well have left off GROUP BY and COUNT and done ALL the work client side and declared SQL can't do aggregation but Raven can.
Adding WITH ROLLUP to the SQL statement is even simpler and more succinct than Raven if I do say so myself. 3 lines of SQL vs 40+ lines of C# map/reduce for the same result.
3 lines of SQL has my vote too on this one
+1
I like the idea about simple things expressed in 40 lines of code instead of 2-3lines :)
Yes, forget test-ability! This won't run on SQLITE so there goes quick test execution. Then again when was the last time you saw a DBA write a test for anything?
An application I work on, implements this really easily as part of an in-built query language - here's the one-liner in that case:
GROUP BY Species { GROUP BY Breed { COUNT } }
http://dev.enterprisetester.com/help/#Search_TQL_Aggregation
http://blog.bittercoder.com/2012/08/20/odata-tql-and-filtering/
We don't build an index, but the result sets are built on the fly against Lucene.Net indexes, so it's pretty fast (few milliseconds) to execute.
I think the ROLLUP/CUBE can work well for these scenarios too when using Sql Server (unfortunately I don't have the luxury of supporting just a single Database) - but I think as things get more complex, like faceted results, or including a list of entities matching a facet (top 10 / bottom 10 etc.) then the Sql Server solution becomes a lot more complex...
Comment preview