Building a recommendation engine in NHibernate
Well, it isn’t really a recommendation engine, it is a sample of one, and I strongly recommend not using it, but I am getting ahead of myself.
In the 6th episode of the TekPub’s NHibernate webcast, me & Rob worked on creating statistical queries with NHibernate. To be totally honest, the reason that we did that is to show off NHibernate’s querying capabilities, not so you would be able to make use of this in your applications. A recommendation engine is not something that you should run out of your OLTP store, so please take that under advisement.
The reason for this post is to explain in details how the final result works. Here is the NHibernate code:
var orderIDsContainingCurrentSku = DetachedCriteria.For<OrderItem>() .Add<OrderItem>(x=>x.Product.SKU==sku) .SetProjection(Projections.Property("Order.id")); var skusOfProductsAppearingInOrdersContainingCurrentSku = DetachedCriteria.For<OrderItem>() .SetProjection(Projections.GroupProperty("Product.id")) .AddOrder(NHibernate.Criterion.Order.Desc(Projections.Count("Order.id"))) .Add<OrderItem>(x=>x.Product.SKU!=sku) .Add(Subqueries.PropertyIn("Order.id", orderIDsContainingCurrentSku)) .SetMaxResults(15); var recommended = _session.CreateCriteria<Product>() .SetFetchMode<Product>(x => x.Descriptors, FetchMode.Join) .Add(Subqueries.PropertyIn("id", skusOfProductsAppearingInOrdersContainingCurrentSku)) .SetResultTransformer(Transformers.DistinctRootEntity) .List<Product>();
And here is the resulting SQL:
SELECT this_.SKU as SKU1_1_, this_.ProductName as ProductN2_1_1_, this_.BasePrice as BasePrice1_1_, this_.WeightInPounds as WeightIn4_1_1_, this_.DateAvailable as DateAvai5_1_1_, this_.EstimatedDelivery as Estimate6_1_1_, this_.AllowBackOrder as AllowBac7_1_1_, this_.IsTaxable as IsTaxable1_1_, this_.DefaultImageFile as DefaultI9_1_1_, this_.AmountOnHand as AmountO10_1_1_, this_.AllowPreOrder as AllowPr11_1_1_, this_.DeliveryMethodID as Deliver12_1_1_, this_.InventoryStatusID as Invento13_1_1_, descriptor2_.SKU as SKU3_, descriptor2_.DescriptorID as Descript1_3_, descriptor2_.DescriptorID as Descript1_4_0_, descriptor2_.Title as Title4_0_, descriptor2_.Body as Body4_0_ FROM Products this_ left outer join ProductDescriptors descriptor2_ on this_.SKU = descriptor2_.SKU WHERE this_.SKU in (SELECT top 15 this_0_.SKU as y0_ FROM OrderItems this_0_ WHERE not (this_0_.SKU = 'Binoculars2' /* @p0 */) and this_0_.OrderID in (SELECT this_0_0_.OrderID as y0_ FROM OrderItems this_0_0_ WHERE this_0_0_.SKU = 'Binoculars2' /* @p1 */) GROUP BY this_0_.SKU ORDER BY count(this_0_.OrderID) desc)
The problem is that both the NHibernate code and the SQL are pretty complicated, and mapping between the two might be pretty hard if you are not familiar with that. So let us take this in stages. First, let us understand the logic in the SQL itself. Most of the complexity happens in the where clause, so let us look at this in depth:
WHERE this_.SKU in (SELECT top 15 this_0_.SKU as y0_ FROM OrderItems this_0_ WHERE not (this_0_.SKU = 'Binoculars2' /* @p0 */) and this_0_.OrderID in (SELECT this_0_0_.OrderID as y0_ FROM OrderItems this_0_0_ WHERE this_0_0_.SKU = 'Binoculars2' /* @p1 */) GROUP BY this_0_.SKU ORDER BY count(this_0_.OrderID) desc)
What exactly is going on in here?
Let us look at the nested most select, this select the OrderId from an order that have a Binoculars item in it. That is then passed to the parent select, matching it to that OrderID and returning the SKU if it is not also a Binoculars.
What we are actually saying here is: Give me all the items from orders that contains Binoculars, except for Binoculars. The logic is simple, you are very likely to buy something that someone else also bought in an order together with stuff that you bought (complimentary products, another book in the same series, etc). Next, we have the order by, we use that to find the stuff that you are most likely to buy. By ordering the items based on the number of orders they appear in, we try to find the most popular items (hence, the stuff that you are likely to buy as well).
I think that this is fairly clear, and now that we have the logic of the statement, let us try to understand how that NHibernate code produced it.
The answer is actually very simple, NHibernate’s Criteria API is all about composability. We simply composed the query from all the tiny pieces. Let us look at each individual piece in detail:
var orderIDsContainingCurrentSku = DetachedCriteria.For<OrderItem>()
.Add<OrderItem>(x=>x.Product.SKU==sku)
.SetProjection(Projections.Property("Order.id"));
This query is using DetachedCriteria, this is a way to generate a query (or a sub query, as we will soon see), without having a direct reference to the session. This is mostly useful in cases like this, where you want to compose several queries into a single one.
In this case, it is pretty obvious what is going on, we ask NHibernate to select from OrderItems (by using the OrderItem entity), where the product sku is equal to the appropriate SKU (Binoculars2, in this case), and we don’t want to get the entity back, instead, we want only a single field (this is what SetProjection is for), the order id. Note that OrderItem mapping is quite interesting:
<class name="OrderItem" table="OrderItems" > <composite-id> <key-many-to-one name="Product" column="SKU"/> <key-many-to-one name="Order" column="OrderID"/> </composite-id> </class>
It is a class with a composite id, where each part of the PK is also a FK to a different table. With NHibernate, we map this using <key-many-to-one/> in a composite-id element.
When we want to query on that, we can either use the usual many-to-one approach, or, if we want to refer to a particular column, we use the “id” (all lower case) keyword. In other words, “Order.id” refers to the OrderItems.OrderID column, while “Product.id” or “Product.SKU” refers to the “OrderItems.SKU” column. I think that you can figure out what is going on now, this query generate the following SQL:
SELECT this_0_0_.OrderID as y0_ FROM OrderItems this_0_0_ WHERE this_0_0_.SKU = 'Binoculars2' /* @p1 */
And I think that can see the direct correlation between the NHibernate query and the generated SQL.
Next in line, and seemingly much more complicated, we have this:
var skusOfProductsAppearingInOrdersContainingCurrentSku = DetachedCriteria.For<OrderItem>() .SetProjection(Projections.GroupProperty("Product.id")) .AddOrder(NHibernate.Criterion.Order.Desc(Projections.Count("Order.id"))) .Add<OrderItem>(x=>x.Product.SKU!=sku) .Add(Subqueries.PropertyIn("Order.id", orderIDsContainingCurrentSku)) .SetMaxResults(15);
But it isn’t really complidated, let us look at this in details. The first line is already familiar for us, asking to select from OrderItems. Next, we use SetProjection again, to select just the “Product.id”, which is the OrderItems.SKU. Note that we are using something slightly different this time, where before we used Projections.Property, now we use Projections.GroupProperty. What is the difference between the two?
Projection.Property instructs NHibernate to put the matching column in the select clause, while Projection.GroupProperty instructs NHibernate to put the matching column in the select clause and in the group by clause. This is required because on the next line, we are using an aggregate function in the order by clause, aggregate functions must be used in conjunction with the appropriate group by clause. That line also specify that we are using a descending order on the count of the “Order.id” (which matches OrderItems.OrderID).
The following line is something we are already familiar with, we are adding a where clause to filter the current SKU. And now we get to the interesting part, we use a subquery to match the “Order.id” to the order ids containing the current SKU. Last, but not least, we limit the number of returned rows to 15. The resulting SQL is:
SELECT top 15 this_0_.SKU as y0_ FROM OrderItems this_0_ WHERE not (this_0_.SKU = 'Binoculars2' /* @p0 */) and this_0_.OrderID in (/* orderIDsContainingCurrentSku */) GROUP BY this_0_.SKU ORDER BY count(this_0_.OrderID) desc
I think that again, once we have gone over this in details, you will agree that there is a pretty simple mapping between the query and the resulting SQL.
Now, let us look at the actual query code, which make use of the previous two subqueries:
var recommended = _session.CreateCriteria<Product>()
.SetFetchMode<Product>(x => x.Descriptors, FetchMode.Join)
.Add(Subqueries.PropertyIn("id", skusOfProductsAppearingInOrdersContainingCurrentSku))
.SetResultTransformer(Transformers.DistinctRootEntity)
.List<Product>();
We create a standard criteria query, ask it to eager load the Descriptors, then we perform a subquery, matching the product “id” (when specified using all lower case it is an NHibernate keyword referencing the current entity’s PK column) to the skus of products that appear in orders contains the current sku. Because we used a join to eager load the descriptors, we need to specify SetResultTransofer so we will get only distinct root entities.
All in all, when you break it up to pieces like that, I don’t think that it is an overly complex process. The query we were trying to get to is by no means a simple one, but we didn’t have any additional complexity when trying to create it using NHibernate.
Comments
Does this really qualify as a "recommendation engine"? I say that because this seems like it's a common feature of a web app like the one in the screen cast. I would say in this case the NH query is actually more confusing (depending on experience of course) than the raw SQL.
So, If you don't recommend something like this is used in production, what would you recommendation someone using NH do in this case? I mean, this is a common need for an e-commerce app.
Bob,
As I said, what we did was show how we can use NHibernate to create moderately complex queries. That was the goal, not talking about the actual SQL.
Bob,
As for how I would build something like that?
Probably by utilizing an OLAP DB and querying from that.
Ayende,
OK, I guess my point is that if the scope of the app is small, an OLAP DB may be overkill. So, something like the NH query solution you posted may indeed be necessary in a real production environment.
Bob,
And if you like it, you can use it. But we wanted to be crystal clear that this isn't the general recommendation.
last year in a burst of energy i ported
http://taste.sourceforge.net/
to .net (the pre-Mahout version). There is sufficient extensibility to interface with external data stores, but i never got around to nHibernate integration. i plan to do so in a few months for a current project.
http://mymediaproject.codeplex.com/ seems like a full blown solution that i would use if the license was right (free only for non-commercial use, but my project will be MIT ish)
Well, I think you just recommended not using NHibernate for statistical queries. SQL is good enough for this... and the Criteria API is just ugly and difficult to understand
In my opinion OLAP would be way overkill for this. Why not just use HQL or a named SQL query? Criteria is ugly for non-trivial queries and hopefully a good LINQ provider for NHibernate will be ready soon.
I tend to agree with the others. In this case SQL would provide much more clarity of the intent of the query than the obfuscated use of the Criteria API. Most likely find I would have to use a profiler to look at the SQL output to reverse-engineer what's going on in this situation.
When it comes to complex groupings and sub queries I would really need a good reason not to use LINQ or custom sql.
Hi Ayende
I was working on something similar yesterday and encountered a bizarre behavior with projections which might be a bug in NH...
Could you shed some insight into the following behavior I noticed:
"GroupProperty across associations doesn't seem to work if no alias is specified in the detached criteria"
The domain is of Customer having many Orders and we already have the detached criterias nicely wrapped in "finders"
I wanted to use the existing Order finder DetachedCriteria.For <order()) and project Sum(order.Total) by order.Customer.Name:
Projections.GroupProperty("Customer.Name") does NOT work and throws NHibernate.QueryException "could not resolve property: Customer.Name of: Domain.Order"
But if I add an alias in the criteria:
query.CreateAlias("Customer", "c") and group using alias Projections.GroupProperty("c.Name") it WORKS!
Also strangely a projection on the Customer.Id property works fine without any alias...
Ajai
Gents I think you're missing the point. RE the Criteria API we did it that way because I asked him too - if we went the HQL route I would have just did as you mention and used SQL or an SP.
The main reason it's "ugly" is because we tried it in the first place. And OLAP system doesn't need to be backed by a full OLAP service - you can "star up" a set of flattened tables, index them, and run a query that way which we be a lot prettier and probably a super simple Criteria call. I'll pick up this point in the next screencast.
OLAP is never overkill if you find yourself needing to introspect data for patterns etc. This is your business and directly at the heart of what it is you're doing. I would agree that the systems out there (like SQL Integration) are pathetic when it comes to this stuff - but again you don't need a full system like that to build out OLAP.
Dmitry,
You should not run these types of queries against your OLTP data store.
Ajai,
Please use the NHibernate mailing list
http://groups.google.com/group/nhusers
Alternatively, you may choose our commercial support option:
http://nhprof.com/commercialsupport
Would you run full-featured analysis services for this feature? Rob is talking about creating flat database tables. How would they be populated, with triggers?
I do not see SQL struggling to much with the query because it only returns up to 15 records. That is assuming the tables are indexed properly.
Oren,
That depends on too many conditions to answer.
Depending on the traffic on the site, the size of the data, etc.
Probably I'll go with a separate DB to hold renormalized schema with some process to sync the OLTP and reporting DB.
What is this generic overload of Add <orderitem that takes an expression? Is this something new in the trunk, I don't think I have it in NH 2.1.2...
zvolkov,
That is an extension method from NH Lambda Extensions
Perhaps its the syntax that is foreign to me, but I kept stumbling on it. I am not saying I think its a good idea, but I can't wait to be able to do this sort of thing with LINQ. :-)
Comment preview