Building a recommendation engine in NHibernate

time to read 14 min | 2605 words

Well, it isn’t really a recommendation engine, it is a sample of one, and I strongly recommend not using it, but I am getting ahead of myself.

In the 6th episode of the TekPub’s NHibernate webcast, me & Rob worked on creating statistical queries with NHibernate. To be totally honest, the reason that we did that is to show off NHibernate’s querying capabilities, not so you would be able to make use of this in your applications. A recommendation engine is not something that you should run out of your OLTP store, so please take that under advisement.

The reason for this post is to explain in details how the final result works. Here is the NHibernate code:

var orderIDsContainingCurrentSku = DetachedCriteria.For<OrderItem>()
            .Add<OrderItem>(x=>x.Product.SKU==sku)
            .SetProjection(Projections.Property("Order.id"));

var skusOfProductsAppearingInOrdersContainingCurrentSku = DetachedCriteria.For<OrderItem>()
    .SetProjection(Projections.GroupProperty("Product.id"))
    .AddOrder(NHibernate.Criterion.Order.Desc(Projections.Count("Order.id")))
    .Add<OrderItem>(x=>x.Product.SKU!=sku)
    .Add(Subqueries.PropertyIn("Order.id", orderIDsContainingCurrentSku))
    .SetMaxResults(15);


var recommended = _session.CreateCriteria<Product>()
    .SetFetchMode<Product>(x => x.Descriptors, FetchMode.Join)
    .Add(Subqueries.PropertyIn("id", skusOfProductsAppearingInOrdersContainingCurrentSku))
    .SetResultTransformer(Transformers.DistinctRootEntity)
    .List<Product>();

And here is the resulting SQL:

SELECT this_.SKU                 as SKU1_1_,
       this_.ProductName         as ProductN2_1_1_,
       this_.BasePrice           as BasePrice1_1_,
       this_.WeightInPounds      as WeightIn4_1_1_,
       this_.DateAvailable       as DateAvai5_1_1_,
       this_.EstimatedDelivery   as Estimate6_1_1_,
       this_.AllowBackOrder      as AllowBac7_1_1_,
       this_.IsTaxable           as IsTaxable1_1_,
       this_.DefaultImageFile    as DefaultI9_1_1_,
       this_.AmountOnHand        as AmountO10_1_1_,
       this_.AllowPreOrder       as AllowPr11_1_1_,
       this_.DeliveryMethodID    as Deliver12_1_1_,
       this_.InventoryStatusID   as Invento13_1_1_,
       descriptor2_.SKU          as SKU3_,
       descriptor2_.DescriptorID as Descript1_3_,
       descriptor2_.DescriptorID as Descript1_4_0_,
       descriptor2_.Title        as Title4_0_,
       descriptor2_.Body         as Body4_0_
FROM   Products this_
       left outer join ProductDescriptors descriptor2_
         on this_.SKU = descriptor2_.SKU
WHERE  this_.SKU in (SELECT   top 15 this_0_.SKU as y0_
                     FROM     OrderItems this_0_
                     WHERE    not (this_0_.SKU = 'Binoculars2' /* @p0 */)
                              and this_0_.OrderID in (SELECT this_0_0_.OrderID as y0_
                                                      FROM   OrderItems this_0_0_
                                                      WHERE  this_0_0_.SKU = 'Binoculars2' /* @p1 */)
                     GROUP BY this_0_.SKU
                     ORDER BY count(this_0_.OrderID) desc)
 

The problem is that both the NHibernate code and the SQL are pretty complicated, and mapping between the two might be pretty hard if you are not familiar with that. So let us take this in stages. First, let us understand the logic in the SQL itself. Most of the complexity happens in the where clause, so let us look at this in depth:

WHERE  this_.SKU in (SELECT   top 15 this_0_.SKU as y0_
 FROM     OrderItems this_0_
 WHERE    not (this_0_.SKU = 'Binoculars2' /* @p0 */)
          and this_0_.OrderID in (SELECT this_0_0_.OrderID as y0_
                  FROM   OrderItems this_0_0_
                  WHERE  this_0_0_.SKU = 'Binoculars2' /* @p1 */)
 GROUP BY this_0_.SKU
 ORDER BY count(this_0_.OrderID) desc)

What exactly is going on in here?

Let us look at the nested most select, this select the OrderId from an order that have a Binoculars item in it. That is then passed to the parent select, matching it to that OrderID and returning the SKU if it is not also a Binoculars.

What we are actually saying here is: Give me all the items from orders that contains Binoculars, except for Binoculars. The logic is simple, you are very likely to buy something that someone else also bought in an order together with stuff that you bought (complimentary products, another book in the same series, etc). Next, we have the order by, we use that to find the stuff that you are most likely to buy. By ordering the items based on the number of orders they appear in, we try to find the most popular items (hence, the stuff that you are likely to buy as well).

I think that this is fairly clear, and now that we have the logic of the statement, let us try to understand how that NHibernate code produced it.

The answer is actually very simple, NHibernate’s Criteria API is all about composability. We simply composed the query from all the tiny pieces. Let us look at each individual piece in detail:

var orderIDsContainingCurrentSku = DetachedCriteria.For<OrderItem>()
            .Add<OrderItem>(x=>x.Product.SKU==sku)
            .SetProjection(Projections.Property("Order.id"));

This query is using DetachedCriteria, this is a way to generate a query (or a sub query, as we will soon see), without having a direct reference to the session. This is mostly useful in cases like this, where you want to compose several queries into a single one.

image In this case, it is pretty obvious what is going on, we ask NHibernate to select from OrderItems (by using the OrderItem entity), where the product sku is equal to the appropriate SKU (Binoculars2, in this case), and we don’t want to get the entity back, instead, we want only a single field (this is what SetProjection is for), the order id. Note that OrderItem mapping is quite interesting:

<class name="OrderItem" table="OrderItems" >
  <composite-id>
      <key-many-to-one name="Product" column="SKU"/>
      <key-many-to-one name="Order" column="OrderID"/>
  </composite-id>
</class>

It is a class with a composite id, where each part of the PK is also a FK to a different table. With NHibernate, we map this using <key-many-to-one/> in a composite-id element.

When we want to query on that, we can either use the usual many-to-one approach, or, if we want to refer to a particular column, we use the “id” (all lower case) keyword. In other words, “Order.id” refers to the OrderItems.OrderID column, while “Product.id” or “Product.SKU” refers to the “OrderItems.SKU” column. I think that you can figure out what is going on now, this query generate the following SQL:

SELECT this_0_0_.OrderID as y0_
FROM   OrderItems this_0_0_
WHERE  this_0_0_.SKU = 'Binoculars2' /* @p1 */

And I think that can see the direct correlation between the NHibernate query and the generated SQL.

Next in line, and seemingly much more complicated, we have this:

var skusOfProductsAppearingInOrdersContainingCurrentSku = DetachedCriteria.For<OrderItem>()
    .SetProjection(Projections.GroupProperty("Product.id"))
    .AddOrder(NHibernate.Criterion.Order.Desc(Projections.Count("Order.id")))
    .Add<OrderItem>(x=>x.Product.SKU!=sku)
    .Add(Subqueries.PropertyIn("Order.id", orderIDsContainingCurrentSku))
    .SetMaxResults(15);

But it isn’t really complidated, let us look at this in details. The first line is already familiar for us, asking to select from OrderItems. Next, we use SetProjection again, to select just the “Product.id”, which is the OrderItems.SKU. Note that we are using something slightly different this time, where before we used Projections.Property, now we use Projections.GroupProperty. What is the difference between the two?

Projection.Property instructs NHibernate to put the matching column in the select clause, while Projection.GroupProperty instructs NHibernate to put the matching column in the select clause and in the group by clause. This is required because on the next line, we are using an aggregate function in the order by clause, aggregate functions must be used in conjunction with the appropriate group by clause. That line also specify that we are using a descending order on the count of the “Order.id” (which matches OrderItems.OrderID).

The following line is something we are already familiar with, we are adding a where clause to filter the current SKU. And now we get to the interesting part, we use a subquery to match the “Order.id” to the order ids containing the current SKU. Last, but not least, we limit the number of returned rows to 15. The resulting SQL is:

SELECT   top 15 this_0_.SKU as y0_
FROM     OrderItems this_0_
WHERE    not (this_0_.SKU = 'Binoculars2' /* @p0 */)
      and this_0_.OrderID in (/* orderIDsContainingCurrentSku */)
GROUP BY this_0_.SKU
ORDER BY count(this_0_.OrderID) desc

I think that again, once we have gone over this in details, you will agree that there is a pretty simple mapping between the query and the resulting SQL.

Now, let us look at the actual query code, which make use of the previous two subqueries:

var recommended = _session.CreateCriteria<Product>()
    .SetFetchMode<Product>(x => x.Descriptors, FetchMode.Join)
    .Add(Subqueries.PropertyIn("id", skusOfProductsAppearingInOrdersContainingCurrentSku))
    .SetResultTransformer(Transformers.DistinctRootEntity)
    .List<Product>();

We create a standard criteria query, ask it to eager load the Descriptors, then we perform a subquery, matching the product “id” (when specified using all lower case it is an NHibernate keyword referencing the current entity’s PK column) to the skus of products that appear in orders contains the current sku. Because we used a join to eager load the descriptors, we need to specify SetResultTransofer so we will get only distinct root entities.

All in all, when you break it up to pieces like that, I don’t think that it is an overly complex process. The query we were trying to get to is by no means a simple one, but we didn’t have any additional complexity when trying to create it using NHibernate.