Bulk Data Operations Using NHibernate

time to read 22 min | 4330 words

If there is one thing that NHibernate (and most other O/RMs) are not very good at, it is bulk data operations. What does it mean? Well, let's take a simple case of wanting to delete all the records in a table. (I'm going to use Active Record Blog Sample for the sample, it is much simpler). Let take this, for instance:

Post.DeleteAll();

This will delete all the rows from the Posts table. Really easy to write or do, but what is the meaning here? What goes under the cover is something like:

  1. Load all the posts
  2. For each post:
  1. Resolve the relationships between the post and other objects
  2. According to the cascade options specified, delete / update all relating objects
  1. Possibly recurse to the object child, etc
  • Delete the post
  • This is done this way since NHibernate will take charge of maintaining the integrity of the database (and because you don't want to get foreign keys errors). In this case, it will make sure that all the comments for all the posts are deleted as well. In general, this is a Good Thing, since usually you don't deal with massive amounts of data that need to be modified at once.

    This is simple horrible when you start to deal with many records that you want to change at once. Case in point, I load several thousands objects to the application, process them, and then delete some of them. There is no order in the way I load the objects, and there is no order in which they will be deleted.

    I am operating in batches, and a single batch may delete 10,000 rows. Because of the above mentioned behavior, this resolts in 10,000 queries in the unoptimized version. I think you can guess where my bottleneck was. Now, there are quite a few ways I can mitigate it. I could push all the updates into one Session Unit Of Work, and let NHibernate batch it appropriately, but even when it is batched, it is still 10,000 queries that you are talking about. Sure, it's just "DELETE FROM TableName Where Id = @p0", but that is still way too many queries for something that should happen many times a second.

    Actually, since the work is being done on multiply threads, there is even more problems with it, since NHibernate's Session is not thread safe. Problem, isn't it?

    NHibernate's documentation warns against doing this kind of stuff, for the reasons I have just listed. In this case, NHibernate choose the approach of safe & slow than fast & wrong. I agree with this approach, wholeheartedly, but it is not doing me any good in this case.

    Now, let us think about the problem first, and see what I want to happen. I want to be able to issue a delete statement like this:

    DELETE

    FROM TableName
    WHERE Id In (@p0, @p1, @p2 ... @p9999)

    I just tested it, and I can verify that the IN operator in SQL 2005 can have over 200,000 items in it, so I'm good to go, am I not? The query size can get big fast, and it's not terribly efficent way to do it. Beisde, I don't want to appear in the Daily WTF.

    Here is what I ended up with, it is a two part solution. One part, in the database, is composed of a table and a procedure:

    CREATE TABLE [dbo].[HelperTable_IdMarkedForDeletion](

          [Id] [int] NOT NULL PRIMARY KEY CLUSTERED

    )

    GO

     

    CREATE PROCEDURE  [dbo].[ClearIncomingFromRowsMarkedForDeletion]

    AS

          BEGIN TRANSACTION

     

          DELETE FROM TableName

          WHERE Id IN (SELECT Id FROM HelperTable_IdMarkedForDeletion)

     

          TRUNCATE TABLE HelperTable_IdMarkedForDeletion

     

          COMMIT

    GO

    This actually gives the Query Optimizer a chance to optimize the procedure and get an efficent result. Now, to the second part, which is the code. I just created this class:

    public class BulkDeleter<PkType, ItemType>

    {

        DataTable table;

        string tempTable, procName;

        Func<PkType, ItemType> getPk;

     

        public BulkDeleter(string tempTableName, string procName, Func<PkType, ItemType> getPk)

        {

            this.tempTable = tempTableName;

            this.procName = procName;

            this.table = new DataTable(tempTable);

            this.table.Columns.Add("Id", typeof(PkType)).Unique = true;

            this.getPk = getPk;

        }

     

        public void RegisterForDeletion(ItemType item)

        {

            lock (table)

            {

                table.Rows.Add(getPk(item));

            }

        }

     

        public void PerformDelete(IDbConnection connection)

        {

            lock (table)

            {

                using (SqlBulkCopy bulkCopy = new SqlBulkCopy((SqlConnection)connection))

                {

                    bulkCopy.DestinationTableName = tempTable;

                    bulkCopy.WriteToServer(table);

                }

     

                table.Rows.Clear();

     

                using (IDbCommand command = connection.CreateCommand())

                {

                    command.CommandText = procName;

                    command.CommandType = CommandType.StoredProcedure;

                    command.ExecuteNonQuery();

                }

            }

        }

    }

    On construction, it gets the table and procedure name, and a delegate that will extract the primary key from the items that are registered for deletion. Internally, it managed a DataTable of the primary keys, any time you register an object for deletion, it will store its primary key in the datatable. When you decide to perform the delete, it will bulk copy the content of the DataTable to the server, and then execute the stored procedure.

    Bulk copying is very fast, and there is nothing like a stored procedure to handle bulk data operations. It is hard to measure timing correctly, since the work was spread across multiply threads, but I know that I reduced the number of queries the database has to deal with from thousands to two.

    As a note, if I was working on a project that required bulk data operations on more than just one or two cases, I would've probably created a custom persister for NHibernate, which would be able to make more assumtions about my model than NHibernate can. Right now this is a single scenario that I need to address, and this was the simplest thng that solved the problem (elegantly).