Abstracting the persistence medium isn’t going to let you switch persistence abstractions

time to read 5 min | 884 words

This came in as a comment for my post about encapsulating DALs:

Just for the sake of DAL - what if I want to persist my data in another place, different than a relational database? For instance, in a object-oriented database, or just serialize the entities and store them somewhere on the disk... What about then ? NHibernate cannot handle this situation, can it ? Wouldn't I be entitled to abstract my DAL so that it can support such case ?

The problem here is that it assume that such a translation is even possible.

Let us take the typical Posts & Comments scenario. In a relational model, I would model this using two tables. Now, let us say that I have the following DAL operations:

  • GetPost(id)
  • GetPostWithComments(id)
  • GetRecentPostsWithCommentCount(id)

The implementation of this should be pretty obvious on a relational system.

But when you move to a document database, for example, suddenly this “abstract” API becomes an issue.

How do you store comments? As separate documents, embedded in the post? If they are embedded in the post, GetPost will now have the comments, which isn’t what happens with the relational implementation. If they are separate documents, you complicated the implementation of GetPostWithComments, and GetRecentPostsWithCommentCount now have a SELECT N+1 problem.

Now, to be absolutely clear, you can certainly do that, there are fairly simple solutions for that, but this isn’t how you would build the system if you started from a document database standpoint.

What you would end up with in a document database design is probably just:

  • GetPost(id)
  • GetRecentPosts(id)

Things like methods for eagerly loading stuff just don’t happen there, because if you need to eagerly load stuff, they are embedded in the document.

Again, you can make this work, but you end up with a highly procedural API that is very unpleasant to work with. Consider the Membership Provider API as a good example of that. The API is a PITA to work with compared to APIs that don’t need to work to the lowest common denominator.

When talking about data stores, that usually means limit yourself to key/value store operations, and even then, there is stuff that leaks through. Let me show you an example that you might be familiar with:

image

The MembershipProvider is actually a good example of an API that try to fit several different storage systems. It has a huge advantage from the get go because it comes with two implementations for two different storage systems (RDBMS and ActiveDirectory).

Note the API that you have, not only is it highly procedural, but it can also assume nothing about its implementation. It makes for code that is highly coupled to this API and to this form of working with data in procedural fashion. This make it much harder to work with compared to code using a system that enable automatic change tracking and data store synchronization (pretty much every ORM).

That is one problem of such APIs, it greatly increase the cost of working with it. Another problem is that stuff still leaks through.

Take a look at the GetNumberOfUsersOnline method. That is a method that can only be answered efficiently on a relational database. The ActiveDirectoryMembershipProvider just didn’t implement this method, because it isn’t possible to answer that question from ActiveDirectory.

By that same token, you can’t implement FindUsersByEmail, FindUsersByName on a key/value store (well, you can maintain a reverse index, of course), nor can you implement GetAllUsers.

You can’t abstract the nature of the underlying data store that you use, it is going to leak through. The issues isn’t so much if you can make it work, you usually can, the issue is if you can make it work efficiently. Typical problems that show up is that you try to map concepts that simply do not exists in the underlying data store.

Relations is a good example of something that it is very easy to do in SQL but very hard to do using almost anything else. You can still do that, of course, but now your code is full of SELECT N+1 and you can waive goodbye to your performance. If you build an API for a document database and try to move it to a relational one, you are going to get hit with “OMG! It takes 50 queries to save this entity on a database, and 1 request to save it on a DocDb”, leaving aside the problem that you are also going to have problems reading it from a relational database (the API would assume that you give back the full object).

In short, abstractions only work as long as you can assume that what you are abstracting is roughly similar. Abstracting the type of the relational database you use is easy, because you are abstracting the same concept. But you can't abstract away different concepts. Don’t even try.