That No SQL Thing – Document Databases
A document database is, at its core, a key/value store with one major exception. Instead of just storing any blob in it, a document db requires that the data will be store in a format that the database can understand. The format can be XML, JSON, Binary JSON (MongoDB), or just about anything, as long as the database can understand it.
Why is this such a big thing? Because when the database can understand the format of the data that you send it, it can now do server side operations on that data. In most doc dbs, that means that we can now allow queries on the document data.The known format also means that it is much easier to write tooling for the database, since it is possible to show, display and edit the data.
I am going to use Raven as the example for this post. Raven is a JSON based, REST enabled document database. Documents in Raven use the JSON format, and each document contains both the actual data and an additional metadata information that you can use. Here is an example of a document:
{
"name": "ayende",
"email": "ayende@ayende.com",
"projects": [
"rhino mocks",
"nhibernate",
"rhino service bus",
"raven db",
"rhino persistent hash table",
"rhino distributed hash table",
"rhino etl",
"rhino security",
"rampaging rhinos"
]
}
We can PUT this document in the database, under the key ‘ayende’. We can also GET the document back by using the key ‘ayende’.
A document database is schema free, that is, you don’t have to define your schema ahead of time and adhere to that. It also allow us to store arbitrarily complex data. If I want to store trees, or collections, or dictionaries, that is quite easy. In fact, it is so natural that you don’t really think about it.
It does not, however, support relations. Each document is standalone. It can refer to other documents by store their key, but there is nothing to enforce relational integrity.
The major benefit of using a document database comes from the fact that while it has all the benefits of a key/value store, you aren’t limited to just querying by key. By storing information in a form that the database can understand, we can ask the server to do things for us. Such as defining the following index for us:
from doc in docs from prj in doc.projects select new { Project = prj, Name = doc.Name }
With that in place, we can now make queries on the stored documents:
GET /indexes/ProjectAndName?query=Project:raven
This will give us all the documents that include a project with the name raven.
We can take it a bit further, though, and ask the database to store aggregations for us:
from doc in docs from prj in doc.projects select new { Count = 1, Project = prj} from result in results group result by result.Project into g select new { Project = g.Key, Count = g.Sum(x=>x.Count) };
And now we can query for the number of users involved in the raven project:
GET /indexes/UsersCountByProject?query=Project:raven
In general, querying a document database falls into the prepared ahead of time (Raven, CouchDB) or at query time (MongoDB).
In the first case, you define an indexing function (in Raven’s case, a Linq query, in CouchDB case, a JavaScript function) and the server will run it to prepare the results, once the results are prepared, they can be served to the client with minimal computation. CouchDB and Raven differs in the method they use to update those indexes, Raven will update the index immediately on document change, and queries to indexes will never wait. The query may return a stale result (and is explicitly marked as such), but it will return immediately. With CouchDB, a view is updated at view query time, which may lead to a long wait time on the first time a view is accessed if there were a lot of changes in the meanwhile. The CouchDB approach avoids doing extra works that may not be needed, but Raven’s approach means that you don’t have to handle potentially large delays at queyring times. CouchDB allow to pass a parameter at the view querying time to allow it to return stale results, which seems to be the best of all worlds.
Note that in both CouchDB and Raven’s cases, indexes do not affect write speed, since in both cases this is done at a background task.
MongoDB, on the other hand, allows ad-hoc querying, and relies on indexes defined on the document values to help it achieve reasonable performance when the data size grows large enough. MongoDB’s indexes behave in much the same way RDBMS indexes behave, that is, they are updated as part or the insert process, so large number of indexes is going to affect write performance.
Other document databases:
- ThruDB is the closest one that I could find, but it is basically a key/value store with manual Lucene indexing.
- Terrastore seems to be a full fledged document database, but I can’t find much information about it. Requires Terracotta to work.
- Riak is another Erlang based Doc DB. It seems to be close to MongoDB in terms of operation, but that is about as much as I know about it. Querying Riak is done by writing a map/reduce query in javascript or erlang. I think it is the originator of this.
Comments
For those of us that are still trying to wrap our heads around document databases, their uses as well as the paradigm in general, would you recommend some reading material?
For instance, I understand the relatively easy concept of serializing and deserializing POCO's to and from JSON (or some other format) for storage and retrieval. There's a ton of material about the easy parts of the paradigm. But I haven't been able to find a good reference for the harder concepts such as differences between querying data in a RDMS and document database. Or for that matter the proper way data should be queried in document databases.
What material did you use to learn document databases?
Eric,
I read through the code base of CouchDB, went over all the documentation available for CouchDB, etc.
Nothing spectacular, I am afraid.
with the RESTful JSON access it sounds like Raven is a good candidate for view model persistence. Since raven takes care of the Lucene indexing through linq views it's that much easier to introduce searching into an application. (searching as I have heard Udi describe it).
I'm still on the fence if a document DB should replace a RDBMS, but that could simply be my own resistance to change. I know everyone says "what about reporting". I think that is where an RDBMS comes into play. store the domain and/or view data in the document DB and export data to an RDBMS for reporting/mining.
Jason,
A DocDBs isn't really about replacing RDBMS, those still have an important role to play.
I'll talk about usecases for DocDbs in a future post
Can I ask few questions?
What is the underlying storage of Raven DB? Esent? If so, what about its performance?
Does Raven support distributed transactions/System.Transactions API?
I'd be happy to use it if it supported DTC, I'm thinking about transactional message queue & document operations, which is quite common pattern. Or maybe you have an idea how to implement transactional document & MQ operations without DTC?
Do you think this can be excellent source to store Temporal data for eg, an Invoice document etc ? The more I search, the more I realize it's not replacement to RDBMS but can be excellent addon to store Temporal and denornmized data which most systems seems to be having(think invoice, think payment receipt etc). I think NoSQL should be renamed to PlusSQL or something similar :)
@Parag:
To my understanding, the initial meaning for 'NoSQL' was actually intended to be 'NOSQL' (standing for N.ot O.nly SQL) rather than the more often-used colloquialism 'NoSQL' taken to me 'No' SQL at all.
This original meaning would be closer to the spirit of what you suggest ('PlusSQL') than the more common (mis)interpretation as being 'against' SQL in its entirety.
Sadly, I cannot seem to find an authoritative reference to corroborate my understanding in any 'official' sense, but I do believe this was its initial intent.
-Steve B.
Parag,
Yes, I think it would be ideally suited for that. It is a much more natural model for those sort of things than a RDBMS.
I can't agree that a document db is, at its core, a key/value store. A key value store is a document db, but not all document dbs are key value stores.
The definition of a key value store is that the data can only be accessed by key - this is fundamental to what they are. MongoDB, CouchDB and presumably RavenDB allow, as you state, more complex querying. Once you allow for secondary indexes, you are very far away from key/value.
My experience is more with MongoDB, so I can speak about it best. Sure it requires an _id, but that's an internal implementation detail...you can skip it and it'll generate one for you (I wouldn't be surprised if SQL server did the same thing when no PK was present). Also, the structure of a MongoDB document isn't keyed at all, the _id is part of the document, not the key of the document.
Karl,
For the most part, MongoDB is a drastically different beast than CouchDB or RavenDB.
You are correct, by the way, in that MongoDB's idea of how to deal with documents isn't really a key based approach.
I would say that you _id is much more than an implementation details, this is how you address the document, and MongoDB has a lot of support to make this as fast as possible.
But yes, while you can do something like this:
db.people.findOne( { _id: ObjectId( "47cc67093475061e3d95369d" ) } )
That isn't really how I think about a key store.
Comment preview