Notes on RavenDB vs. PostgreSQL

time to read 8 min | 1528 words

A common question I field on many customer inquiries is comparing RavenDB to one relational database or another. Recently we got a whole spate of questions on RavenDB vs. PostgreSQL and I though that it might be easier to just post about it here rather than answer the question each time. Some of the differences are general, for all or most relational databases, but I’m also going to touch on specific features of PostgreSQL that matter for this comparison.

The aim of this post is to provide highlights to the differences between RavenDB and PostgreSQL, not to be an in depth comparison of the two.

PostgreSQL is a relational database, storing the data using tables, columns and rows. The tabular model is quite entrenched here, although PostgreSQL has the notion of JSON columns. 

RavenDB is a document database, which store JSON documents natively. These documents are schema-free and allow arbitrarily complex structure.

The first key point that distinguish these databases is with the overall mindset. RavenDB is meant to be a database for OLTP systems (business applications) and has been designed explicitly for this. PostgreSQL is trying to achieve both OLTP and OLAP scenarios and tends to place a lot more emphasis on the role of the administrator and operations teams. For example, PostgreSQL requires VACUUM, statistics refresh, manual index creation, etc. RavenDB, on the other hand, it design to run in a fully automated fashion. There isn’t any action that an administrator needs to take (or schedule) to ensure that RavenDB will run properly.

RavenDB is also capable of configuring itself dynamically, adjusting to the real world load it has based on feedback from the operational environment. For example, the more queries a particular index has, the more resources it will be granted by RavenDB. Another example is how RavenDB processes queries in general. Its query analyzer will run through the incoming queries and figure out what is the best set of indexes that you need to answer them. RavenDB will then go ahead and create these indexes on the fly. Such an action tends to be scary for users coming from relational databases, but RavenDB was designed upfront for specifically these scenarios. It is able to build the new indexes without adding too much load to the server and without taking any locks. Other tasks that are typically handled by the DBA, such as configuring the system, are handled dynamically by RavenDB based on actual operational behavior. RavenDB will also cleanup superfluous indexes and reduce the resources available for indexes that aren’t in common use. All of that without a DBA to perform acts of arcane magic.

Another major difference between the databases is the notion of schema. PostgreSQL requires you to define your schema upfront and adhere to that. The fact that you can use JSON at times to store data provides an important escape hatch, but while PostgreSQL allows most operations on JSON data (including indexing them), it is unable to collect statistics information on such data, leading to slower queries. RavenDB uses a schema-free model, documents are grouped into collections (similar to tables, but without the constraint of having the same schema), but have no fixed schema. Two documents at the same collection can have distinct structure.  Typical projects using JSON columns in PostgreSQL will tend to pull specific columns from the JSON to the table itself, to allow for better integration with the query engine. Nevertheless, PostgreSQL’s ability to handle both relational and document data gives it a lot of brownie points and enable a lot of sophisticated scenarios for users.

RavenDB, on the other hand, is a pure JSON database, which natively understand JSON. It means that the querying language is much nicer for querying that involve JSON and comparable for queries that don’t have a dynamic structure. In addition to being able to query the JSON data, RavenDB also allows you to run aggregation using Map/Reduce indexes. These are similar to materialized views in PostgreSQL, but unlike those, RavenDB is going to update the indexes automatically and incrementally. That means that you can query on large scale aggregation in microseconds, regardless of data sizes.

For complex queries, that touch on relationships between pieces of data, we have very different behaviors. If the relations inside PostgreSQL are stored as columns and using foreign keys, it is going to be efficient to deal with them. However, if the data is dynamic or complex, you’ll want to put it in a JSON column. At this point, the cost of joining relations skyrockets for most data sets. RavenDB, on the other hand, allow you to follow relationships between documents naturally, at indexing or querying time. For more complex relationships work, RavenDB also has graph querying which allow you to run complex queries on the shape of your data.

I mentioned before that RavenDB was designed explicitly for business applications, that means that it has a much better feature set around their use case. Consider the humble Customers page, which needs to show the Customers details, Recent Orders (and their total), Recent Support Calls, etc.

When querying PostgreSQL, you’ll need to make multiple queries to fetch this information. That means that you’ll have to deal with multiple network roundtrips, which in many cases can be the most expensive piece of actually querying the database. RavenDB, on the other hand, has the Lazy feature, which allow you to combine multiple separate queries into a single network roundtrip. This seemingly simple feature can have a massive impact on your overall performance.

A similar feature is related to the includes feature. It is very common when you load one piece of data that you want to get related information. With RavenDB, you can indicate that to the database engine, which will send you all the results in one shot. With a relational database, you can use a join (with the impact on the shape of the results, Cartesian products issue and possible performance impact) or issue multiple queries. Simple change, but significant improvement over the alternative.

RavenDB is a distributed database by default while PostgreSQL is a single node by default. There exists features and options (log shipping, logical replication, etc), which allow PostgreSQL to run as a cluster, but they tend to be non trivial to setup, configure and maintain. With RavenDB, even if you are running a single node, you are actually running a cluster. And when you have multiple nodes, it is trivial to join them into a cluster, from which point on, you can just manage everything as usual. Features such as multi-master, the ability for disconnected work and widely distributed clusters are native parts of RavenDB and integrate seamlessly, while they tend to be of the “some assembly required” in PostgreSQL.

The two databases are very different from one another and tend to be used for separate purposes. RavenDB is meant to be the application database, it excels in being the backend of OTLP systems and focus on that to the exclusion of all else. PostgreSQL tend to be more general, suitable for dynamic queries, reports and exploration as well as OLTP scenarios. It may not be a fair comparison, but I have literally built RavenDB specifically to be better than a relational database for the common needs of business applications, and ten years in, I think it still shows significant advantages in that area.

Finally, let’s talk about performance. RavenDB was designed based on failures in the relational model. I spent years as a database performance consultant, going from customer to customer fixing the same underlying issues. When RavenDB was designed, we took that to account. The paper OLTP – Through the Looking Glass, and What We Found There has some really interesting information. Including the issue of about 30% of a relational database performance is spent on locking.

RavenDB is using MVCC, just like PostgreSQL. Unlike PostgreSQL, RavenDB doesn’t need to deal with transaction id wraparound, VACUUM costs, etc. Instead, we maintain MVCC not on the row level, but on the page level. There is a lot less locks to manage and deal with and far less complexity internally. This means that read transactions are completely lock free and don’t have to do any coordination with write transactions. That has an impact on performance and RavenDB can routinely manage to achieve benchmark numbers on commodity hardware that are usually reserved for expensive benchmark machines.

One of our developers got a new machine recently and did some light benchmarking. Running in WSL (Ubuntu on Windows), RavenDB was able to exceed 115,000 writes / sec and 275,000 reads / sec. Hare the specs:

image

And let’s be honest, we weren’t really trying hard here, but we still got nice numbers. A lot of that is by designing how we interact internally to have a much simpler architecture and shape, and it shows. And the nice thing is that these advantages are cumulative. RavenDB is fast, but you also gain the benefits of the protocol allowing you to issue multiple queries in a single roundtrip, the ability to include additional results and dynamically adjusting to the operation environment.

It Just Works.