Database assisted denormalization – Oracle edition
I decided to take a chance (installing Oracle is a big leap :-) ) and see how things match in Oracle.
I decided to run the following query:
SELECT deptno, dname, loc, (SELECT COUNT(*) FROM emp WHERE emp.deptno = dept.deptno) AS empcount FROM dept WHERE deptno = 20
Please note that I run in on a database that had (total) maybe a 100 records, so the results may be skewed.
Like in the SQL Server case, we need to create an index on the FK column. I did so, after which I got:
Then I dropped that index and create a simple view:
CREATE VIEW depswithempcount AS SELECT deptno, dname, loc, (SELECT COUNT(*) FROM emp WHERE emp.deptno = dept.deptno) AS empcount FROM dept
Querying on top of that gives me the same query plan as before. Trying to create a materialized view out of this fails, because of the subquery expression, I’ll have to express the view in terms of joins, instead. Like this:
SELECT dept.deptno, dname, loc, COUNT(*) empcount FROM dept LEFT JOIN emp ON dept.deptno = emp.deptno WHERE dept.deptno = 20 GROUP BY dept.deptno, dname, loc
Interestingly enough, this is a different query plan than the subquery, with SQL Server, those two query exhibit identical query plans.
Now, to turn that into an materialized view.
CREATE materialized VIEW deptwithempcount AS SELECT dept.deptno, dname, loc, COUNT(*) empcount FROM dept left join emp ON dept.deptno = emp.deptno GROUP BY dept.deptno, dname, loc
And querying on this gives us very interesting results:
select * from deptwithempcount where deptno = 20
Unlike SQL Server, we can see that Oracle is reading everything from the view. But let us try one more thing, before we conclude this with a victory.
update emp set deptno = 10 where deptno = 20; select * from deptwithempcount where deptno = 20But now, when we re-run the materialized view query, we see the results as they were at the creation of the view.
There appears to be a set of options to control that, but the one that I want (RERESH FAST), which update the view as soon as data changes will not work with this query, since it consider it too complex. I didn’t investigate too deeply, but it seems that this is another dead end.
Comments
Probably one of the biggest deficiencies in the various RDBMS offerings out there is their inability to materialize complex views.
This is a hard problem to solve but could be improved and fits very well into the relational model.
If you went with a eventually consistent model with a background view generation like some NOSQL products, the the complex view update could be deferred, but you could get stale results.
You should try a trigger based solution with MSSQL or Oracle at least as a comparison.
Not sure what the aversion to triggers is, they are ideal for creating complex user defined "indexes" such as what you are doing with the Post Count.
Materialize views are essentially just declarative system managed triggers that update another table.
Ayende,
Put both tables in the same table cluster, so that finding the employees is not another disk seek. This gets close to document data bases.
"This is a hard problem to solve" It is not. It can be solved by using stacked indexed views (search on google to find a pdf from MS; alas not implemented).
I was appalled as I learned about oracles deferred materialized views. I believe there is no synchronous option which makes them worthless for many applications. What an unnecessary crippling.
On the other hand If you want a deferred update system then the automatic view matching in the query optimizer must come in very handy.
I'm with Justin - relational databases have an existing, proven, built-in mechanism to implement what you're trying to accomplish; they're called triggers.
On one occasion I remember fiddling around with indexed views for a week before giving up and implementing a trigger solution. Quickest, easiest, most sensible and most performant change I ever made to that system.
@Tobi: materialized/indexed views are by definition used in functionality which consumes large buckets of data and processes them for a lengthy period of time for readonly purposes. This alone makes it already not necessary to have the views be 'up to date' at all time, as that's only possible with a live query.
many people think data has to be 'non-stale' at any given moment, but they forget that only some data has to be 'non-stale'. All the rest can be consumed from for example materialized/indexed views. The post count is a good example for data which is totally acceptable to be 'stale': who's going to check whether user X has indeed 103 posts? no-one.
It therefore comes down to:
decide what data elements can be stale
make sure the non-stale data is fast and easy to retrieve and manipulate.
And with 'stale' I don't mean data which is a year old, but for example materialized views (or your own tables with duplicated denormalized data, same thing, if your db doesn't support materialized views) which are refreshed every 10 minutes. For reports about monthly sales etc., that's perfectly fine.
This also leads to the point where one could decide to use two distinct databases instead of 1: one for the non-stale data and one for the stale data, which is for example updated every night.
It's true that this means one has to think about what the data in a system really means at time T and time T+t. But IMHO that's a good thing, it makes one realize that a database is more than just a bucket to store bits in, and actually a machine which can make things really easy.
"many people think data has to be 'non-stale' at any given moment, but they forget that only some data has to be 'non-stale'. All the rest can be consumed from for example materialized/indexed views. The post count is a good example for data which is totally acceptable to be 'stale': who's going to check whether user X has indeed 103 posts? no-one."
Word!
@Frans Bouma: Materialized views are not only used for aggregation. You can also save tons of join with them. Example: A forum thread. At least the following tables contribute one row for each post: ForumPosts, Users, UserAvatarImages. Then the resultset has to be sorted by datetime. All of this can be materialized so that the forum page needs exactly one range seek. Very cheap, can do that 2k times per second per core.
Have you tried using count(1) instead of count(*) and seeing if it makes a difference to your execution time? Also if it is ok for the count to be a little stale you could store it as a normal numerical field that gets updated by a batch process (database job in sql server) thats runs every so often?
@ Justin
+1 for materialized / schema bound views, being "sytem shortcuts" for triggers.
Ultimately the DB needs to infer if ANY object insert / update / delete would alter the output of ANY of the materialized views (Which I guess are little more than tables anyway).
I suspect this is the reason they are fairly limited in complexity, otherwise they could cripple modification performance, but in a way that is completely non-obvious.
Whilst triggers too maybe less than obvious, at least they are tied to the table you find the bottleneck on.
Would be interesting to see the query plan for the insert / updates on tables that form part of the materialized views.
The query plan does multiple updates in sequence. For aggregation views it actually aggregates the changed rows for each different view.
Comment preview