Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,567
|
Comments: 51,185
Privacy Policy · Terms
filter by tags archive
time to read 2 min | 201 words

Let's assume that you've a table with a position column that you use for indexing or orderring, and you want to shuffle the ordering of items there. You can do it with a single query like this one:

UPDATE

Attributes
 
SET Position = CASE position
                        
WHEN @From THEN @To
                         
ELSE Position+1 END
WHERE
ReferalID = @Referal and Position BETWEEN @To AND @From

In this case, it'll switch put the row in the @from index in the @to index, and move all other rows accordingly.

The scary thing is that I learned this in the Daily WTF.

time to read 6 min | 1134 words

I need to find the reference / reasoning for the following argument:

A single query using a join will be more efficent in server resources than several queries to get the same information.

For instance, consider this query:

SELECT

      Users.Name,

      ProductLicenses.StartDate,

      ProductLicenses.EndDate

FROM Users JOIN ProductLicenses

ON Users.Id = ProductLicenses.UserId

WHERE Users.Id = @UserId

Versus this:

SELECT

      Users.Name

FROM Users WHERE Users.Id = @UserId

And then this:

SELECT

      ProductLicenses.StartDate,

      ProductLicenses.EndDate

FROM ProductLicenses

WHERE ProductLicenses.UserId = @UserId

I know that the first one should be more efficent, but I can't find out the documentation for it. Server round trips is one consideration (although it can be mitigated with multiply result sets), but the concern here is the performance of the server only.

Note: I'm talking here about 2 - 4 tables joins, top, properly indexed.

On Toads

time to read 1 min | 192 words

I don't like Oracle's tools. Today I had a chance to work with Toad a little bit, and I found that the UI is more cluttered than I am used to, but it has several features that I want for SQL Server Management Studio:

  • Color highlighting for columns and tables
  • Auto correct form to from (this is huge, I make this mistake all the time).
  • I can CTRL+Click on a table / procedure, and I'll get the defination window of this object. This mean that I can quickly browse between stored procedures, tables, etc. Very useful. I would really like this in SQL Management Studio.
  • You can format the PL/SQL right from Toad. This is way cool, and I really wish I had something for T-SQL as well.

Anyone knows of tools that can provide similar functionality to SQL Management Studio?

time to read 7 min | 1347 words

The following tidbit caused quite a bit of problem, and a loss of a day for two people. I'm working on legacy system right now, and part of this work is refactoring stuff without breaking the behavior. The process is quite big, but we managed quite fine until we suddenly started to get wildly different results. We traced is a dozen times, and eventually we narrowed it down to code that looked a bit like this (PL/SQL):

declare

cursor offline_sales_cursor as

      select product_id, customer_id, price

      from offline_sales

 

for sale in offline_sales_cursor

begin

      insert into sales(product_id, customer_id, price)

      {sale.product_id, sale.customer_id, sale.price);

 

      exception

      when DUP_VAL_ON_INDEX

            update sales

                  set price = sale.price

            where product_id = sale.product_id

                  and customer_id = sale.customer_id;

end;

end loop;

 

commit;

For the purpose of the discussion, sales has a primary key on product_id and customer_id (yes, it's bad design, but it's not a real db) and foreign keys to the products and customers tables. Can you guess what the problem was?

It turns out that we had an invalid customer id in the offline_sales table, which meant that this would run for a little bit, get a foreign key error and exit the loop and then commit part of the changes! When I refactored this bit, I changed it to two statements, one for updating and one for inserting, so the failure wasn't writing anything. I looked at the code several times, and only caught this by accident (the code base has commits spread all over the place). Urgh! Urgh! Urgh! [Gnashing of teeth].

time to read 8 min | 1468 words

Clemens has a new post about O/R Mappers, and this time he brings up several relevant points that I agree with :-). He brings up a comment (by Scott E) from the previous post:

I've hiked up the learning curve for Hibernate (the Java flavor) only to find that what time was saved in mapping basic CRUD functionality got eaten up by out-of-band custom data access (which always seems to be required) and tuning to get performance close to what it would have been with a more specialized, hand-coded DAL.

This comment seems to assume that the hand coded DAL will not need to be tuned, which I find hard to believe. Yes, there are times where you'll need to tune the O/R where you will not need to tune the DAL, but the reverse is true as well. On general, O/R Mappers will generate better code than you'll write (I can't be smart on th 54th search SP that I write to do something a bit more complex than the code generator can handle).

Defining and resolving associations is difficult. 1:N is hard, because you need to know what your N looks like. You don't want to dehydrate 10000 objects to find a value in one of them or to calculate a sum over a column....

No, it's not hard at all. You just need to define the association, the O/RM will take care of the rest. Both example that he gives can be done efficently by O/RM. Finding spesific values in an association:

select

p from Blog blog where blog.Id = @id count(blog.Posts.Comments) > 50

Getting the sum of a property (not column, mind you):

select

sum(c.Credit) from Customer c where c.Id = @id

Both of this statements work on the logical representation of the class, not on the physical schema in the database. The second statement travese two 1:N relationships, without affecting performance. [That said, this is probably not the way to do this, using filtering on the collection is a better way to do it, and just as efficently.]

1:N is so difficult because an object model is inherently about records, while SQL is about sets. N:M is harder.

I just wrote a post on using Active Record with many to many associations. Check it out, it's not hard at all. Not even doing searching / filtering / aggregation on it.

[in relation to in memory cache] What do you do if you happen to find an object you have already loaded as you resolve an 1:N association and realize that the object has meanwhile changed on disk?

Um, doesn't this belong to my concurrency strategy? Either I use optimistic concurrency and throws when the row was updated, or I handle it before had. In genenral, the in memory cache is short lived, it has a life time of a single request, so this isn't a problem. If it is, I can specify that the request will by-pass the first level cache.

Another question is what the scope of the object identity is. Per appdomain/process, per machine or even a central object server (hope not)?

In NHibernate, the scope is the session you're working on. A session is similar to a DbConnection in many ways, so it is usually a short lived object.

Transactions are hard...
... If you are loading and managing data as object-graphs, how do you manage transaction isolation? How do you identify the subtree that's being touched by a transaction? How do you manage rollbacks? What is a transaction, anyways?

A transaction is a kind of a poisonous snake that will give you ACID stomach, I believe.

Yes, transactions are hard. No, I don't give them up when I'm using O/RM. I really don't care about identifying the subtree that the transaction is working on, that is the job of the O/RM and the DB, not the application developer. Here is how I'll manage a transaction:

using

(ITransaction trans = session.BeginTransaction(IsolationLevel.Snapshot))
{
   
//do stuff
  
trans.Commit();
}

This will rollback automatically on exception, of course, but I can do it manually if I really want as well. What is the problem with that again?

Changing the underlying data model is hard.

No, it's not. All your programming is done against the logical layout of your objects not aginst the physical schema. You want to change the data model? Just change the mapping, and you are done. There are some cases where you can't do this without changing the code, but they would usually require code modifications to the code anyway.

Reporting and data aggregation is hard.

Reporting is one thing that I'll try to do in SQL, because the tools to do it are great, and because I'll often be working on a de-normalized database, rather than my production one. But aggregation is easy. The example that Clemens gives is XPath vs. DOM manipulation. And he is correct, if I had to do this stuff manually, but I don't. I get the O/RM to do this stuff for me (in the database, where those actions belong).

One thing that I'm not sure that I understood is:

O/R is object->relational mapping.
LINQ is relational->object mapping.
LINQ acknowledges the relational nature of the vast majority of data, while O/R attempts to deny it.

I'm not sure which O/RM he is talking about, but it's not one that I have used. And O/RM that deny the relational nature of the data just isn't going to work well. It seems to me that Clemens had a bad experiance with O/RMs (maybe the C++ one that he mentions?) and didn't look into the current offering in the field with enough depth. I just don't see many of the problems that he mentions as real problems unless you misuse the tools that you have.

That said, O/RM are indeed a leaky abstraction, and you need to understand what is happening under the cover, but GC is a leaky abstraction as well, and still it seems like a great productivity tool nonetheless.

time to read 1 min | 80 words

Udi Dahan has a post about Clemens anti O/RM post. I really liked his conclustion, and the reasoning behind it.

Beyond that, I wanted to comment about knowing SQL & utilizing O/RM to the fullest. If you don't know what is going on under the covers, you'll not be able to diagnoze and fix performance problems. So using O/RM doesn't free you from understanding DB design and understanding how SQL works.

time to read 3 min | 415 words

Did you know that you can have transactions on DDL as well? I run into this more or less by accident, and it is very cool. Check this out:

BEGIN TRANSACTION

    CREATE TABLE Foo( Num int)
   
INSERT INTO Foo VALUES (1)
   
INSERT INTO Foo VALUES (2)

    SELECT * FROM Foo

ROLLBACK

SELECT

object_id(N'Foo') as FooId

The results of this query are:

Num
-----
1
2

FooId
------
NULL

The CREATE TABLE statement was rolled back! I usually don't approve of using DDL in the usual course of things, but what really impressed me was this query:

CREATE

TABLE Foo( Num int)
INSERT INTO Foo VALUES (1)
INSERT INTO Foo VALUES (2)

BEGIN

TRANSACTION
  
TRUNCATE TABLE Foo
  
SELECT COUNT(*) FooCount FROM Foo
ROLLBACK

SELECT

COUNT(*) FooCount FROM Foo
DROP TABLE Foo

The results of this query are even more impressive to me:

FooCount
----------
0

FooCount
---------
2

I find it amazing, period.

time to read 2 min | 250 words

I was playing with SQL Server snapshots today, and they are very cool. The idea is that you can freaze your database at a point in time, and always refer back to it. The problem is that it costs you in terms of performance, and it may cost dearly.

The way that database snapshots are implemented is copy-on-write to sparse files, so when you create a snapshot, a sparse file (a file that is filled with zero, which the OS knows not to save to the disk, so you don't waste space). Any time a page has to be written, it's copied to the sparse file first, and then written to disk.

This means that you get double the I/O if you've a snapshot, but also that you get to see your database at the point you created the snapshot. It's very cool, and restoring to the snapshot is very fast, compared to backups.

One problem that I run into is that I didn't consider the cost of multiply snapshots, at one point I had seven snapshots of the same database, and I was wondering why operations was taking so long (I multiplied the I/O cost {which was significant to start with} by X8!).

Either way, this is a cool thing just technologically speaking. I've researched a bit of implementing this as a stream in .Net, but is seems that there isn't a bulitin way to set a region in the file to zero data, and I don't want to use P/Invoke to call DeviceIoControl().

time to read 1 min | 130 words

Marko from Mindworks just posted a MsSql2005 Dialect for NHibernate. This means that NHibernate will be able to take advantage of the new paging feature in SQL Server 2005.

Considerring that I work extensively with SQL Server 2005 and NHibernate, this is a welcome edition.

Update: Currently there is not way to get it online, and I can't find the message in the archieves, so here is the file. I suppose that it will shortly be part of official NHibernate.

Update 2: Changed the file extention so you could download it from IIS

Update 3: You can now file it here



time to read 9 min | 1710 words

Clemens has a post about O/R Mappers. From his post:

Another argument I keep hearing is that O/R mapping yields a significant productivity boost. However, if that were the case and if using O/R mapping would shorten the average development cost in a departmental development project by – say – a quarter or more, O/R mapping would likely have taken over the world by now. It hasn't. And it's not that the idea is new. It’s been around for well more than a decade.

O/R Mappers are hard to do without framework services such as reflection. Code generation can support this for lagnauges without reflection, but this is harder to do, and much less flexible. I think that this is one of the reasons that ORM is only now getting a lot of notice from people (now being relative, Hibernate is very strong in the Java space, and has been for quite a while). I believe that the reason that ORM adoption is slow is simply because Microsoft's stance in the matter was "Using Stored Procedures and Result Sets (COM) / DataSets (.Net)" for a long time, and now it is "Wait for DLinq."

O/R Mappers certainly do not take the complexity away, I still need to be aware of it, but it doesn't get shoved in my face every time that I need to do something that touches the database. Understanding what is happening in all layers is important in this, as in all other things. The major benefit of using O/R Mappers in your application comes when you realize how flexible you suddenly are. You need to make a query that you have not thought of before? Just ask the O/R to do it for you, there is no context switch while you add an additional method to your DAL, write the stored procedure, test them both, and then return to your code. Instead, you just tell the O/R mapper to give the data you want. This is very powerful, and it is a huge productivity enhancement.

Clemens goes on to say that we should treat our database as a service, and apply the following principals to it:

  • Boundaries are explicit => Database access is explicit
  • Services avoid coupling (autonomy) => Database schema and in-process data representation are disjoint and mapped explicitly
  • Share schema not code => Query/Sproc result sets and Sproc inputs form data access schema (aliased result sets provide a degree of separation from phys. schema)

I already have an explicit boundary. There is my code, and there is the O/R library. The O/R library expose a set of classes / methods that I can use to query the database. How they do it, I really don't care about. It's pretty clear when I'm accessing the database and when I'm not. The fact that I don't have CustomersDAL or SqlConnection in sight doesn't mean that the boundary is less real.

Coupling between the in process data representation and the database schema is not something I worry about. Why? Because my O/R of choice (NHibernate, in case you missed it) can handle pretty much every schema that I throws at it, and do it more quickly (and likely more efficently) than I can do it. I'm not tied to a particular schema, and I don't care if my database is changed. All I need to do is update the mappings, and I'm done with it.

I'm not sure what is the difference between the second and third points is, since to me it seems like repeating the same statement, with different focus. I don't care for seperation from the physical schema in the database layer because I've already go that seperation by using an O/R mapper.

Every class of data items (table) surround special considerations: read-only, read/write, insert-only; update frequency, currency and replicability; access authorization; business relevance; caching strategies; etc. 

Yes, indeed, I fully agree with Clemens on this one. But I just don't see the relevance of this to the O/R vs Custom DAL argument. I can define all of that with ORM more easily than I could if I was writing my own DAL.

Proper data management is the key to great architecture. Ignoring this and abstracting data access and data management away just to have a convenient programming model is … problematic.

I sort of agree with this, and then I sort of don't. What does Clemens suggest here? Write Stored Procedures and Custom DALs? For what purpose, exactly? The API that I'll get from a well designed DAL is likely to be very similar to the API I get now (for much less work and effort) by using an ORM.

One thing that Clemens said that is likely to cause problems is:

Many of the proponents of O/R mapping that I run into (and that is a generalization and I am not trying to offend anyone – just an observation) are folks who don't know SQL and RDBMS technology in any reasonable depth and/or often have no interest in doing so.

Well, I know SQL, and I think that I know it pretty well. I am not a guru by any means, but I can use the tools that SQL gives me to get the correct result back. I still love O/R Mapping, not because I have no interest in SQL or RDBMS, but because they are grunt work. I enjoy working with SQL to solve complex problems, and I can make SQL Server cry and beg for mercy, but doing this for everyday work? Why on earth would I want to do that? I don't do memory management on my own anymore, and for much the same reasons. It's boring, it's easy to get wrong and it's just not relevant to what I'm trying to do most of the time.

I would like to end that with a story of explaining the cost to develop a new feature to a co-worker. (He didn't work on the project, by the way.) He just couldn't believe my estimates, the conversation was something like:

Me: Well, I guess this thingie will take us about a day.
Co-Worker: Are you sure? You haven't done anything in this area yet.
Me: Yes, we need to build all the UI from scratch, which is why it will take so long. It is mostly showing the data to the user, but there is some complex UI stuff we need to do here.
Co-Worker: But you don't have anything that get the data!
Me: So, it's just querying the database for it?
Co-Worker: But you haven't written anything to do it yet.

.. a couple of minutes goes by where I just can't understand what he is talking about

Co-Worker: Look, you have not done this, and you didn't allocate any time for it, how can you keep you estimate?
Me: Like this [notepad] Container.Repository<EmployeeContract>().FindWhere(Expression.Eq("Employee",employee)) [/notepad], this will fetch us all the contracts for the employee, and we can then show them to the user, who can then do XYZ.
Co-Worker: But it doesn't get you Salary and Bonuses for this contract.
Me: They are loaded when we load the contract, we don't need to take care of it.

(EmployeeContract, Salaray and Bonus are fictional, of course, but the idea was that I needed to pull the data from 4 - 5 tables for this page.) The problem in this conversation was that I got the infrastructure to pull the data from the database with great flexibility, I don't worry about how the database is structured, I let the O/R mapper take care of this.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Production Postmortem (52):
    07 Apr 2025 - The race condition in the interlock
  2. RavenDB (13):
    02 Apr 2025 - .NET Aspire integration
  3. RavenDB 7.1 (6):
    18 Mar 2025 - One IO Ring to rule them all
  4. RavenDB 7.0 Released (4):
    07 Mar 2025 - Moving to NLog
  5. Challenge (77):
    03 Feb 2025 - Giving file system developer ulcer
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}