Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,567
|
Comments: 51,185
Privacy Policy · Terms
filter by tags archive
time to read 1 min | 168 words

I was asked a question today, and I'm not sure what the answer is. The issue is with optimistic concurrency on the web. I got an edit page that allows a user to edit a page. I want to throw a concurrency exception if a second user has changed the object while the first user was editing it.

The issue is that when I get the values of the form back when the user is finished editing, I no longer know what version of the object the user has edited. It looks like I would need to store the version of the object (most likely a SQL Server Row Version) in the form as well, and then get the object from NHibernate, set the version property of the object (and all the rest of the fields), and then try to save, to see if there was a concurrency violation.

Any other ideas that I have not thought of?

time to read 22 min | 4330 words

If there is one thing that NHibernate (and most other O/RMs) are not very good at, it is bulk data operations. What does it mean? Well, let's take a simple case of wanting to delete all the records in a table. (I'm going to use Active Record Blog Sample for the sample, it is much simpler). Let take this, for instance:

Post.DeleteAll();

This will delete all the rows from the Posts table. Really easy to write or do, but what is the meaning here? What goes under the cover is something like:

  1. Load all the posts
  2. For each post:
  1. Resolve the relationships between the post and other objects
  2. According to the cascade options specified, delete / update all relating objects
  1. Possibly recurse to the object child, etc
  • Delete the post
  • This is done this way since NHibernate will take charge of maintaining the integrity of the database (and because you don't want to get foreign keys errors). In this case, it will make sure that all the comments for all the posts are deleted as well. In general, this is a Good Thing, since usually you don't deal with massive amounts of data that need to be modified at once.

    This is simple horrible when you start to deal with many records that you want to change at once. Case in point, I load several thousands objects to the application, process them, and then delete some of them. There is no order in the way I load the objects, and there is no order in which they will be deleted.

    I am operating in batches, and a single batch may delete 10,000 rows. Because of the above mentioned behavior, this resolts in 10,000 queries in the unoptimized version. I think you can guess where my bottleneck was. Now, there are quite a few ways I can mitigate it. I could push all the updates into one Session Unit Of Work, and let NHibernate batch it appropriately, but even when it is batched, it is still 10,000 queries that you are talking about. Sure, it's just "DELETE FROM TableName Where Id = @p0", but that is still way too many queries for something that should happen many times a second.

    Actually, since the work is being done on multiply threads, there is even more problems with it, since NHibernate's Session is not thread safe. Problem, isn't it?

    NHibernate's documentation warns against doing this kind of stuff, for the reasons I have just listed. In this case, NHibernate choose the approach of safe & slow than fast & wrong. I agree with this approach, wholeheartedly, but it is not doing me any good in this case.

    Now, let us think about the problem first, and see what I want to happen. I want to be able to issue a delete statement like this:

    DELETE

    FROM TableName
    WHERE Id In (@p0, @p1, @p2 ... @p9999)

    I just tested it, and I can verify that the IN operator in SQL 2005 can have over 200,000 items in it, so I'm good to go, am I not? The query size can get big fast, and it's not terribly efficent way to do it. Beisde, I don't want to appear in the Daily WTF.

    Here is what I ended up with, it is a two part solution. One part, in the database, is composed of a table and a procedure:

    CREATE TABLE [dbo].[HelperTable_IdMarkedForDeletion](

          [Id] [int] NOT NULL PRIMARY KEY CLUSTERED

    )

    GO

     

    CREATE PROCEDURE  [dbo].[ClearIncomingFromRowsMarkedForDeletion]

    AS

          BEGIN TRANSACTION

     

          DELETE FROM TableName

          WHERE Id IN (SELECT Id FROM HelperTable_IdMarkedForDeletion)

     

          TRUNCATE TABLE HelperTable_IdMarkedForDeletion

     

          COMMIT

    GO

    This actually gives the Query Optimizer a chance to optimize the procedure and get an efficent result. Now, to the second part, which is the code. I just created this class:

    public class BulkDeleter<PkType, ItemType>

    {

        DataTable table;

        string tempTable, procName;

        Func<PkType, ItemType> getPk;

     

        public BulkDeleter(string tempTableName, string procName, Func<PkType, ItemType> getPk)

        {

            this.tempTable = tempTableName;

            this.procName = procName;

            this.table = new DataTable(tempTable);

            this.table.Columns.Add("Id", typeof(PkType)).Unique = true;

            this.getPk = getPk;

        }

     

        public void RegisterForDeletion(ItemType item)

        {

            lock (table)

            {

                table.Rows.Add(getPk(item));

            }

        }

     

        public void PerformDelete(IDbConnection connection)

        {

            lock (table)

            {

                using (SqlBulkCopy bulkCopy = new SqlBulkCopy((SqlConnection)connection))

                {

                    bulkCopy.DestinationTableName = tempTable;

                    bulkCopy.WriteToServer(table);

                }

     

                table.Rows.Clear();

     

                using (IDbCommand command = connection.CreateCommand())

                {

                    command.CommandText = procName;

                    command.CommandType = CommandType.StoredProcedure;

                    command.ExecuteNonQuery();

                }

            }

        }

    }

    On construction, it gets the table and procedure name, and a delegate that will extract the primary key from the items that are registered for deletion. Internally, it managed a DataTable of the primary keys, any time you register an object for deletion, it will store its primary key in the datatable. When you decide to perform the delete, it will bulk copy the content of the DataTable to the server, and then execute the stored procedure.

    Bulk copying is very fast, and there is nothing like a stored procedure to handle bulk data operations. It is hard to measure timing correctly, since the work was spread across multiply threads, but I know that I reduced the number of queries the database has to deal with from thousands to two.

    As a note, if I was working on a project that required bulk data operations on more than just one or two cases, I would've probably created a custom persister for NHibernate, which would be able to make more assumtions about my model than NHibernate can. Right now this is a single scenario that I need to address, and this was the simplest thng that solved the problem (elegantly).
    time to read 2 min | 374 words

    There always seems to be a great debate with regard to what data type will be used as the primary key for tables. The debate is mostly centered on whatever we should use identity fields vs Guids.

    Guids has the advantage that they are globally unique (duh!), so they are a good solution for the cases where you need to merge data from several sources. Identity fields are easier to understand (Quick, what is the difference between 8833075D-8861-4324-8814-421BD5F04C7D and AD7FF558-DFCB-4354-9D68-C1DD926BC22A) and to debug since you can easily see the difference between concecutive rows. They can cause problems in merge scenarios, since it is entirely all too possible that the identity columns from different machines will be identical.

    Another, bigger, problem with Guids is that they are random by nature, which mean that using them as a clustered primary key has distinct disadvantages in high insert rate scenarios. The problem is that two concecutive rows will be put in two different pages, assuming an even distribution of the Guids, in high insert scenarios the I/O generated by using it can become the most significant cost of the whole operation. SQL Server will constantly load pages to memory and discard them when it run out of room in the cache.

    In SQL Server 2005 there is a better solution for this, although I have not heard anyone talks about it. Instead of using Guids, you can use Sequencial Guids. What does this mean? SQL Server 2005 has a new functions that you can use for creating Guids, called newsequentialid(). This function can generate sequencial (per machine) guids. You get vastly improved performance for high insert rate scenarios, since all your inserts are likely to go into a small set of pages amd you can safely merge between multiply machines, since the sequencial Guids are still globally unique.

    Here are a couple that I generated: 13D3AFD4-CDCA-DA11-9389-000C6EFB7322 and 14D3AFD4-CDCA-DA11-9389-000C6EFB7322

    One thing to note, though: Don't use Sequencial Guids as a session id, since then an attacker knows what will be the next id generated (obviously) and try to use it.

    time to read 3 min | 515 words

    Enough bullshitting, lets start from the the following scenario, we have an application that need a database (a rare event, you can imagine).
    We have three tiers of users here, normal users can view data, editors can change data and the administrator can do whatever they want. It's a simple model, but it is enough for this example. Here are several security treats for the application, depending on the way it was implemented.
    A hacker gained access to the database and is able to send any query that it wants to the server.
    If we were using the Stored Procedures (SP)apporach, he doesn't have access to the tables, and can only call stored procedures, whose names are not known to the hacker.
    If we were using the Parameterized Queries (PQ) approach, he has access to tables, based on the user's role, but he doesn't know the names of the tables.

    Next step for the hacker is to start guessing. In both approach, he is likely to try to use the rich meta-data capabilities of the server to find out what the database structure is like. Because the database is well-admined, the attempt will fail (for both cases), since the application user will not have access to such tables as sys.objects, etc.

    After realizing that the easy path is out, the hacker starts guessing stuff like:
    (SP): sp_AddEmployee, AddEmployee, AddEmp, sp_InsertEmployee, etc...
    (PQ): SELECT * FROM Employees, SELECT * FROM Emps, SELECT * FROM Ovdim (yuck!), SELECT * FROM Subordinates, etc...

    (So far he is only limited by the user permissions, we assume that the administrator is not going to check the logs in the weekend).
    He successuflly manage to map the following items (using the errors he gets from the database) and their schema:
    (SP): AddEmployee, RemoveEmployee. UpdateEmployee, GetEmployee - Stored procedures, only the GetEmployees is avialable to the user he is running under
    (PQ): Employees table - Only a select permission.

    Using PQ here gave the attacer an easier access to the employees table, since all he needs to do is SELECT * FROM Employees.
    The second approach requires a more complex query, but basically it's something like:
    WHILE 1=1 BEGIN 
        GetEmployee(@current)
     SET @current = @current + 1
    END

    Not that different, and it is likely that a matching ListEmployees procedure will exist (and that the attacker will have access to it).
    Now our nefarious attacker got all the employees data, now he decides to give all the secreteries a 150% salary increase.
    Using PQ, he issues the following query:
    UPDATE Salary = Salary * 1.5 WHERE Title = 'Secretery'
    Using SP, he has a lot more work to do but it's basically the same thing, only procedurally.
    -- loop
     If title = 'secretery'
      exec UpdateEmployee( ... )
      
    SP: He gets an error because he doesn't have permissions to the procedure.
    QP: He gets an error because he doesn't have permissions to the table.
     
     Repeat for inserts / deletes.  Different methods, same end result.
     
     I don't see how using Stored Procedures leads to inhernetly better security in this cases.

    time to read 2 min | 249 words

    You do not have the resources to do as effectively as the existing tools in most scenarios. There is a reason that Nhibernate got over 100,000 Lines of Code. The simplest O/RM is just mapping from columns to properties. This is repetetive work that can be done by a code generation tool easily.

    Then you realize that you need:

    • Transactions
    • Concurrency Control
    • Caching
    • Handling of associations between objects - This is a can of worms all by itself, there are quite a few of those, and you are likely to want to use all of them.

    The above are the bare neccecities for an O/RM or a Data Mapper layer. I got problems with them because they are:

    1. Not trivial
    2. Solved problems

    And then we didn't mentioned performance. I got a list of 25 bullet points of different (sometimes overlapping) features that I am using in NHibernate right now. None of them are esoteric by any mean. It's not a weekend project, not if you are going to do it right.

    time to read 1 min | 157 words

    For every complex problem, there is a simple, elegant solution. It is also wrong.

    Sahil is talking about the pitfalls of DLinq, how a seemingly minor change in the code created a drastically different query which drastically different performance implications. Just to note, this is not unique to DLinq, it's something that exists in NHibernate, or any other O/RM in the market.

    O/RM is a leaky abstraction at best, I have said it before, and I will probably say it many times in the future. This mean that you need to know what is going under the hood. The good thing about it is that you have to be under the hood all the time. Most of the time, you can program while being blissfully unaware of what is going on. Just make sure that whatever technology that you use doesn't have the hood welded shut.

    time to read 3 min | 495 words

    Eber Irigoyen had this to comment on my post about SPs:

    ...the point is how much scalability, flexibility, security, etctirity the SP give you, remember the rule of thumb, always use the right tool for the job, use what will give you the most benefits in the long run

    I completely disagree with this, if you didn't catch on so far. Here is my refutal:

    • Scalability:
      • There is no performance advantages to using SP over Parameterized SQL.
      • Scaling the database is much harder than scaling an application.
      • Note: I implemented a caching web service once that simply cache data for certain period of time, saving quite a bit from the poor database. Deploy a bunch of those, and you're seriously reduced the amount of work the database has to do.
    • Flexibility:
      • Stored Procedures are... Well, procedures. This means that you've all the advantages and disadvantages of procedural programming. On general, I wouldn't call it very flexible. Changing the procedure's internal can be done without breaking clients is possible within a certain set of changes, but major stuff is hard.
      • A search procedure should be flexible, you should be able to search by any combination of parameters, how easy it is to write this procedure?
    • Security:
      • Not applicable in situations where you don't give users direct access to the database, in my opinion. There are things to be said about defense in depth, but I don't think that they should be curried to the point where you are forced to give up significant advantages while keeping the security of the system (using views / triggers instead of procedures, for instance, can result in a situation just as safe, and far easier for the programmers to work with effectively).
    time to read 2 min | 383 words

    When I'm working with clients, I hear quite a few that are quite adament about "Only access the database through Stored Procedures."

    They list security and performance as their top reasons, as well as the all important "we always did it this way."

    I am in the completely opposite side, I want to hand over my database querying to an engine that can take a reasonable request and do something with it. For instance, transform it to a SQL statement and give me the results from the database without me having to write the SQL code. It is just as performant, because there isn't any difference between the caching for stored procedures and parameterized SQL. In fact, one can argue that the query plans for parameterized SQL can be optimized better than the ones for Stored Procedures, since the optimizer is free to recompile the plan based on the expected rows that the query is likely to affect.

    The issue of security doesn't bother me because I'm not giving access to my dataase to any Joe User, I'm giving access to the application which (presumebly) I wrote and I trust. SQL Injection is not an issue, of course.

    Considerring that an evil developer could just as easily exploit the SP layer to do whatever they want in the database (that is assuming that they aren't the ones writing the stored proceures, of course), I don't see any point in not trusting the application. If this is a database that serves several application, I would argue that the database is not the correct place to put this abstraction and that a web service would be much better.

    I believe that this tendecy to use stored procedures as the default mode of operation stems from the times where client server system where all the rage. In those cases, you often had to give the users access to the database. And a knowledgable user would be able to go in and change whatever they wanted. In this scenario, SP are the way of least resistance for securing the application.

    These days, I'm not familiar with many classic client / server apps that talks directly to the database, and I would argue venehemtly against this kind of architecture. Under those circumstances, I fail to see what are the advantages of using stored procedures.

    time to read 8 min | 1410 words

    I was at the Israeli Security User Group meeting today and I met with Justin, who mentioned his favoirate tough question for now is: "Can a DDL Trigger in SQL Server 2005 prevent dropping a DDL Trigget?"

    It turns out that it can, but there is a catch. A DDL trigger is not called if the trigger that is being dropped is the same trigger. This mean that to implement this functionality we need something like:

    CREATE TRIGGER Prevent_Drop_DDL_Trigger

    ON DATABASE

    FOR DROP_TRIGGER

    as

          if PATINDEX ('%on database%', EVENTDATA().value('(/EVENT_INSTANCE/TSQLCommand/CommandText)[1]','nvarchar(max)')) > -1

          BEGIN

                PRINT 'No deleting of DDL Triggers!'

                ROLLBACK

          END

    GO

    CREATE TRIGGER Prevent_Drop_DDL_Trigger_Second

    ON DATABASE

    FOR DROP_TRIGGER

    as

          if PATINDEX ('%on database%', EVENTDATA().value('(/EVENT_INSTANCE/TSQLCommand/CommandText)[1]','nvarchar(max)')) > -1

          BEGIN

                PRINT 'No deleting of DDL Triggers!'

                ROLLBACK

          END

    GO

    Now, let's try dropping one of the triggers:

    drop

    trigger Prevent_Drop_DDL_Trigger on database

    This will call the Prevent_Drop_DDL_Trigger_Second which will fail the query. Prevent_Drop_DDL_Trigger is not called for its own drop, by the way.

    Be aware that this means that in order to remove those triggers you need to ALTER one of them, DROP the second one, and then DROP the first one. It's possible to get into a situation where the two triggers doesn't allow drop / alter on either one, which potentially can lead to a trigger that you can modify without dropping the database.

    I would classify this behavior as Evil (TM).

     

    time to read 3 min | 478 words

    It is common to people that are starting use O/RM to get into a situation that they don't know how to solve using the O/RM (or even aware that such a thing is possible). The easiest example is loading just Id and Description for a certain class without loading all the (possibility expensive) properties.

    There are several ways to do this using most O/RMs, but they are usually not obvious to the beginner. This is usualy a pain point since they can easily see how they would do it using plain SQL, but they either don't know how to use the tool appropriately, or they just don't think this is possible using the O/RM. The immediate solution to that is to just use plain SQL (or SP) to solve this one problem.

    It is wrong because it pokes big holes in the model you show the rest of the system. If means that the model that most of the system is using doesn't match the model that the rest of the system is using.

    This is bad because you lose of of the key benefits of using O/RM, abstracting the database. A good O/RM will always let you work against the objects structure, not the data structure. This is important because it frees you from dealing with a potentially ugly schmea and work with a good model all the time.

    In nearly all the cases, you can use the options that the O/RM gives you, and get what you need without dropping to SQL. That is not to say that SQL isn't important. It is very important, because O/RM tend to be leaky abstractions, it is just that you shouldn't reach to the familiar tool when you need to solve a problem without an immediate solution.

    What I'm more afraid of is that using SQL to solve the immediate problem will start to spread through the code, since "it is already the way we did it in XYZ". And that is a whole new can of worms. Letting this spread means that you effectively have two data access layers for a single application for the same model. I trust that I don't need to explain why this is bad. Just the syncornization between the two will be a pain, and considerring the way the SQL Access Layer is likely to appear (growing from quick solutions to problems), I am willing to be that it is not going to exhibit good design (at all levels).

    If you really need to use both, make sure that you investigate the options to solve the problem using the O/RM first, and then make sure that this solution goes through the front door with all the bells (Design, Tests, Code).

    FUTURE POSTS

    No future posts left, oh my!

    RECENT SERIES

    1. Production Postmortem (52):
      07 Apr 2025 - The race condition in the interlock
    2. RavenDB (13):
      02 Apr 2025 - .NET Aspire integration
    3. RavenDB 7.1 (6):
      18 Mar 2025 - One IO Ring to rule them all
    4. RavenDB 7.0 Released (4):
      07 Mar 2025 - Moving to NLog
    5. Challenge (77):
      03 Feb 2025 - Giving file system developer ulcer
    View all series

    RECENT COMMENTS

    Syndication

    Main feed Feed Stats
    Comments feed   Comments Feed Stats
    }