Transactions and concurrency
I just had a head-scratching, hair-pulling, what-the-hell-is-going-on bug. Basically, a classic producer / consumer issue. Each work item is composed of several sub-items, which should always be processed as a single unit (exactly one time). The problem was that I suddenly started to get duplicate processing of the same work item, but with different sub-items each time.
I went over the code with a comb, and I couldn't see anything wrong, I investigated the database, and everything was fine there as well. I knew that I was in trouble when I considerred going down to the SQL Server protocol level and check if somehow network problems were causing this issue.
Here is a simplified version of the producer (and yes, I would never write this kind of code for production, test, or anything but a short and to the point demo). As you can see, it merely generate 500 records into a table.
private static void Producer()
{
SqlConnection connection = new SqlConnection(connectionString);
int count = 0;
while (true)
{
connection.Open();
SqlTransaction sqlTransaction = connection.BeginTransaction(isolationLevel);
for (int i = 0; i < 500; i++)
{
SqlCommand sqlCommand = connection.CreateCommand();
sqlCommand.Transaction = sqlTransaction;
sqlCommand.CommandText = "INSERT INTO t (Id) VALUES(@p1)";
sqlCommand.Parameters.AddWithValue("@p1", count);
sqlCommand.ExecuteNonQuery();
sqlCommand.Dispose();
}
sqlTransaction.Commit();
Console.WriteLine("Wrote 500 records with count " + count);
count += 1;
connection.Close();
}
}
And here is the consumer, which read from the table, and ensure that it reads in batches of 500:
private static void Consumer()
{
SqlConnection connection = new SqlConnection(connectionString);
while (true)
{
connection.Open();
SqlTransaction sqlTransaction = connection.BeginTransaction(isolationLevel);
SqlCommand sqlCommand = connection.CreateCommand();
sqlCommand.Transaction = sqlTransaction;
sqlCommand.CommandText = "SELECT COUNT(*) FROM t GROUP BY id";
SqlDataReader sqlDataReader = sqlCommand.ExecuteReader();
if (sqlDataReader.RecordsAffected != -1)
Console.WriteLine("Read: {0}", sqlDataReader.RecordsAffected);
while (sqlDataReader.Read())
{
int count = sqlDataReader.GetInt32(0);
Console.WriteLine("Count = {0}", count);
if (count != 500)
Environment.Exit(1);
}
sqlDataReader.Dispose();
sqlCommand.Dispose();
sqlTransaction.Commit();
connection.Close();
}
}
Note that I have painted the isolationLevel red in both. Here is the code that run both methods:
private static void Main()
{
Delete();
new Thread(Producer).Start();
new Thread(Consumer).Start();
}
If I set isolationLevel to ReadCommited or RepeatableRead, this consistently fails within a couple of seconds, it manage to do partial read of the records that were inserted by the consumer. I I set the isolationLevel to Serializable or Snapshot, it behaves as expected.
I may be missing something, but I would expect ReadCommited to only allow the second transaction to read... well, commited rows. Just to point out, the real scenario doesn't involve aggregation, and I am seeing the same issue. I suppose that the new records were commited (and thus made visible) when the query already scanned that part of the table, and thus it missed some one the rows, while snapshot and serializable force either a static image all the way through or waiting till the end.
Any comments?
Comments
Oren, what is the actual failure?
Just not sure what you mean by 'fails in a couple of seconds'.
Cheers,
Stuart
Deadlock probably?
@Stuart,
I mean that it fails by giving a result that is not 500.
This means that it is able to see only a partial result from the second transaction.
@Andrew,
There is no dead look.
it is just that the second transaction is able to see partial results from the first transaction
Read this article : http://sqlblogcasts.com/blogs/tonyrogerson/archive/2006/11/16/1345.aspx
I think this has your answer.
Cheers,
Stu
Interesting issue. Replicating your code I can yield the same results and the behavior seems clearly wrong. From subjective testing it seems to clearly be some kind of race issue around the row/page locks being acquired and released between the two as you can make it work cleanly by changing the Consumer statement to:
SELECT COUNT(*) FROM t WITH (TABLOCK) GROUP BY id
While it seems that technicallly the transaction is working correctly this is a bit of a trap for the unknowing about the consistency of the read results..
@Stuart,
That was what I was thinking, but this is plain ugly.
I expected that either all the new rows or none of the new rows would be visible to the transaction.
This makes me wonder what is the point of read commited in this scenario.
@Jeremy,
I understand what is going on, it is basically scanning the table and finding newly commited rows, but skipping some of the rows commited in the same transaction.
It looks like exactly the wrong thing to do as far as I am concerned, certainly a very nasty trap.
I guess that is why Oracle devs are laughing at SQL Server, and why there is the new Snapshot level
I haven't tried this yet, but as far as I remember, neither the Read Committed or Repeatable Reads isolation levels will prevent you from reading phantoms, which might be part of the issue.
However, I'm also not clear as to why the consumer is expecting to get a count of 500... since the select query groups by the ID. To be consistent, shouldn't the consumer also loop asking for the count of records for a specific id?
@Tomas,
This isn't the case of phantom rows, at least not as I understand them. There isn't any deletion done at any point in time.
The producer inserts 500 rows per transaction, the consumer reads the counts per id.
If there was true isolation, it would always see 500 records per ID, the issue is that it manages to see a partial result, and get a number of rows that is different than 500.
@Ayende: As I understand them, phantoms apply to both added as well as deleted rows. But I'll be the first one to say I'm not sure if that's the case here!
Does anybody actually know how the isolation levels work? Does they guy who wrote it even know?
@El,
Yes, they do.
That doesn't mean that it is easy to figure out.
Like most concurrent stuff, this is hard.
@Ayende,
Whilst I'm probably stating the obvious, it is interesting that you can change the semantics of the select using table hints (HOLDLOCK), so that whilst the INSERT is using ReadCommitted, the SELECT can use the consistency of the SERIALIZABLE locking semantics.
I did try this, and it never failed.
Also, the smaller the size of the insert transaction, the less frequently it fails.
Cheers,
Stuart
SQL Server 2005 currently support 6 different Isolation Level behaviours, and only 2 of them (Snapshot and Serializable) are preventing Phantom reads. About your code, are you sure the actual issued queries have not any WHERE clause?
They definitely have a where clause.
I thought that phantom reads where for records that were deleted, not partially created.
What this shows is that some of the information is made available to the second transaction, but not all of it.
I took the following from the definition of the Phantom phenomena in the SQL standard (actually I took it from SQL-99, but I recall it as being more or less the same on SQL-92):
P3 (‘‘Phantom’’): SQL-transaction T1 reads the set of rows N that satisfy some <search condition>.
SQL-transaction T2 then executes SQL-statements that generate one or more rows that satisfy the <search condition> used by SQL-transaction T1. If SQL-transaction T1 then repeats the initial read with the same <search condition>, it obtains a different collection of rows.
I don't have an issue with that, I have an issue with geting a partial view of the results of T2
Incidentally, while looking for the SQL-92 standard, I found this very interesting paper from Microsoft Research:
ftp://ftp.research.microsoft.com/pub/tr/tr-95-51.pdf
It is the theorical foundation for the Snapshot Isolation Level, the feature introduced in SQL Server 2005. Although probably they didn't invented that, it is quite amazing to discover that the time to market from their R&D awareness to the moment in which it was released in their flagship database product was more or less 10 years ..
I just wrote a post about this issue:
http://ale.riolo.co.uk/2007/04/transaction-and-concurrency-on-sql.html
I found out that using a table with a primary key (I used a surrogate key through identity column) everything work as expected.
I have run into this issue on production, in a table that most definitely had a PK (identity).
An identity PK (clustered) would basically sort the table, so the scan would tend to be linear. The code that I had used a where clause that filtered by several columns, thus probably producing a non linear table scan.
Ayende,
Have you tried to run your code with the READ_COMMITTED_SNAPSHOT db option set to ON? This will keep a row version for rows that are affected by the statement in the transaction, and wil present those versions of the row instead of doing locking. I would imagine that this setting (which is OFF by default) would create consistent reads/writes while not impacting concurrency too much.
Is it the same as SNAPSHOT isolation?
Semantically, they're the same. From what I understand, the SNAPSHOT isolation leve is effective at the transaction level (i.e. the snapshot is created when the transaction starts), whereas the READ COMMITTED with the snapshot option is at a statement level (i.e. snapshot is created before the statement is executed).
Comment preview