Riddle me thisFile access & Multi threading
What would be the output of the following code?
var list = new List<Task>(); var fileStream = File.Open("test", FileMode.Create, FileAccess.ReadWrite, FileShare.None); for (int i = 0; i < 150; i++) { var index = i; var task = Task.Factory.StartNew(() => { var buffer = Encoding.ASCII.GetBytes(index + "\r\n"); fileStream.Write(buffer, 0, buffer.Length); }); list.Add(task); } Task.WaitAll(list.ToArray()); fileStream.Flush(true); fileStream.Dispose();
Please note that:
- What order the data is saved is not important.
- Preventing data corruption is important.
Can you guess?
More posts in "Riddle me this" series:
- (12 Oct 2010) File access & Multi threading
- (06 May 2010) String outputs
- (02 May 2010) Hard links & file moves
Comments
fileStream.Write(buffer, 0, buffer.Length) can be interrupted and you end up with corrupted data on disk ?
write to disk should be serialized ?
Index variable looks as immutable.
Also file flashed at the end of all tasks.
So the question is -
can something like MemoryStream (some underground stream wich are used for buffering) support simultaneous write access?
I don't know exactly but as I remember Console.Output allow writing from different threads.
So we probably receive mess of numbers.
I’m pretty sure the file position will get corrupted, meaning that the code will probably overwrite some data (although it may take far more than 150 iterations for the error to manifest), which the CLR may detect and then throw an IO Exception. However, I don’t see the point of potentially launching 150 threads (the thread pool will inject more threads when they block, which is the case with Write since all IO on windows is asynchronous). You should use continuations to serialize the access to the underlying stream and, if you think the cost of asynchronous IO justifies itself (it involves a kernel APC and more context switching), use TaskFactory.FromAsync to build the tasks.
You will occasionally get the incorrect sequence of binary bits written to the file because multiple threads are writing to the file stream at the same time.
To a computer or human interpreting the binary data in the resulting file as ASCII, the symptoms would be missing characters or incorrect characters.
Since the code block which is executed async is a closure which accesses the outer 'i' variable the code would probably never write the first numbers (see csharpindepth.com/Articles/Chapter5/Closures.aspx).
IMHO the output would probably be something like
148
148
149
149
150
150
150
... repeat 150 times
According to the MS docs you will likely get an exception because each thread does not have an exclusive lock on its handle:
"When a FileStream object does not have an exclusive hold on its handle, another thread could access the file handle concurrently and change the position of the operating system's file pointer that is associated with the file handle...
If an unexpected change in the handle position is detected in a call to the Write method, the contents of the buffer are discarded and an IOException is thrown."
Hmm..
looks like this local variable prevent such behavior
var index = i;
Only two things can happen: Either whole lines are saved or partial lines can occur. I believe what actually happens depends on the file system implementation. My guess is that whole lines will be preserved although this is not guaranteed.
Most speculations in the comments are wrong. No IOException will occur, the file position will not get corrupted and the closure issue has explicitly been avoided.
The default TaskFactory is configured with TaskScheduler,MaximumConcurrencyLevel == 1
My guess would be
1
150
149
148
...
2
@Lachlan
According to MSDN MaxConcurrencyLevel is actually Int32.MaxValue.
The whole purpose of System.Threading.Tasks is to manage concurrency, not to prevent it.
Does this blog have a comments feed?
Yes, the big issue here is that file access through a writer is not thread safe at all. Not just the file position pointer, but the actual write buffer as well - fileStream.Write() simply queues up the write and returns.
I'm guessing the output of your code is a file with 0-x things written into it, and an IOException.
BTW: There is a way to get a synchronized stream for a text writer "TextWriter.Synchronized()", but this requires using a TextWriter instead of a FileStream. It also has the downside of having the FCL be in charge of your performance if this is constantly called code.
You could have out of order writes - which you stated isn't a problem. This is pretty likely.
You could have "corrupted" entries - the acess through FileStream.Write isn't syncronous, it uses an internal buffer, and it uses native methods and passes in buffer offsets to do the write. It would be be somewhat easy to have a write "smash" the buffer and end up writing NULL instead of the characters of the correct string, or possibly even duplicate a string. This is fairly likely.
The documentation says IOExceptions could occurr with multiple-thread writing, but it looks like that only happens when the native handle is exposed. I'm not sure this could happen.
I just confirmed with reflector that filestream does not synchronize its buffer as I assumed. This means that the program will likely not complete because of a contract exception inside of the filestream.
You'll get garbled content in the files, but no exceptions. This is not platform/OS dependent because the problem is not with the file writes, but the FileStream internal buffer.
You are attempting to write a total of 640 bytes which is a lot less than the default 4K internal buffer for FileStream. So you will not actually write to the file until the Flush(). Thus you will not get any threading issues with accessing the file itself.
So the real question is, what happens with multiple threads updating the internal buffer? Reflector shows that both the actual change to the internal buffer, and the change to keep track of the write position are not thread-safe. Both of these could cause corruption of the writes. However they will not cause any exceptions, and so you will have something in the internal buffer to write to the file when Flush() is called, even though it could be corrupted.
Caveat: This is for .NET 2.0, not looked into other versions.
You're going to get some of the numbers being written over, and some will have invalid data (either 3, 4, or 5 bytes of it?) before them.
This code inside FileStream is the culprit:
Buffer.InternalBlockCopy(array, offset, this._buffer, this._writePos, byteCount);
this._writePos += byteCount;
If two of the tasks are executing this at almost the same time, one may copy over what the other has just copied to the buffer, but _writePos will still get increased twice (probably, at least).
Since MSDN help of msdn.microsoft.com/.../system.io.filestream.aspx ">FileStream tells
Any public static (Shared in Visual Basic) members of this type are thread safe. Any instance members are not guaranteed to be thread safe.
it is not safe to use Write() method from different threads, so the above piece of code could actually corrupts the data. Actually with integer number, is less likely you encounter a corruption, but if you write longer string you can easily spot that writes are actually intermixed.
Actually, I think that the code will work and satisfy the requirements (i.e. noncorrupted, order-insensitive write) due to FileShare.None. All of the tasks will get scheduled on the thread pool, will run in an arbitrary order, but only one at a time will be able to write to the stream, effectively serializing them.
On second thought (and rereading the msdn doco on the method), I take that back, FileShare.None only applies to the act of the opening of the stream, not any actual IO done over it. My bad...
Methinks Nick is teh winnar.
err, I mean James & Nick. Sorry James I pulled a TL;DR on your answer :)
On my computer this gives unpredictable output with every run sometimes missing numbers or entering 0x00 in random places.
@Sam - to be honest, I'd have done that to myself! Nick's answer is definitely clearer :-)
IMO: The correct solution should be based on the separate writing thread which collects data from the other threads and owns the file handler and whole writing and error handling logic. Regards.
I am not sure what happens, but why would you not chose to use Paralles.For to achieve the same outcome?
@Manu Parallel.For is for compute-bound operations. Ayende's example uses IO-bound ones
Comment preview