Monday, May 14, 2007

Dev Source: Making Asynchronous File I/O Work in .NET: A Survival Guide

Dev Source: Making Asynchronous File I/O Work in .NET


Jim Mischel

Picture two members of the Microsoft .NET Framework development team meeting in the break room.

Programmer 1: "Can you imagine an application developer trying to do an asynchronous write in a C# program?"

Programmer 2: "Ha! I wonder if he'll just code it up and assume that it's asynchronous, or if he'll notice that it blocks and spend hours trying to figure out why."

Programmer 1: "Either way it'd be a hoot to watch, wouldn't it?"

Programmer 2: "Yeah. I love messing with peoples' heads."

Okay, so perhaps the .NET Framework developers didn't purposely make the asynchronous file operations work synchronously. It's even possible that whoever wrote the asynchronous file I/O support didn't know that it sometimes doesn't work as expected. But that shortcoming had to be revealed during testing, right?

Apparently not, as the .NET Framework SDK documentation makes almost no mention of the possibility that an asynchronous file operation will block and complete synchronously. In addition, most asynchronous I/O samples in the SDK and on other Web sites show "asynchronous" operations that don't actually execute in the background.

In this article, I'll show you how the asynchronous file I/O routines are supposed to work, demonstrate that they often don't work as expected, explain why, and show you how to guarantee that your asynchronous file operations occur in the background without blocking the main execution thread.

A Refresher

To a processor capable of several billion operations per second, an I/O device that requires milliseconds for positioning and transfers data at a couple hundred megabits per second is slow, even glacial. Whenever your application writes data to or reads data from a disk or network file, the processor spends most of its time waiting for the I/O channel or servicing other tasks. In either case, your program's execution is blocked, waiting for the file operation to complete. Even with fast hard drives that can write 10MB or more per second, that's a full second that your application is unresponsive. Try blocking your GUI application for a full second, and see how quickly your users complain about a clunky interface.

Another situation in which slow file operations are annoying is when you have to read from or write to multiple files. Imagine having to read two different large files when your application starts. Each file takes 15 seconds to load, which means that it takes 30 seconds for your application to load. It probably wouldn't surprise you to learn that you could load both files in a total of about 15 seconds if you could tell the computer to do two things at once. While it's waiting on the I/O channel for one of the files, it can be gathering data from the other file.

Operations that occur in the order that they're specified are said to occur synchronously. Most of the time, we want our programs to operate synchronously, because the program's logic depends on it. It would be unfortunate, for example, if the computer tried to use the result of a calculation before the result was computed.

Sometimes, though, we don't really care about the order of some intermediate steps, just so long as all of the steps are completed before we have to use any of the results. For example, think of the instructions to bake cookies:

Gather the ingredients Preheat the oven to 450 degrees Lightly grease a cookie tray Mix the ingredients in a bowl Put the cookie dough on the tray Place tray in oven and cook until golden brown

It's pretty obvious here that it doesn't matter in what order the first three operations happen, or if they all happen concurrently. You just have to make sure that you gather the ingredients before you try to mix them, that the cookie tray is greased and the ingredients are mixed before you put the dough on it, and that the oven is preheated before you pop the tray in. If you have a helper, you can save time by having him preheat the oven and grease the cookie tray while you're gathering and mixing the ingredients.

What you have here, and in many computer programs, is small sets of operations that can happen in any order, and synchronization points where you ensure that the previous operations are complete. If you add the synchronization points to the cookie instructions, you can draw a flow chart that illustrates the tasks that can be performed concurrently.

Wouldn't it be nice if you could do the same thing with your file I/O? Just think if you could load seldom-used data in the background while the user is navigating through menus. That would be a much nicer user experience than staring at a splash screen while the data buffers are initialized.

Reading a File Asynchronously

The designers of the .NET Framework went out of their way to make asynchronous I/O easy to use. In its simplest form, writing a program to do asynchronous reads and writes is almost as easy as using standard, synchronous I/O.

Discounting the changes in how you open the file and the name of the function you call, the only real difference with asynchronous I/O is the additional requirement of a synchronization point. At some time you have to check the result of the operation, termed "harvesting the result."

For example, consider this code snippet that uses synchronous operations to read data from a file:

static void synchronousRead()
{
byte[] data = new byte[BUFFER_SIZE];
FileStream fs = new FileStream("readtest.dat", FileMode.Open,
FileAccess.Read, FileShare.None);
try
{
int bytesRead = fs.Read(data, 0, BUFFER_SIZE);
Console.WriteLine("{0} bytes read", bytesRead);
}
finally
{
fs.Close();
}
}

To do the same thing asynchronously, you have to pass a couple of other options to the FileStream constructor, call BeginRead rather than Read, and you have to create the synchronization point where you harvest the result. Let me show you the code first, and then I'll explain how it works.

static void asyncRead()
{
byte [] data = new byte[BUFFER_SIZE];
FileStream fs = new FileStream("readtest.dat", FileMode.Open,
FileAccess.Read, FileShare.None, 1, true);
try
{
// initiate an asynchronous read
IAsyncResult ar = fs.BeginRead(data, 0, data.Length, null, null);
// read is proceeding in the background.
// You can do other processing here.
// When you need to access the data that's been read, you
// need to ensure that the read has completed, and then
// harvest the result.

// wait for the operation to complete
ar.AsyncWaitHandle.WaitOne();

// harvest the result
int bytesRead = fs.EndRead(ar);

Console.WriteLine("{0} bytes read", bytesRead);
}
finally
{
fs.Close();
}
}

The first difference in the asynchronous read code is the way that you open the file. To perform asynchronous file I/O in .NET, you must create the FileStream with the useAsync flag set to True. The only way to set this flag is to call one of the two FileStream constructors that allow it to be specified. If you try to do asynchronous operations on files that were not opened with this flag set, the operations will proceed synchronously.

To initiate an asynchronous read, you call BeginRead, passing it the same parameters that you pass to Read, plus two others: a completion delegate and a state object. In this example, I've supplied null for both of those parameters. Later in the article, I describe how to use those two parameters, and provide an example.

Calling BeginRead causes the read operation to begin executing in a background thread. The main thread returns from the BeginRead call, and continues executing. You can perform any processing you like after BeginRead returns, except you can't access the buffer that is being filled by the read operation (the data buffer, in this example). Either reading or writing to that buffer while the read operating is running can corrupt the data being read into the buffer.

The value returned from BeginRead is an object that implements the IAsyncResult interface. This object provides information about the read operation, and is used both to determine when the operation is complete, and to harvest the result. The Completed property tells you if the asynchronous operation has completed. The CompletedSynchronously property tells you if the operation was completed "immediately" during the BeginRead call, and the AsyncState property contains a reference to the state object that you passed to BeginRead. (I discuss CompletedSynchronously and AsyncState a little later.)

The whole point to asynchronous I/O is that you want the file operation to run in the background while you do other things. When it's time to work with the results of the I/O operation, you synchronize and continue on your way.

There are two points to synchronizing: determining if the operation completed, and harvesting the result. Whereas there's only one way to harvest the result, there are two ways that you can wait for the operation to complete: poll or wait.

The code in the example above waits for completion by calling the Wait method of the IAsyncResult object's AsyncWaitHandle property. This uses a Windows synchronization object to set an event and wait until the event is signaled before continuing. This is very processor-efficient because the waiting thread (your main thread in this case) consumes zero processor cycles while waiting. When the read is complete, it signals the wait handle and Windows transfers control to the waiting thread. This is the preferred way to do synchronization when you want to spawn an asynchronous read, perform some processing, and then wait for the read to complete before continuing.

The other way to wait for completion is to poll: periodically check the status of the IsCompleted flag, and continue processing until the flag is set. Here's a code snippet that does just that.

// wait for the operation to complete
while (!ar.IsCompleted)
{
System.Threading.Thread.Sleep(10);
}

This is the recommended method of waiting for completion if you want to do some processing until the read is done. Perhaps you want to do some animation and pre-calculate some non critical items while waiting for the read to be done. In the code example, the thread just sleeps for 10 milliseconds between each check of IsCompleted. You could just have well coded some processing loop that takes some small time to complete.

The nice thing about the polling method is that it gives us a simple way to determine if the I/O operation is operating synchronously or asynchronously. If you add an output statement to the loop, you should see at least some output before the I/O operation completes. Provided, of course, that the I/O was handled asynchronously. The code below modifies the example above slightly, checking the CompletedAsynchronously flag and displaying a period each time through the polling loop.

static void asyncRead()
{
byte [] data = new byte[BUFFER_SIZE];
FileStream fs = new FileStream("e:\\readtest.dat", FileMode.Open,
FileAccess.Read, FileShare.None, 1, true);
try
{
// initiate an asynchronous read
IAsyncResult ar = fs.BeginRead(data, 0, data.Length, null, null);
if (ar.CompletedSynchronously)
{
Console.WriteLine("Operation completed synchronously.");
}
else
{
// Read is proceeding in the background.
// wait for the operation to complete
while (!ar.IsCompleted)
{
Console.Write('.');
System.Threading.Thread.Sleep(10);
}
Console.WriteLine();
}
// harvest the result
int bytesRead = fs.EndRead(ar);

Console.WriteLine("{0} bytes read", bytesRead);
}
finally
{
fs.Close();
}
}

If the operation proceeds in the background, you should see at least one period displayed on the screen. If you run this on a fast hard drive, especially if BUFFER_SIZE is small, one period is all that you'll see. Try reading from a USB flash drive or from a network drive with a buffer size of five megabytes. Then you should see a lot of periods racing across the screen.

Don't be surprised, if you run the program twice against the same file, if the second time through takes only a fraction of the time as the first. Windows is very good about caching data — sometimes too good, as you'll see — so the second time through the data is already in memory and just requires that Windows transfer it to your buffer. No device access required.

Sometimes, your program will "lock up" (stop responding) for after you call BeginRead, but when the function call returns, the CompletedSynchronously property of the returned IAsyncResult is not set. In that case, you'll see one period displayed on the screen, and then the operation is complete. This doesn't happen very often, if at all, during asynchronous reads, but it does happen often during asynchronous writes, as you'll see below.

Completion Callbacks and State Objects

In some cases, you want to execute some code immediately after the asynchronous I/O operation is complete, without waiting for the main thread to finish what it's doing. Or you don't want the main thread to be concerned with cleaning up after the read (calling EndRead and closing the stream). That's where the completion callback comes in; you can set things up so that the background thread executes the asynchronous read transfers control to the completion callback method after the read is completed.

The IAsyncResult reference that was created when you initiated the BeginRead statement is passed to the completion callback as its only parameter. A reference to the state object that you pass to BeginRead is stored in the IAsyncResult object. This functionality allows true "fire and forget" asynchronous I/O. Here's an example:

static private byte[] globalData = new byte[BUFFER_SIZE];
static void asyncReadWithCallback()
{

FileStream fs = new FileStream("readtest.dat", FileMode.Open,
FileAccess.Read, FileShare.None, 1, true);
// initiate an asynchronous read
IAsyncResult ar = fs.BeginRead(globalData, 0, BUFFER_SIZE,
new AsyncCallback(readCallback), fs);
if (ar.CompletedSynchronously)
{
Console.WriteLine("Operation completed synchronously.");
}
// The read has completed or it is proceeding in the background.
// Either way, we don't care, as this main thread will not
// be accessing the information, or we have some other way to
// ensure that the the read is finished before we access the data.

// Finish whatever processing here ...

// note that we don't close the stream because
// it will be closed by the callback function
}

// readCallback is called when the asynchronous read completes.
// It harvests the result and closes the stream.
// You could use this routine to signal a manual event or do
// other processing that must occur when the read is complete.
static void readCallback(IAsyncResult ar)
{
FileStream fs = (FileStream)ar.AsyncState;
int bytesRead = fs.EndRead(ar);
Console.WriteLine("{0} bytes read", bytesRead);
fs.Close();
}

The call to BeginWrite creates an AsyncDelegate that references the callback function, and passes the FileStream reference. The FileStream reference is stored in the IAsyncResult object, and is available in the AsyncState property. You can pass any type as the state object, and then cast the AsyncState property to that type from within the callback function. If you want to do any operations on the file stream, be sure to include the stream in whatever state object you create.

The documentation for BeginRead, by the way, says "on Windows, all I/O operations smaller than 64 KB will complete synchronously for better performance." My testing with small reads shows that this is not true. 've tried many different buffer sizes from 2 bytes to 64 K bytes. Not one of them completed synchronously. At least, ar.CompletedSynchronously never was set. It's possible that the operation was blocking, but being reported as having completed asynchronously.

The .NET Framework's implementation of asynchronous reading is simple to use, and quite powerful. Asynchronous writes work in much the same way, at least in theory.

Asynchronous Writes

To write data asynchronously, you use the same type of code as for a read. The only real difference is that you call BeginWrite and EndWrite instead of BeginRead and EndRead, and the EndWrite method has no return value. Other than that, the code is identical. The I/O completion routine and the state object also operate exactly the same for writes as for reads. Below is the code for a simple asynchronous write.

static void asyncWrite()
{
byte[] data = new byte[BUFFER_SIZE];
FileStream fs = new FileStream("writetest.dat", FileMode.Create,
FileAccess.Write, FileShare.None, 1, true);
try
{
// initiate an asynchronous write
Console.WriteLine("start write");
IAsyncResult ar = fs.BeginWrite(data, 0,
data.Length, null, null);
if (ar.CompletedSynchronously)
{
Console.WriteLine("Operation completed synchronously.");
}
else
{
// write is proceeding in the background.
// wait for the operation to complete
while (!ar.IsCompleted)
{
Console.Write('.');
System.Threading.Thread.Sleep(10);
}
Console.WriteLine();
}
// harvest the result
fs.EndWrite(ar);
Console.WriteLine("data written");
}
finally
{
fs.Close();
}
}

I added the line that outputs "start write," because I want to illustrate that asynchronous writes don't always proceed asynchronously. If you use this code to output a file to the local computer's hard drive, you'll probably see one period displayed before the program exits. Local hard drives are so fast, and Windows is so good at buffering, that local file operations appear to be instantaneous, even when they aren't.

If you want a surprise, change the program so that it outputs to a network device or a USB flash drive — something slower than a local hard drive. If your system is anything like the others I've tested this program on, you will see the message "start write", and then the program will display nothing for a while. Finally, it will display one period and then exit. The program obviously blocks during the I/O operation, but the CompletedSynchronously property isn't set. To paraphrase that most famous line from Hamlet, "Something is rotten inside of Windows."

What They Don't Tell You

Neither the .NET Framework SDK documentation for BeginWrite, nor the Windows API documentation for WriteFile (the API function that BeginWrite uses) make much mention of the possibility that an asynchronous write might block. The documentation for BeginWrite does say that "the underlying operating system resources might allow access in only one of these modes." Still, if the OS didn't support asynchronous writes, you'd expect BeginWrite to return an IAsyncResult that has the CompletedSynchronously flag set. Right?

It took some digging, but I finally came across a Microsoft Knowledge Base article titled Asynchronous Disk I/O Appears as Synchronous on Windows NT, Windows 2000, and Windows XP. This article is chock full of good information. For example, under the heading "Set Up Asynchronous I/O," the article states:

Be careful when coding for asynchronous I/O because the system reserves the right to make an operation synchronous if it needs to. Therefore, it is best if you write the program to correctly handle an I/O operation that may be completed either synchronously or asynchronously.

Having programmed computers for 25 years, I understood and even expected that. It's interesting, though, that the SDK documentation doesn't mention it, nor do any of the many articles I've found about asynchronous I/O under .NET. That still didn't answer my question as to why asynchronous writes were blocking.

The article covers many reasons why your asynchronous file operations might appear synchronous. I found my answer to the problem under the heading "Extending a file." As the article states:

Another reason that I/O operations are completed synchronously is the operations themselves. On Windows NT, any write operation to a file that extends its length will be synchronous.

The article mentions Windows NT specifically, but I've seen the same behavior on Windows 2000 and Windows Server 2003. The rest of the section goes on to explain how to get around this limitation, and the security reasons for not attempting the workaround.

The problem is that the FileStream constructor I call is creating a new file (FileMode.Create), which means that any existing file is truncated to zero bytes so that it's just like a newly-created file. Writing any data to that file extends it, thereby causing the operation to proceed synchronously. I proved that this is the cause of the error by running my program once and then changing the open mode to FileMode.Open. When I ran the program again, it opened the existing file and wrote data to it from the beginning, overwriting the existing data. That operation proceeded asynchronously, as you would expect.

I don't know enough about Windows internals and the NTFS file system to understand why extending a file has to be a synchronous operation. It'd be interesting, although not terribly useful, to learn the reason, and I encourage any member of the Windows team to contact me with an explanation. Understanding why doesn't solve my problem, though. I want my asynchronous writes!

Alternate Feline De-furring Method

The problem here is that the operating system blocks the thread during a write that extends a file, and I don't want my thread blocked. That was the whole point of attempting an asynchronous write operation in the first place. Sometimes if you want something done write (er, right), you just have to do it yourself. Since BeginWrite won't guarantee asynchronous operation, I decided to do the I/O by creating and calling an asynchronous delegate. Doing so presents two primary advantages:

I/O is guaranteed to be asynchronous because it is executing on a background thread. The code executed in the background can involve multiple reads and writes, and arbitrarily complex calculations interspersed between I/O operations.

Creating an asynchronous delegate isn't much more complicated than creating a file I/O completion callback. All you need to do is define the method that will execute on the background thread, define a delegate for it, instantiate a delegate, and execute. The code below replaces the asynchronous write code from the last example.

// Define the method type that will be called
private delegate void WriteDelegate(byte[] dataBuffer);

static void asyncWriteWithDelegate()
{
byte[] data = new byte[BUFFER_SIZE];

// create a new WriteDelegate
WriteDelegate dlgt = new WriteDelegate(WriteTheData);

// Invoke the delegate asynchronously.
IAsyncResult ar = dlgt.BeginInvoke(data, null, null);

// WriteTheData is now executing asynchronously
// Continue with foreground processing here.
while (!ar.IsCompleted)
{
Console.Write('.');
System.Threading.Thread.Sleep(10);
}
Console.WriteLine();

// harvest the result
dlgt.EndInvoke(ar);
}

static void WriteTheData(byte[] dataBuffer)
{
using (FileStream fs = new FileStream("e:\\writetest.dat", FileMode.Create,
FileAccess.Write, FileShare.None))
{
Console.WriteLine("Begin write...");
fs.Write(dataBuffer, 0, dataBuffer.Length);
Console.WriteLine("Write complete");
}
}

Notice the similarity between this code and the asynchronous write code that uses FileStream.BeginWrite. Granted, you have to define and create a delegate, but other than that, the code is almost identical. Calling the Invoke method is similar to BeginWrite, and you harvest the result by calling EndInvoke in the same way that you call EndWrite. You even check for completion in the same way: by testing IAsyncResult.IsCompleted, or by waiting on the AsyncWaitHandle.

You can even pass a completion callback routine and state object to BeginInvoke in the same way that you can for BeginRead and BeginWrite. The completion callback is executed on the background thread after the delegate is done executing, and the state object is passed in the AsyncState property of the IAsyncResult object.

Note also that WriteTheData calls Write rather than BeginWrite to initiate the I/O. Since WriteTheData is executing on a background thread, it makes little sense to do the I/O asynchronously. This would hold true for reads, too, although I guess there are rare situations in which you'd want a background thread to do asynchronous I/O.

The second advantage of using asynchronous delegates rather than BeginRead or BeginWrite sometimes is overlooked. More often than not, you want your program's initialization code to do more than just read a file into a buffer — something that the built-in asynchronous functionality does quite well. Very often, you want to read the file in blocks (or records), and process those blocks to create internal data structures, or do some post-processing as you're writing data to a file. Perhaps you want to read or write an encrypted file. In those and many other common situations, the standard asynchronous I/O support supplied by FileStream isn't enough. Considering how easy it is to create and invoke an asynchronous delegate, I've found that it's easier to do all of my asynchronous I/O—even simple file reads into a buffer—through asynchronous delegates. Doing so guarantees that the operation will execute in the background, and also makes it easier for me to make the inevitable changes when the client requests different functionality.

Asynchronous I/O to Other Devices

All classes that inherit from System.IO.Stream implement the BeginRead and BeginWrite methods. However, the default implementations of these methods in System.IO.Stream simply call Read and Write, making the operations synchronous. Only those classes that override BeginRead and BeginWrite actually support asynchronous I/O. Of the .NET Framework classes that inherit from System.IO.Stream, only System.Net.Sockets.NetworkStream includes full support for asynchronous operation. All others depend on the default implementation of BeginRead and BeginWrite, which meansthat all I/O operations on those streams will block the calling thread until the operation is completed.

Copyright © 2005 Ziff Davis Media Inc. All Rights Reserved. Originally appearing in Dev Source.