Friday, January 4, 2008

Asynchronous File I/O

Asynchronous File I/O

Introduction
Today I'm going to talk about Asynchronous file I/O. The .Net platform provides for developers a set of Asynchronous methods that can be used to gain performance in certain situations (file I/O, Networking, Queue managment, etc). This tutorial will cover using Asynchronous methods on basic File I/O to gain a performance increase on large files.

What is Asynchronous?
Consider this situation. Let's assume you have 1000 files that you need to open, read into byte arrays, then write to a different location. Each file is well over 4 megabytes of data. What could you do?

First, you could use a single thread to process each file synchronously. This could take a loooong time.

You could use a new thread for each file, and process them all at the same time. On the surface, this seems like the best and most elegant method. But this has three huge problems. First, the memory overhead would be enourmous.

Having 1000 different files opened and read into memory at the same time would be quite drastic. The second problem is more subtle. The OS (Windows 2000, NT, XP) manages how much time each thread gets. This time is called the quantum. The OS divides processor time up evenly between all threads with like priority. So what is the problem? When the CPU switches from one thread to another, it has to perform "context switching". This is a fancy way of saying the CPU is saving the state of one thread and loading a new one. Context switching happens whenever a new thread is being worked on by the OS, so usually has a very minimal impact on the system. But when you start increasing the total number of threads (in this case by 1000) context switching becomes longer, at the same time the quantum becomes shorter. By time the context switching is done, the quantum is mostly burned up and little time is actually used for processing.

The last problem deals with the limitations of the OS itself. Each thread is by default allocated 1MB of stack space. Modern Windows OS's (NT, 2000, XP) are limited to 2GB of user address space. This means that there can never be over (roughly) 2000 threads, assuming you have 2GB of spare memory.

So, what to do? Well, you could use a threadpool object to handle the requests. This is actually exactly what Asynchronous programming does, but does it "behind the scenes" and makes life a lot easier for the programmer. So lets discuss Asynchronous now. We are going to hit a bit of theory before jumping into code. Asynchronous methods do some interesting things that are not usually descibed in tutorials. I'd like to describe what is actually going on in the background, as it took me quite a while to scrap together the details.

Asynchronous methods use to very important things to gain performance: completion ports and a threadpool. Completions ports is a way to manage multiple requests at once in a non-blocking manner. A single thread will make requests to perform I/O on the completion port, which then notifies the OS of the request. The completion port returns control to the caller to continue processing or doing more work on other things. When the OS is done with the I/O, it notifies the caller through a callback. At this point, an idle thread from the threadpool (or, preferably, the thread that is currently executing to avoid context switching) is taken and used to recieve the callback. Data is gotten from callback, and it exits, returning the thread back to the threadpool.

This method allows us to make several to a lot of calls and allow them to be processed in the most efficient manner. Here is some more background information, and the reason I said that only some tasks benefitted from Asynch programming. Suppose instead of opening a file and reading from that (where the disk access speed is the bottleneck), you were factoring large numbers. In this scenareo, the bottleneck is the CPU, and no performance gain would come from asynch programming. Asynch programming works well when threads will spend a signifigant portion of their quantum doing little CPU work (because it is waiting for the disk to return, or a network connection, etc). In these situations, asynch programming is perfect because it can perform many operations at once and process them as they finish. Code
Finally! Lets hit some code. By now, if you are still reading, you might have gone into shock. There were some pretty gory details above, but fear not! Asynch programming is absurdly simple in .Net.

    Private Sub ReadFilesInC()
' make a reference to a directory
Dim di As New IO.DirectoryInfo("c:\")
Dim diar1 As IO.FileInfo() = di.GetFiles()
Dim dra As IO.FileInfo
'list the names of all files in the specified directory
For Each dra In diar1
ReadData(dra.FullName)
Next
End Sub

Public Function ReadData(ByVal Path As String)
'--This is the state object that will contain the file
'---and the return data. This class is nescessary because
'---we dont want the return data overwritten by each
'---read method (like it would if it was declared globally)
Dim tempStateObj As New StateObj
'--Open up the file we wish to read
tempStateObj.file = New System.IO.FileStream(Path, IO.FileMode.Open)
'--Redimension the return data array to the appropriate length
ReDim tempStateObj.retData(tempStateObj.file.Length)
'--Call the Asynch read method. Pass the byte array we wish to fill
'---the offset, the length of the file to read, the address of the
'---callback sub, and finally a state object for our own use.
tempStateObj.file.BeginRead(tempStateObj.retData, 0, tempStateObj.file.Length, _
New AsyncCallback(AddressOf OnReadDone), tempStateObj)
'--Control is immediately given back to the program. The BeginRead method
'---Does not block program execution like the Read method, so we are free to
'---get back to processing.
Debug.WriteLine("BeginRead done: " & Path)
'--To show that this is a normal thread and not threadpooled:
Debug.WriteLine(System.Threading.Thread.CurrentThread.IsThreadPoolThread)
End Function
Public Sub OnReadDone(ByVal ar As IAsyncResult)
'--This is the asynchronous callback delegate that is called when the BeginRead
'---is done processing.
'--The state object is passed to us in ar. It is a generic
'---IAsyncResult, and must be casted into something usable
'---We know we passed a StateObj class, so we cast it as such
Dim state As StateObj = CType(ar, StateObj)
'--From our state object, we must call EndRead(ar) to get the
'---number of bytes read. Even if you do not wish to know the
'---number of bytes, you must *always* call EndRead in the callback
Dim bytesRecieved As Int32 = state.file.EndRead(ar)
'--If you don't wish to know the number of bytes read, do this:
'state.file.EndRead(ar)
state.file.Close()
'--Just to prove that this thread is running in a Threadpool object that
'---is managed by the OS:
Debug.WriteLine(System.Threading.Thread.CurrentThread.IsThreadPoolThread)
'--Open up a new file to write to
state.file = New System.IO.FileStream("C:\Somewhere\data.txt", IO.FileMode.Create)
'--Begin the Asynch write to the file, passing everything and the state object again
state.file.BeginWrite(state.retData, 0, state.retData.Length, New AsyncCallback(AddressOf OnWriteDone), state)
'--At this point, the sub will terminate and the thread will go back to the
'---internal threadpool. It is important not to do anything that will block or take
'---large amounts of time in this sub. The internal threadpool is limited to
'---25 threads total, and it is important to return the thread as quick as possible
'---to the pool for further use.
End Sub
Public Sub OnWriteDone(ByVal ar As IAsyncResult)
'--This is the asynchronous callback delegate that is called when the BeginWrite
'---is done processing.
'--The state object is passed to us in ar. It is a generic
'---IAsyncResult, and must be casted into something usable
'---We know we passed a StateObj class, so we cast it as such
Dim state As StateObj = CType(ar, StateObj)
'--From our state object, we must call EndWrite(ar). You must
'---*always* call EndWrite in the callback
state.file.EndWrite(ar)
state.file.Close()
'--At this point, the sub will terminate and the thread will go back to the
'---internal threadpool. It is important not to do anything that will block or take
'---large amounts of time in this sub. The internal threadpool is limited to
'---25 threads total, and it is important to return the thread as quick as possible
'---to the pool for further use.
End Sub
'----------
'--Our StateObj Class
'----------
Public Class StateObj
Public file As System.IO.FileStream
Public retData() As Byte
End Class


Hopefully the comments are fairly self explanatory. At first, we call OpenFilesInC() to start reading ever file in the C:\ Drive (but not subdirectories). This will then start calling ReadData(), which will asynchrounously start opening and reading them. The ReadData will return immediately after executing, unlike a similar function that uses Read() instead of BeginRead().


ReadData() uses a small public class called StateObj. This class contains the file and the return data array. This is important. Each file that you open asynchronously must have it's own buffer to store data in. If you used a global variable, each file would overwrite the others, making quite a mess. Luckily, the asynch methods allow us to pass a state object that can be retrieved at the callback.


The callback delegate itself is also fairly self explanatory. We cast the IAsynResult object to our original StateObj class, then call EndRead(ar) to get the number of bytes read. Regardless of if you use the number of bytes or choose to store it, you must always call EndRead() in the callback. If you don't, bad things (TM) could happen.


The same procedure is used for writing. This is one of the great features of the .Net platform. The sockets have the same naming and usage as the file io, which has the same as memory streams, etc. YOu can take your knowledge from this and apply it to many other objects. Any method that has the prefix Begin... on it (BeginRead, BeginWrite, BeginAccept, etc) will have a corresponding End... method (EndRead, EndWrite, EndAccept, etc). These two pairs form the asynch methods in any object that has them.


Ending

This was just a brief glance at asynch programming. It is a great tool that can drastically improve performance in many situations. It does raise some interesting programming hurdles and make some things more complicated (such as requiring state objects and such), but it is well worth it. Just remember, always call your End... method



Note on example

The above example would probably be quite a memory hog, as it is reading the entire file into memory at once. A better solution would be to use a smaller buffer (of perhaps 4-10 KB) and read that, then call the BeginWrite, then call another BeginRead to get the next block of data from the file, etc.


As to the demo, it does exactly what the code posted does (with some modifications). Reading the article gives an overview of what the code does. It reads all files from the location passed to the ReadFiles function (specified by the text box txtInput) and asynchronously reads them, the asynchronously writes them to the application directory. It is *not* recursive, so no fear of copying your entire computer to the directory. NOTE: If you know how to get the code to format with colors, please let me know.