CompletableFuture in I/O-bound applications

Hardware access needs long time

Applying reactive programming to improve performance by utilizing all CPU cores became rather easy with Java 8 thanks to CompletableFuture. This is nice when an application is mostly CPU-bound, i. e. when it performs in-memory operations. But what happens in case the application is mostly I/O-bound, i.e. when it performs a lot of file processing or network access? Or what if it is CPU-bound but a blocking call though is needed? Actually CompletableFuture still is the answer!

The only difference is that using it becomes a bit more tricky, as there is no handy class that “magically” does some tricks behind the scene, rather one has to know a bit about the underlying hardware and operating system’s internals to get things right.

To understand parallel computing in I/O-bound applications, hardware and operating system’s way to handle I/O must be understood. With CPU-bound applications there is not much to know: A CPU has multiple cores, each core can serve one or multiple threads (aka “Hyperthreading” etc.), and CompletableFuture by default will utilize as many queues and threads as there are CPU cores. In-CPU and in-memory operations are lightning-fast, and will never block. Or to correct myself, will neither block so long that a human might notice it, nor will block for an undeterministic amount of time. The latter is what people actually mean with the term “blocking” as the CPU has to wait and nobody knows how long in advance, so cannot do anything to prevent “losing” a CPU core.

For disks and network adapters things work a bit different. Due to the nature of the mechanical design of disk drives, these can handle only one single access at any time natively, and that time might be very, very long (several milliseconds, so in units of CPU frequency, an “endless” period). There is only one mechanical arm, which can only point to one physical location at any time. So without further “tricks” it makes not much sense to have more than one thread per disk as there is nothing that could process it — it would simply be blocked. But, yes, there are such tricks. NCQ and I/O schedulers in the operating system, hypervisor, and other layers, will allow to optimize performance by interleaving outstanding requests. So in fact, more than one access is possible to be scheduled at any time. But those queues cannot be endless, and the measurable level of optimization is rather limited. If the end of the optimization potential is reached, the thread still will block again, so a real solution is to use queues instead of threads, which are much more lightweight.

Don’t block the common pool!

As CompletableFuture by default is using one single, shared thread pool under the hood, blocking threads belonging to that pool is not a good idea: If all of them are blocked, the application will simply not be respond anymore. This is what one really wants to prevent at all costs!

The solution is to split pools into (at least) the common pool for CPU-bound action and a pool for (at least potentially) blocking operations. Whether or not an operation could block is simple: If it is using I/O (i. e. and kind of hardware other than CPU and RAM) it is to be considered blocking. I assume any professional programmer is able to judge whether an operation is able to execute solely in-CPU or in-memory.

Separate I/O pool

The pattern to split pools is described by the following example, which takes a file name from JavaFX, loads the file from disk, then sends the file content via the network, and reports back to JavaFX that the work is done. JavaFX is chosen simply to show that there might be even more special threads to take care of, besides hardware device access.

void demo() {
  CompletableFuture.supplyAsync(App::getFileName, JAVA_FX)
                   .thenApplyAsync(App::readDiskBlocking, DISK_1)
                   .thenAcceptAsync(App::writeLanBlocking, LAN_1)
                   .thenRunAsync(App::notifyUser, JAVA_FX);
}

static Executor DISK_1 = Executors.newSingleThreadExecutor(r -> new Thread(r, "DISK_1"));

static Executor LAN_1 = Executors.newSingleThreadExecutor(r -> new Thread(r, "LAN_1"));

static Executor JAVA_FX = r -> {
  if (Platform.isFxApplicationThread())
    r.run();
  else
    Platform.runLater(r);
};

The ThreadFactory is provided simply to have thread names assigned in case a stack trace is to be examined later, and has nothing to do with the solution.

The executors are static because there won’t be more hardware just because one likes to start more application instances. So the number of threads is bound to the number of devices, not the number of application instances.

This example even is more sphisticated than essentially needed, as it correctly assumes that not all I/O is going through one single queue: In fact, disk and network can be accessed at the same time, so having different threads to access them makes sense. And there can be more than one disk or network adapter. So the key to parallel execution of I/O-bound tasks is having multiple hardware, and to have one pool per device (again, nobody wants all I/O threads be blocked by one device, so a shared I/O pool should not be used).

But there are even more optimizations. In reality, using one single thread is much too small for each hardware device. As said earlier, hardware can interleave multiple requests, so it makes sense to increase the pool size. Not only can a disk use NCQ, it can answer reads from the cache while the hardware is spinning to the next write location. And certainly can a network card perform full-duplex operation, i. e. read and write at the same time. Also a disk is not necessarily just one disk: Think of RAID systems with clever head controllers, or SANs! In fact, there is no optimum number, so many professional applications allow the end user to fine-tune the level of parallelism. What to take from here is: One thread per device is just the lower limit. Some professional applications even have multiple thread pools per device, e. g. to tune the number of reader threads and the number of writer threads.

Asynchronous I/O

But there are more ways to deal with hardware than to just separate blocking from non-blocking pools. Modern operating systems in fact are able to deal with the problem of blocking calls themselves, and possibly in a much better (read: faster and efficient) way than an application ever could, as the OS and the drivers know much better how a piece of hardware behaves and have less overhead per task than a Java program. So to get even more performance out of the same hardware, asynchronous I/O is the way to go.

In core, it is like CompletableFuture on the OS level, as it allows to ask the OS to perform actions in background and deal with all the described details on its own, and tell us when done. Ain’t that cool? And it’s around for many years in the NIO package! As NIO will not block, there is no more need for separate thread pools, but the code is getting more complex (there is no free lunch):

void demo() {
  CompletableFuture.supplyAsync(App::getFileName, JAVA_FX)
                   .thenComposeAsync(App::readDiskNonBlocking)
                   .thenComposeAsync(App::writeLanNonBlocking)
                   .thenRunAsync(App::notifyUser, JAVA_FX);
 }

static CompletableFuture<String> readDiskNonBlocking(Path fileName) {
  CompletableFuture<String> cf = new CompletableFuture<>();
  try {
    AsynchronousFileChannel ch = AsynchronousFileChannel.open(fileName);
    ByteBuffer b = ByteBuffer.allocate(...);
    ch.read(b, 0L, cf, new CompletionHandler<Integer, CompletableFuture<String>>() {
      @Override public void completed(Integer l, CompletableFuture<String> a) {
        a.complete(...);
      }

      @Override public void failed(Throwable t, final CompletableFuture<String> a) {
        a.completeExceptionally(t);
      }
    });
  } catch (IOException e) {
    completableFuture.completeExceptionally(e);
  }
  return cf;
}

static CompletableFuture<Void> writeLanNonBlocking(String fileContent) {
  CompletableFuture<Void> cf = new CompletableFuture<>();
  ...
  return cf;
}

To reduce code size in this article, the actual details are left out unless needed for understanding the key concept. Asynchronous socket channels are used in a similar fashion to asynchronous file channels.

While this code apparently is more complex to write due to the needed read and write functions (which can be reused, certainly), it not only will perform faster, consume less resources, but also will be easier to maintain, as no more pools are to be managed and tuned.

Conclusion

Using time-consuming I/O operations is an essential part of many applications and needs some tricks to inline them with the reactive programming API provided by CompletableFuture. Nevertheless, with a bit of understanding how the hardware works like and how to manage threads, it is pretty straigforward to mix blocking operations with CompletableFuture’s internal thread pool without rendering the complete application unresponsive. More performance at less tuning costs can be gained using NIO instead of blocking calls.

Advertisements

About Markus Karg

Java Guru with +30 years of experience in professional software development. I travelled the whole world of IT, starting from Sinclair's great ZX Spectrum 48K, Commodore's 4040, over S/370, PCs since legendary XT, CP/M, VM/ESA, DOS, Windows (remember 3.1?), OS/2 WARP, Linux to Android and iOS... and still coding is my passion, and Java is my favourite drug!
This entry was posted in Allgemein, Java, Programming and tagged , , , . Bookmark the permalink.