CompletableFuture in I/O-bound applications

Hardware access needs long time

Applying reactive programming to improve performance by utilizing all CPU cores became rather easy with Java 8 thanks to CompletableFuture. This is nice when an application is mostly CPU-bound, i. e. when it performs in-memory operations. But what happens in case the application is mostly I/O-bound, i.e. when it performs a lot of file processing or network access? Or what if it is CPU-bound but a blocking call though is needed? Actually CompletableFuture still is the answer!

The only difference is that using it becomes a bit more tricky, as there is no handy class that “magically” does some tricks behind the scene, rather one has to know a bit about the underlying hardware and operating system’s internals to get things right.

To understand parallel computing in I/O-bound applications, hardware and operating system’s way to handle I/O must be understood. With CPU-bound applications there is not much to know: A CPU has multiple cores, each core can serve one or multiple threads (aka “Hyperthreading” etc.), and CompletableFuture by default will utilize as many queues and threads as there are CPU cores. In-CPU and in-memory operations are lightning-fast, and will never block. Or to correct myself, will neither block so long that a human might notice it, nor will block for an undeterministic amount of time. The latter is what people actually mean with the term “blocking” as the CPU has to wait and nobody knows how long in advance, so cannot do anything to prevent “losing” a CPU core.

For disks and network adapters things work a bit different. Due to the nature of the mechanical design of disk drives, these can handle only one single access at any time natively, and that time might be very, very long (several milliseconds, so in units of CPU frequency, an “endless” period). There is only one mechanical arm, which can only point to one physical location at any time. So without further “tricks” it makes not much sense to have more than one thread per disk as there is nothing that could process it — it would simply be blocked. But, yes, there are such tricks. NCQ and I/O schedulers in the operating system, hypervisor, and other layers, will allow to optimize performance by interleaving outstanding requests. So in fact, more than one access is possible to be scheduled at any time. But those queues cannot be endless, and the measurable level of optimization is rather limited. If the end of the optimization potential is reached, the thread still will block again, so a real solution is to use queues instead of threads, which are much more lightweight.

Don’t block the common pool!

As CompletableFuture by default is using one single, shared thread pool under the hood, blocking threads belonging to that pool is not a good idea: If all of them are blocked, the application will simply not be respond anymore. This is what one really wants to prevent at all costs!

The solution is to split pools into (at least) the common pool for CPU-bound action and a pool for (at least potentially) blocking operations. Whether or not an operation could block is simple: If it is using I/O (i. e. and kind of hardware other than CPU and RAM) it is to be considered blocking. I assume any professional programmer is able to judge whether an operation is able to execute solely in-CPU or in-memory.

Separate I/O pool

The pattern to split pools is described by the following example, which takes a file name from JavaFX, loads the file from disk, then sends the file content via the network, and reports back to JavaFX that the work is done. JavaFX is chosen simply to show that there might be even more special threads to take care of, besides hardware device access.

void demo() {
  CompletableFuture.supplyAsync(App::getFileName, JAVA_FX)
                   .thenApplyAsync(App::readDiskBlocking, DISK_1)
                   .thenAcceptAsync(App::writeLanBlocking, LAN_1)
                   .thenRunAsync(App::notifyUser, JAVA_FX);
}

static Executor DISK_1 = Executors.newSingleThreadExecutor(r -> new Thread(r, "DISK_1"));

static Executor LAN_1 = Executors.newSingleThreadExecutor(r -> new Thread(r, "LAN_1"));

static Executor JAVA_FX = r -> {
  if (Platform.isFxApplicationThread())
    r.run();
  else
    Platform.runLater(r);
};

The ThreadFactory is provided simply to have thread names assigned in case a stack trace is to be examined later, and has nothing to do with the solution.

The executors are static because there won’t be more hardware just because one likes to start more application instances. So the number of threads is bound to the number of devices, not the number of application instances.

This example even is more sphisticated than essentially needed, as it correctly assumes that not all I/O is going through one single queue: In fact, disk and network can be accessed at the same time, so having different threads to access them makes sense. And there can be more than one disk or network adapter. So the key to parallel execution of I/O-bound tasks is having multiple hardware, and to have one pool per device (again, nobody wants all I/O threads be blocked by one device, so a shared I/O pool should not be used).

But there are even more optimizations. In reality, using one single thread is much too small for each hardware device. As said earlier, hardware can interleave multiple requests, so it makes sense to increase the pool size. Not only can a disk use NCQ, it can answer reads from the cache while the hardware is spinning to the next write location. And certainly can a network card perform full-duplex operation, i. e. read and write at the same time. Also a disk is not necessarily just one disk: Think of RAID systems with clever head controllers, or SANs! In fact, there is no optimum number, so many professional applications allow the end user to fine-tune the level of parallelism. What to take from here is: One thread per device is just the lower limit. Some professional applications even have multiple thread pools per device, e. g. to tune the number of reader threads and the number of writer threads.

Asynchronous I/O

But there are more ways to deal with hardware than to just separate blocking from non-blocking pools. Modern operating systems in fact are able to deal with the problem of blocking calls themselves, and possibly in a much better (read: faster and efficient) way than an application ever could, as the OS and the drivers know much better how a piece of hardware behaves and have less overhead per task than a Java program. So to get even more performance out of the same hardware, asynchronous I/O is the way to go.

In core, it is like CompletableFuture on the OS level, as it allows to ask the OS to perform actions in background and deal with all the described details on its own, and tell us when done. Ain’t that cool? And it’s around for many years in the NIO package! As NIO will not block, there is no more need for separate thread pools, but the code is getting more complex (there is no free lunch):

void demo() {
  CompletableFuture.supplyAsync(App::getFileName, JAVA_FX)
                   .thenComposeAsync(App::readDiskNonBlocking)
                   .thenComposeAsync(App::writeLanNonBlocking)
                   .thenRunAsync(App::notifyUser, JAVA_FX);
 }

static CompletableFuture<String> readDiskNonBlocking(Path fileName) {
  CompletableFuture<String> cf = new CompletableFuture<>();
  try {
    AsynchronousFileChannel ch = AsynchronousFileChannel.open(fileName);
    ByteBuffer b = ByteBuffer.allocate(...);
    ch.read(b, 0L, cf, new CompletionHandler<Integer, CompletableFuture<String>>() {
      @Override public void completed(Integer l, CompletableFuture<String> a) {
        a.complete(...);
      }

      @Override public void failed(Throwable t, final CompletableFuture<String> a) {
        a.completeExceptionally(t);
      }
    });
  } catch (IOException e) {
    completableFuture.completeExceptionally(e);
  }
  return cf;
}

static CompletableFuture<Void> writeLanNonBlocking(String fileContent) {
  CompletableFuture<Void> cf = new CompletableFuture<>();
  ...
  return cf;
}

To reduce code size in this article, the actual details are left out unless needed for understanding the key concept. Asynchronous socket channels are used in a similar fashion to asynchronous file channels.

While this code apparently is more complex to write due to the needed read and write functions (which can be reused, certainly), it not only will perform faster, consume less resources, but also will be easier to maintain, as no more pools are to be managed and tuned.

Conclusion

Using time-consuming I/O operations is an essential part of many applications and needs some tricks to inline them with the reactive programming API provided by CompletableFuture. Nevertheless, with a bit of understanding how the hardware works like and how to manage threads, it is pretty straigforward to mix blocking operations with CompletableFuture’s internal thread pool without rendering the complete application unresponsive. More performance at less tuning costs can be gained using NIO instead of blocking calls.

Posted in Allgemein, Java, Programming | Tagged , , ,

Impressions from JavaForum Stuttgart 2015

Some impressions from my talk at JavaForum Stuttgart 2015 “JAX-RS 2.1 New Features”.

Markus KARG at #jsr2015 Markus KARG at #jsf2015

Image | Posted on by | Tagged , ,

My first contribution to a W3C standard: The default-language() XPath / XQuery function

I felt honoured and happy when I was informed this week by XQuery co-inventor Jonathan ROBIE that my proposal https://www.w3.org/Bugs/Public/show_bug.cgi?id=28850 this week was accepted by the Xpath/Xquery working group at the W3C!

While XSLT spec lead Michael KAY pointed out, it is rather unlikely that a new functionality is to be added while a new version of a standard is already in the Candidate Recommendation pahse (like the spec’s v3.1 already is http://www.w3.org/TR/xpath-functions-31/), and it is even more unlikely that a new proposal is added within few weeks, I actually am totally happy that both happened in case of my proposal: I filed it end of June, and it was agreed by the working group mid of July – midst of the Candidate Recommendation phase. You daresay that I am really proud and happy to have achieved this.

So you can expect XSLT transformers like SAXON soon to be able to not only respect the local language when applying Xpath formatting functions like http://www.w3.org/TR/xpath-functions-30/#func-format-dateTime. Thanks to my new default-language() function I have proposed to the W3C you will be even able to write your own functions doing so.

For example, if I you want to write an XPath function that formats phone numbers according to the location one performs a report, in Germany you would expect to get  (00 49 12 34) 56 78 90, while an international format would look like +49 1234 567890. So how to get the current locale? The answer is what I have proposed and now will become an international cross-vendor standard: default-language().

The default-language() method will return a string which is never null but always a valid language token (such a the format-dateTime() method would consume it). It comes in hand in any kind of custom function which is language dependend by nature, so writing functions like format-phone() will be possible and pretty easy now.

Posted in Open Source, Programming, Standards | Tagged , ,

JVM Callbacks for hardware state events

Strange but true, a Java program cannot detect the situation where it is still running but the laptop is going to sleep because the user closed the lid. Neither can a full-blown Java application server like GlassFish detect a power loss event reported by the UPS and safely inform its clients or sibling cluster nodes that it will be away soon.

This is ridiculous, as the operating system knows all these events very well and there are native APIs to inform applications about state changes for decades. I wonder why nobody every asked to such an API on the Java platform?

Anyways, if you share my impression that it would be beneficial for an application to detect such state changes, then please vote for my proposal on OpenJDK’s JIRA.

Posted in Java, Open Source, Programming, Standards | Tagged ,

#jsf2015 slides online “JAX-RS 2.1 New Features”

You can find the slides of my “JAX-RS 2.1 New Features” presentation at JavaForum Stuttgart 2015 at Speaker Deck.

Aside | Posted on by | Tagged , ,

Contributed Binary Transmission for setObject(int, Object, int, int)

Cool, my latest contribution to PGJDBC on GitHub was accepted today, so next release will provide better performance for setObject(int, Object, int) and setObject(int, Object, int, int) methods, as the values are now transferred as binary values, and not as String literals anymore! Contributing to such an agile project is really fun! :-)

Aside | Posted on by | Tagged ,

Fluent APIs are 1% faster

Using fluent APIs without a guilty conscience

Fluent APIs are pretty common these days and make our lifes so much easier and the code far better to read. Even the JDK came up with some first fluent APIs many years ago, and we all are used to them meanwhile, like e. g. the well-known StringBuilder API.

Today I came up with the crazy idea to measure whether it will have any impact on performance when using a Fluent API instead of a non-fluent one. So I hacked together a short benchmark using Oracle Java Microbenchmarking Harness and gave it a shot.

The result is that actually Fluent APIs run about 1% faster than non-fluent APIs. Also the binary code is a bit smaller, hence less bytes have to be transferred, stored, loaded, etc. This might not sound dramatic, but hey, you not only get it for free, you even get it at the “cost” of better readability and short code.

Certainly my benchmark is anything but flawless, and the conclusion is not really statistically firm, as I measured more code than just the change from non-fluent to fluent API, and as the statistical error is about the same size of the measured difference. But hey, does anybody really conclude from that flaws that this means that using Fluent API (hence sparing code) would not have any postive effect? ;-)

If my target would be massive performance improvements, certainly this will not be the #1 item to fix. For me the conclusion is that if there is a Fluent API, I will simply use it by default, unless some really good reasons conflict with that, as it makes the source better to read, the byte code shorter, and has no negative side effects – and does not make the program slower.

Fluent API Benchmark Setup

Non-Fluent API

public class MyBenchmark {
  @Benchmark public void testMethod() {
    StringBuilder sb = new StringBuilder();
    sb.append("a");
    sb.append("b");
    sb.append("c");
  }
}

The appending part look like this in byte code:

aload_1
ldc #4
invokevirtual #5
pope
aload_1
ldc #6
invokevirtual #5
pop
aload_1
ldc #7
invokevirtual #5
pop

The class file has a size of 652 Bytes.

JMH says the benchmark measured 38,250,987.105 operations per second.

Fluent API

public class MyBenchmark {
  @Benchmark public void testMethod() {
    StringBuilder sb = new StringBuilder();
    sb.append("a").append("b").append("c");
  }
}

The interesting section is consideraly shorter now in byte code as no more aload_1 and pop are found for the subsequent (fluent) invocations:

aload_1
ldc #4
invokevirtual #5
ldc #6
invokevirtual #5
ldc #7
invokevirtual #5

The class file has a size of 640 Bytes, which is 12 Bytes (1.84%) shorter.

JMH says this version yields 378,78,930.655 operations per second, which is 372,056.450 ops/s faster (0,973%).

Posted in Java, Programming | Tagged ,