Testing Fiber Implementations in JVM

posted at August 7, 2014 with tags akka, java, quasar

I have always been interested in concurrent programming through the use of threads. This inclination did not originate from a need to exploit the parallelism at the hardware level for performance reasons, but rather due to the fact that I believe it establishes a model of reasoning about your code that provides a more convenient and natural way for encapsulating concurrency in the overall program flow. I cannot express the relief I felt when I first ported an asynchronous piece of networking code to PThreads. But my joy did not last long – context switching, memory, cache miss costs and more importantly the physical constraints that bounds the number of threads were waiting for me at the end of the tunnel. Lucky for me, it did not take much time to figure out the notion of Light-weight processes, which later introduced me to Protothreads. That being said, I should have admit that I fell back to my single-threaded asynchronous event loop many times rather than a Protothreads flavored PThreads spaghetti. I suppose every programmer had once dreamed about having light-weight threads that maps to a multitude of physical CPU cores via some sort of mechanism in the background, many similar efforts in the literature (Plan9 rfork, Solaris LWP, NetBSD lwp_create) backs this claim, I suppose.

Threads, Coroutines, and Fibers

The heart of the problem lies in the question of mapping $m$ light-weight threads to $n$ physical processor cores ( $m \gg n$ ) given that the operating system has no prior knowledge on what is going on at the application level. As an inevitable consequence of this problem, developers come up with the idea of providing application level constructs to hint the underlying process scheduling mechanism about the concurrency operating at the application level. Among many other attempts, coroutines and fibers are the two well-known examples emanated from this research. Quoting from Wikipedia,

Fibers describe essentially the same concept as coroutines. The distinction, if there is any, is that coroutines are a language-level construct, a form of control flow, while fibers are a systems-level construct, viewed as threads that happen not to run concurrently. Priority is contentious; fibers may be viewed as an implementation of coroutines, or as a substrate on which to implement coroutines.

If I would try to restate the given description from a more naive perspective targeting the Java Virtual Machine, coroutines are language-level constructs to form a way of cooperative execution of multiple tasks, whereas fibers are light-weight threads scheduled preemptively by the virtual machine. There is a set of coroutine implementations available for JVM, but in this post I will try to stick with two particular fiber implementations: Akka from Typesafe and Quasar from Parallel Universe.

Akka and Quasar

Last year I had a plenty amount of time to play around with Akka during the The Principles of Reactive Programming course. Claiming that Akka is just a toolkit to realize fibers on the JVM would be a fairly incomplete definition – it is totally much more than that. Designed for resiliency from top to ground, a gigantic toolbox of common messaging pattern shortcuts, adaptive load balancing, routing, partitioning, and configuration-driven remoting, etc. just to name a few. Additionally, one should note that it is battle-tested with real-world work loads by means of both as a standalone application and a compound to the Typesafe Platform.

On the other hand, Quasar is a relatively new player in this league. That being said, it is implemented as a compact library with – in my opinion – a relatively sufficient set of features. Though it has a lot to go to catch Akka in terms of out of the box features. In addition, while Akka just provides actor abstractions at the programming language level, Quasar ships both actors and fibers. (Actually, Quasar actors are built on top of fibers.) Quoting from Quasar core developer Ron Pressler’s post:

Quasar and Akka provide completely different abstractions. Quasar provides fibers while Akka provides an asynchronous-programming framework based on callbacks. Both libraries then implement actors using their respective abstractions. […] fibers are lightweight threads, which means that their most important property – even before being light-weight – is being threads. The thread abstraction is a series of sequential computations with the ability to block – on IO or synchronization. Callbacks, on the other hand, are a completely different abstraction. Akka’s callbacks (and therefore actors), are not allowed to block, and therefore do not provide a thread abstraction; hence they’re not fibers.

Quasar’s main design goal was the desire to make the simple, familiar, threading abstraction (i.e., synchronous, blocking code) cheap, as kernel threads are expensive, and blocking a kernel thread carries a lot of overhead. Quasar is used to run servlets that serve hundreds of thousands of concurrent requests – instead of mere thousands – all without changing your existing Java code. We want developers get the full performance benefits of asynchronous code – which Akka also offers – while keeping their APIs and synchronous programming model – which Akka certainly doesn’t. Quasar makes threading cheap, while Akka abandons the thread abstraction altogether. In that respect, Quasar is a lot more similar to Erlang and Go (which, of course, provided the inspiration to Quasar), which also use the thread abstraction but provide a lightweight thread implementation.

Akka’s design goals are different, and we think Quasar is far simpler to use than Akka, requires less mental overhead and far-less re-engineering, while at the same time being more feature-rich.

The Benchmark

In order to have an idea on how two implementations compare to each other in terms of performance, I put together a small benchmark suite (fiber-test) that employs both Akka (2.3.5) and Quasar (0.6.0) for the well-known thread-ring problem. ( $n$ threads are spawned and connected as a ring structure. Through this ring a message – an integer comprising 4 bytes – is circulated involving $m$ message passings.) In addition, I included a version of the benchmark that uses Java Threads to establish a base line.

In fiber-test, you will observe that benchmark.sh employs taskset to bond the JVM to a given set of CPUs on the system. However, for small number (1-2) of CPUs, Quasar freezes randomly and Java Threads become totally unresponsive. Hence, I get rid off of the taskset for the figures presented here.

Tests are deployed on a Ubuntu GNU/Linux 14.04 (LTS) system running on a 6 physical core Intel(R) Xeon(R) E5645 2.40GHz processor, where 64-bit Java HotSpot VM (build 1.8.0_20-ea-b05) is used. JMH is set to run 5+5 warmup and regular iterations using 3 JVM forks. There are 503 threads in each benchmark and 1e7 message passings. Results are in milliseconds per benchmark.

Fiber Impl.	Min.	Avg.		Max.	Stddev.	Confidence Interval
Quasar Fibers	695.613	29x	721.457	758.231	20.796	99.9%	699.226	743.689
Akka Actors	856.807	23x	911.295	963.403	34.345	99.9%	874.578	948.012
Quasar Actors	1553.224	13x	1606.718	1660.329	35.756	99.9%	1568.492	1644.943
Java Threads	15730.028	1x	20709.233	33084.117	4338.423	99.9%	16071.196	25347.270

Results point out that even in the worst case (which corresponds to Quasar Actors in the figures) fiber implementation on the average performs 13 times better compared to native Java threads. In addition, Akka Actors and Quasar Fibers improve this level up to 23x and 29x, respectively. All that being said and done, I could not reason about the performance difference between Quasar Fibers and Actors, given that actors are implemented using fibers in Quasar. In order to further investigate the problem, I profiled the benchmarks using -agentlib:hprof=cpu=samples,interval=20,depth=3. (See HPROF documentation for further information on the parameters.)

rank	self	accum	count	trace	method
Quasar Fibers
1	47,55%	47,55%	67286	300366	co.paralleluniverse.fibers.Fiber.run1
2	47,48%	95,04%	67186	300364	com.github.vy.fibertest.QuasarFiberRingBenchmark$InternalFiber.run
3	3,37%	98,40%	4765	300372	com.github.vy.fibertest.QuasarFiberRingBenchmark$InternalFiber.run
Quasar Actors
1	22,17%	22,17%	77787	300526	co.paralleluniverse.fibers.Fiber.park1
2	18,53%	40,69%	65011	300508	co.paralleluniverse.fibers.Fiber.run
3	18,32%	59,01%	64292	300527	co.paralleluniverse.strands.channels.SingleConsumerQueueChannel.receive
4	18,27%	77,28%	64114	300517	com.github.vy.fibertest.QuasarActorRingBenchmark$InternalActor.doRun
5	17,39%	94,67%	61009	300528	co.paralleluniverse.actors.ActorRunner.run
Akka Actors
1	64,09%	64,09%	5158	300307	scala.concurrent.forkjoin.ForkJoinPool.runWorker
2	12,51%	76,60%	1007	300298	sun.misc.Unsafe.park
3	10,69%	87,29%	860	300309	sun.misc.Unsafe.unpark
4	2,65%	89,94%	213	300300	scala.concurrent.forkjoin.ForkJoinPool.scan
5	1,12%	91,05%	90	300316	akka.dispatch.Dispatcher.dispatch
Java Threads
1	94,33%	94,33%	10373	300057	sun.misc.Unsafe.unpark
2	5,59%	99,93%	615	300058	sun.misc.Unsafe.park

As anticipated, Java Threads were bitten by park/unpark methods in sun.misc.Unsafe. Whereas, Akka Actors and Quasar Fibers perform some sort of interleaving between routines to avoid these calls and that totally rocks. Unfortunately, it turns out that the channel mechanisms (e.g., co.paralleluniverse.strands.channels.SingleConsumerQueueChannel) in Quasar Actors somehow trigger the Quasar core to employ park calls in between the message receivers and that predominantly consumes the majority of the fiber runtime.

The Conclusion

Observing that fiber implementations perform superior compared to standard Java Threads is nothing new. However, seeing Quasar Fibers showing up impressive performance is not something I was expecting, considering that the project is in its very early stages. That being said, it appears that channels in Quasar Actors have a certain room for improvement and might borrow some ideas from its Akka counterpart. In the overall, both fiber implementations show stable and noteworthy performance figures and stand as a perfect standard Java Thread replacement for a considerable amount of use cases.