I have always been interested in concurrent programming through the use of threads. This inclination did not originate from a need to exploit the parallelism at the hardware level for performance reasons, but rather due to the fact that I believe it establishes a model of reasoning about your code that provides a more convenient and natural way for encapsulating concurrency in the overall program flow. I cannot express the relief I felt when I first ported an asynchronous piece of networking code to PThreads. But my joy did not last long – context switching, memory, cache miss costs and more importantly the physical constraints that bounds the number of threads were waiting for me at the end of the tunnel. Lucky for me, it did not take much time to figure out the notion of Light-weight processes, which later introduced me to Protothreads. That being said, I should have admit that I fell back to my single-threaded asynchronous event loop many times rather than a Protothreads flavored PThreads spaghetti. I suppose every programmer had once dreamed about having light-weight threads that maps to a multitude of physical CPU cores via some sort of mechanism in the background, many similar efforts in the literature (Plan9 rfork, Solaris LWP, NetBSD lwp_create) backs this claim, I suppose.
The heart of the problem lies in the question of mapping light-weight threads to physical processor cores () given that the operating system has no prior knowledge on what is going on at the application level. As an inevitable consequence of this problem, developers come up with the idea of providing application level constructs to hint the underlying process scheduling mechanism about the concurrency operating at the application level. Among many other attempts, coroutines and fibers are the two well-known examples emanated from this research. Quoting from Wikipedia,
Fibers describe essentially the same concept as coroutines. The distinction, if there is any, is that coroutines are a language-level construct, a form of control flow, while fibers are a systems-level construct, viewed as threads that happen not to run concurrently. Priority is contentious; fibers may be viewed as an implementation of coroutines, or as a substrate on which to implement coroutines.
If I would try to restate the given description from a more naive perspective targeting the Java Virtual Machine, coroutines are language-level constructs to form a way of cooperative execution of multiple tasks, whereas fibers are light-weight threads scheduled preemptively by the virtual machine. There is a set of coroutine implementations available for JVM, but in this post I will try to stick with two particular fiber implementations: Akka from Typesafe and Quasar from Parallel Universe.
Last year I had a plenty amount of time to play around with Akka during the The Principles of Reactive Programming course. Claiming that Akka is just a toolkit to realize fibers on the JVM would be a fairly incomplete definition – it is totally much more than that. Designed for resiliency from top to ground, a gigantic toolbox of common messaging pattern shortcuts, adaptive load balancing, routing, partitioning, and configuration-driven remoting, etc. just to name a few. Additionally, one should note that it is battle-tested with real-world work loads by means of both as a standalone application and a compound to the Typesafe Platform.
On the other hand, Quasar is a relatively new player in this league. That being said, it is implemented as a compact library with – in my opinion – a relatively sufficient set of features. Though it has a lot to go to catch Akka in terms of out of the box features. In addition, while Akka just provides actor abstractions at the programming language level, Quasar ships both actors and fibers. (Actually, Quasar actors are built on top of fibers.) Quoting from Quasar core developer Ron Pressler’s post:
Quasar and Akka provide completely different abstractions. Quasar provides fibers while Akka provides an asynchronous-programming framework based on callbacks. Both libraries then implement actors using their respective abstractions. […] fibers are lightweight threads, which means that their most important property – even before being light-weight – is being threads. The thread abstraction is a series of sequential computations with the ability to block – on IO or synchronization. Callbacks, on the other hand, are a completely different abstraction. Akka’s callbacks (and therefore actors), are not allowed to block, and therefore do not provide a thread abstraction; hence they’re not fibers.
Quasar’s main design goal was the desire to make the simple, familiar, threading abstraction (i.e., synchronous, blocking code) cheap, as kernel threads are expensive, and blocking a kernel thread carries a lot of overhead. Quasar is used to run servlets that serve hundreds of thousands of concurrent requests – instead of mere thousands – all without changing your existing Java code. We want developers get the full performance benefits of asynchronous code – which Akka also offers – while keeping their APIs and synchronous programming model – which Akka certainly doesn’t. Quasar makes threading cheap, while Akka abandons the thread abstraction altogether. In that respect, Quasar is a lot more similar to Erlang and Go (which, of course, provided the inspiration to Quasar), which also use the thread abstraction but provide a lightweight thread implementation.
Akka’s design goals are different, and we think Quasar is far simpler to use than Akka, requires less mental overhead and far-less re-engineering, while at the same time being more feature-rich.
In order to have an idea on how two implementations compare to each other in terms of performance, I put together a small benchmark suite (fiber-test) that employs both Akka (2.3.5) and Quasar (0.6.0) for the well-known thread-ring problem. ( threads are spawned and connected as a ring structure. Through this ring a message – an integer comprising 4 bytes – is circulated involving message passings.) In addition, I included a version of the benchmark that uses Java Threads to establish a base line.
In fiber-test, you will observe that
benchmark.sh
employs taskset to bond the JVM to a given
set of CPUs on the system. However, for small number (1-2) of CPUs, Quasar
freezes randomly and Java Threads become totally unresponsive. Hence, I get rid
off of the taskset
for the figures presented here.
Tests are deployed on a Ubuntu GNU/Linux 14.04 (LTS) system running on a 6 physical core Intel(R) Xeon(R) E5645 2.40GHz processor, where 64-bit Java HotSpot VM (build 1.8.0_20-ea-b05) is used. JMH is set to run 5+5 warmup and regular iterations using 3 JVM forks. There are 503 threads in each benchmark and 1e7 message passings. Results are in milliseconds per benchmark.
Fiber Impl. | Min. | Avg. | Max. | Stddev. | Confidence Interval | |||
---|---|---|---|---|---|---|---|---|
Quasar Fibers | 695.613 | 29x | 721.457 | 758.231 | 20.796 | 99.9% | 699.226 | 743.689 |
Akka Actors | 856.807 | 23x | 911.295 | 963.403 | 34.345 | 99.9% | 874.578 | 948.012 |
Quasar Actors | 1553.224 | 13x | 1606.718 | 1660.329 | 35.756 | 99.9% | 1568.492 | 1644.943 |
Java Threads | 15730.028 | 1x | 20709.233 | 33084.117 | 4338.423 | 99.9% | 16071.196 | 25347.270 |
Results point out that even in the worst case (which corresponds to Quasar
Actors in the figures) fiber implementation on the average performs 13 times
better compared to native Java threads. In addition, Akka Actors and Quasar
Fibers improve this level up to 23x and 29x, respectively. All that being said
and done, I could not reason about the performance difference between Quasar
Fibers and Actors, given that actors are implemented using fibers in Quasar. In
order to further investigate the problem, I profiled the benchmarks using
-agentlib:hprof=cpu=samples,interval=20,depth=3
. (See
HPROF
documentation for further information on the parameters.)
rank | self | accum | count | trace | method |
---|---|---|---|---|---|
Quasar Fibers | |||||
1 | 47,55% | 47,55% | 67286 | 300366 | co.paralleluniverse.fibers.Fiber.run1 |
2 | 47,48% | 95,04% | 67186 | 300364 | com.github.vy.fibertest.QuasarFiberRingBenchmark$InternalFiber.run |
3 | 3,37% | 98,40% | 4765 | 300372 | com.github.vy.fibertest.QuasarFiberRingBenchmark$InternalFiber.run |
Quasar Actors | |||||
1 | 22,17% | 22,17% | 77787 | 300526 | co.paralleluniverse.fibers.Fiber.park1 |
2 | 18,53% | 40,69% | 65011 | 300508 | co.paralleluniverse.fibers.Fiber.run |
3 | 18,32% | 59,01% | 64292 | 300527 | co.paralleluniverse.strands.channels.SingleConsumerQueueChannel.receive |
4 | 18,27% | 77,28% | 64114 | 300517 | com.github.vy.fibertest.QuasarActorRingBenchmark$InternalActor.doRun |
5 | 17,39% | 94,67% | 61009 | 300528 | co.paralleluniverse.actors.ActorRunner.run |
Akka Actors | |||||
1 | 64,09% | 64,09% | 5158 | 300307 | scala.concurrent.forkjoin.ForkJoinPool.runWorker |
2 | 12,51% | 76,60% | 1007 | 300298 | sun.misc.Unsafe.park |
3 | 10,69% | 87,29% | 860 | 300309 | sun.misc.Unsafe.unpark |
4 | 2,65% | 89,94% | 213 | 300300 | scala.concurrent.forkjoin.ForkJoinPool.scan |
5 | 1,12% | 91,05% | 90 | 300316 | akka.dispatch.Dispatcher.dispatch |
Java Threads | |||||
1 | 94,33% | 94,33% | 10373 | 300057 | sun.misc.Unsafe.unpark |
2 | 5,59% | 99,93% | 615 | 300058 | sun.misc.Unsafe.park |
As anticipated, Java Threads were bitten by park
/unpark
methods in
sun.misc.Unsafe
. Whereas, Akka Actors and Quasar Fibers perform some sort of
interleaving between routines to avoid these calls and that totally rocks.
Unfortunately, it turns out that the channel mechanisms (e.g.,
co.paralleluniverse.strands.channels.SingleConsumerQueueChannel
) in Quasar
Actors somehow trigger the Quasar core to employ park
calls in between the
message receivers and that predominantly consumes the majority of the fiber
runtime.
Observing that fiber implementations perform superior compared to standard Java Threads is nothing new. However, seeing Quasar Fibers showing up impressive performance is not something I was expecting, considering that the project is in its very early stages. That being said, it appears that channels in Quasar Actors have a certain room for improvement and might borrow some ideas from its Akka counterpart. In the overall, both fiber implementations show stable and noteworthy performance figures and stand as a perfect standard Java Thread replacement for a considerable amount of use cases.