https://volkan.yazi.ci//Volkan Yazıcı2019-12-17T21:57:00ZVolkan Yazıcıhttps://volkan.yazi.ci/tag:volkan.yazi.ci,2019-12-17://blog/post/2019/12/17/loom/The State of Lightweight Threads for the JVM2019-12-17T21:57:00Z2019-12-17T21:57:00Z
<p>This week I took the stage at <a href="//bol.com">bol.com</a> with a presentation titled
<em>The State of Lightweight Threads for the JVM</em>. The abstract of the presentation
was as follows:</p>
<blockquote>
<p>Thanks to your constantly crashing desktop apps, you know about processes.
Microservice flavored web development taught us about threads too. Then came
callbacks, shortening lifetime of millions of developers. Do you recall the day
you shed tears when Java shipped a Future that you can complete? I know you
haven’t forgotten how Reactive Streams occupied our agendas with RxJava,Reactor,
and infinite stack traces. Now newspapers are yelling about coroutines, fibers,
lightweight threads! We poor souls… In this presentation, I aimto give a
2,000ft overview of where do we come from, where are we heading to, and what
does all this mean for the Java Virtual Machine, aka, JVM. Are you a Java
developer? Thread circus owner? Reactive ninja? Kotlin fan? COBOL enthusiast?
Algol pensioner? There is something I will tell you. (Spoiler: I will as well
talk about Project Loom, structured concurrency, and scoped variables.)</p>
</blockquote>
<p>The slides are available in <a href="loom-bol.pdf">PDF</a> and
<a href="loom-bol-org.odp">ODP</a> formats.</p>
<iframe src="//www.slideshare.net/slideshow/embed_code/key/vJM88UwcN6ypDn" width="476" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no">
</iframe>
tag:volkan.yazi.ci,2019-04-25://blog/post/2019/04/25/elasticsearch-survival-guide/Elasticsearch Survival Guide for Developers2019-04-25T11:44:00Z2019-04-25T11:44:00Z
<p>For some months, I have been jotting down notes on best practices that I wish I would
have known when I first started developing applications running against Elasticsearch.
Even though the following collection tries to communicate certain ideas in Java,
I believe almost each of such cases apply to every other programming language with
almost no or minor changes. I tried to avoid repeating content that has already been
covered in tutorials and the Elasticsearch documentation. The listed principles are
all derived from my personal point of view, I strived to share only the ones that I
can justify with either facts or experience.</p>
<p>Before proceeding further, I would like to thank <a href="https://discuss.elastic.co/t/review-request-for-elasticsearch-survival-guide-for-developers-blog-post/183411/2">David
Turner</a>
and <a href="https://twitter.com/wvl0">William Leese</a> for their valuable feedback. <a href="https://elastic.blog.csdn.net/article/details/105355260">A Chinese
translation of this article</a>
is kindly featured on CSDN by <a href="https://me.csdn.net/wojiushiwo987">Ming Yi Tian Xia</a>.</p>
<h1 id="table-of-contents">Table of Contents</h1>
<ul>
<li>
<a href="#mapping">Mapping</a>
<ul>
<li><a href="#nested-fields">Avoid <code>nested</code> fields</a></li>
<li><a href="#strict-mapping">Have a strict mapping</a></li>
<li><a href="#analyze">Don’t analyze fields of type <code>string</code> unless necessary</a></li>
</ul>
</li>
<li>
<a href="#settings">Setting</a>
<ul>
<li><a href="#oversharding">Avoid oversharding</a></li>
<li><a href="#tuning-merges">Unlearn every hack for tuning merges</a></li>
<li><a href="#memory">Pay attention to JVM memory settings</a></li>
</ul>
</li>
<li>
<a href="#querying">Querying</a>
<ul>
<li><a href="#cas">Compare-and-swap over <code>_version</code> field is poor man’s transactions</a></li>
<li><a href="#splitting-queries">Try splitting complex queries</a></li>
<li><a href="#numeric-types">Know your numeric types</a></li>
<li><a href="#transport-client">Don’t use Elasticsearch Transport/Node client in your Java application (and always use JSON over HTTP)</a></li>
<li><a href="#rest-client">Use the official Elasticsearch REST client in your Java application</a></li>
<li><a href="#cache-keys">Don’t use Elasticsearch query models to generate cache keys</a></li>
<li><a href="#http-caching">Don’t use HTTP caching for caching Elasticsearch responses</a></li>
<li><a href="#sliced-scrolls">Use sliced scrolls sorted on <code>_doc</code></a></li>
<li><a href="#get-by-id">Prefer <code>GET /index/type/{id}</code> over <code>POST /index/_search</code> for single document retrieval</a></li>
<li><a href="#size0-includes-excludes">Use <code>size: 0</code> and <code>includes</code>/<code>excludes</code> wisely</a></li>
<li><a href="#backpressure">Implement proper backpressure while querying</a></li>
<li><a href="#explicit-timeouts">Provide explicit timeouts in queries</a></li>
<li><a href="#blocking-io-threads">Don’t block the Elasticsearch client I/O threads (and know your threads)</a></li>
<li><a href="#json-templates">Don’t write Elasticsearch queries with JSON templates injecting variables</a></li>
<li><a href="#json-serializer">Prefer your own JSON serializer over the one provided by Elasticsearch clients</a></li>
</ul>
</li>
<li>
<a href="#strategy">Strategy</a>
<ul>
<li><a href="#latest-version">Always (try to) stick to the latest JVM and Elasticsearch versions</a></li>
<li><a href="#snapshots">Use Elasticsearch complete and partial snapshots for backups</a></li>
<li><a href="#performance-test-bed">Have a continuous performance test bed</a></li>
<li><a href="#aliases">Use aliases</a></li>
<li><a href="#synonyms">Avoid having big synonym collections</a></li>
<li><a href="#force-merge">Force merge and increase operation bandwidth before enabling replicas</a></li>
<li><a href="#metrics">Record application-level metrics</a></li>
<li><a href="#cpu">Invest in CPU!</a></li>
<li><a href="#plugins">Avoid writing custom Elasticsearch plugins</a></li>
</ul>
</li>
</ul>
<p><a name="mapping"></a></p>
<h1 id="mapping">Mapping</h1>
<p>Here I share Elasticsearch
<a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html">mapping</a>
related tips.</p>
<p><a name="nested-fields"></a></p>
<h2 id="avoid-nested-fields">Avoid <code>nested</code> fields</h2>
<ul>
<li>
<p>Under the hood, each Elasticsearch document corresponds to a Lucene document,
most of the time. This promise is broken for fields of type
<a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html"><code>nested</code></a>.
There each field is stored as a separate document next to the parent Lucene
one. The impact of this being:</p>
<ul>
<li>Querying on <code>nested</code> fields is slower compared to fields in parent document</li>
<li>Retrieval of matching <code>nested</code> fields adds an additional slowdown</li>
<li>Once you update any field of a document containing <code>nested</code> fields,
independent of whether you updated a nested field or not, all the underlying
Lucene documents (parent and all its <code>nested</code> children) need to be marked as
deleted and rewritten. In addition to slowing down your updates, such an
operation also creates garbage to be cleaned up by segment merging later on.</li>
</ul>
</li>
<li>
<p>In certain ocassions, you can flatten <code>nested</code> fields. For instance, given the
following document:</p>
<pre><code class="language-json"><span class="p">{</span>
<span class="nt">"attributes"</span><span class="p">:</span> <span class="p">[</span>
<span class="p">{</span><span class="nt">"key"</span><span class="p">:</span> <span class="s2">"color"</span><span class="p">,</span> <span class="nt">"val"</span><span class="p">:</span> <span class="s2">"green"</span><span class="p">},</span>
<span class="p">{</span><span class="nt">"key"</span><span class="p">:</span> <span class="s2">"color"</span><span class="p">,</span> <span class="nt">"val"</span><span class="p">:</span> <span class="s2">"blue"</span><span class="p">},</span>
<span class="p">{</span><span class="nt">"key"</span><span class="p">:</span> <span class="s2">"size"</span><span class="p">,</span> <span class="nt">"val"</span><span class="p">:</span> <span class="s2">"medium"</span><span class="p">}</span>
<span class="p">]</span>
<span class="p">}</span></code></pre>
<p>You can flatten it as follows:</p>
<pre><code class="language-json"><span class="p">{</span>
<span class="nt">"attributes"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"color"</span><span class="p">:</span> <span class="p">[</span><span class="s2">"green"</span><span class="p">,</span> <span class="s2">"blue"</span><span class="p">],</span>
<span class="nt">"size"</span><span class="p">:</span> <span class="s2">"medium"</span>
<span class="p">}</span>
<span class="p">}</span></code></pre>
</li>
</ul>
<p><a name="strict-mapping"></a></p>
<h2 id="have-a-strict-mapping">Have a strict mapping</h2>
<p>Do you know how many times I witnessed a production failure due to a new field
first receiving a <code>long</code> value where the rest of the values are of type
<code>double</code>? After the first received <code>long</code>, Elasticsearch creates the field, sets
its type to <code>long</code>, and shares this mapping change with the rest of the nodes in
the cluster. Then the rest of the <code>double</code> values are simply rejected due to
type mismatch.</p>
<ul>
<li>Have a strict mapping to avoid surprises.</li>
<li>Don’t blacklist, just whitelist.</li>
<li>Avoid using <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-templates.html">dynamic templates</a>
– they are just gateway drugs.</li>
<li>Disable <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-field-mapping.html#date-detection">date detection</a>,
which is on by default.</li>
</ul>
<p><a name="analyze"></a></p>
<h2 id="dont-analyze-fields-of-type-string-unless-necessary">Don’t analyze fields of type <code>string</code> unless necessary</h2>
<p>By default, a freshly inserted string field is assigned of type
<a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html"><code>text</code></a>,
which incurs an analysis cost. Unless you need fuzzy matching, and just want
filtering, use type <code>keyword</code> instead. This is small amendment of the <a href="#strict-mapping">strict
mapping</a> bullet point.</p>
<p><a name="setting"></a></p>
<h1 id="settings">Settings</h1>
<p>Here I share Elasticsearch cluster
<a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-update-settings.html">settings</a>
related tips.</p>
<p><a name="oversharding"></a></p>
<h2 id="avoid-oversharding">Avoid oversharding</h2>
<p>One of the greatest strengths of Elasticsearch is sharding, that is, splitting
the data into multiple nodes to exploit parallelization. There are many myths
surrounding this subject. Recall that sharding of an index cannot be changed
once it is set. This makes oversharding a very common pitfall for newcomers.
Make sure you have done your homework (that is, RTFM, such as <a href="https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster">this
one</a>)
before taking any decisions.</p>
<p><a name="tuning-merges"></a></p>
<h2 id="unlearn-every-hack-for-tuning-merges">Unlearn every hack for tuning merges</h2>
<p>Elasticsearch is in essence yet another distributed
<a href="http://lucene.apache.org/">Lucene</a> offering, just like
<a href="http://lucene.apache.org/solr/">Solr</a>. Under the hood, every Elasticsearch
document corresponds to a Lucene document, most of the time. (There are certain exceptions
to this rule like <code>nested</code> fields, though this generalization is pretty
accurate.) In Lucene, documents are stored in
<a href="https://lucene.apache.org/core/3_0_3/fileformats.html">segments</a>. Elasticsearch
in the background continuously maintains these Lucene segments by means of the
following two patterns:</p>
<ul>
<li>
<p>In Lucene, when you delete and/or update a document, the old one gets marked
as removed and a new one gets created. Elasticsearch keeps track of these dead
documents and compacts such segments that are highly polluted by rebuilding
them.</p>
</li>
<li>
<p>Newly added documents might yield to segments of imbalanced sizes.
Elasticsearch might decide to merge these into bigger ones for optimization
purposes.</p>
</li>
</ul>
<p>This aforementioned compaction is referred to as <a href="https://www.elastic.co/guide/en/elasticsearch/guide/current/merge-process.html">segment
merges</a>
in Elasticsearch terminology. As you can guess, merges are highly disk I/O- and
CPU-bound operations. As a user, you would not want to have them ruining your
Elasticsearch query performance. As a matter of fact, you can avoid them completely in
certain circumstances: Build the index once and don’t change it anymore. Though
this condition might be difficult to meet in many application scenarios. Once
you start to insert new documents or update existing ones, segment merges become
an inevitable part of your life.</p>
<p>An on-going segment merge can significantly damage the overal query performance
of the cluster. Do a random search on google and you will find many people
looking for help to reduce the performance impact and many others sharing certain settings
that worked for them. Over the last years, there were two particular patterns I
observed in these cries for help: they existed since the earliest of Elasticsearch
versions (so everybody agrees that it used to hurt and is still hurting) and
the majority of the mentioned settings have become deprecated over time (or even worse,
removed completely). So my rules of thumb for tuning merges is as follows:</p>
<ol>
<li>
<p>Unlearn every hack you heard about tuning merges. It is an operation tightly
coupled with the internals of Elasticsearch and subject to change without
providing a backward compatibility. There is no secret knob to make
it run faster; it is like the garbage collector in JVM, <code>VACUUM</code> in
PostgreSQL, etc.</p>
</li>
<li>
<p>Find the sweet spot for the
<a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-translog.html">translog</a>
flushes. Try relaxing <code>index.translog.sync_interval</code> and
<code>index.translog.flush_threshold_size</code> settings until you stop seeing
benefits for your usage pattern.</p>
</li>
<li>
<p>Adapt <code>index.refresh_interval</code> to your needs. Imagine you first bootstrap an
index and later on occasionally perform small updates on it. In such a case,
start with a loose (even disabled!) <code>refresh_interval</code> and make it tighter
after bootstrap.</p>
<p>Note that in recent versions, if you are indexing but not searching then
no refreshes will take place at all. Quoting from <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html#dynamic-index-settings">Dynamic Index
Settings</a>:</p>
<blockquote>
<p>If this [<code>index.refresh_interval</code>] setting is not explicitly set, shards
that haven’t seen search traffic for at least <code>index.search.idle.after</code>
seconds will not receive background refreshes until they receive a search
request.</p>
</blockquote>
</li>
</ol>
<p><a name="memory"></a></p>
<h2 id="pay-attention-to-jvm-memory-settings">Pay attention to JVM memory settings</h2>
<p>Elasticsearch can yield dramatic performance characteristics depending on two
primary memory settings: JVM heap space and the amount of memory left to the
kernel page cache. I am not going to dive into these details here, because they
are pretty well
<a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html">documented</a>.
This is your reminder to not blindly set the Elasticsearch JVM heap size judging
from your past non-Elasticsearch JVM application experiences.</p>
<p><a name="querying"></a></p>
<h1 id="querying">Querying</h1>
<p>Below I collected tips that you can take advantage of (or better stick to) while
querying Elasticsearch.</p>
<p><a name="cas"></a></p>
<h2 id="compare-and-swap-over-version-field-is-poor-mans-transactions">Compare-and-swap over <code>_version</code> field is poor man’s transactions</h2>
<p>I believe you have already figured out that Elasticsearch doesn’t support
transactions. Though you can leverage the
<a href="https://www.elastic.co/guide/en/elasticsearch/reference/7.0/docs-get.html#get-versioning"><code>_version</code></a>
field in a <a href="https://en.wikipedia.org/wiki/Compare-and-swap">CAS</a>-loop to provide
integrity at least on a single document basis. An example demonstration of this
trick can be summarized as follows:</p>
<pre><code class="language-java"><span class="n">String</span> <span class="n">id</span> <span class="o">=</span> <span class="s">"123"</span><span class="o">;</span>
<span class="k">for</span> <span class="o">(;;)</span> <span class="o">{</span>
<span class="n">EsDocument</span> <span class="n">prevDoc</span> <span class="o">=</span> <span class="n">store</span><span class="o">.</span><span class="na">findById</span><span class="o">(</span><span class="n">id</span><span class="o">);</span>
<span class="kt">long</span> <span class="n">prevVersion</span> <span class="o">=</span> <span class="n">prevDoc</span><span class="o">.</span><span class="na">getVersion</span><span class="o">();</span>
<span class="n">Object</span> <span class="n">prevSource</span> <span class="o">=</span> <span class="n">prevDoc</span><span class="o">.</span><span class="na">getSource</span><span class="o">();</span>
<span class="n">Object</span> <span class="n">nextSource</span> <span class="o">=</span> <span class="n">update</span><span class="o">(</span><span class="n">prevSource</span><span class="o">);</span>
<span class="kt">boolean</span> <span class="n">updated</span> <span class="o">=</span> <span class="n">store</span><span class="o">.</span><span class="na">updateByIdAndVersion</span><span class="o">(</span><span class="n">id</span><span class="o">,</span> <span class="n">prevVersion</span><span class="o">,</span> <span class="n">nextSource</span><span class="o">);</span>
<span class="k">if</span> <span class="o">(</span><span class="n">updated</span><span class="o">)</span> <span class="o">{</span>
<span class="k">break</span><span class="o">;</span>
<span class="o">}</span>
<span class="o">}</span></code></pre>
<p>Clearly this trick doesn’t stretch to multiple indices or <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/parent-join.html">parent/child
relations</a>.</p>
<p><strong>Warning:</strong> Usage of <code>_version</code> within CAS-loops are deprecated due to <a href="https://github.com/elastic/elasticsearch/issues/19269">known
issues</a>. In Elasticsearch
version 6.6 and onwards, one should rather use <code>_seq_no</code> and <code>_primary_term</code>
fields instead. Though the gist of this practice is still the same. See
<a href="https://www.elastic.co/guide/en/elasticsearch/reference/6.6/optimistic-concurrency-control.html">Optimistic Concurrency
Control</a>
for details.</p>
<p><a name="splitting-queries"></a></p>
<h2 id="try-splitting-complex-queries">Try splitting complex queries</h2>
<p>If you have complex queries with both, say, filter and aggregation components,
splitting these into multiple queries and executing them in parallel speeds up
the querying performance in most cases. That is, in the first query just get
the hits using the filter then in the second query, just get the aggregations
without retrieving hits, that is, <code>size: 0</code>.</p>
<p><a name="numeric-types"></a></p>
<h2 id="know-your-numeric-types">Know your numeric types</h2>
<p>Many JSON parsers reach for various optimizations to provide efficient
read/write performance. For instance,
<a href="https://github.com/FasterXML/jackson">Jackson</a>, the de facto JSON parser in the
Java world, picks the primitive with the smallest memory footprint that can
store the number passed by JSON. Hence after reading a field value via Jackson,
you might end up getting an <code>int</code>, <code>long</code>, <code>double</code>, etc. Once you get a, say,
<code>double</code>, it is highly likely that you had already lost precision and/or will be
losing precision while serializing it back to JSON again. To avoid such
surprises, prefer <code>BigDecimal</code> over <code>float</code> or <code>double</code>. Using <code>BigInteger</code> for
integral numbers is a safe bet too. (See
<a href="https://fasterxml.github.io/jackson-databind/javadoc/2.0.0/com/fasterxml/jackson/databind/DeserializationFeature.html#USE_BIG_DECIMAL_FOR_FLOATS">USE_BIG_DECIMAL_FOR_FLOATS</a>
and
<a href="https://fasterxml.github.io/jackson-databind/javadoc/2.0.0/com/fasterxml/jackson/databind/DeserializationFeature.html#USE_BIG_INTEGER_FOR_INTS">USE_BIG_INTEGER_FOR_INTS</a>
Jackson configurations for details.)</p>
<p><a name="transport-client"></a></p>
<h2 id="dont-use-elasticsearch-transportnode-client-in-your-java-application-and-always-use-json-over-http">Don’t use Elasticsearch Transport/Node client in your Java application (and always use JSON over HTTP)</h2>
<p>Elasticsearch is written in Java and its query model classes implement custom
(de)serialization methods using Jackson’s <code>JsonGenerator</code> and <code>JsonParser</code>. This
way, thanks to Jackson, a model instance can be (de)serialized to both JSON
(text) and <a href="https://en.wikipedia.org/wiki/Smile_%28data_interchange_format%29">SMILE</a>
(binary) formats without breaking a sweat. Logically, Elasticsearch uses the
binary format for communication within the cluster due to performance reasons.
Using <code>JsonParser</code> for parsing SMILE has a slight caveat: A schema cannot always
be evolved in such a way that backwards compatibility is guaranteed. Indeed this
is not a problem for an Elasticsearch cluster; all the nodes (hopefully!) run the
same version. Though using SMILE in your application means that you might need
to shutdown your application, upgrade it to a newer version which is using the
models of the new Elasticsearch you are about to deploy in parallel.</p>
<p>What about performance between JSON and SMILE? Even Elastic’s own data intensive
products such as Logstash and Kibana have replaced SMILE with JSON. It is highly
unlikely that your bottleneck while querying Elasticsearch would be
serialization. Further <a href="https://github.com/fabienrenaud/java-json-benchmark">Jackson is an excelling library in JSON serialization
efficiency</a>. Hence, to be
on the safe side, just stick to JSON over HTTP.</p>
<p><a name="rest-client"></a></p>
<h2 id="use-the-official-elasticsearch-rest-client-in-your-java-application">Use the official Elasticsearch REST client in your Java application</h2>
<p><a href="https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/index.html">The official
driver</a>
will be maintained through the lifetime of the project. It will implement
community approved features from its competitors. Unless something goes terribly
wrong, the official client will either assimilate the others by simply including
their features or the unofficial clients will just rust away due to not being
able to keep up with the rate of new features being introduced in Elasticsearch
and finally disappear. That being said, the official REST client is a piece of
crap for two main reasons:</p>
<ol>
<li>It has a leaky abstraction over <a href="http://hc.apache.org/httpcomponents-client-ga/">Apache HTTP
Client</a> whose configurations
you need to deal with while tuning the client.</li>
<li>Do you recall <a href="#transport-client">the query model classes I mentioned above</a>?
I have some bad news for you: Model classes are entangled with the server
code and the REST client uses those classes. So what does this mean for you?
Well… Adding the REST client in your dependencies will drag the entire
Elasticsearch milkyway into your JAR Hell.</li>
</ol>
<p><a href="https://github.com/reactiverse/elasticsearch-client">Eclipse Vert.x works around this
entanglement</a> in a yet
another entangled way. Though I doubt if it will be granted a long life given the
reasons I listed above.</p>
<p>In summary, the official REST client (unfortunately) is still your best bet.</p>
<p><a name="cache-keys"></a></p>
<h2 id="dont-use-elasticsearch-query-models-to-generate-cache-keys">Don’t use Elasticsearch query models to generate cache keys</h2>
<p>Or more specifically:</p>
<ul>
<li>
<p>[If you want your hash keys to be consistent between different JVM processes,
JVM upgrades, and JVM restarts,] don’t use <code>Object#hashCode()</code>.</p>
</li>
<li>
<p>The JSON generated by Elasticsearch query models for semantically identical
queries are not necessarily identical. It took more than a year to figure this
out ourselves. Hence, don’t take the query model generated JSON as your source
of hash. (My guess is that somewhere in the <code>JsonGenerator</code>-based serializers,
they are probably iterating over a <code>java.util.Map</code> or <code>java.util.Set</code> whose
order for identical content varies under different conditions.)</p>
</li>
</ul>
<p>Ok. So what else is left? How should we do it?</p>
<ol>
<li>
<p>You query Elasticsearch due to a request you have just received, right? Does
that request has its own application level model? Good. There you go. Use
that one as a hash source.</p>
</li>
<li>
<p>Your application level request model is too complex to generate a proper hash
key? Doh! Ok. Don’t tell anyone that I told you this: Parse Elasticsearch
generated JSON using Jackson and let Jackson serialize it one more time back
to JSON, but this time <a href="https://fasterxml.github.io/jackson-databind/javadoc/2.0.0/com/fasterxml/jackson/databind/SerializationFeature.html#ORDER_MAP_ENTRIES_BY_KEYS">instruct Jackson to sort the
keys</a>.</p>
</li>
</ol>
<p><a name="http-caching"></a></p>
<h2 id="dont-use-http-caching-for-caching-elasticsearch-responses">Don’t use HTTP caching for caching Elasticsearch responses</h2>
<p>Many people fall in the trap of fronting their Elasticsearch cluster with an
HTTP cache such as <a href="http://varnish-cache.org/">Varnish</a> due to its convenience
and low barrier of entry. This seductive approach has certain shortcomings:</p>
<ul>
<li>
<p>When using Elasticsearch in production, it is highly likely you will end up
having multiple clusters due to various reasons: resiliency, experimentation
room, zero downtime upgrades, etc. Then,</p>
<ul>
<li>
<p>once you front each cluster with a dedicated HTTP cache, 99% of the cache
content will just be duplicated.</p>
</li>
<li>
<p>if you decide to use a single HTTP cache for all clusters, it is really
difficult to programmatically configure an HTTP cache to adopt the needs of
the ever changing cluster states. How will you communicate the cluster load to let
the cache balance the traffic. How will you configure scheduled or manual
downtimes? How will you make the cache gradually migrate the traffic from one
to another during maintanence windows?</p>
</li>
</ul>
</li>
<li>
<p>As mentioned above, HTTP caches are difficult to command programmatically.
When you need to manually evict one or more entries, it is not always as easy
as a <code>DELETE FROM cache WHERE keys IN (...)</code> query. And let me warn you, you
are gonna need that manual eviction sooner or later.</p>
</li>
</ul>
<p><a name="sliced-scrolls"></a></p>
<h2 id="use-sliced-scrolls-sorted-on-doc">Use sliced scrolls sorted on <code>_doc</code>
</h2>
<p><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html">Scrolls</a>
are the vehicle that Elasticsearch provides allowing you to scan its entire dataset for large
reads. They are functionally (and surprisingly, internally too) pretty much
similar to <a href="https://en.wikipedia.org/wiki/Cursor_(databases)">RDBMS cursors</a>.
Though most people don’t get them right in their first attempt. Here are some
basics:</p>
<ul>
<li>
<p>If you reach for scrolls, you are probably reading quite some data. It is
highly likely <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html#sliced-scroll">slicing</a> will help you improve the read performance
significantly.</p>
</li>
<li>
<p>The order doesn’t matter in your reads? This is your lucky day! Just sort
on <code>_doc</code> field and there’s a +20% read speed up without any other changes.
(<code>_doc</code> is a pseudo field to let Elasticsearch use the order documents are
in on disk.)</p>
</li>
<li>
<p>The <code>scrollId</code> might (and does!) change between calls. Hence make sure you are
always scrolling using the most recently retrieved <code>scrollId</code>.</p>
</li>
<li>
<p>Reaching for <code>REINDEX</code>? Did you enable slicing there too? Good.</p>
</li>
</ul>
<p><a name="get-by-id"></a></p>
<h2 id="prefer-get-indextypeid-over-post-indexsearch-for-single-document-retrieval">Prefer <code>GET /index/type/{id}</code> over <code>POST /index/_search</code> for single document retrieval</h2>
<p>Elasticsearch uses different thread pools to handle <code>GET /index/type/{id}</code> and
<code>POST /index/_search</code> queries. Using <code>POST /index/_search</code> with payload <code>{query:
{"match": {"_id": "123"}}}</code> (or something similar) occupies your
search-dedicated thread pool. Under heavy load, this will worsen your both
search and single document fetch performance. Simply just stick to <code>GET
/index/type/{id}</code>.</p>
<p><a name="size0-includes-excludes"></a></p>
<h2 id="use-size-0-and-includesexcludes-wisely">Use <code>size: 0</code> and <code>includes</code>/<code>excludes</code> wisely</h2>
<p>This tip I guess applies to any storage engine of your preference: Avoid
retrieving (or even storing!) content unless necessary. I have witnessed
Elasticsearch showing dramatic performance differences before and after adding a
<code>size: 0</code> clause.</p>
<p><a name="backpressure"></a></p>
<h2 id="implement-proper-backpressure-while-querying">Implement proper backpressure while querying</h2>
<p>Yet another database-independent tip: There is no point in hammering your
database to the point of making it choke. Consider the following Java snippet
employing an imaginary database client:</p>
<pre><code class="language-java"><span class="kt">void</span> <span class="nf">findUserById</span><span class="o">(</span><span class="n">String</span> <span class="n">id</span><span class="o">,</span> <span class="n">Callable</span><span class="o"><</span><span class="n">User</span><span class="o">></span> <span class="n">callback</span><span class="o">)</span> <span class="o">{</span>
<span class="n">dbClient</span><span class="o">.</span><span class="na">find</span><span class="o">(</span><span class="s">"user"</span><span class="o">)</span>
<span class="o">.</span><span class="na">filter</span><span class="o">(</span><span class="s">"id"</span><span class="o">,</span> <span class="n">id</span><span class="o">)</span>
<span class="o">.</span><span class="na">first</span><span class="o">(</span><span class="nl">callback:</span><span class="o">:</span><span class="n">apply</span><span class="o">);</span>
<span class="o">}</span></code></pre>
<p>Assuming <code>dbClient</code> is implemented in a non-blocking and asynchronous fashion –
that is, each request is enqueued to be delivered and each response handler is
enqueued to react on incoming payloads – what would happen if your database can
handle at most 1 req/sec while your application perfectly receives, handles, and
passes on 10 req/sec? Let me draw you a shallow probability tree depicting
consequences of such an incident.</p>
<ol>
<li>Your database gets more load than it can take. If your database has proper
backpressure mechanics, which most don’t possess, including Elasticsearch,</li>
</ol>
<ul>
<li>
<p>Then it will start choking and eventually throw up. This will get reflected
as query errors on the application side. If your application is equipped
with backpressure mechanics as well, it can kindly reflect this back to the
caller.</p>
</li>
<li>
<p>Otherwise,</p>
<ol>
<li>
<p>even the simplest database queries will start suffering due to heavy load.</p>
</li>
<li>
<p>the database process queue will overgrow.</p>
<ol>
<li>
<p>Excessive growth of the queue (that is, no backpressure mechanics in
place) will start stressing the process memory.</p>
</li>
<li>
<p>The requests that succeed in making from the queue to an executor thread
will highly likely already become deprecated. That is, the database will
be doing work that is of no use to anybody.</p>
</li>
</ol>
</li>
<li>
<p>above two points drawn from the process queue overgrow of the database,
also apply to the application.</p>
</li>
</ol>
</li>
</ul>
<p>Unfortunately there is no silver bullet or a step-by-step guide to implementing
backpressure for a particular application. This in a way makes sense, each has
its own domain-specific requirements. That being said, I can share my personal
best practices:</p>
<ul>
<li>
<p>Use performance benchmarks of your application (You have performance
benchmarks, right?) to estimate an upper bound on the load that your
application still delivers acceptable performance. Enforce this limit in your
application via a rate limiter. Please, please, please don’t block the carrier
thread using the rate limiter! Rather just communicate the backpressure to
your client, for instance, by means of an <a href="https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#5xx_Server_errors">HTTP 503 (SERVICE UNAVAILABLE)
status code</a>.</p>
</li>
<li>
<p>Avoid using thread pools with an unbounded task queue. Java programmers, all
your RxJava/Reactor schedulers are by default backed by a
<code>ScheduledThreadPoolExecutor</code> (the one and only <code>ScheduledExecutorService</code>
implementation in the Java standard library) and that internally uses an
unbounded task queue, unfortunately. See <a href="https://stackoverflow.com/q/7403764/1278899">this Stack Overflow
post</a> and <a href="http://cs.oswego.edu/pipermail/concurrency-interest/2019-April/016860.html">this
concurrency-interest
discussion</a>
on how to work around that.</p>
</li>
<li>
<p>If your application is a pipe between two resources (e.g., from a Pubsub queue
to an Elasticsearch cluster) make sure your producers react to consumers’
backpressure. That is, if the consumer latency starts increasing, you better
start slowing down the producer.</p>
</li>
</ul>
<p><a name="explicit-timeouts"></a></p>
<h2 id="provide-explicit-timeouts-in-queries">Provide explicit timeouts in queries</h2>
<p>Almost all Elasticsearch API endpoints allow the user to specify a timeout - use
it. This will help both your application and your Elasticsearch cluster: Spot
and shrug off unexpected long running operations, save associated resources,
establish a stable SLA with no surprises, etc.</p>
<p><a name="blocking-io-threads"></a></p>
<h2 id="dont-block-the-elasticsearch-client-io-threads-and-know-your-threads">Don’t block the Elasticsearch client I/O threads (and know your threads)</h2>
<p>This tip is also database independent: Don’t block I/O threads of an external
resource for another external resource. Let me demonstrate this pitfall with a
snippet:</p>
<pre><code class="language-java"><span class="kd">public</span> <span class="kt">void</span> <span class="nf">enrichUserById</span><span class="o">(</span><span class="n">String</span> <span class="n">id</span><span class="o">,</span> <span class="n">Function</span><span class="o"><</span><span class="n">EsDocument</span><span class="o">,</span> <span class="n">EsDocument</span><span class="o">></span> <span class="n">enricher</span><span class="o">,</span> <span class="n">Runnable</span> <span class="n">callback</span><span class="o">)</span> <span class="o">{</span>
<span class="n">esClient</span><span class="o">.</span><span class="na">findUserById</span><span class="o">(</span><span class="n">id</span><span class="o">,</span> <span class="n">user</span> <span class="o">-></span> <span class="o">{</span>
<span class="n">EsDocument</span> <span class="n">enrichedUser</span> <span class="o">=</span> <span class="n">enricher</span><span class="o">.</span><span class="na">apply</span><span class="o">(</span><span class="n">user</span><span class="o">);</span>
<span class="n">esClient</span><span class="o">.</span><span class="na">saveUserById</span><span class="o">(</span><span class="n">id</span><span class="o">,</span> <span class="n">enrichedUser</span><span class="o">,</span> <span class="n">callback</span><span class="o">);</span>
<span class="o">});</span>
<span class="o">}</span></code></pre>
<p>What is mostly unintentional here is that: Both <code>enricher.apply(user)</code> and
<code>callback.run()</code> will get executed in the Elasticsearch client I/O thread. Here
I see two common cases:</p>
<ul>
<li>
<p>If both functions don’t incur any other I/O calls (except the ones that are
again reaching for Elasticsearch) and this is the only place in the entire
application where you access to Elasticsearch, then this is a good practice.
You repurpose Elasticsearch client I/O thread for a CPU-intensive
post-processing. Almost no thread context-switch costs. On a 4 core machine,
with 4 threads dedicated to Elasticsearch client I/O, you will get an almost
optimal performance given your usage pattern.</p>
</li>
<li>
<p>If both or any of the functions internally perform other I/O calls and/or
there are multiple places in the application where Elasticsearch client is
used for different purposes, then you are occupying the Elasticsearch client
I/O threads for something unrelated whereas these threads could have just been
serving yet another Elasticsearch request. In such cases, it is better to
employ task-specific thread pools and avoid exhausting Elasticsearch client
I/O loop unnecessarily:</p>
<pre><code class="language-java"><span class="kd">public</span> <span class="kt">void</span> <span class="nf">enrichUserById</span><span class="o">(</span><span class="n">String</span> <span class="n">id</span><span class="o">,</span> <span class="n">Function</span><span class="o"><</span><span class="n">EsDocument</span><span class="o">,</span> <span class="n">EsDocument</span><span class="o">></span> <span class="n">enricher</span><span class="o">,</span> <span class="n">Runnable</span> <span class="n">callback</span><span class="o">)</span> <span class="o">{</span>
<span class="n">esClient</span><span class="o">.</span><span class="na">findUserById</span><span class="o">(</span><span class="n">id</span><span class="o">,</span> <span class="n">user</span> <span class="o">-></span> <span class="o">{</span>
<span class="n">computationExecutor</span><span class="o">.</span><span class="na">submit</span><span class="o">(()</span> <span class="o">-></span> <span class="o">{</span>
<span class="n">EsDocument</span> <span class="n">enrichedUser</span> <span class="o">=</span> <span class="n">enricher</span><span class="o">.</span><span class="na">apply</span><span class="o">(</span><span class="n">user</span><span class="o">);</span>
<span class="n">esClient</span><span class="o">.</span><span class="na">saveUserById</span><span class="o">(</span><span class="n">id</span><span class="o">,</span> <span class="n">enrichedUser</span><span class="o">,</span> <span class="o">()</span> <span class="o">-></span> <span class="o">{</span>
<span class="n">computationExecutor</span><span class="o">.</span><span class="na">submit</span><span class="o">(</span><span class="n">callback</span><span class="o">);</span>
<span class="o">});</span>
<span class="o">});</span>
<span class="o">});</span>
<span class="o">}</span></code></pre>
</li>
</ul>
<p><a name="json-templates"></a></p>
<h2 id="dont-write-elasticsearch-queries-with-json-templates-injecting-variables">Don’t write Elasticsearch queries with JSON templates injecting variables</h2>
<p>Don’t ever do this:</p>
<pre><code class="language-json"><span class="p">{</span>
<span class="nt">"query"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"bool"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"filter"</span><span class="p">:</span> <span class="p">[</span>
<span class="p">{</span>
<span class="nt">"term"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"username"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"value"</span><span class="p">:</span> <span class="p">{</span><span class="err">{username</span><span class="p">}}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="err">}</span><span class="p">,</span>
<span class="p">{</span>
<span class="nt">"term"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"password"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"password"</span><span class="p">:</span> <span class="p">{</span><span class="err">{password</span><span class="p">}}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="err">}</span><span class="p">,</span>
<span class="p">]</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span></code></pre>
<p>You just fell into a half-century old trap: <a href="https://en.wikipedia.org/wiki/SQL_injection">SQL
injection</a>, adopted for
Elasticsearch. No matter how smart your character whitelisting, escaping
routines are, it is just a matter of time until someone passes a malicious <code>username</code>
and/or <code>password</code> input that would expose your entire dataset. I also personally
find it a pretty bad practice to render a curly brace (<code>{</code>, <code>}</code>) rich structured
text via a templating language (e.g., Moustache, Handlebars) which uses the very
same curly braces for its own directives.</p>
<p>There are two safe approaches that I can recommend to generate dynamic queries:</p>
<ol>
<li>Use the query models provided by the Elasticsearch (official) client. (This
works for Java pretty well.)</li>
<li>Use a JSON library (e.g., <a href="https://github.com/FasterXML/jackson">Jackson</a>) to
build the JSON tree and serialize it to JSON.</li>
</ol>
<p><a name="json-serializer"></a></p>
<h2 id="prefer-your-own-json-serializer-over-the-one-provided-by-elasticsearch-clients">Prefer your own JSON serializer over the one provided by Elasticsearch clients</h2>
<p>Many Elasticsearch clients allow you to pass a generic JSON object and serialize
it to JSON before passing it over the wire. For instance, the official
Elasticsearch REST client for Java allows <code>java.util.Map<String, Object></code>
instances as source. Then it uses its own JSON serializer to translate these
models into JSON. While this works fine for vanilla Java, which most of the
time is sufficient to get the message across in tutorials, most real world
applications have more complex class structures that necessitate custom
serialization. For instance, speaking of Java client, how does it serialize
Guava models? What about the new date and time classes introduced in Java 8?
What will happen to all your <code>@JsonProperty</code>-, <code>@JsonSerializes</code>-, etc.
annotated classes? Hence it is always a better practice to employ your own
serialization and pass a <code>byte[]</code> as source to the client. That will save you
from any nasty surprises.</p>
<p><a name="strategy"></a></p>
<h1 id="strategy">Strategy</h1>
<p>In this last section I collected convenient <em>strategic</em> (whatever that means)
practices which address concerns not covered above.</p>
<p><a name="latest-version"></a></p>
<h2 id="always-try-to-stick-to-the-latest-jvm-and-elasticsearch-versions">Always (try to) stick to the latest JVM and Elasticsearch versions</h2>
<p>Elasticsearch is a Java application. (Surprise, surprise!) Like every other Java
application it has its hot paths and garbage collection woes. Almost every new
JVM release will bring you more optimizations that you can take advantage of
without breaking a sweat. Note that due to the low-level performance hacks
exploited in Lucene, which is used internally by Elasticsearch, Lucene is sort
of fragile to JVM upgrades, particularly involving garbage collector changes.
Fortunately, Elasticsearch has an official page listing supported <a href="https://www.elastic.co/support/matrix#matrix_jvm">JVM
releases</a> and <a href="https://www.elastic.co/guide/en/elasticsearch/guide/current/_don_8217_t_touch_these_settings.html">garbage
collectors</a>.
Always check these pages out before attempting any JVM upgrades.</p>
<p>Elasticsearch upgrades are also a source of free performance gains. I have never
experienced a regression after Elasticsearch upgrades. That said, your milage
may vary and this is why you should have proper integration tests in place.</p>
<p><a name="snapshots"></a></p>
<h2 id="use-elasticsearch-complete-and-partial-snapshots-for-backups">Use Elasticsearch complete and partial snapshots for backups</h2>
<p>Elasticsearch lets you easily take
<a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html">snapshots</a>
of an index either completely or partially against an existing snapshot.
Depending on your update patterns and index size, find the best combination for
your use case. That is, for instance, 1 complete snapshot at 00:00 and 3 partial
snapshots at 06:00, 12:00, and 18:00. It is also known to be a good practice to
have them stored at <a href="https://aws.amazon.com/s3/">AWS S3</a> or <a href="https://cloud.google.com/storage/">GCP
Storage</a>. There are
<a href="https://www.elastic.co/guide/en/elasticsearch/plugins/master/repository.html">plugins</a>
to facilitate these scenarios.</p>
<p>As in every backup solution, make sure you can restore them and practice this a
couple of times. In case of a post-failure restoration, you might need to
engineer your own replay mechanisms to reach the last stable state just before
the crash. Leveraging queues supporting custom retention periods (e.g., keep all
the messages received in the last 2 days) for this purpose is a practice
employed a lot.</p>
<p><a name="performance-test-bed"></a></p>
<h2 id="have-a-continuous-performance-test-bed">Have a continuous performance test bed</h2>
<p>Like any other database, Elasticsearch shows varying performance under different
conditions: index, document sizes; update, query patterns; index, cluster
settings; hardware, OS, JVM versions, etc. It is difficult to keep track of each
knob to observe its impact on overall performance. Make sure you have
(at least) daily performance tests to help you narrow down a recently introduced
change contributing to lower performance.</p>
<p>This utopic test bed is easier said than done. You will need to make sure that
the test environment has representative data of production, (preferably)
identical configuration to production, complete coverage of use cases, and
provides a clean slate (including the OS cache!) for each test to avoid
retaining the effects of a previous run. I know, quite a list. But it pays off.</p>
<p><a name="aliases"></a></p>
<h2 id="use-aliases">Use aliases</h2>
<p>Now I am gonna tell you something quite opinionated, though backed by
experience: Never query against indices, but rather against
<a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html">alias</a>es.
Aliases are pointers to actual indices. You can group one or more indices under
a single alias. Many Elasticsearch indices have an internal context attached to
the index name, such as, <code>events-20190515</code>. Now you have two choices in the
application code while querying against <code>events-*</code> indices:</p>
<ol>
<li>
<p>Determine the index name on the fly via a certain date pattern:
<code>events-YYYYMMDD</code>. This approach has two major drawbacks:</p>
<ul>
<li>
<p>The need to fallback to an index of a particular date necessitates the
entire code base to be engineered accordingly to support such an operation.</p>
</li>
<li>
<p>Putting all clock synchronization issues aside, at midnight, you need to
make sure that the next index is there.</p>
</li>
</ul>
</li>
<li>
<p>Create an <code>events</code> alias pointing to the <code>events-*</code> index you want the
application to use. The component responsible for the creation of the new
indices can atomically switch the alias to the new index. This approach will
bring two notable benefits:</p>
<ul>
<li>
<p>It doesn’t suffer from the drawbacks of the previous approach.</p>
</li>
<li>
<p>The application code is way more simpler by just pointing to the <code>events</code>
index everywhere.</p>
</li>
</ul>
</li>
</ol>
<p><a name="synonyms"></a></p>
<h2 id="avoid-having-big-synonym-collections">Avoid having big synonym collections</h2>
<p>Elasticsearch supports both index- and query-time
<a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html">synonyms</a>.
These are powerful shotguns with a good track record of shooting its wielder in
the foot. No search engine is complete without synonyms, hence they have pretty
valid use cases. That said, the following implications need to be kept in mind while
employing them:</p>
<ul>
<li>
<p>Index-time synonyms increase the index size and create extra runtime
overhead.</p>
</li>
<li>
<p>Query-time synonyms doesn’t add to the index size, but as their name
implies, creates extra runtime overhead.</p>
</li>
<li>
<p>Using synonyms, it is pretty easy to unintentionally break something while
trying to fix some other thing.</p>
</li>
</ul>
<p>Continuosly monitor the impact of synonyms on the performance and try to write
tests for each synonym added.</p>
<p><a name="force-merge"></a></p>
<h2 id="force-merge-and-increase-operation-bandwidth-before-enabling-replicas">Force merge and increase operation bandwidth before enabling replicas</h2>
<p>A really common Elasticsearch use case is to periodically (once every couple of
hours) create an index. There is a really good
<a href="https://developers.soundcloud.com/blog/how-to-reindex-1-billion-documents-in-1-hour-at-soundcloud">SoundCloud</a>
article on how to achieve the optimal performance. Quoting from that
compilation, there are the following items that I particularly find a “must”.</p>
<ol>
<li>Always enable replicas after completing the indexing.</li>
<li>Before enabling replicas, make sure you
<ul>
<li>shrinked the index size by <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html">forcing a
merge</a>
and</li>
<li>temporarily increased the replica transmission bandwidth until allocation
has finished, that is,
<a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/recovery.html"><code>indices.recovery.max_bytes_per_sec</code></a>.</li>
</ul>
</li>
</ol>
<p><a name="metrics"></a></p>
<h2 id="record-application-level-metrics">Record application-level metrics</h2>
<p>Kibana provides quite solid insights into Elasticsearch performance:
indexing, search latency and throughput, flush, merge operations, etc. Once you
enhance this view with extra JVM (GC pauses, heap size, etc.) and OS (CPU usage,
disk I/O, kernel caches, etc.) metrics, you will get a rock solid monitoring
dashboard. That said, this is not enough. If used by a single application or
more, Elasticsearch will get hit by various access patterns. Imagine a part of
your software trying to shovel off 10 million documents of not-so-important user
activity, while another component is trying to update user account details. If
you would look at your shiny Elasticsearch metrics, everything will look fine
since 10 million document updates will make the statistical effect of user
account updates disappear. On the other hand, your users might not be that
happy with the latency they observe while they are trying to update their
accounts. Hence always expose extra application-level metrics for your
Elasticsearch queries. While Elasticsearch metrics provide sufficient indicators
for the overall cluster performance, they lack the domain-specific context.</p>
<p><a name="cpu"></a></p>
<h2 id="invest-in-cpu">Invest in CPU!</h2>
<p>I cannot emphasize this enough: <strong>invest in CPU!</strong>. Now open your dashboards,
look at the metrics of your most recent Elasticsearch hammering session in
production, and tell me whether you are disk I/O, memory, or CPU bound? I am not
telling you to use a 32-core machine with an inferior SATA disk drive and 8 GB
of memory. Rather what I am talking about is this: get a decent machine with
sufficient memory and SSD (you might want to checkout NVMe cards too depending
on your disk I/O), after that invest in CPU. Judging from my past experiences,
whether it is a write- or read-heavy load, CPU has always been our bottleneck.</p>
<p><a name="plugins"></a></p>
<h2 id="avoid-writing-custom-elasticsearch-plugins">Avoid writing custom Elasticsearch plugins</h2>
<p>Once I had the opportunity to have the joy of pairing with a colleague to write
an Elasticsearch plugin that exposes synonyms over a REST endpoint. That allowed
us to query synonyms uploaded to an Elasticsearch cluster at runtime and
manipulate it on-the-fly. After having 20 releases where we were convinced that
there are no more concurrency bugs (actually, we were pretty convinced with
the previous 19 releases too), it worked like a charm. Though the real suffering
started when we tried to upgrade the Elasticsearch version supported by the plugin.
In a nutshell, avoid writing custom Elasticsearch plugins, because…</p>
<ul>
<li>
<p>Many Elasticsearch releases contain significant internal changes. It is highly
likely the public APIs you base your plugin on will be hit by backward incompatible
changes.</p>
</li>
<li>
<p>Getting concurrency right in an environment that you are not accustomed to can
be daunting.</p>
</li>
<li>
<p>You need to tailor your deployment procedure to ship the plugin every time and
everywhere it is needed. You cannot just use the vanilla Elasticsearch
artifacts as is anymore.</p>
</li>
<li>
<p>Since your application depends on the specific functionality provided by the
plugin, the Elasticsearch instances you run during integration tests also need
to incorporate the plugin as well. Hence you cannot use vanilla, say, Docker
images out of the box anymore.</p>
</li>
</ul>
tag:volkan.yazi.ci,2019-01-26://blog/post/2019/01/26/reactive-spring5/Notes on "Reactive Programming in Spring 5"2019-01-26T20:25:00Z2019-01-26T20:25:00Z
<blockquote>
<p><strong>TL;DR</strong> – Already have your hands dirty with Spring? Wondering
how reactive will play along with Spring and how they will shape
each other? Read this book. You will get crushed under the load of
information trying to get covered, but will have no regrets.</p>
</blockquote>
<p><img src="cover.jpg" alt="Reactive Programming in Spring 5"></p>
<p>I tend to avoid reading books whose subject is about a certain library
or tool, unless I think it will provide some insight into a certain
domain in addition to the software’s internals. And as you may guess,
many <a href="https://spring.io/">Spring</a>-related books do not pass this
test. Though <a href="https://www.packtpub.com/application-development/hands-reactive-programming-spring-5">Reactive Programming in Spring
5</a>
was an exception due to following reasons:</p>
<ul>
<li>
<p><a href="https://twitter.com/OlehDokuka">Oleh Dokuka</a>, one of the two
authors of the book, is a contributor to many Spring projects, in
particular to <a href="http://projectreactor.io/">Reactor</a>. He is further a
pretty active figure in <a href="https://gitter.im/reactor/reactor">Reactor Gitter
channel</a> where he is serving a
fantastic public service. I personally am really interested in
hearing anything he would say about <a href="http://www.reactive-streams.org/">Reactive
Streams</a> in general.</p>
</li>
<li>
<p>Love it, hate it, but almost the entire Java train is fueled by
Spring framework. In my day to day work, I am swamped into Spring
too. If you would pay close attention, particularly after the recent
<a href="https://springoneplatform.io/2018/sessions">SpringOne 2018</a>, Spring
is full steam moving towards a reactive interface supported by all
its whistles and bells – <a href="http://rsocket.io/">RSocket</a>,
<a href="https://r2dbc.io/">R2DBC</a>, <a href="https://docs.spring.io/spring-framework/docs/5.0.0.BUILD-SNAPSHOT/spring-framework-reference/html/web-reactive.html">WebFlux and
WebClient</a>,
Reactor, etc. Though sometimes it is really difficult to see how the
entire machine is progressing given the excessive amount of
sub-projects. I expected the book to provide a 2,000 ft overview to
the roadmap of this beast.</p>
</li>
</ul>
<h1 id="overview">Overview</h1>
<p>If you are dealing with Spring while taking your daily dose of Java,
<strong>read this book!</strong> The book manages to deliver an outstanding
coverage of both reactive streams and its integration to almost all of
the existing Spring components. In a nutshell, it lays out the
following subjects in order:</p>
<ul>
<li>fundamentals of reactive programming (observer pattern)</li>
<li>its historical development in the presence of RxJava, Reactive Streams, and Reactor</li>
<li>pull, push, and pull-push streaming</li>
<li>backpressure</li>
<li>theoratical foundations of reactive programming (Little’s and Amdahl’s laws)</li>
<li>WebFlux versus Web MVC</li>
<li>introduction to Reactor</li>
<li>reactive
<ul>
<li>Spring Data</li>
<li>Spring Cloud Stream</li>
<li>Spring Cloud Flow</li>
<li>Spring Boot</li>
</ul>
</li>
<li>RSocket</li>
<li>testing</li>
<li>deployment to cloud</li>
</ul>
<p>In conclusion, authors did a splendid job in compiling an exhaustive
content covering almost every aspect of reactive programming in Spring
5. They possess an in-depth understanding of the material they are
talking about and present it in a digestable form to the user. While
this thick compilation has the potential of crushing the reader, it is
definitely a rewarding read.</p>
<h2 id="pros">Pros</h2>
<p>After finishing the book, a cover-to-cover marathon of 538 pages, I
felt like I know everything there is to know about reactive
programming in Spring.</p>
<p><img src="i-know.jpg" alt="I know reactive Spring!"></p>
<p>The extent of the covered content was astonishingly huge. I knew about
RxJava, Reactive Streams, and Reactor individually, but not how they
relate to each other. The presented historical perspective cleared out
the missing pieces of the puzzle for me. I really enjoyed how the
prose builds up by examining the problem, introducing the solution,
laying out caveats of certain approaches, and drawing real-world
comparisons. They repeated this pragmatic pattern for almost any
project (Reactor, WebFlux, Spring Data, Spring Cloud Stream, etc.)
they introduced.</p>
<p>I was not expecting an in-depth Reactor content, though the authors
delivered a perfect job here given the book’s main subject is not just
Reactor.</p>
<p>Investigation of pull, push, pull-push streaming, stress of
backpressure, and reinforcement of the content with some theory
(Little’s and Amdahl’s laws) was pretty enlightening to read. While
many would consider these subjects boring, I found it pretty
interesting and practical.</p>
<h2 id="cons">Cons</h2>
<p>The book is too long to a point of becoming exhausting. Many material
(Vert.x, Ratpack integration, Reactive Streams TCK, transactions in
distributed messaging systems, etc.) could have been left out and then
the main juice would fit into ~300 pages. Even then, it is still
long. I would have split the book into two: <em>Project Reactor in
Action</em> and <em>Reactive Programming in Spring 5</em>. While reading, I also
felt such a split as well.</p>
<p>While 90% of the book is pretty objective, certain subjects were
presented with bias. Particularly, RSocket-vs-gRPC comparison and
using message brokers over system-to-system communication chapters
were really opinionated. (I further explained this problem in my notes
shared below.)</p>
<p>The presented content is always excercised using Apache Kafka and
MongoDB. I suppose this is due to the fact that these two are the only
tools that have almost perfect reactive coverage by Spring family of
projects.</p>
<h1 id="notes">Notes</h1>
<p>Below I share my notes ranging from a grammar mistake to a code typo,
incorrect or ambiguous information to a thought provoking know-how,
practical tip, etc. I further classify them in 4 groups:
<span class="note-correction">corrections</span>,
<span class="note-improvement">improvements</span>,
<span class="note-question">questions</span>, and
<span class="note-other">other</span>.</p>
<ul>
<li>
<p><span class="note-correction">[p11]</span> Shared <code>processRequest()</code>
method is missing a <code>return</code> statement.</p>
</li>
<li>
<p><span class="note-correction">[p12]</span> As a coroutine/fiber
fanboy, I could not have passed through the following statement:</p>
<blockquote>
<p>In some languages, such as C#, Go, and Kotlin, the same code [a
simple inter-microservice communication over HTTP] might be
non-blocking when green threads are used. However, in pure Java,
we do not have such features yet. Consequently, the actual thread
will be blocked in such cases.</p>
</blockquote>
<p>I forgot how many times I needed to correct this false misbelief,
but I will repeat it here one more time: For any language X
compiling to JVM bytecode, if a Java standard library call is
[thread] blocking, access to it in X is going to be blocking as
well. Kotlin coroutines is no exception to this. Further every
convenience X provides, say coroutines, except syntatic sugars, has
already been at the exposure of Java via solid libraries like
<a href="https://github.com/nqzero/kilim">Kilim</a> (since 2006) and
<a href="https://github.com/puniverse/quasar">Quasar</a> (since 2013).</p>
<p>Additionally, the rules of the game will totally change in a couple
of years after the release of <a href="http://openjdk.java.net/projects/loom/">Project
Loom</a>.</p>
</li>
<li>
<p><span class="note-improvements">[p16]</span> In Diagram 1.4, what
is the difference between solid and dashed lines? I am surprised to
see that the image is advertising Apache Kafka while the rest of the
components are free from such specification.</p>
</li>
<li>
<p><span class="note-improvements">[p18]</span> In Diagram 1.5, texts
are not readable. (Unfortunately, many of the image texts are not
readable in the hardcopy. I don’t know about the e-book though.)</p>
</li>
<li>
<p><span class="note-correction">[p28]</span> <em>all cross-service
communication is non-blocking anymore.</em> → <em>… is not blocking
anymore.</em></p>
</li>
<li>
<p><span class="note-correction">[p46]</span> <em>the actual request
processing continues until <code>SseEnitter.complete()</code></em> →
<em>… until <code>SseEmitter.complete()</code></em></p>
</li>
<li>
<p><span class="note-improvement">[p46]</span> Justification of why
<code>@Async</code> is necessary here is missing.</p>
</li>
<li>
<p><span class="note-correction">[p59]</span> <em>The <code>count</code> operator is
pretty descriptive, it emits the only value …</em> → <em>… emits
only one value …</em></p>
</li>
<li>
<p><span class="note-question">[p62]</span> Chapter starts with</p>
<pre><code class="language-java"><span class="kd">public</span> <span class="kd">interface</span> <span class="nc">SearchEngine</span> <span class="o">{</span>
<span class="n">List</span><span class="o"><</span><span class="n">URL</span><span class="o">></span> <span class="nf">search</span><span class="o">(</span><span class="n">String</span> <span class="n">query</span><span class="o">,</span> <span class="kt">int</span> <span class="n">limit</span><span class="o">);</span>
<span class="o">}</span></code></pre>
<p>interface, improves it with</p>
<pre><code class="language-java"><span class="kd">public</span> <span class="kd">interface</span> <span class="nc">InterableSearchEngine</span> <span class="o">{</span>
<span class="n">Iterable</span><span class="o"><</span><span class="n">URL</span><span class="o">></span> <span class="nf">search</span><span class="o">(</span><span class="n">String</span> <span class="n">query</span><span class="o">,</span> <span class="kt">int</span> <span class="n">limit</span><span class="o">);</span>
<span class="o">}</span></code></pre>
<p>and</p>
<pre><code class="language-java"><span class="kd">public</span> <span class="kd">interface</span> <span class="nc">FutureSearchEngine</span> <span class="o">{</span>
<span class="n">CompletableFuture</span><span class="o"><</span><span class="n">List</span><span class="o"><</span><span class="n">URL</span><span class="o">>></span> <span class="nf">search</span><span class="o">(</span><span class="n">String</span> <span class="n">query</span><span class="o">,</span> <span class="kt">int</span> <span class="n">limit</span><span class="o">);</span>
<span class="o">}</span></code></pre>
<p>After piling up enough motivation, justifies the victory of
<code>Observable</code>s:</p>
<pre><code class="language-java"><span class="kd">public</span> <span class="kd">interface</span> <span class="nc">RxSearchEngine</span> <span class="o">{</span>
<span class="n">Observable</span><span class="o"><</span><span class="n">URL</span><span class="o">></span> <span class="nf">search</span><span class="o">(</span><span class="n">String</span> <span class="n">query</span><span class="o">,</span> <span class="kt">int</span> <span class="n">limit</span><span class="o">);</span>
<span class="o">}</span></code></pre>
<p>As a reader, let me ask the obvious question: Why not using a
<code>Stream<URL></code> return type instead?</p>
</li>
<li>
<p><span class="note-question">[p103]</span> In the following
explanation,</p>
<blockquote>
<p>Note that <code>ScheduledPublisher</code> is effectively an infinite stream
and the completion of the merged <code>Publisher</code> is ignored.</p>
</blockquote>
<p>What does <em>merged <code>Publisher</code></em> refer to? <code>NewsPreparationOperator</code>?</p>
</li>
<li>
<p><span class="note-correction">[p106]</span> <em>Unfortunately, building a
proper test suit</em> → <em>… test suite</em></p>
</li>
<li>
<p><span class="note-improvement">[p114]</span> Rather than focusing
this much on the TCK, which is most of the time more relevant for
library authors compared to users, I wish the authors would have had
spared some pages on pitfalls and patterns of processors, which is
bread and butter of reactive stream users and sadly skipped in the
previous chapter.</p>
</li>
<li>
<p><span class="note-improvement">[p123-129]</span> I don’t think the
readers really care about reactive streams adjustments for Vert.x,
Ratpack or MongoDB reactive streams driver. I find these pages
irrelevant and distracting.</p>
</li>
<li>
<p><span class="note-improvement">[p133]</span> It is page 133 and the
reader is newly getting exposed to Project Reactor. Sort of too
late, I think.</p>
</li>
<li>
<p><span class="note-improvement">[p133]</span> <em>Project Reactor, the
most famous library in the reactive landscape, …</em> Really? Putting
aside I favor Reactor over RxJava, the IT community that I know of
does not give me such an impression to back this claim up.</p>
</li>
<li>
<p><span class="note-correction">[p141]</span> <em>investigating its reach
API.</em> → <em>… its rich API</em></p>
</li>
<li>
<p><span class="note-improvement">[p149]</span> Speaking of <code>defer()</code>,
it allows retries, which is, I think, of uttermost importance mostly
overlooked by many and aligns better with reactive streams semantic
compliance. Consider the following code:</p>
<pre><code class="language-java"><span class="o"><</span><span class="n">T</span><span class="o">></span> <span class="n">Mono</span><span class="o"><</span><span class="n">T</span><span class="o">></span> <span class="nf">f</span><span class="o">(</span><span class="n">Mono</span><span class="o"><</span><span class="n">T</span><span class="o">></span> <span class="n">mono</span><span class="o">)</span> <span class="o">{</span>
<span class="n">Stopwatch</span> <span class="n">stopwatch</span> <span class="o">=</span> <span class="n">Stopwatch</span><span class="o">.</span><span class="na">createUnstarted</span><span class="o">();</span>
<span class="k">return</span> <span class="n">mono</span>
<span class="o">.</span><span class="na">doOnSubscribe</span><span class="o">(</span><span class="n">ignored</span> <span class="o">-></span> <span class="n">stopwatch</span><span class="o">.</span><span class="na">start</span><span class="o">())</span>
<span class="o">.</span><span class="na">doOnTerminate</span><span class="o">(()</span> <span class="o">-></span> <span class="n">LOGGER</span><span class="o">.</span><span class="na">log</span><span class="o">(</span><span class="s">"{}"</span><span class="o">,</span> <span class="n">stopwatch</span><span class="o">));</span>
<span class="o">}</span></code></pre>
<p>versus the alternative below</p>
<pre><code class="language-java"><span class="o"><</span><span class="n">T</span><span class="o">></span> <span class="n">Mono</span><span class="o"><</span><span class="n">T</span><span class="o">></span> <span class="nf">f</span><span class="o">(</span><span class="n">Mono</span><span class="o"><</span><span class="n">T</span><span class="o">></span> <span class="n">mono</span><span class="o">)</span> <span class="o">{</span>
<span class="k">return</span> <span class="n">Mono</span><span class="o">.</span><span class="na">defer</span><span class="o">(()</span> <span class="o">-></span> <span class="o">{</span>
<span class="n">Stopwatch</span> <span class="n">stopwatch</span> <span class="o">=</span> <span class="n">Stopwatch</span><span class="o">.</span><span class="na">createUnstarted</span><span class="o">();</span>
<span class="k">return</span> <span class="n">mono</span>
<span class="o">.</span><span class="na">doOnSubscribe</span><span class="o">(</span><span class="n">ignored</span> <span class="o">-></span> <span class="n">stopwatch</span><span class="o">.</span><span class="na">start</span><span class="o">())</span>
<span class="o">.</span><span class="na">doTerminate</span><span class="o">(()</span> <span class="o">-></span> <span class="n">LOGGER</span><span class="o">.</span><span class="na">log</span><span class="o">(</span><span class="s">"{}"</span><span class="o">,</span> <span class="n">stopwatch</span><span class="o">));</span>
<span class="o">});</span>
<span class="o">}</span></code></pre>
<p>Also this is a good spot to talk about assembly-vs-runtime overhead
of operator chaining.</p>
</li>
<li>
<p><span class="note-correction">[p149]</span> The word wrapping of
paragraph starting with <em>Summing this up, Project reactor allows
…</em> is broken.</p>
</li>
<li>
<p><span class="note-correction">[p150]</span> <em>…, we may still request
<code>Long.MAX_VALUE).</code></em> → <em>request <code>Long.MAX_VALUE.</code></em></p>
</li>
<li>
<p><span class="note-improvement">[p154]</span> The first paragraph of
the page (<em>However, the described approach for defining subscription
…</em>) is an eye opener gem. Given many seasoned RS users fell into
this trap at least once, a couple of paragraphs detailing these
caveats would come really handy.</p>
</li>
<li>
<p><span class="note-improvement">[p169]</span> Speaking of
<code>Flux#flatMap()</code> operator, <code>concurrency</code> parameter might have been
slightly mentioned here.</p>
</li>
<li>
<p><span class="note-question">[p171]</span> Given we are executing
<code>doOn...()</code> methods (<code>doOnNext()</code>, <code>doOnComplete()</code>,
<code>doOnSubscribe()</code>, etc.) for their side effects, what does happen
when they throw an exception?</p>
</li>
<li>
<p><span class="note-improvement">[p173]</span> Provide a couple of
examples for <code>log()</code> operator, please! Many people, particularly
coming from RxJava, do not know of its existence.</p>
</li>
<li>
<p><span class="note-correction">[p174]</span> <em>However, this allows
forthe sending …</em> → <em>… allows for the sending …</em></p>
</li>
<li>
<p><span class="note-question">[p174]</span> In the <code>Flux.create()</code>
example, I am puzzled about how does it handle the backpressure.</p>
</li>
<li>
<p><span class="note-improvement">[p183]</span> I think <em>exponential
backoff</em> deserves an explanation here. What does it mean? Further, I
would have used seconds rather than milliseconds in the example,
since the latter does not expose the backoff periods clearly at
first sight.</p>
</li>
<li>
<p><span class="note-correction">[p188]</span> <em><code>Flux<Integer>
cachedSource = source.share();</code></em> Here and in the rest, all of the
occurences of <code>cachedSource</code> should be replaced with <code>sharedSource</code>.</p>
</li>
<li>
<p><span class="note-improvement">[p191]</span> The used <em>assembly
phase</em> term will be described in next chapters, hence using it here
is sort of confusing. Further, the given explanation for
<code>transform()</code> has still some room for improvement. Additionally, I
would prefer to see a “when to use <code>compose()</code>, when to use
<code>transform()</code>” guide.</p>
</li>
<li>
<p><span class="note-improvement">[p196-197]</span> There are 3 places
where <em>assemble-time</em> is used, though it used to be referred as
<em>assembly-time</em>”.</p>
</li>
<li>
<p><span class="note-correction">[p198]</span> <em>… passing
<code>Subscription</code> through ever <code>Subscriber</code>s …</em> → <em>… passing
<code>Subscription</code> through <code>Subscriber</code>s …</em></p>
</li>
<li>
<p><span class="note-other">[p200]</span> The last paragraph of p199
goes as follows:</p>
<blockquote>
<p>THe importance of understanding this phase is that during runtime
we may apply optimization that may reduce amount of signals
exchange. [“the amount of signals exchanged”?] For example, as we
are going to see in the next sections, we may reduce the number of
<code>Subscription#request</code> calls and improve, therefore, performance
of the stream.</p>
</blockquote>
<p>Then the following tip opens the p200:</p>
<blockquote>
<p>… the invocation of <code>Subscription#request</code> method causes a write
to the <code>volatile</code> field that holds demand. Such a write is an
expensive operation from computation perspective, so it is better
to avoid it if possible.</p>
</blockquote>
<p>Hrm… I thought the main overhead of <code>Subscription#request</code> calls
was unnecessary individual requests which could have been
batched. Compared to this, write to a <code>volatile</code> does not really
sound like an overhead.</p>
</li>
<li>
<p><span class="note-improvement">[p200]</span> Diagram 4.8 is missing
some explanation.</p>
</li>
<li>
<p><span class="note-question">[p204]</span> In Diagram 4.11, how is
backpressure preserved given there is an external buffer employed
by <code>publishOn</code>. Who maintains this buffer? Is it the task queue of
the <code>ScheduledExecutionService</code> used under the hood?</p>
</li>
<li>
<p><span class="note-question">[p206]</span> Totally lost with the
following sentence: <em>Under the hood, <code>subscribeOn</code> executes the
subscription to the parent <code>Publisher</code> into <code>Runnable</code>, which is the
scheduler for a specified <code>Scheduler</code>.</em></p>
</li>
<li>
<p><span class="note-correction">[p207]</span> In the shared code
snippet, <code>.map()</code> and <code>.filter()</code> are missing <code>...</code> as input
arguments.</p>
</li>
<li>
<p><span class="note-question">[p208]</span> How does
<code>SingleScheduler</code> handle blocking functions spoiling time-sensitive
executions?</p>
</li>
<li>
<p><span class="note-improvement">[p209]</span> <em>… so the execution
is attache to …</em> → <em>… so the execution is attached to …</em></p>
</li>
<li>
<p><span class="note-improvement">[p209]</span> In the next page, it
has been admitted that the shared code snippet does not demonstrate
a good usage of <code>ThreadLocal</code>s. So why not sharing a proper use of
<code>ThreadLocal</code>s?</p>
</li>
<li>
<p><span class="note-correction">[p211]</span> I think <em>Such design
decision … have its own <code>Context</code>.</em> part of the last paragraph
needs a rewrite.</p>
</li>
<li>
<p><span class="note-correction">[p213]</span> <em>… bottom (with id
initial) has empty Context at all.</em> → <em>… bottom (with id
<code>initial</code>) has empty <code>Context</code> at all.</em></p>
</li>
<li>
<p><span class="note-correction">[p214]</span> <em>The preciding code show a
…</em> → <em>The preciding code shows a …</em></p>
</li>
<li>
<p><span class="note-correction">[p215]</span> <em>… we mentioned another
sample of such …</em> → <em>… we mentioned another example of such
…</em></p>
</li>
<li>
<p><span class="note-correction">[p217]</span> In Diagram 4.15, the box
at the bottom should be titled <code>subscribe()</code> instead of <code>flatMap()</code>.</p>
</li>
<li>
<p><span class="note-correction">[p225]</span> <em>Preview Online Code
Files</em> at the bottom looks to be a typo.</p>
</li>
<li>
<p><span class="note-question">[p230]</span> <em>… the WebFlux module
provides built-in backpressure support …</em> Cool! But how does it do
that?</p>
</li>
<li>
<p><span class="note-correction">[p254]</span> <em>… using the
<code>PasswordEncoder#matchs</code> …</em> → <em>… using the
<code>PasswordEncoder#matches</code> …</em></p>
</li>
<li>
<p><span class="note-question">[p254]</span> In the shared snippet,
which scheduler executes the <code>map(p ->
passwordEncoder.matches(...))</code> line? Netty I/O loop thread? If so
(or some other scheduler of similar importance), is it wise perform
<code>passwordEncoder#matches</code> here?</p>
</li>
<li>
<p><span class="note-correction">[p255]</span> In the shared code
snippet, isn’t <code>flatMap(isMatched -> ...)</code> should be replaced with
<code>map(isMatched -> ...)</code>?</p>
</li>
<li>
<p><span class="note-correction">[p278]</span> <em><code>else
if(resposne.statusCode() == EXPECTATION_FAILD) {</code></em> → <em><code>else
if(response.statusCode() == EXPECTATION_FAILED) {</code></em></p>
</li>
<li>
<p><span class="note-correction">[p267]</span> <em>… template has a
placeholder, <code>dataSource</code>, …</em> → <em>… template has a
placeholder, <code>playList</code>, …</em></p>
</li>
<li>
<p><span class="note-correction">[p273-274]</span> All apperances of
<em>Albom</em> and <em>albom</em> should be replaced with <em>Album</em> and <em>album</em>,
respectively.</p>
</li>
<li>
<p><span class="note-correction">[p278]</span> <em>As we can observe from
the preciding diagram, with an increase in parallelization, the
throughput of the system starts becoming slower and slower.</em>
Hrm… Actually the throughput keeps on increasing, though the gain
is slowing down. Does it mean <em>the increase in throughput</em>?</p>
</li>
<li>
<p><span class="note-improvement">[p281]</span> The definitions of
contention and coherence are left pretty ambiguous. A couple of
practical examples would come really handy.</p>
</li>
<li>
<p><span class="note-correction">[p292]</span> Diagram 6.16 misses the
legend. The caption can be reworked to fix this as follows: <em>WebFlux
(dash) versis WebMVC (plus) throughput…</em> This applies to the rest
of diagrams with multiple lines in the following pages.</p>
</li>
<li>
<p><span class="note-improvement">[p301-303]</span> I would rather use
a bar chart for performance figures rather than a table.</p>
</li>
<li>
<p><span class="note-improvement">[p308]</span> Diagram 6.25 looks
like a <a href="https://en.wikipedia.org/wiki/Microsoft_Paint">mspaint.exe</a>
101 assignment submission.</p>
</li>
<li>
<p><span class="note-correction">[p311]</span> <em>… the system has four
central components …</em> → <em>… the system has three central
components …</em></p>
</li>
<li>
<p><span class="note-correction">[p349]</span> <em>… EclipseLink, Spring
Data JDBC, and Spring Data JDBC, …</em> → <em>… EclipseLink,
Spring Data JDBC, and Spring JDBC, …</em></p>
</li>
<li>
<p><span class="note-correction">[p356-357]</span> All occurences of
<code>updatedBook...</code> should be replaced with <code>updateBook...</code>.</p>
</li>
<li>
<p><span class="note-improvement">[p379]</span> What is the point of
<em>Distributed transactions with the SAGA pattern</em> here?</p>
</li>
<li>
<p><span class="note-improvement">[p405]</span> In Diagram 8.1, there
is plenty of space that could have been used and the texts are so
small to the point of becoming unreadable.</p>
</li>
<li>
<p><span class="note-other">[p418]</span> This page concludes the
<em>Scaling up with Cloud Streams</em> chapter by announcing its triumph
over previously mentioned service-to-service architectures. I sadly
find this conclusion pretty biased and far from reflecting the
reality. There is no silver bullet for such problems and as has
always been the case, <em>it depends</em>. Further, almost all of the
listed so-called improvements to message broker-based solutions are
either as is or with similar approaches applicable to
service-to-service architectures as well. It is also surprising to
see no mention of practical hurdles one need to overcome while
scaling brokers too. You can easily Google and find dozens of Kafka
horror stories. For a book that spent chapters on backpressure
problems, the question of how is it affected by the delegation of
the communication to a broker is totally ignored. Given how
objective and well-equiped the authors were throughout the entire
book, I am really amazed to see such a hype-oriented conclusion at
the end of this chapter.</p>
</li>
<li>
<p><span class="note-improvement">[p425]</span> In the shared code
snippet, used <code>messagesStream</code> and <code>statisticStream</code> are neither
described, nor defined.</p>
</li>
<li>
<p><span class="note-improvement">[p430]</span> For a book published
in October 2018, it is better to not limit the list of cloud
providers offering serverless deployments with AWS Lambda. To the
best of my knowledge, GCP, Azure, and many others provide similar
functionalities.</p>
</li>
<li>
<p><span class="note-correction">[p435]</span> <em><code>ClassLoad</code></em> →
<em><code>ClassLoader</code></em></p>
</li>
<li>
<p><span class="note-question">[p437]</span> It is noted that authors
escaped <code><</code> and <code>></code> characters in the URL they pass to <code>curl</code> with
<code>%3C</code> and <code>%3E</code>, respectively. Though, AFAIK, <code>curl</code> already does
that if you pass the URL in quotes. Isn’t it?</p>
</li>
<li>
<p><span class="note-correction">[p440]</span> <em>… represent independent
ClassLoaders.</em> → <em>… represent independent <code>ClassLoader</code>s.</em></p>
</li>
<li>
<p><span class="note-correction">[p441]</span> <em><code>implements
Function<Payment, Payment></code></em> → <em><code>implements Function<Payment,
PaymentValidation></code></em></p>
</li>
<li>
<p><span class="note-correction">[p441]</span> <em><code>public Payment
apply(Payment payment) { ... }</code></em> → <em><code>public PaymentValidation
apply(Payment payment) { ... }</code></em></p>
</li>
<li>
<p><span class="note-correction">[p448]</span> A dashed arrow is missing
from Service B’s <code>request(1)</code> line to Service A’s <code>request(10)</code>
line.</p>
</li>
<li>
<p><span class="note-improvement">[p449]</span> Page contains a pretty
well wrapped comparison of Reactive Streams versus TCP for flow
control. Given HTTP/3 is almost ready to be served and is employing
UDP rather than TCP, a <em>Reactive Streams versus HTTP/3</em> comparison
would be really thought-provoking.</p>
</li>
<li>
<p><span class="note-correction">[p450]</span> <code>request(3)</code> under Service
A’s adapter is, I believe, mistakenly placed.</p>
</li>
<li>
<p><span class="note-other">[p454]</span> In the shared code snippet,
isn’t <code>Flux.interval(Duration.ofMillis(100))</code> too harsh for service
discovery?</p>
</li>
<li>
<p><span class="note-other">[p456]</span> I am aware of the fact that
the design of gRPC incurs <a href="https://github.com/grpc/grpc-java/issues/1549">serious flow control
problems</a>, though in
a section where you compare contenders like RSocket and gRPC, there
is no place for such a vague and ungrounded statement:</p>
<blockquote>
<p>Looking at the preciding code, we may get the feeling that gRPC,
along with asynchronous message passing, gives backpressure
control support as well. However, that part is a bit tricky.</p>
</blockquote>
<p>Is backpressure is not handled at all or what? One should not pass
over such a claim by hand waving.</p>
</li>
<li>
<p><span class="note-correction">[p457]</span> <em><code>public static class
TestApplication</code></em> → <em><code>public class TestApplication</code></em></p>
</li>
<li>
<p><span class="note-question">[p457]</span> The RSocket endpoint is
annotated with <code>@RequestManyMapping(value = "/stream1", mimeType =
"application/json")</code>. Given RSocket is a wire protocol, what is the
purpose of the provided path and MIME type here?</p>
</li>
<li>
<p><span class="note-other">[p461]</span> The summary repeats the
triumph of message brokers over service-to-service communication one
more time. Again, this is a really subjective statement. There are
many advantages and disadvantages to each approach. Announcing a
winner is a pretty dangerous generalization.</p>
</li>
<li>
<p><span class="note-improvement">[p471]</span> Shared
<code>StepVerifier.withVirtualTime()</code> trick only works unless you use a
custom scheduler in <code>interval()</code> and <code>timeout()</code> calls.</p>
</li>
<li>
<p><span class="note-other">[p478]</span> Here authors share a hack to
work around the problem of Spring Boot 2.0 <code>WebClient</code> not providing
the mocking support for outgoing HTTP interactions. I would rather
share this trick in a Stack Overflow post, because the readers of
the book will take it for granted, and rather share the
aforementioned WireMock approach.</p>
</li>
<li>
<p><span class="note-other">[p480]</span> <em>Testing
WebSocket</em>… Again, this is a hack rather than a solution. I would
not share it in the book.</p>
</li>
<li>
<p><span class="note-correction">[p491]</span> Diagram 10.1 is totally
irrelevant to the content of the page.</p>
</li>
<li>
<p><span class="note-improvement">[p492]</span> Some texts in Diagram
10.3 are not readable.</p>
</li>
<li>
<p><span class="note-correction">[p503]</span> <em>… memory usage by
regions, GC pauses, treads count …</em> → <em>… memory usage by
regions, GC pauses, thread counts …</em></p>
</li>
<li>
<p><span class="note-improvement">[p510]</span> It is wise to note
that Spring Boot Admin is not an official Spring project.</p>
</li>
<li>
<p><span class="note-improvement">[p514-515]</span> No mention of
either GCP or App Engine.</p>
</li>
</ul>
<style type="text/css">
span.note-correction {
color: red;
}
span.note-improvement {
color: orange;
}
span.note-question {
color: green;
}
span.note-other {
color: silver;
}
</style>
tag:volkan.yazi.ci,2018-11-14://blog/post/2018/11/14/elasticsearch-primary-data-store/Using Elasticsearch as the Primary Data Store2018-11-14T07:54:00Z2018-11-14T07:54:00Z
<p>The biggest e-commerce company in the Netherlands and Belgium,
<a href="https://bol.com">bol.com</a>, set out on a 4 year journey to rethink and rebuild
their entire <a href="https://en.wikipedia.org/wiki/Extract,_transform,_load">ETL (Extract, Transform, Load)</a>
pipeline, that has been cooking up the data used by its search engine since
the dawn of time. This more than a decade old white-bearded giant, breathing
in the dungeons of shady Oracle PL/SQL hacks, was in a state of decay, causing
ever increasing hiccups on production. A rewrite was inevitable. After
drafting many blueprints, we went for a Java service backed by <strong>Elasticsearch
as the primary storage!</strong> This idea brought shivers to even the most senior
Elasticsearch consultants hired, so to ease your mind I’ll walk you through
why we took such a radical approach and how we managed to escape our legacy.</p>
<p>Before diving into the details, let me share a 2,000ft overview of an
e-commerce search setup that will help you to gain a better understanding of
the subjects discussed onwards. Note that this simplification totally omits a
nebula of incorporated caching layers, systems orchestrating multiple search
clusters, queues with custom flush and replay functionalities, in-place
resiliency mechanisms, services maintaining deprecated search entities to
avoid getting ranked down by bots due to 404s, circuit breakers, throttlers,
load balancers, etc. But it is still accurate enough to convey the general
idea.</p>
<p><img src="overview.jpg" alt="Architecture Overview"></p>
<h1 id="table-of-contents">Table of Contents</h1>
<ul>
<li>
<a href="#search">The Search</a>
<ul>
<li><a href="#what-is-search">What is search anyway?</a></li>
<li><a href="#who-is-using-search">Who/What is using search?</a></li>
<li><a href="#what-about-performance">What about performance?</a></li>
<li><a href="#how-volatile">How volatile is the content?</a></li>
</ul>
</li>
<li>
<a href="#etl">The ETL</a>
<ul>
<li><a href="#content-stream">Real-time Content Stream</a></li>
<li><a href="#configuration-stream">Configuration Stream</a></li>
</ul>
</li>
<li>
<a href="#operational-overview">Operational Overview</a>
<ul>
<li><a href="#configuration-mutations">Configuration Mutations</a></li>
<li><a href="#configuration-predicates">Configuration Predicates</a></li>
</ul>
</li>
<li><a href="#old-etl">The Old ETL</a></li>
<li>
<a href="#battle-of-storage-engines">The Battle of Storage Engines</a>
<ul>
<li><a href="#benchmark-setup">Benchmark Setup</a></li>
<li><a href="#benchmark-results">Benchmark Results</a></li>
</ul>
</li>
<li>
<a href="#new-etl">The New ETL</a>
<ul>
<li><a href="#primary-storage-elasticsearch">The Primary Storage: Elasticsearch</a></li>
<li><a href="#configuration-dsl-json-groovy">The Configuration DSL: JSON and Groovy</a></li>
</ul>
</li>
<li><a href="#conclusion">Conclusion</a></li>
<li><a href="#acknowledgements">Acknowledgements</a></li>
<li><a href="#faq">F.A.Q</a></li>
</ul>
<p><a name="search"></a></p>
<h1 id="the-search">The Search</h1>
<p><i>[Before going any further, I want to take this opportunity to align you on
what exactly I do mean by <em>search</em>. I hope this will help you to better wrap
your mind around the ultimate consumer of ETL. That being said, feel free to
skip this section and directly jump to the ETL deep dive in the next
section.]</i></p>
<p>Many people tend to make the mistake of having a narrow view on search at
e-commerce and confining its use case to a mere term scavenging in a
mountainous stack of product attributes. While this statement holds to a
certain extent, it resembles a cherry located at the tip of an iceberg. (In
<a href="/blog/post/2018/02/17/varnishing-search-performance/">Varnishing Search Performance</a>
presentation, I tried to summarize how difficult it can get just to add a
caching layer between your search logic and backend.) There are books written,
university lectures offered, and computer science branches dedicated on the
matter. But let me try to briefly elaborate this from an engineering
standpoint.</p>
<p><a name="what-is-search"></a></p>
<h2 id="what-is-search-anyway">What is search anyway?</h2>
<p>If I would try to give a general, but far from complete, overview, it enables
one to</p>
<ul>
<li>
<p>search for a term in hundreds of product attributes, where <em>matching</em>
and <em>ranking</em> are curated with directly or indirectly available consumer
(are you a PS4 owner searching for the newest “Call of Duty”?) and
relevance (you probably meant a band by typing “The Doors”, which is
irrelevant for “Doors & Windows” department) contexts,</p>
</li>
<li>
<p>browse (basically a search without a term) in thousands of categories
with similar ranking mechanics used in search aforementioned,</p>
</li>
<li>
<p>beam up directly to a certain product or category given the input matches
with certain patterns (EAN, ISBN, ISSN, etc.) or merchandising rules (any
syntactic and/or semantic combination of “wine glasses” should end the
flow in a particular department, etc.),</p>
</li>
<li>
<p>implicitly trigger multiple searches under the hood (e.g. narrowing
down to a lower category or widening up to a higher category, etc.)
to enhance the results,</p>
</li>
<li>
<p>and decorate every listing with faceting (you probably want to see
“Capacity” facet rather than “Shoe Size” while searching/browsing in
“Harddisks”) support.</p>
</li>
</ul>
<p><a name="who-is-using-search"></a></p>
<h2 id="whowhat-is-using-search">Who/What is using search?</h2>
<p>This is a big debate. But I know a handful of certain consumers:</p>
<ul>
<li>
<p><strong>Customers:</strong> People who search and buy goods. They look harmless, until
one gets exposed to them on a <a href="https://en.wikipedia.org/wiki/Black_Friday_%28shopping%29">Black Friday</a>
where they work hand to hand in masses to <a href="https://en.wikipedia.org/wiki/Denial-of-service_attack">DDoS</a>
the entire infrastructure.</p>
</li>
<li>
<p><strong>Bots:</strong> They periodically (a couple of times a day at most, as of the
date of this writing) try to digest your entire catalog into their system
for two main purposes:</p>
<ul>
<li>Integrate the catalog into their own search engine (that is, Google),</li>
<li>Tune their pricing strategy (that is, competitors)</li>
</ul>
<p>The worst part of handling bot traffic is you cannot always throttle them
(for instance, Google takes into account website latencies for rankings) and
you need to make sure they do not harm the customer traffic. Food for
thought: Imagine your customers swarming at your shop at Christmas Eve
and Google decided to spider your entire catalog with thousands of requests
per second.</p>
</li>
<li>
<p><strong>Partners:</strong> Your business partners can also scan your catalog
periodically to integrate into their own systems. (Fun fact: Some even
require a daily Excel export.) One can classify them as bots only
interested in a subset of the data.</p>
</li>
<li>
<p><strong>Internal services:</strong> Last time I counted, there were 20+ internal
services using search to enhance their results in addition to the
users I listed above. Their usage can constitute up to 50% of the
traffic.</p>
</li>
</ul>
<p>In the case of partners and internal services, one might argue why do they
need the search data rather than directly accessing the raw product attributes
and offers. The answer is simple: They also use additional attributes (e.g.,
facets, categories) incorporated at the ETL pipeline. Hence, rather than
exposing the internal ETL system to them, it is more convenient to manage them
at the search gateway which is known to have battle-tested scalability and
resiliency measures.</p>
<p><a name="what-about-performance"></a></p>
<h2 id="what-about-performance">What about performance?</h2>
<p>As decades-long experience in this domain points, making search 10ms faster
can yield millions of euros extra revenue depending on the scale of your
business. Unfortunately, this equation works the other way around as well.
Hence, you are always expected to perform under a certain latency and above a
certain throughput threshold.</p>
<p><a name="how-volatile"></a></p>
<h2 id="how-volatile-is-the-content">How volatile is the content?</h2>
<p>Very, very, very volatile! I cannot emphasize this enough and I believe this
is a crucial difference that puts e-commerce search apart from Google-like
search engines – recall the conflict between Google and Twitter for indexing
tweets. Maybe examples can help to convey the idea better:</p>
<ul>
<li>
<p>A product might have multiple offers (bol.com offer, partner offer, etc.)
featuring varying properties (pricing, deliverability, discounts, etc.)
where both offers and/or their properties are highly volatile. The offer
might run out of stock, the price might change, etc. While customer-facing
web pages are enhanced with the most recent data at runtime, search index
might lag behind and provide an eventually consistent view. The volatility
in this context might range from seconds to months. On prime time, e.g. on
Valentine’s Day, you don’t want your search engine to return gift listings
that ran out of stock a couple of seconds ago.</p>
</li>
<li>
<p>Your manual (triggered by shop specialists) and automated (artificial
intelligence, machine learning driven) processes can alter the category
tree, add new facets, tune the exposure of existing facets, modify the
search behavior (e.g., flows triggered by merchandising rules), add context
sensitive (e.g. category-dependent) thesaurus entries, synonyms, introduce
new rankings, etc. These changes might necessitate the update of millions of
documents retroactively.</p>
</li>
</ul>
<p>This <em>volatility</em> debate will take a prominent role while deciding on the
architecture of the next ETL pipeline, which I will elaborate in a minute.</p>
<p><a name="etl"></a></p>
<h1 id="the-etl">The ETL</h1>
<p>In the domain of search at e-commerce,
<a href="https://en.wikipedia.org/wiki/Extract,_transform,_load">ETL</a> denotes the
pipeline where the input is a multitude of information sources (product
attributes, offers, discounts, rankings, facets, synonyms, thesaurus entries,
etc.) and the output is the
<a href="https://en.wikipedia.org/wiki/Denormalization">denormalized</a> input
constituting search-ready documents optimized for search query performance.
Wait a second? If an ETL pipeline just delivers some optimization purposes,
doesn’t this sound like that one can have a search without it? Sorta… That
is indeed possible to a certain extent. If we would put the details aside for
a moment, we can roughly compare the two approaches as follows:</p>
<table>
<thead>
<tr>
<th>Strategy</th>
<th>Advantages</th>
<th>Disadvantages</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Without ETL</strong></td>
<td>
Every change in the input sources take immediate effect. (Hence, almost
zero index time cost.)
</td>
<td>
Latency and throughput hurts dramatically due to necessitated join and
enrich operations on input sources at query time.
</td>
</tr>
<tr>
<td><strong>With ETL</strong></td>
<td>
Since all potential data to satisfy search requests has already
been baked into the index, search necessitates the least amount of
effort to satisfy a request at query time.
</td>
<td>
Every change in the input sources will necessitate pre-processing
affecting a multitude of products ranging from a couple to millions.
</td>
</tr>
</tbody>
</table>
<p>Put another way, ETL is all about the trade-off between index- versus
query-time performance. In the light of all these and given</p>
<ol>
<li>our existing ETL was functionally comprehensive enough,</li>
<li>query time performance of Elasticsearch has already been suffering due to
faceting, internally triggered queries, etc. to an extent external caching
becomes a necessity,</li>
<li>and search latency has a big impact on the revenue,</li>
</ol>
<p>we took the thick ETL pipeline path.</p>
<p>But what is this ETL pipeline really? What does it literally do? In order to
answer these questions, let me focus your attention to the input sources going
into the ETL pipeline:</p>
<p><img src="etl.jpg" alt="ETL Input Sources"></p>
<p><i>[GPC stands for <a href="https://www.gs1.org/standards/gpc">Global Product Classification</a>,
which is de facto commercial categorization of goods varying from a car to
a litre of milk.]</i></p>
<p>These two input sources, content and configuration, feature two totally
different execution patterns framing the functional requirements of the
potential ETL solutions, hence, play the uttermost critical role in justifying
the plan we picked. Let’s examine them further:</p>
<p><a name="content-stream"></a></p>
<h2 id="real-time-content-stream">Real-time Content Stream</h2>
<p>Here the ETL pipeline listens from more than a dozen queues for updates
ranging from product attributes to offers, offer-specific discounts to
rankings, etc. all formatted in <a href="https://json.org/">JSON</a>. Fortunately, each
real-time content stream message triggers a single product update. Let me
exemplify this with a case: when <code>disk_capacity_bytes</code> attribute of a product
changes, we</p>
<ol>
<li>first fetch the relevant document from the storage,</li>
<li>update its <code>disk_capacity_bytes</code> attribute,</li>
<li>apply configuration(s) matching with the last state of the updated document,</li>
<li>and persist the obtained result back.</li>
</ol>
<p>There are some concerns need to be addressed here:</p>
<ul>
<li>
<p>This is a pretty <em>CPU intensive</em> operation. Configurations, in essence, are
rules in the form of <code>(predicate, mutation)</code> pairs defined via
business-friendly screens by shop specialists. When an attribute of a
document gets updated, this change might be of interest to many
configurations which are determined by performing an inverse lookup on tens
of thousands of configuration predicates (e.g., <code>attrs.disk_capacity_bytes !=
null</code>) matching with the last state of the document. Later on mutations
(e.g., <code>doc.disk_capacity_gigabytes = attrs.disk_capacity_bytes / 1e9</code>) of
the found configurations are executed to let them shape the document
according to their needs.</p>
<p>This innocent looking procedure sneakily introduces two critical issues
under the hood:</p>
<ol>
<li><em>How would you represent the configuration predicate such that you can
match them against the content?</em></li>
<li><em>How would you represent the configuration mutation such that you can
execute them against the content?</em></li>
</ol>
<p>And it goes without saying, both concerns aforementioned need to be
engineered efficiently. You are expected to repeat this procedure on each
message JSON of the real-time content stream where the traffic is in the
order of millions per day.</p>
<p>As a concrete configuration example consider the following: You have two
“Disk Capacity” facets defined by business: one for computers, one for
smart phones departments. The first one translates the
<code>disk_capacity_bytes</code> into a <code>disk_capacity_terabytes</code> attribute which is
defined to be exposed when <code>category == "computers"</code> and the second
translates into a <code>disk_capacity_gigabytes</code> attribute which is defined to
be exposed when <code>category == "smart phones"</code>. Here both configurations are
executed when the <code>attrs.disk_capacity_bytes != null</code> predicate holds.</p>
</li>
<li>
<p>This operation needs to be performed <em>atomically</em>. Two concurrent operations
touching to the same product should not result in a corrupt content.</p>
</li>
</ul>
<p><a name="configuration-stream"></a></p>
<h2 id="configuration-stream">Configuration Stream</h2>
<p>Configurations are the rules defined via business-friendly screens. There
modifications done by shop specialists are published in snapshots when they
think the changes grow into a stable state that they are ready to be exposed
to the customer. Each published configuration snapshot ends up serving three
purposes:</p>
<ol>
<li>search gateway uses it to determine how to query the search index,</li>
<li>ETL pipeline uses it to process the real-time content stream,</li>
<li>and ETL pipeline <em>retroactively updates</em> the documents that are potentially
affected.</li>
</ol>
<p>While the first two are relatively cheap operations, the last one is the
elephant in the room! This is the first time in our beautiful tale described
so far that we need to propagate a change to millions of documents. Let me
further explain this in an example:</p>
<p>Let’s consider that the following category definition:</p>
<pre><code class="language-javascript"><span class="k">if</span> <span class="p">(</span><span class="nx">attrs</span><span class="p">.</span><span class="nx">gpc</span><span class="p">.</span><span class="nx">family_id</span> <span class="o">==</span> <span class="mi">1234</span> <span class="o">&&</span> <span class="nx">attrs</span><span class="p">.</span><span class="nx">gpc</span><span class="p">.</span><span class="nx">chunk_id</span> <span class="o">==</span> <span class="mi">5678</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">doc</span><span class="p">.</span><span class="nx">category</span> <span class="o">=</span> <span class="s2">"books"</span>
<span class="p">}</span></code></pre>
<p>is modified as follows:</p>
<pre><code class="language-javascript"><span class="k">if</span> <span class="p">(</span><span class="nx">attrs</span><span class="p">.</span><span class="nx">gpc</span><span class="p">.</span><span class="nx">family_id</span> <span class="o">==</span> <span class="mi">1234</span> <span class="o">&&</span> <span class="nx">attrs</span><span class="p">.</span><span class="nx">gpc</span><span class="p">.</span><span class="nx">chunk_id</span> <span class="o">==</span> <span class="mh">0xDEADBEEF</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">doc</span><span class="p">.</span><span class="nx">category</span> <span class="o">=</span> <span class="s2">"AWESOME BOOKS"</span>
<span class="p">}</span></code></pre>
<p>Sir, you are in trouble! As the very ETL pipeline, what you are expected to
deliver is to</p>
<ol>
<li>find products that are matching with the old predicate,</li>
<li>revert the changes of the old configuration mutation by removing <code>books</code> from the <code>category</code> field,</li>
<li>find products that are matching with the new predicate,</li>
<li>and apply the changes of the new configuration mutation by adding <code>AWESOME BOOKS</code> to the <code>category</code> field.</li>
</ol>
<p>This easier said than done operation contains many implicit concerns:</p>
<ul>
<li>
<p>ETL needs to avoid removing <code>books</code> from the <code>category</code> field if there are
rules, other than the changed one, adding <code>books</code> to the very same <code>category</code>
field. There are two ways you can approach to this:</p>
<ol>
<li>
<p>With every value added to a field, store a meta information pointing
to the rules associated with that value. These back-tracking pointers
optimize the check whether a value can be removed or not, with the cost
of maintaining them in an ocean of values.</p>
</li>
<li>
<p>After removing every value, put the product back into the ETL pipeline
just like handling products in the real-time content stream. If there are
any rules, other than the changed one, adding <code>books</code> to the very same
<code>category</code> field, they will kick in. This simple approach comes with the
cost of a CPU intensive and unfortunately mostly redundant processing.</p>
</li>
</ol>
</li>
<li>
<p>Given that configuration predicates are allowed to access any field, how
would one represent a predicate and translate this into an ETL storage query
filter that performs well? (You would not want to scan the whole data set
for each predicate that is changed, right? Well… depends.)</p>
<p>Let’s first discuss the representation of predicates issue, which was also a
concern in the real-time content stream processing. Here you might first
fall into the trap of whitelisting the operators (<code>==</code>, <code>!=</code>, <code>></code>, <code>>=</code>,
<code><</code>, <code><=</code>, <code>~=</code>) and the content attributes (<code>attrs.gpc.family_id</code>,
<code>attrs.gpc.chunk_id</code>, <code>attrs.disk_capacity_bytes</code>, etc.) that are allowed in
configuration predicates. While whitelisting operators is fine, whitelisting
the content attributes implies that the ETL pipeline, the configuration
administration GUIs, etc. all needs to have the knowledge of this whitelist
which strictly depends on the structure of the real-time content stream
message structures. Whenever the message structures change or you want to
add a new attribute to this whitelist, both happen a couple of times every
year, you need to propagate this to many components in your service milky
way and perform a deploy without downtime.</p>
<p>What about translating these predicate representations into efficient ETL
storage query filters? Let’s take the simplest approach: Represent each
attribute with a separate field. Then let me ask you the following
questions:</p>
<ol>
<li>
<p>If you would opt for using an RDBMS, you can represent attributes by
columns and create an index for each individual column. (Ouch!) Thanks to
the half-century battle-tested RDBMS literature, the database can easily
optimize and perform an index scan for the constructed queries:</p>
<pre><code class="language-sql"><span class="k">SELECT</span> <span class="p">...</span>
<span class="k">FROM</span> <span class="n">content</span>
<span class="k">WHERE</span> <span class="n">attrs_gpc_family_id</span> <span class="o">=</span> <span class="s1">'1234'</span>
<span class="k">AND</span> <span class="n">attrs_gpc_chunk_id</span> <span class="o">=</span> <span class="s1">'5678'</span></code></pre>
<p>That being said… What if you hit to the maximum column count limitation?
(Yes, we did!) Further, what about attributes that are list of objects:</p>
<pre><code class="language-json"><span class="p">{</span>
<span class="nt">"authors"</span><span class="p">:</span> <span class="p">[</span>
<span class="p">{</span>
<span class="nt">"fname"</span><span class="p">:</span> <span class="s2">"Volkan"</span><span class="p">,</span>
<span class="nt">"lname"</span><span class="p">:</span> <span class="s2">"Yazici"</span>
<span class="p">},</span>
<span class="p">{</span>
<span class="nt">"fname"</span><span class="p">:</span> <span class="s2">"Lourens"</span><span class="p">,</span>
<span class="nt">"lname"</span><span class="p">:</span> <span class="s2">"Heijs"</span>
<span class="p">}</span>
<span class="p">]</span>
<span class="p">}</span></code></pre>
<p>You definitely cannot store these in a single column and still query each
individual component. Ok, then you can normalize the data as follows:</p>
<pre><code class="language-sql"><span class="k">SELECT</span> <span class="p">...</span>
<span class="k">FROM</span> <span class="n">content</span><span class="p">,</span>
<span class="n">attribute</span> <span class="k">AS</span> <span class="n">a1</span><span class="p">,</span>
<span class="n">attribute</span> <span class="k">AS</span> <span class="n">a2</span>
<span class="k">WHERE</span> <span class="n">a1</span><span class="p">.</span><span class="n">content_id</span> <span class="o">=</span> <span class="n">content</span><span class="p">.</span><span class="n">id</span> <span class="k">AND</span> <span class="n">a1</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="s1">'gpc_family_id'</span> <span class="k">AND</span> <span class="n">a1</span><span class="p">.</span><span class="n">value</span> <span class="o">=</span> <span class="s1">'1234'</span>
<span class="k">AND</span> <span class="n">a2</span><span class="p">.</span><span class="n">content_id</span> <span class="o">=</span> <span class="n">content</span><span class="p">.</span><span class="n">id</span> <span class="k">AND</span> <span class="n">a2</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="s1">'gpc_chunk_id'</span> <span class="k">AND</span> <span class="n">a2</span><span class="p">.</span><span class="n">value</span> <span class="o">=</span> <span class="s1">'5678'</span></code></pre>
<p>So far so good. But… In a matter of months, you will need to start
partitioning tables and maybe even move certain partitions into separate
database instances to maintain the latency under a certain threshold.
(Yes, we did this as well!) But this never-ending database structure
optimization more and more feels like you are inventing your own
distributed database using a plain RDBMS. Does this really still need to
be this way in 2018?</p>
</li>
<li>
<p>If you would opt for using <a href="https://www.mongodb.com/">MongoDB</a>, like using
an RDBMS, you still need create an explicit index on each (whitelisted)
field. For filters involving multiple fields (e.g., <code>attrs.gpc.family_id
== 1234 && attrs.gpc.chunk_id == 5678</code>), MongoDB query optimizer can
purpose individual field indices via <a href="https://docs.mongodb.com/manual/core/index-intersection/">index intersection</a>.
That being said, our experience with this feature has not been very
pleasant.</p>
<p>The issue where attributes might contain list of objects is
<a href="https://docs.mongodb.com/manual/tutorial/query-array-of-documents/">not a problem for MongoDB</a>.</p>
</li>
<li>
<p>If you would opt for <a href="https://cloud.google.com/datastore">Google Cloud Datastore</a>,
you will need to create explicit indices for each potential filter
combination and order matters! Yes, you read that right! Let me exemplify
this bizarre situation. If you have configurations with the following
predicates:</p>
<ul>
<li><code>attrs.gpc.family_id == 1234</code></li>
<li><code>attrs.gpc.chunk_id == 5678</code></li>
<li><code>attrs.gpc.family_id == 1234 && attrs.gpc.chunk_id == 5678</code></li>
<li><code>attrs.gpc.chunk_id == 5678 && attrs.gpc.family_id == 1234</code></li>
</ul>
<p>you need to define 4 different indices! Ouch! This in its own was a
Datastore show stopper for us.</p>
</li>
<li>
<p>If you would opt for <a href="https://www.elastic.co/products/elasticsearch">Elasticsearch</a>,
all fields are indexed by default and you can use them in any combination!
Yay! No need for whitelisting! And similar to MongoDB, Elasticsearch also
allows <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html">querying list of objects</a>,
you just need to declare them explicitly as <code>nested</code>. If you don’t even
want to worry about that, you can add a dynamic mapping template to make
each object nested by default. Following is the index mapping you can use
for that purpose:</p>
<pre><code class="language-json"><span class="p">{</span>
<span class="nt">"date_detection"</span><span class="p">:</span> <span class="kc">false</span><span class="p">,</span>
<span class="nt">"dynamic_templates"</span><span class="p">:</span> <span class="p">[</span>
<span class="p">{</span>
<span class="nt">"strings"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"match_mapping_type"</span><span class="p">:</span> <span class="s2">"string"</span><span class="p">,</span>
<span class="nt">"mapping"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"type"</span><span class="p">:</span> <span class="s2">"keyword"</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">},</span>
<span class="p">{</span>
<span class="nt">"objects"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"match_mapping_type"</span><span class="p">:</span> <span class="s2">"object"</span><span class="p">,</span>
<span class="nt">"mapping"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"type"</span><span class="p">:</span> <span class="s2">"nested"</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">]</span>
<span class="p">}</span></code></pre>
<p>Above mapping also disables analyzing the fields of type <code>string</code>, since
we are not interested in performing fuzzy queries. Clearly, date detection
is disabled for similar reasons.</p>
<p>These being said, Elasticsearch is known to suffer from deteriorating
query performance over time when exposed to high update rates.</p>
</li>
</ol>
</li>
</ul>
<p><a name="operational-overview"></a></p>
<h1 id="operational-overview">Operational Overview</h1>
<p>So far we examined the current ETL setup with concrete examples for several
cases. We broke down the system into its individual input sources and detailed
their implications on certain architectural decisions. Let’s wrap up this
mind-boggling details into operational abstractions:</p>
<p><img src="etl-abstraction.jpg" alt="The ETL: Operational Overview"></p>
<p>Given these operational abstractions, let me summarize the constraints the
configuration components (predicate and mutation) imply.</p>
<p><a name="configuration-mutations"></a></p>
<h2 id="configuration-mutations">Configuration Mutations</h2>
<p>If you would recall, configuration mutations were simple document enhancement
instructions that I exemplified as follows:</p>
<pre><code class="language-javascript"><span class="nx">doc</span><span class="p">.</span><span class="nx">category</span> <span class="o">=</span> <span class="s2">"books"</span></code></pre>
<p>Here <code>doc</code> is a dictionary denoting the ETL’ed document source and mutation
“adds” <code>books</code> value to its <code>category</code> field. This (for simplification
purposes, JavaScript-employed) innocent looking expression can (and does!) go
to unintended extents:</p>
<pre><code class="language-javascript"><span class="k">if</span> <span class="p">(</span><span class="nx">attrs</span><span class="p">.</span><span class="nx">suitable_for_month</span> <span class="o"><=</span> <span class="mi">2</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">doc</span><span class="p">.</span><span class="nx">childhood_stage</span> <span class="o">=</span> <span class="s2">"newborn"</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="nx">attrs</span><span class="p">.</span><span class="nx">suitable_for_month</span> <span class="o"><=</span> <span class="mi">12</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">doc</span><span class="p">.</span><span class="nx">childhood_stage</span> <span class="o">=</span> <span class="s2">"infant"</span><span class="p">;</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="nx">attrs</span><span class="p">.</span><span class="nx">suitable_for_month</span> <span class="o"><=</span> <span class="mi">48</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">doc</span><span class="p">.</span><span class="nx">childhood_stage</span> <span class="o">=</span> <span class="s2">"toddler"</span><span class="p">;</span>
<span class="p">}</span></code></pre>
<p>The choice of the mutation <a href="https://en.wikipedia.org/wiki/Domain-specific_language">DSL</a>
employed is expected to deliver the following requirements:</p>
<ul>
<li>It “must” support JSON input and output for the real-time content stream.
(See step B4 in the figure.)</li>
<li>It “should” support ETL storage input and output for the configuration
snapshot stream. (See step A4 in the figure.)</li>
</ul>
<p>The reason that the latter functionality marked as optional is that the ETL
pipeline can also retrieve these documents in raw from the storage, convert
them to JSON, execute mutations, and persist them back again – assuming data
integrity is provided by other means, e.g., transactions, retries powered by
compare-and-swap operations, etc.</p>
<p><a name="configuration-predicates"></a></p>
<h2 id="configuration-predicates">Configuration Predicates</h2>
<p>Configuration predicates were simple conditions restricted to use a
whitelisted set of operators (<code>==</code>, <code>!=</code>, <code>></code>, <code>>=</code>, <code><</code>, <code><=</code>, <code>~=</code>)
supporting grouping:</p>
<pre><code class="language-javascript"><span class="nx">attrs</span><span class="p">.</span><span class="nx">gpc</span><span class="p">.</span><span class="nx">family_id</span> <span class="o">==</span> <span class="mi">1234</span> <span class="o">&&</span> <span class="nx">attrs</span><span class="p">.</span><span class="nx">gpc</span><span class="p">.</span><span class="nx">chunk_id</span> <span class="o">==</span> <span class="mi">5678</span></code></pre>
<p>Similar to mutations, the choice of the predicate DSL used is expected to
deliver the following requirements:</p>
<ul>
<li>It “must” support JSON input for the real-time content stream. (See step B2
in the figure.)</li>
<li>It “should” support ETL storage input for determining the affected documents
by the configuration snapshot delta. (See step A4 in the figure.)</li>
</ul>
<p>We relaxed the latter constraint since one can very well prefer to put the
entire stored document collection (Ouch!) back into the ETL pipeline, process
them, detect the changed ones, and persist the updates. This approach has
certain assumptions though:</p>
<ul>
<li>We don’t need to perform this too often. That is, the frequency of
configuration snapshots are relatively low, e.g., max. a couple of times a
day.</li>
<li>The snapshot deltas affect a significant percentage of the entire collection
to an extent that the advantage of finding and processing only the affected
documents diminishes.</li>
</ul>
<p>Given you still need to make a back of the envelope calculation on your cloud
bill for each approach, our years of statistics in the ETL snapshot
configuration point that most of the time snapshot deltas affect at most 5% of
the entire collection and the average is less than 1% – thanks to the
incremental updates carried out by shop specialists. Hence, performing a
complete ETL a couple of times a day feels like overkill and hurts the
engineer within you.</p>
<p><a name="old-etl"></a></p>
<h1 id="the-old-etl">The Old ETL</h1>
<p>The old ETL was a single Oracle database where the configurations were modeled
in PL/SQL. Since the configuration abstraction language was the very same
language the database uses itself, executing mutations and predicates was
effortless. Hail <a href="https://en.wikipedia.org/wiki/SQL_injection">SQL injection</a>
as a feature! Though this came with some notable costs:</p>
<ul>
<li>Using PL/SQL within the abstraction model created both functional and
financial vendor lock-in. The functional deficiency (incompetent
expressiveness, leakage of PL/SQL to irrelevant components) obstructed many
innovations over the years, where it became more and more difficult as time
passed. Additionally, it constituted a significant obstacle for migrating
the service to the cloud. Its financial aspect was negligible at the scale
of <a href="https://bol.com">bol.com</a>.</li>
<li>Rolling back changes of an updated configuration mutation is quite a
PL/SQL engineering endeavor to implement in practice. This difficulty,
spiced up with the insufficient logging, testing, debugging, profiling, etc.
utilities, drew programmers back from taking this path. <em>Hence, there was a
12+ hours long complete ETL run every night for configuration snapshot
deltas.</em> This beast tamed by an experienced couple of engineers has a
reputation to have frequent hiccups and make bugs really difficult to debug,
find, and reproduce, let alone fix!</li>
</ul>
<p>In its previous incarnation, the content attributes were stored in <code><id,
content_id, key, value></code> normalized form. This approach started to suffer from
efficiency aches in the hinges pulling the ETL’ed data to the search index.
Back then hired Oracle consultants examined the usage and recommended to go
with a denormalized structure where each attribute is stored as a column. In
addition to temporarily bandaging up the efficiency related wounds, this
allowed DBAs to let their imaginations go wild to map the attributes to
columns. Recall the attributes composed of objects I mentioned above? Special
characters were used to create such multi-value attributes, which was pretty
much (to put it mildly) unpleasant. But the killer bullet came in the form of
a six-inch punch referred as <a href="https://stackoverflow.com/a/14722914/1278899">the maximum allowed column count
limit</a>. But isn’t engineering
all about <a href="https://youtu.be/D_Vg4uyYwEk">how hard you can get it and keep moving
forward</a>? Yes, comrade! We thought so and used a
single binary XML column to store attributes, queried them using Oracle XPath
toolbox, escaped attribute values, finally concatenated them into SQL strings
that are eventually executed, and for sure crossed our fingers.</p>
<p>There are a couple of important details that I could not manage to cover in
the above war diary without spoiling the coherency. Let me drop them here in
no particular order:</p>
<ul>
<li>Task parallelization is pretty difficult in PL/SQL. We tried patching this
hole via internal Oracle AQs, but I am not really sure whether it improved
or worsened the state.</li>
<li>In a database procedure that is expected to run for 12+ hours, Murphy’s law
works flawlessly. Anything that can go wrong, did, does, and will go wrong.
We wisely(!) engineered the system to persist its state at certain check
points constituting retriable handles to invoke when you come in the morning
and see that the ETL crashed.</li>
<li>The number of moving components necessitated the use of <a href="https://www.cronacle.com/">a proprietary
scheduling tool supporting Oracle</a>. The schedule
was glued with <a href="https://www.gnu.org/software/bash/">bash</a> scripts, designed
in a proprietary development environment only available for Windows, and
rolled out on Oracle machines running GNU/Linux. Neither GNU/Linux, nor
Windows using developers were fond of this situation.</li>
<li>Due to the high cost of a failing ETL, business also did not feel empowered
to change and/or commercially optimize it easily. This was a pretty
demotivating issue affecting both technical and business people need to work
with it.</li>
</ul>
<p>Enough blaming the former engineer. We need to get our facts right. The
aforementioned PL/SQL giant was not rolled out in a day with a big bang. This
more than a decade old ETL pipeline was developed with all the best practices
and tooling available back then. The more you dive into its source code,
navigate through commits of features spanning through years, it becomes easier
to see what went wrong and where. Now you are able to realize the patterns
that necessitated exceptional handling of certain features, of which many due
to backward-compatibility with legacy systems that have already been
deprecated or replaced by newcomers, exploded the complexity to unintended
depths. Software development is never-ending progress and axioms you base your
initial architecture on become invalidated in the course of time due to
changing business needs. Aiming for infinite flexibility comes with an
engineering cost as well, which might very well fall short of justifying such
an expense. One should also include the massive burst of data volume and its
update frequency into this list. I personally think the old ETL pipeline and
its engineers did a fantastic job. The tool served its purpose for more than a
decade and harvested an immense amount of lessons for its successor. I would
be more than happy if we as a team can also achieve to deliver such a long
living product.</p>
<p><a name="battle-of-storage-engines"></a></p>
<h1 id="the-battle-of-storage-engines">The Battle of Storage Engines</h1>
<p>Given our functional requirements, we evaluated a couple of different ETL
pipeline storage solutions which I <a href="#configuration-stream">hinted to earlier</a>.
Following is the feature matrix of each candidate:</p>
<table>
<thead>
<tr>
<th>Storage Solution</th>
<th>Distributed</th>
<th>Sharded</th>
<th>Required Indices</th>
<th>Integrity Measure</th>
</tr>
</thead>
<tbody>
<tr>
<td>PostgreSQL</td>
<td>No</td>
<td>No</td>
<td>One<sup>1</sup>
</td>
<td>Transactions</td>
</tr>
<tr>
<td>PostgreSQL (partitioned)</td>
<td>No</td>
<td>Yes<sup>2</sup>
</td>
<td>One<sup>1</sup>
</td>
<td>Transactions</td>
</tr>
<tr>
<td>MongoDB</td>
<td>Yes</td>
<td>Yes<sup>3</sup>
</td>
<td>Some<sup>4</sup>
</td>
<td>Compare-and-swap<sup>5</sup>
</td>
</tr>
<tr>
<td>Elasticsearch</td>
<td>Yes</td>
<td>Yes</td>
<td>None</td>
<td>Compare-and-swap<sup>6</sup>
</td>
</tr>
</tbody>
</table>
<p><sup>1</sup> PostgreSQL <code>jsonb</code> index covers all fields.<br>
<sup>2</sup> PostgreSQL partitioning is not sharding in distributed sense, but still serves a similar purpose.<br>
<sup>3</sup> MongoDB sharding requires <a href="https://docs.mongodb.com/manual/sharding/#shard-keys">manual configuration</a>.<br>
<sup>4</sup> MongoDB requires an explicit index for each whitelisted field allowed in ETL configuration predicates.<br>
<sup>5</sup> MongoDB <a href="https://docs.mongodb.com/manual/core/write-operations-atomicity/"><code>updateMany()</code> or <code>findAndModify()</code></a> can be leveraged for the desired integrity.<br>
<sup>6</sup> Elasticsearch <code>_version</code> field can be leveraged to implement a compare-and-swap loop.</p>
<p><a name="benchmark-setup"></a></p>
<h2 id="benchmark-setup">Benchmark Setup</h2>
<p>For the benchmark, we populated each store with 33 million JSON documents of
which each weighs an average size of 2.5KB. One of the contrived fields in the
document is <code>search_rank</code>. Later on, a file consisting of 6 million distinct
<code><id, search_rank></code> pairs is streamed in batches of size 1000. For each batch,
we first fetch the old <code>search_rank</code>s associated with the <code>id</code>s and then bulk
update these with the new <code>search_rank</code>s. In this scenario, what we tried to
emulate is a bulk update triggered by a configuration snapshot delta, which
is the most storage performance demanding operation in the ETL pipeline.</p>
<p>Used test bed is a cluster composed of 6 dedicated machines with the following
specifications:</p>
<ul>
<li>
<strong>CPU</strong>: 16 core Intel Xeon E5-2620 v4 @ 2.10GHz</li>
<li>
<strong>Memory/Swap</strong>: 128GB/16GB</li>
<li>
<strong>Disk</strong>: 375GB (Intel P4800X Performance NVMe PCIe SSD)</li>
<li>
<strong>Kernel</strong>: 3.10.0-693.1.1.el7.x86_64</li>
</ul>
<p>We further configured each store as follows:</p>
<ul>
<li>
<p><strong>PostgreSQL</strong>: Just one PostgreSQL 9.6.10 instance containing a single
<code><id, content></code> table where <code>content</code> is of type <a href="https://www.postgresql.org/docs/current/datatype-json.html#JSON-INDEXING"><code>jsonb</code></a>.
Benchmark configured to update only the <code>search_rank</code> attribute of the
<code>content</code> column.</p>
</li>
<li>
<p><strong>PostgreSQL (partitioned)</strong>: Same as above, but the <code>content</code> table is
partitioned into 10 tables.</p>
</li>
<li>
<p><strong>MongoDB</strong>: MongoDB 3.6 with the following configurations:</p>
<pre><code class="language-yaml"><span class="l-Scalar-Plain">systemLog.destination</span><span class="p-Indicator">:</span> <span class="l-Scalar-Plain">file</span>
<span class="l-Scalar-Plain">systemLog.logAppend</span><span class="p-Indicator">:</span> <span class="l-Scalar-Plain">true</span>
<span class="l-Scalar-Plain">processManagement.fork</span><span class="p-Indicator">:</span> <span class="l-Scalar-Plain">true</span>
<span class="l-Scalar-Plain">storage.engine</span><span class="p-Indicator">:</span> <span class="l-Scalar-Plain">wiredTiger</span>
<span class="l-Scalar-Plain">security.authorization</span><span class="p-Indicator">:</span> <span class="l-Scalar-Plain">enabled</span>
<span class="l-Scalar-Plain">replication.oplogSizeMB</span><span class="p-Indicator">:</span> <span class="l-Scalar-Plain">9216</span></code></pre>
<p>Note that sharding is not enabled. (More on this later.)</p>
<p>Similar to PostgreSQL setup, benchmark configured to update only the
<code>search_rank</code> attribute of documents.</p>
</li>
<li>
<p><strong>Elasticsearch</strong>: Elasticsearch 6.3.0 with the following JVM flags:</p>
<pre><code>-Xms30g
-Xmx30g
-Xss256k
-XX:NewRatio=3
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintClassHistogram
-XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime
</code></pre>
<p>Here JVM heap size is set to 30G due to
<a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html">compressed OOPs limitation</a>.</p>
<p>Different from PostgreSQL and MongoDB setups, where only the <code>search_rank</code>
attribute is updated, Elasticsearch benchmark is configured to update the
entire document. While this overkill is subject to hammer Elasticsearch way
heavier (since Elasticsearch will create quite some garbage segments waiting
to be merged and making every object nested worsens the case even more) than
other stores, it is more strategically aligned with how we want to use it in
the future.</p>
</li>
</ul>
<p><a name="benchmark-results"></a></p>
<h2 id="benchmark-results">Benchmark Results</h2>
<p>Below you will see the results of the benchmark for only MongoDB and
Elasticsearch. The reason PostgreSQL results were omitted is no matter what
kind of optimization we throw at it, the benchmark always took more than 2
hours, regardless of partitioning, whereas MongoDB and Elasticsearch took a
couple of minutes.</p>
<style>
.concurrency { text-align: center; }
.measurement { text-align: right; }
.per-batch .measurement { font-weight: bold; }
</style>
<table>
<thead>
<tr>
<th>Store</th>
<th>Conc.<sup>7</sup>
</th>
<th>Latency</th>
<th>Total (s)</th>
<th>Fetch<sup>8</sup> 75% (ms)</th>
<th>Fetch<sup>8</sup> 99% (ms)</th>
<th>Fetch<sup>8</sup> Max. (ms)</th>
<th>Update<sup>9</sup> 75% (ms)</th>
<th>Update<sup>9</sup> 99% (ms)</th>
<th>Update<sup>9</sup> Max. (ms)</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="6">MongoDB</td>
<td rowspan="2" class="concurrency">8</td>
<td>total</td>
<td class="measurement">518</td>
<td class="measurement">68</td>
<td class="measurement">999</td>
<td class="measurement">3380</td>
<td class="measurement">64</td>
<td class="measurement">2347</td>
<td class="measurement">4153</td>
</tr>
<tr class="per-batch">
<td colspan="2">per batch</td>
<td class="measurement">8</td>
<td class="measurement">125</td>
<td class="measurement">423</td>
<td class="measurement">8</td>
<td class="measurement">293</td>
<td class="measurement">519</td>
</tr>
<tr>
<td rowspan="2" class="concurrency">16</td>
<td>total</td>
<td class="measurement">526</td>
<td class="measurement">71</td>
<td class="measurement">3082</td>
<td class="measurement">7905</td>
<td class="measurement">68</td>
<td class="measurement">5564</td>
<td class="measurement">7955</td>
</tr>
<tr class="per-batch">
<td colspan="2">per batch</td>
<td class="measurement">4</td>
<td class="measurement">193</td>
<td class="measurement">494</td>
<td class="measurement">4</td>
<td class="measurement">348</td>
<td class="measurement">497</td>
</tr>
<tr>
<td rowspan="2" class="concurrency">32</td>
<td>total</td>
<td class="measurement">518</td>
<td class="measurement">61</td>
<td class="measurement">6668</td>
<td class="measurement">11465</td>
<td class="measurement">98</td>
<td class="measurement">10533</td>
<td class="measurement">13784</td>
</tr>
<tr class="per-batch">
<td colspan="2">per batch</td>
<td class="measurement">2</td>
<td class="measurement">208</td>
<td class="measurement">358</td>
<td class="measurement">3</td>
<td class="measurement">329</td>
<td class="measurement">431</td>
</tr>
<tr>
<td rowspan="6">Elasticsearch</td>
<td rowspan="2" class="concurrency">8</td>
<td>total</td>
<td class="measurement">251</td>
<td class="measurement">278</td>
<td class="measurement">423</td>
<td class="measurement">798</td>
<td class="measurement">94</td>
<td class="measurement">186</td>
<td class="measurement">412</td>
</tr>
<tr class="per-batch">
<td colspan="2">per batch</td>
<td class="measurement">35</td>
<td class="measurement">53</td>
<td class="measurement">100</td>
<td class="measurement">12</td>
<td class="measurement">23</td>
<td class="measurement">52</td>
</tr>
<tr>
<td rowspan="2" class="concurrency">16</td>
<td>total</td>
<td class="measurement">196</td>
<td class="measurement">478</td>
<td class="measurement">697</td>
<td class="measurement">1004</td>
<td class="measurement">141</td>
<td class="measurement">266</td>
<td class="measurement">410</td>
</tr>
<tr class="per-batch">
<td colspan="2">per batch</td>
<td class="measurement">30</td>
<td class="measurement">44</td>
<td class="measurement">63</td>
<td class="measurement">9</td>
<td class="measurement">17</td>
<td class="measurement">26</td>
</tr>
<tr>
<td rowspan="2" class="concurrency">32</td>
<td>total</td>
<td class="measurement">175</td>
<td class="measurement">951</td>
<td class="measurement">1368</td>
<td class="measurement">1515</td>
<td class="measurement">214</td>
<td class="measurement">331</td>
<td class="measurement">828</td>
</tr>
<tr class="per-batch">
<td colspan="2">per batch</td>
<td class="measurement">30</td>
<td class="measurement">43</td>
<td class="measurement">47</td>
<td class="measurement">7</td>
<td class="measurement">10</td>
<td class="measurement">26</td>
</tr>
</tbody>
</table>
<p><sup>7</sup> Number of concurrent batches.<br>
<sup>8</sup> Time it takes to fetch a batch.<br>
<sup>9</sup> Time it takes to update a batch.</p>
<p>Let me share some observations from the results:</p>
<ul>
<li>
<p><strong>Increasing concurrency</strong> improves Elasticsearch performance (up to 32
concurrent batches) but does not have much effect on MongoDB.</p>
</li>
<li>
<p><strong>Elasticsearch rocked in performance</strong> even though it is hammered with
the update of the entire document whereas MongoDB is just trying to update a
single attribute. Using 32 concurrent batches, it took 175s and 518s for
Elasticsearch and MongoDB, respectively, to complete the benchmark.</p>
</li>
<li>
<p><strong>Elasticsearch yielded way more predictable performance</strong> figures compared
to MongoDB. Note the difference between 75- and 99-percentile figures.</p>
</li>
<li>
<p><strong>Elasticsearch segment merges</strong> were unexpectedly pretty stable during
the runs, whereas we were anticipating it to become the bottleneck due to
high update rate. But compare-and-swap loops played over <code>_version</code> fields
allowed for the necessary data integrity without breaking a sweat.</p>
</li>
</ul>
<p>At the time of testing, we initially were not able to enable sharding in
MongoDB due to operational obstacles on our side. Though Elasticsearch results
were such promising, to the point of even shocking the hired Elasticsearch
consultants, we decided to go with it, of which we have years of production
experience. If we would put the necessity of whitelisted configuration
predicate fields problem aside – that is, required explicit indices on what
can be queried – MongoDB could very well be a viable option as well.</p>
<p>But, really, why Elasticsearch has a reputation of not being recommended as a
primary data store? I think it all started when the official project website
years ago contained an explicit statement admitting that Elasticsearch is not
intended to be used as a primary data store. Once, as the very owner of the
project itself, you admit this fact, it is really difficult to convince people
the other way around – even if the situation might have been improved. Later
on, published <a href="https://jepsen.io/">Jepsen</a> (an effort to improve the safety of
distributed databases, queues, consensus systems, etc.) reports (<a href="https://aphyr.com/posts/317-call-me-maybe-elasticsearch">one in
2014-06-15 using Elasticsearch 1.1.0</a>
and the other <a href="https://aphyr.com/posts/323-call-me-maybe-elasticsearch-1-5-0">one in 2015-04-27 using Elasticsearch
1.5.0</a>)
worsened the situation and this bad reputation disseminated over the web in
the speed of light. While this tornado DDoS’ing the entire Hackernews,
Proggit, etc. blogosphere with endless discussions in the form of <i>“See? I
told ya so!”</i>, Elasticsearch team put up a <a href="https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html">Elasticsearch Resiliency
Status</a>
page. There they started sharing (even up to today!) known resiliency
problems, including the ones found in Jepsen reports, converting them into
reproducable cases in <a href="https://github.com/elastic/elasticsearch/issues/">GitHub
issues</a>, and tackling them
one at a time. What else would qualify as a professional commitment if not
this one? Again, these were all back in early 2015. Our Elasticsearch
production deployments successfully managed to return with a victory from
every battle front thrown at them. It did not always feel like a walk in
the park. We had our hard times, though managed to overcome those and noted
down the experience to the book of lessons learnt. Let me share some common
practices from that collection:</p>
<ul>
<li>
<strong>Security</strong>: Elasticsearch does not provide any means of security
measures (encryption, etc.) out of the box. We do not use Elasticsearch to
store any sort of <a href="https://en.wikipedia.org/wiki/Personally_identifiable_information">PII</a>.</li>
<li>
<strong>Transactions</strong>: Elasticsearch does not have transaction support. Though we
work around it by performing compare-and-swap loops over the <code>_version</code>
field.</li>
<li>
<strong>Tooling</strong>: Elasticsearch tooling is… just a piece of crap. It doesn’t
have a proper development environment – you are stuck to running a fully
blown Kibana just to be able to use its arcane
<a href="https://www.elastic.co/guide/en/kibana/current/console-kibana.html">Console</a>.
Its Java client drags in the entire milky way of Elasticsearch artifacts
as a dependency which is a <a href="https://en.wikipedia.org/wiki/Java_Classloader#JAR_hell">JAR
Hell</a> time bomb
waiting to explode. Further, the recently introduced <a href="https://www.elastic.co/guide/en/elasticsearch/client/java-rest/master/java-rest-high.html">high-level REST
client</a>
leaks the Apache HTTP Client API models, etc. For the leaked models
and transitive dependencies, there is nothing much you can do – you just
learn to live with them. For IDE, you just keep a thick stack of HTTP
request recipes using your favorite HTTP client, e.g.,
<a href="https://curl.haxx.se/2">cURL</a>, <a href="https://www.getpostman.com/">Postman</a>,
<a href="https://httpie.org/">httpie</a>, etc.</li>
<li>
<strong>Documentation</strong>: Elasticsearch does not have documentation; <a href="https://www.postgresql.org/docs/">PostgreSQL
has documentation</a>, <a href="https://docs.mongodb.com/">MongoDB has
documentation</a>. What Elasticsearch has is <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html">a stack
of surface-scratching blog posts served in the form of a documentation-like
website</a>.
Elasticsearch also has an ocean of <a href="https://stackoverflow.com/questions/tagged/elasticsearch">Stack
Overflow</a>
and <a href="https://discuss.elastic.co/c/elasticsearch">forum</a> posts where you are
allowed to swim at your convenience. That being said, one needs to admit that
situation is improving over the time. (Yes, it was way worse!)</li>
<li>
<strong>Resiliency</strong>: Yes, Elasticsearch can crash, just like another piece of
software. In order to address these emergencies, in addition to hot-standby
clusters, we take regular <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html">snapshots</a>
and persist the messages processed by the ETL pipeline to a separate storage
providing efficient write and bulk read operations, e.g., PostgreSQL, Google
BigQuery, etc. In case of need, we just restore from a snapshot and replay
the necessary set of messages to recover the lost state.</li>
</ul>
<p>Is Elasticsearch the perfect tool for the job at hand? Not really. But it is
the one closest to that. We also know how to deal with each other – just like
in any other relationship.</p>
<p><a name="new-etl"></a></p>
<h1 id="the-new-etl">The New ETL</h1>
<p>By taking into account the ETL pipeline concerns detailed in previous
chapters, we derived a list of basic foundations that we aim to deliver:</p>
<ol>
<li>The configuration DSL must be abstract enough to avoid <del>any</del> too
much vendor lock-in. One must be able to represent configurations in this
DSL such that applying these on a JSON and/or the underlying storage unit
must be a matter of writing the necessary adapter classes.</li>
<li>The storage must allow the ETL pipeline to query the entire collection
using any possible filter combinations allowed by the configuration
predicate DSL. This is a crucial pillar in the design to enable real-time
processing of every message, both content and configuration snapshot
stream, without necessitating an ETL run over the complete collection which
used to be the case in the old ETL pipeline.</li>
</ol>
<p>Let me elaborate on how we addressed these deliverables.</p>
<p><a name="primary-storage-elasticsearch"></a></p>
<h2 id="the-primary-storage-elasticsearch">The Primary Storage: Elasticsearch</h2>
<p>The previous benchmark section already detailed the rationale behind employing
Elasticsearch as the primary storage. It is distributed and sharded by
default. It doesn’t require explicit indices on a whitelist of allowed
configuration predicate fields – every field is allowed to be queried by
default. It has no problems with querying fields containing a list of objects.
It provides sufficient leverage for data integrity via compare-and-swap loops
over <code>_version</code> fields. It is very efficient on bulk fetches and updates,
which was totally unexpected for us. Last, but not least, it is our bread and
butter in search and we have plenty of experience with it.</p>
<p><a name="configuration-dsl-json-groovy"></a></p>
<h2 id="the-configuration-dsl-json-and-groovy">The Configuration DSL: JSON and Groovy</h2>
<p>In the case of configuration DSL, we wanted to stop the plague of PL/SQL
leakage all around the code base. For this purpose, we decided to go with the
model depicted below.</p>
<p><img src="dsl.jpg" alt="The New Configuration DSL"></p>
<p>Here we replaced SQL WHERE clauses, which were used to represent configuration
predicates in the old ETL pipeline, with JSON describing the structure of
the predicate. This new predicate representation resembling the Elasticsearch
filters is translated to individual executors matching against either JSON
(coming from the real-time content stream) or the storage engine, that is,
Elasticsearch. Note that the way we used to represent the predicate is
independent of medium (JSON, Elasticsearch, etc.) it is executed against such
that we even implemented a MongoDB adapter at some point. An example
configuration predicate JSON is show below:</p>
<pre><code class="language-json"><span class="p">{</span>
<span class="nt">"type"</span><span class="p">:</span> <span class="s2">"nested"</span><span class="p">,</span>
<span class="nt">"path"</span><span class="p">:</span> <span class="p">[</span><span class="s2">"content"</span><span class="p">,</span> <span class="s2">"attribute"</span><span class="p">],</span>
<span class="nt">"filter"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"type"</span><span class="p">:</span> <span class="s2">"and"</span><span class="p">,</span>
<span class="nt">"filters"</span><span class="p">:</span> <span class="p">[</span>
<span class="p">{</span>
<span class="nt">"type"</span><span class="p">:</span> <span class="s2">"nested"</span><span class="p">,</span>
<span class="nt">"path"</span><span class="p">:</span> <span class="p">[</span><span class="s2">"gpc"</span><span class="p">],</span>
<span class="nt">"filter"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"type"</span><span class="p">:</span> <span class="s2">"and"</span><span class="p">,</span>
<span class="nt">"filters"</span><span class="p">:</span> <span class="p">[</span>
<span class="p">{</span>
<span class="nt">"type"</span><span class="p">:</span> <span class="s2">"equals"</span><span class="p">,</span>
<span class="nt">"path"</span><span class="p">:</span> <span class="p">[</span><span class="s2">"family_id"</span><span class="p">],</span>
<span class="nt">"value"</span><span class="p">:</span> <span class="s2">"1234"</span>
<span class="p">},</span>
<span class="p">{</span>
<span class="nt">"type"</span><span class="p">:</span> <span class="s2">"equals"</span><span class="p">,</span>
<span class="nt">"path"</span><span class="p">:</span> <span class="p">[</span><span class="s2">"chunk_id"</span><span class="p">],</span>
<span class="nt">"value"</span><span class="p">:</span> <span class="s2">"5678"</span>
<span class="p">}</span>
<span class="p">]</span>
<span class="p">}</span>
<span class="p">},</span>
<span class="p">{</span>
<span class="nt">"type"</span><span class="p">:</span> <span class="s2">"nested"</span><span class="p">,</span>
<span class="nt">"path"</span><span class="p">:</span> <span class="p">[</span><span class="s2">"authors"</span><span class="p">],</span>
<span class="nt">"filter"</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">"type"</span><span class="p">:</span> <span class="s2">"and"</span><span class="p">,</span>
<span class="nt">"filters"</span><span class="p">:</span> <span class="p">[</span>
<span class="p">{</span>
<span class="nt">"type"</span><span class="p">:</span> <span class="s2">"equals"</span><span class="p">,</span>
<span class="nt">"path"</span><span class="p">:</span> <span class="p">[</span><span class="s2">"fname"</span><span class="p">],</span>
<span class="nt">"value"</span><span class="p">:</span> <span class="s2">"Volkan"</span>
<span class="p">},</span>
<span class="p">{</span>
<span class="nt">"type"</span><span class="p">:</span> <span class="s2">"equals"</span><span class="p">,</span>
<span class="nt">"path"</span><span class="p">:</span> <span class="p">[</span><span class="s2">"lname"</span><span class="p">],</span>
<span class="nt">"value"</span><span class="p">:</span> <span class="s2">"Yazici"</span>
<span class="p">}</span>
<span class="p">]</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">]</span>
<span class="p">}</span>
<span class="p">}</span></code></pre>
<p>As depicted above, we split the configuration mutation model into two
abstractions: <em>extension</em> and <em>functional extension</em>. An extension is the
simplest form of mutation that generally applies to more than 90% of the
available configurations. It is basically a JSON object that is upon execution
expected to be merged into the original source. A simple example is as
follows:</p>
<pre><code class="language-json"><span class="p">{</span>
<span class="nt">"category"</span><span class="p">:</span> <span class="s2">"books"</span>
<span class="p">}</span></code></pre>
<p>Functional extensions are built to address complex configuration mutations.
There we employed <a href="http://www.groovy-lang.org/">Groovy</a> after experimenting
with some other candidates, e.g., JavaScript (<a href="https://www.oracle.com%0A/technetwork/articles/java/jf14-nashorn-2126515.html">Nashorn</a>, which is <a href="http://openjdk.java.net/jeps/335">planned to be
dropped</a>), Python
(<a href="http://www.jython.org/">Jython</a>), Ruby (<a href="https://www.jruby.org/">JRuby</a>),
etc. The main drivers for us to pick Groovy are as follows:</p>
<ul>
<li>It supports direct access to Java data structures (e.g., <code>java.util.Map</code>)
without any intermediate translations, hence has no problems processing
thousands of mutations on a single core.</li>
<li>It is widely adopted to an extent that in the future we might opt for
running it against the storage engine.</li>
<li>Its runtime performance is on par with the rest of the candidates.</li>
</ul>
<p>That being said, the decision of Groovy creates a JVM vendor lock-in for the
ETL pipeline, though we do not anticipate this to be a problem for at least
the coming decade.</p>
<p>A sample functional extension is given below.</p>
<pre><code class="language-groovy"><span class="kd">static</span> <span class="n">Map</span><span class="o"><</span><span class="n">String</span><span class="o">,</span> <span class="n">Object</span><span class="o">></span> <span class="n">extend</span><span class="o">(</span><span class="n">Map</span><span class="o"><</span><span class="n">String</span><span class="o">,</span> <span class="n">Object</span><span class="o">></span> <span class="n">source</span><span class="o">)</span> <span class="o">{</span>
<span class="kt">def</span> <span class="n">diskCapacityBytes</span> <span class="o">=</span> <span class="o">(</span><span class="kt">long</span><span class="o">)</span> <span class="n">source</span><span class="o">.</span><span class="na">get</span><span class="o">(</span><span class="s2">"disk_capacity_bytes"</span><span class="o">)</span>
<span class="kt">def</span> <span class="n">diskCapacityGigabytes</span> <span class="o">=</span> <span class="n">diskCapacityBytes</span> <span class="o">/</span> <span class="mi">1</span><span class="n">e9</span>
<span class="n">source</span><span class="o">.</span><span class="na">put</span><span class="o">(</span><span class="s2">"disk_capacity_gigabytes"</span><span class="o">,</span> <span class="n">diskCapacityGigabytes</span><span class="o">)</span>
<span class="k">return</span> <span class="n">source</span>
<span class="o">}</span></code></pre>
<p><a name="conclusion"></a></p>
<h1 id="conclusion">Conclusion</h1>
<p>Implementing an e-commerce search engine is a tough business. The part of the
iceberg under the water level – that is, the ETL pipeline – is not less than
that. In this post, I tried to share the lessons we piled up in the
implementation and maintenance of our decade-old ETL pipeline and how we
cultivated these to come up with something new. I attempted to explain how the
choice for the configuration DSL and the used primary storage engine has the
uttermost implication on the rest of the components of the architecture.
Elasticsearch has already been serving us pretty well in the search gateway.
Taking a step further and employing it in the ETL was a substantially
unconventional idea that gave the shivers to every engineer involved in the
decision. But the careful consideration and evaluation of potential candidates
paid off: It worked! So when you visit <a href="https://bol.com">bol.com</a> next time,
you will know that the Elasticsearch in the ETL pipeline – in addition to
many other Elasticsearch using services involved – cooked that warm page for
you seconds ago.</p>
<p><a name="acknowledgements"></a></p>
<h1 id="acknowledgements">Acknowledgements</h1>
<p>I would like thank to <a href="https://twitter.com/bbuharali">Berkay Buharalı</a>,
Lourens Heijs, <a href="https://twitter.com/wvl0">William Leese</a>, <a href="https://almer.tigelaar.net/">Almer S.
Tigelaar</a>, Leon Widdershoven, and <a href="https://twitter.com/maurice_zeijen">Maurice
Zeijen</a> for their valuable feedback in
bringing the post to its final form.</p>
<p><a name="faq"></a></p>
<h1 id="faq">F.A.Q.</h1>
<p>Here I will try to answer certain questions I received via
<a href="https://news.ycombinator.com/item?id=18568922">Hackernews</a> or e-mail.</p>
<h2 id="did-you-try-tuning-the-postgresql-optimization-knobs">Did you try tuning the PostgreSQL optimization knobs?</h2>
<p>bol.com has plenty of databases supported by an army of skilled DBAs. During
benchmarks, we collaborated with our PostgreSQL experts to continuosly tune
the necessary knobs to get the best performance given our data size and access
patterns. Hence, it wasn’t a tune once, run once operation, but rather a
continuous effort to determine an optimal configuration.</p>
<h2 id="how-do-you-calculate-searchrank">How do you calculate <code>search_rank</code>?</h2>
<p>For the benchmark purposes, we employed a deprecated signal (that is,
<code>search_rank</code>) that we used to score the matched documents. In the new search
gateway, that approach is replaced with a multitude of context-dependent
signals combined at runtime. The answer to how does the computation of ranking
signals work is out of the scope of this post. But in a nutshell, it is an
in-house machine learning algorithm harvesting historical user interaction log.
Handling of sudden or seasonal trends? That is a whole different game.</p>
<h2 id="have-you-ever-presented-this-material-in-a-conference-setting">Have you ever presented this material in a conference setting?</h2>
<p>Yes. I managed to squeeze the entire content into a 50 minutes long talk and
had to opportunity to share it with people in</p>
<ul>
<li>
<a href="https://cncv.io/">CNCML Vienna, 2019</a> (<a href="cncml-vienna-2019.pdf">PDF</a>,
<a href="cncml-vienna-2019-org.odp">ODP</a>)</li>
<li>
<a href="https://www.meetup.com/Elastic-NL/events/265086114/">ElasticNL</a>
(<a href="elasticnl-20191008.pdf">PDF</a>, <a href="elasticnl-20191008-org.odp">ODP</a>)</li>
</ul>
<iframe src="//www.slideshare.net/slideshow/embed_code/key/rWY6gjv31CPFz3" width="476" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no">
</iframe>
tag:volkan.yazi.ci,2018-02-17://blog/post/2018/02/17/varnishing-search-performance/Varnishing Search Performance2018-02-17T20:42:00Z2018-02-17T20:42:00Z
<p>This week <a href="http://bol.com">bol.com</a> hosted an <a href="https://www.meetup.com/Elastic-NL/">Elastic User Group
NL</a> meetup titled <a href="https://www.meetup.com/Elastic-NL/events/247114723/">bol.com: Changing the
(search) engine of a racecar going 300 km/h</a>.
The abstract of the presentations were as follows:</p>
<blockquote>
<p>Almost 2 years ago bol.com decided to move towards an
Elasticsearch-powered search engine. But how do you approach such a
project? Who do you involve and what do you need to (not) do? The
engineers at bol.com would like to share their experiences about this
migration, in 4 short talks.</p>
</blockquote>
<p>And among those 4 short talks, I took the stage with <em>Varnishing Search Performance</em>.</p>
<blockquote>
<p>Searching is <em>peanuts</em>. You setup your Elasticsearch cluster (or better
find a SaaS partner) and start shooting your search queries against it.
Well… Not really. If we put the biblical data ingestion story aside, it
won’t take long to realize that even moderately complicated queries can
become a bottleneck for those aiming for <50ms query performance.
Combine a couple of aggregations, double that for facets of range type,
add your grandpa’s boosting factors to the scoring, and there you go;
now you are a search query performance bottleneck owner too! Maybe I am
exaggerating a bit. Why not just start throwing some caches in front of
it? Hrm… We actually thought of that and did so. Though it brought a
mountain of problems along with itself, and there goes my story.</p>
</blockquote>
<p>The slides are available in <a href="varnishing-search-performance.pdf">PDF</a> and
<a href="varnishing-search-performance-org.odp">ODP</a> formats.</p>
<iframe src="//www.slideshare.net/slideshow/embed_code/key/4h5JWHH25nHGa4" width="476" height="400" frameborder="0" marginwidth="0" marginheight="0" scrolling="no">
</iframe>
tag:volkan.yazi.ci,2018-02-09://blog/post/2018/02/09/netty-in-action/Notes on "Netty in Action"2018-02-09T20:34:00Z2018-02-09T20:34:00Z
<p>Those who had priviledge to read my <a href="/blog/post/2017/04/18/inter-service-comm/">frustration chronicles on intra-microservice communication</a> would easily
recall me pointing my finger to Java Platform SE guys for not shipping a
proper HTTP client. There my fury went to an extent calling it as one of the
closest candidates for the billion dollar mistake. Unfortunately screaming out
loud in a blog post does not give much of a relief, because it doesn’t take
more than a month for me to find myself in precisely the same technical
mudpot. Indeed after a couple of months later I wrote that post, I was chasing
yet another performance problem in one of our aggregation services. In
essence, each incoming HTTP request is served by aggregating multiple sources
collected again over HTTP. This simple fairy tale architecture gets
slaughtered on production by 200 Tomcat threads intertwined with Rx
computation and I/O threads resting in the shades of a dozen other thread
pools dedicated for so-called-asynchronous HTTP clients for aggregated remote
services. And I saved the best for last: there were leaking <code>TIME_WAIT</code>
sockets.</p>
<p>All of a sudden the question occurred to me like the roar of rolling boulders
down a steep hill in a far distance: What is the lowest level that I can plumb
a networking application in Java without dealing with protocol intricacies.
Put another way, is there a foundational abstraction exposing both the lowest
(channel with I/O streams) and highest (HTTP headers and body) levels that are
in reach? I rode both Java OIO and NIO (that is, old- and new-I/O) horses in
the past and fell off enough to learn it the hard way that they are definitely
not feasible options in this case. The first attempt in the search of a cure
in Google introduces you to <a href="http://netty.io/">Netty</a>. If you dig long enough,
you also stumble upon <a href="http://mina.apache.org/">Apache Mina</a> too. Netty is
popular enough in the Java world that it is highly likely you are an indirect
consumer of it, unless you are already directly using it. I was aware of its
presence like dark matter in every single network application that I wrote,
though I have never considered to use it directly. Checking the Netty website
after dealing with crippled network applications at hand revealed an
enlightenment within me: <em>Hey! I can purpose this to implement some sort of
RPC mechanism using Protocol Buffers in HTTP 2.0 request payloads!</em> Though
further investigation swipes the dust from the footsteps of giants who had
followed the same path: Google (<a href="https://grpc.io/">gRPC</a>), Facebook
(<a href="https://github.com/facebook/nifty">Nifty</a>), Twitter
(<a href="https://twitter.github.io/finagle/">Finagle</a>), etc. This finding while
crushing my first excitement, later on left its place to the confidence of
getting confirmed that I am on the right path.</p>
<p>I have always heard good things about both Netty and its community. I have
already been sneakily following the
<a href="http://normanmaurer.me/presentations/">presentations</a> and <a href="https://twitter.com/normanmaurer">Twitter
updates</a> of <a href="http://normanmaurer.me/">Norman
Maurer</a>, the Netty shepherd as of date. Though what
triggered me for diving deep with Netty has become the following tweet:</p>
<blockquote class="twitter-tweet" data-lang="en">
<p lang="en" dir="ltr">Challenge accepted! First step is done. Next: Cover to cover study. <a href="https://t.co/Gnfhbi6Ko0">pic.twitter.com/Gnfhbi6Ko0</a></p>— Volkan Yazıcı (@yazicivo) <a href="https://twitter.com/yazicivo/status/954366672751689728?ref_src=twsrc%5Etfw">January 19, 2018</a>
</blockquote>
<p>Norman Maurer has always been kind and encouraging to new contributors. So my
plan is to turn this into a relation with mutual benefit: I can contribute and
get tutored while doing that so.</p>
<h1 id="netty-in-action">Netty in Action</h1>
<p><a href="https://www.manning.com/books/netty-in-action">The book</a> (2016 press date) is
definitely a must read for anyone planning to use Netty. It lays out Netty
fundamentals like channels, handlers, encoders, etc. in detail. That being
said, I have got the impression that the content is mostly curated for
beginners. For instance, dozens of pages (and an appendix) are spent (wasted?)
for a Maven crash course, not to mention the space wasted by Maven command
ouputs shared. This felt a little bit disappointing considering the existing
audience of Netty in general. Who would really read a book about Netty? You
have probably had your time with OIO/NIO primitives or client/server
frameworks in the market. You certainly don’t want to use yet another library
that promises to make all your problems disappear. So I don’t think you can be
qualified as a novice in this battle anymore, and you are indeed in the search
of a scalpel rather than a swiss army knife. Nevertheless, I still think the
book eventually managed to succeed in finding a balance between going too deep
and just scratching the surface.</p>
<h2 id="things-that-are-well-done">Things that are well done</h2>
<ul>
<li>
<p>I really enjoyed the presented <strong>historical perspective</strong> on the development
of Java platforms’ networking facilities and Netty itself. Found it quite
valuable and wanted to read more and more!</p>
</li>
<li>
<p>Emphasis on <strong><code>ByteBuf</code></strong> was really handy. Later on I learnt that there are
people using Netty just for its sound <code>ByteBuf</code> implementation.</p>
</li>
<li>
<p>Almost every single conscious decision within the shared <strong>code snippets are
explained in detail</strong>. While this felt like quite some noise in the
beginning, later on it turned out be really helpful – especially while
manually updating <code>ByteBuf</code> reference counts.</p>
</li>
<li>
<p>Presented <strong>case studies</strong> were quite interesting to read and inspiring too.</p>
</li>
</ul>
<h2 id="things-that-could-have-been-improved">Things that could have been improved</h2>
<ul>
<li>
<p>I had big hopes to read about how to implement an HTTP client with
<strong>connection pool</strong> support. I particularly find this feature inevitable in a
networking application and often not consumed wisely. Though there wasn’t a
single section mentioning about connection pooling of any sort.</p>
</li>
<li>
<p>As someone who had studied <a href="http://normanmaurer.me/presentations/">Norman Maurer’s
presentations</a>, I was expecting to
see waaaay more practical tips about <strong>GC considerations</strong>, updating <strong>socket
options</strong> (<code>TCP_NO_DELAY</code>, <code>SO_SNDBUF</code>, <code>SO_RCVBUF</code>, <code>SO_BACKLOG</code>, etc.),
mitigating <strong><code>TIME_WAIT</code></strong> socket problems, and Netty best practices. Maybe
adding this content would have doubled the size of the book, though I still
think a book on Netty is incomplete without such practical tips.</p>
</li>
<li>
<p>Many inbound requests trigger multiple I/O operations in a typical network
application. It is crucial to not let these operatins block a running thread,
which Netty is well aware of and hence ships a fully-fledged <code>EventExecutor</code>
abstraction. This crucial detail is mentioned in many places within the book,
though none gave a concrete example. Such a common thing could have been
demonstrated by an example.</p>
</li>
</ul>
<h1 id="notes">Notes</h1>
<p>I always take notes while reading a book. Let it be a grammar mistake, code
typo, incorrect or ambiguous information, thought provoking know-how,
practical tip, etc. You name it. Here I will share them in page order. I will
further classify my notes in 3 groups: <span class="note-mistake">mistakes</span>,
<span class="note-improvement">improvements</span>,
<span class="note-question">questions</span>, and
<span class="note-other">other</span>.</p>
<ul>
<li>
<p><span class="note-question">[p19, Listing 2.1]</span> Why did we use
<code>ctx.writeAndFlush(Unpooled.EMPTY_BUFFER)</code> rather than just calling
<code>ctx.flush()</code>?</p>
</li>
<li>
<p><span class="note-mistake">[p21, Listing 2.2]</span> Typo in
<code>throws Exceptio3n</code>.</p>
</li>
<li>
<p><span class="note-improvement">[p49, Section 4.3.1]</span>
The listed items</p>
<blockquote>
<ul>
<li>A new <code>Channel</code> was accepted and is ready.</li>
<li>A <code>Channel</code> connection …</li>
</ul>
</blockquote>
<p>are an identical repetition of Table 4.3.</p>
</li>
<li>
<p><span class="note-improvement">[p60]</span> <code>CompositeByteBuf</code> has the
following remark:</p>
<blockquote>
<p>Note that Netty optimizes socket I/O operations that employ
<code>CompositeByteBuf</code>, eliminating whenever possible the performance and
memory usage penalties that are incurred with JDK’s buffer implementation.
This optimization takes place in Netty’s core code and is therefore not
exposed, but you should be aware of its impact.</p>
</blockquote>
<p>Interesting. Good to know. I should be aware of <em>its impact</em>. But how can
I measure and relate this impact? Maybe I am just nitpicking, tough I would
love to hear a little bit more.</p>
</li>
<li>
<p><span class="note-question">[p77, Table 6.3]</span>
<code>channelWritabilityChanged()</code> method of <code>ChannelInboundHandler</code>… How come
an inbound channel can have a writability notion? I would have expected an
inbound channel to be just readable.</p>
</li>
<li>
<p><span class="note-improvement">[p78, Section 6.1.4]</span> Starts with
some really intriguing paragraph:</p>
<blockquote>
<p>A powerful capability of <code>ChannelOutboundHandler</code> is to defer an operation
or event on demand, which allows for sophisticated approaches to request
handling. If writing to the remote peer is suspended, for example, you can
defer flush operations and resume them later.</p>
</blockquote>
<p>Though it ends here. No more explanations, not even a single example, etc.
A total mystery.</p>
</li>
<li>
<p><span class="note-question">[p79, Table 6.4]</span> <code>read()</code> method of a
<code>ChannelOutboundHandler</code>… Similar to <code>ChannelInboundHandler#channelWritabilityChanged()</code>,
how come an outbound channel can have a read method? What are we reading
that is supposed to be already originating from us and destined to a remote
peer?</p>
</li>
<li>
<p><span class="note-improvement">[p79, Section 6.1.4]</span>
It goes as follows:</p>
<blockquote>
<p><strong><code>ChannelPromise</code> vs. <code>ChannelFuture</code></strong> Most of the methods in
<code>ChannelOutboutHandler</code> take a <code>ChannelPromise</code> argument to be notified
when the operation completes. <code>ChannelPromise</code> is a subinterface of
<code>ChannelFuture</code> that defines the writable methods, such as <code>setSuccess()</code>
or <code>setFailure()</code>, thus making <code>ChannelFuture</code> immutable.</p>
</blockquote>
<p>Ok, but why? I know the difference between a <code>Future</code> and a <code>Promise</code>, though
I still cannot see the necessity for outbound handlers to employ <code>Promise</code>
instead of a <code>Future</code>.</p>
</li>
<li>
<p><span class="note-question">[p84, Listing 6.5]</span> While adding handlers
to a pipeline, what happens in the case of a name conflict?</p>
</li>
<li>
<p><span class="note-improvement">[p84]</span> A remark is dropped on the
<strong><code>ChannelHandler</code> execution and blocking</strong> subject. Just in time! Though
it misses a demonstration.</p>
</li>
<li>
<p><span class="note-question">[p86, Listing 6.9]</span> Again a <code>read()</code>
method for the outbound operations of a <code>ChannelPipeline</code>. I am really
puzzled on the notion of reading from an outbound channel.</p>
</li>
<li>
<p><span class="note-question">[p94, Listing 6.13]</span> What happens when
a <code>ChannelFuture</code> completes before adding a listener to it?</p>
</li>
<li>
<p><span class="note-mistake">[p95, Section 6.5]</span> Last paragraph goes
like this:</p>
<blockquote>
<p>The next chapter will focus on Netty’s codec abstraction, which makes
writing protocol encoders and decoders much easier than using the
underlying <code>ChannelHandler</code> implementations directly.</p>
</blockquote>
<p>Though next chapter focuses on <code>EventLoop</code> and threading model.</p>
</li>
<li>
<p><span class="note-question">[p102, Listing 7.3]</span> Speaking of
scheduling <code>Runnable</code>s to a channel’s event loop, what if channel gets
closed before triggering the scheduled tasks?</p>
</li>
<li>
<p><span class="note-improvement">[p103]</span> Page starts with the
following last paragraph:</p>
<blockquote>
<p>These examples illustrate the performance gain that can be achieved
by taking advantage of Netty’s scheduling capabilities.</p>
</blockquote>
<p>Really? Netty’s scheduling capabilities are shown by using each function in
isolation. Though I still don’t have a clue on how these capabilities can be
purposed for a performance gain. This is a <strong>common problem throughout the
book</strong>: The innocent flashy statement hangs in the air, waiting for a
demonstration that shares some insight distilled by experience.</p>
</li>
<li>
<p><span class="note-mistake">[p104, Figure 7.4]</span> The caption of figure
is as follows:</p>
<blockquote>
<p><code>EventLoop</code> allocation for non-blocking transports (such as NIO and AIO)</p>
</blockquote>
<p>AIO? Looks like a typo.</p>
</li>
<li>
<p><span class="note-mistake">[p107]</span> Chapter starts with the following
opening paragraph:</p>
<blockquote>
<p>Having studied <code>ChannelPipeline</code>s, <code>ChannelHandler</code>s, and codec classes in
depth, …</p>
</blockquote>
<p>Nope. Nothing has been mentioned so far about codec classes.</p>
</li>
<li>
<p><span class="note-improvement">[p112]</span> It is explained that, in the
context of <code>Bootstrap</code>, <code>bind()</code> and <code>connect()</code> can throw
<code>IllegalStateException</code> if some combination of <code>group()</code>, <code>channel()</code>,
<code>channelHandler()</code>, and/or <code>handler()</code> method calls is missing. Similarly,
calling <code>attr()</code> after <code>bind()</code> has no effect. I personally find such
abstractions poorly designed. I would rather have used the <a href="https://immutables.github.io/immutable.html#staged-builder">staged builder
pattern</a> and
avoid such intricacies at compile-time.</p>
</li>
<li>
<p><span class="note-mistake">[p117, Listing 8.6]</span> The 2nd argument to
<code>Bootstrap#group()</code> looks like a typo.</p>
</li>
<li>
<p><span class="note-improvement">[p120]</span> Check this end of chapter
summary out:</p>
<blockquote>
<p>In this chapter you learned how to bootstrap Netty server and client
applications, including those that use connectionless protocols. We
covered a number of special cases, including bootstrapping client channels
in server applications and using a <code>ChannelInitializer</code> to handle the
installation of multiple <code>ChannelHandler</code>s during bootstrapping. You saw
how to specify configuration options on channels and how to attach
information to a channel using attributes. Finally, you learned how to
shut down an application gracefully to release all resources in an
orderly fashion.</p>
<p>In the next chapter we’ll examine the tools Netty provides to help you
test your <code>ChannelHandler</code> implementations.</p>
</blockquote>
<p>I have always found such summaries useless, since it is a repetition of
the chapter introduction, and hence a waste of space. Rather just
give crucial take aways, preferably in a digestible at a glimpse form.
For instance, <em>use <code>EventLoopGroup.shutdownGracefully()</code></em>, etc.</p>
</li>
<li>
<p><span class="note-improvement">[p121]</span> I suppose <em>Unit Testing</em>
chapter used to come after <em>Codecs</em> in previous prints and the authors
have moved it to an earlier stage to establish a certain coherence in
the introductory chapters. Though, reading <em>Codecs</em> reveals that there
is close to 70% overlap in content, which feels like a poorly structured
flow. I see the value in authors’ attempt, though there is quite some
room for improvement via tuning the break down of chapters.</p>
</li>
<li>
<p><span class="note-mistake">[p124, Section 9.2.1]</span>
<code>ByteToMessageDecoder</code> is used before explained. (See my remark above.)</p>
</li>
<li>
<p><span class="note-improvement">[p127]</span> The following bullets</p>
<blockquote>
<p>Here are the steps executed in the code:</p>
<ol>
<li>Writes negative 4-byte integers to a new <code>ByteBuf</code>.</li>
<li>Creates an <code>EmbeddedChannel</code> …</li>
</ol>
</blockquote>
<p>is a repetition of the descriptions available in Listing 9.4.</p>
</li>
<li>
<p><span class="note-mistake">[p138, Listing 10.3]</span> Comma missing
after <code>Integer msg</code>.</p>
</li>
<li>
<p><span class="note-question">[p141]</span> Why do
<code>MessageToMessage{Encoder,Decoder}</code> classes do not have an output type,
but just <code>Object</code>? How do you ensure type safety while chaining them
along a pipeline?</p>
</li>
<li>
<p><span class="note-mistake">[p142, Listing 10.6]</span> Comma missing
after <code>Integer msg</code>.</p>
</li>
<li>
<p><span class="note-mistake">[p145, Listing 10.7]</span> Constructor of
<code>MyWebSocketFrame</code> is named incorrectly.</p>
</li>
<li>
<p><span class="note-improvement">[p151, Section 11.2]</span> I think
<em>Building Netty HTTP/HTTPS applications</em> deserves its own chapter. And
a very important subject is missing: connection pooling.</p>
</li>
<li>
<p><span class="note-question">[p157, Listing 11.6]</span> While building
the WebSocket pipeline, which handler addresses ping/pong frames?</p>
</li>
<li>
<p><span class="note-mistake">[p159, Table 11.4]</span> The first sentence
in the description of <code>WriteTimeoutHandler</code> is identical to the one in
<code>ReadTimeoutHandler</code>. Supposedly a copy-paste side-effect.</p>
</li>
<li>
<p><span class="note-mistake">[p171]</span> Check out the first paragraph:</p>
<blockquote>
<p>WebSocket is an advanced network protocol that has been developed to
improve the performance and responsiveness of web applications. We’ll
explore Netty’s support for <em>each of them</em> by writing a sample
application.</p>
</blockquote>
<p>Each of them? Who are they?</p>
</li>
<li>
<p><span class="note-mistake">[p177]</span> <em>The call to <code>retain()</code> is
needed because after <code>channelRead()</code> …</em> → <em>The call to <code>retain()</code>
is needed because after <code>channelRead0()</code> …</em></p>
</li>
<li>
<p><span class="note-improvement">[p178, Table 12.1]</span> Identical to
Table 11.3.</p>
</li>
<li>
<p><span class="note-mistake">[p181, Figure 12.3]</span>
<code>ChunkedWriteHandler</code> is missing.</p>
</li>
<li>
<p><span class="note-question">[p183, Listing 12.4]</span> There the shutdown
of the chat server is realized via <code>Runtime.getRuntime().addShutdownHook()</code>.
Is this a recommended practice?</p>
</li>
<li>
<p><span class="note-mistake">[p189]</span> <em>Figure 14.1 presents a high-level
view of the …</em> → <em>Figure 13.1</em></p>
</li>
<li>
<p><span class="note-mistake">[p189]</span> <em>Listing 14.1 shows the details
of this simple POJO.</em> → <em>Listing 13.1</em></p>
</li>
<li>
<p><span class="note-improvement">[p190, Listing 13.1]</span> <code>received</code>
field is not used at all. Could be removed to increase clarity.
Interestingly, the field is not even encoded.</p>
</li>
<li>
<p><span class="note-mistake">[p191, Table 13.1]</span>
<code>extendsDefaultAddressedEnvelope</code> → <code>extends DefaultAddressedEnvelope</code></p>
</li>
<li>
<p><span class="note-mistake">[p191]</span> <em>Figure 14.2 shows the
broadcasting of three log …</em> → <em>Figure 13.2</em></p>
</li>
<li>
<p><span class="note-mistake">[p192]</span> <em>Figure 14.3 represents
a high-level view of the …</em> → <em>Figure 13.3</em></p>
</li>
<li>
<p><span class="note-improvement">[p192, Listing 13.2]</span> A <code>byte[] file</code>
and <code>byte[] msg</code> pair is encoded as follows:</p>
<pre><code class="language-java"><span class="n">buf</span><span class="o">.</span><span class="na">writeBytes</span><span class="o">(</span><span class="n">file</span><span class="o">);</span>
<span class="n">buf</span><span class="o">.</span><span class="na">writeBytes</span><span class="o">(</span><span class="n">LogEvent</span><span class="o">.</span><span class="na">SEPARATOR</span><span class="o">);</span>
<span class="n">buf</span><span class="o">.</span><span class="na">writeBytes</span><span class="o">(</span><span class="n">msg</span><span class="o">);</span></code></pre>
<p>Later on each entry is read back by splitting at <code>LogEvent.SEPARATOR</code>. What
if <code>file</code> contains <code>LogEvent.SEPARATOR</code>? I think this is a bad encoding
practice. I would rather do:</p>
<pre><code class="language-java"><span class="n">buf</span><span class="o">.</span><span class="na">writeInt</span><span class="o">(</span><span class="n">file</span><span class="o">.</span><span class="na">length</span><span class="o">);</span>
<span class="n">buf</span><span class="o">.</span><span class="na">writeBytes</span><span class="o">(</span><span class="n">file</span><span class="o">);</span>
<span class="n">buf</span><span class="o">.</span><span class="na">writeInt</span><span class="o">(</span><span class="n">msg</span><span class="o">.</span><span class="na">length</span><span class="o">);</span>
<span class="n">buf</span><span class="o">.</span><span class="na">writeBytes</span><span class="o">(</span><span class="n">msg</span><span class="o">);</span></code></pre>
</li>
<li>
<p><span class="note-question">[p194, Listing 13.3]</span> Is there a
constant for <code>255.255.255.255</code> broadcast address?</p>
</li>
<li>
<p><span class="note-mistake">[p195]</span> <em>Figure 14.4 depicts the
<code>ChannelPipeline</code> of the <code>LogEventonitor</code> …</em> → <em>Figure 13.4</em></p>
</li>
<li>
<p><span class="note-improvement">[p196]</span> Check this out:</p>
<blockquote>
<p>The <code>LogEventHandler</code> prints the <code>LogEvent</code>s in an easy-to-read
format that consists of the following:</p>
<ul>
<li>The received timestamp in milliseconds.</li>
</ul>
</blockquote>
<p>Really? I did not know epoch timestamps were <em>easy-to-read</em>. Maybe for some
definition of easy-to-read.</p>
</li>
<li>
<p><span class="note-mistake">[p195]</span> <em>Now we need to install our
handlers in the <code>ChannelPipeline</code>, as seen in figure 14.4.</em> →
<em>Figure 13.4</em></p>
</li>
<li>
<p><span class="note-mistake">[p205]</span> <em>Approach A, optimistic and
apparently simpler (figure 15.1)</em> → <em>figure 14.1</em></p>
</li>
<li>
<p><span class="note-improvement">[p206]</span> Half of the page is spent
for justifying Droplr’s preference of approach B (safe and complex) over
approach A (optimistic and simpler). Call me an idiot, but I am not sold
to these arguments that the former approach is less safe.</p>
</li>
<li>
<p><span class="note-mistake">[p207]</span> Type of <code>pipelineFactory</code>
is missing.</p>
</li>
<li>
<p><span class="note-improvement">[p210]</span> There is a bullet for
tuning JVM. This on its own could have been a really interesting chapter
of this book.</p>
</li>
<li>
<p><span class="note-other">[p213]</span> Firebase is indeed implementing
TCP-over-long-polling. I wonder if there exists any Java libraries that
implements user-level TCP over a certain channel abstraction.</p>
</li>
<li>
<p><span class="note-mistake">[p214]</span> <em>Figure 15.4 demonstrates
how the Firebase long-polling …</em> → <em>Figure 14.4</em></p>
</li>
<li>
<p><span class="note-mistake">[p215]</span> <em>Figure 15.5 illustrates
how Netty lets Firebase respond to …</em> → <em>Figure 14.5</em></p>
</li>
<li>
<p><span class="note-mistake">[p216]</span> <em>… can start as soon as
byes come in off the wire.</em> → <em>bytes</em></p>
</li>
<li>
<p><span class="note-mistake">[p217, Listing 14.3]</span> Last parenthesis
is missing:</p>
<pre><code class="language-scala"><span class="n">rxBytes</span> <span class="o">+=</span> <span class="n">buf</span><span class="o">.</span><span class="n">readableBytes</span><span class="o">(</span>
<span class="n">tryFlush</span><span class="o">(</span><span class="n">ctx</span><span class="o">)</span></code></pre>
</li>
<li>
<p><span class="note-improvement">[p217, Listing 14.3]</span> 70% of the
intro was about implementing a control flow over long polling, though the
shared code snippet is about totally something else and almost irrelevant.</p>
</li>
<li>
<p><span class="note-mistake">[p223]</span> <em>In referring to figure 15.1,
note that two paths …</em> → <em>figure 14.6</em></p>
</li>
<li>
<p><span class="note-mistake">[p229]</span> <em>This request/execution flow is
shown in figure 16.1.</em> → <em>figure 15.1</em></p>
</li>
<li>
<p><span class="note-mistake">[p230]</span> <em>Figure 16.2 shows how pipelined
requests are handled …</em> → <em>Figure 15.2</em></p>
</li>
<li>
<p><span class="note-mistake">[p230]</span> <em>…, in the required order. See
figure 16.3.</em> → <em>figure 15.3</em></p>
</li>
<li>
<p><span class="note-mistake">[p232]</span> <em>That simple flow (show in figure
16.4) works…</em> → <em>figure 15.4</em></p>
</li>
<li>
<p><span class="note-improvement">[p232]</span> <em>The client call is dispatched
to the Swift library, …</em> What is Swift library? Was not explained anywhere.</p>
</li>
<li>
<p><span class="note-mistake">[p232]</span> <em>This is the flow shown in figure
16.5.</em> → <em>figure 15.5</em></p>
</li>
<li>
<p><span class="note-other">[p234]</span> This is a really interesting piece:</p>
<blockquote>
<p>Before <a href="https://github.com/facebook/nifty">Nifty</a>, many of our major Java
services at Facebook used an older, custom NIO-based Thrift server
implementation that works similarly to Nifty. That implementation is an
older codebase that had more time to mature, but because its asynchronous
I/O handling code was built from scratch, and because Nifty is built on
the solid foundation of Netty’s asynchronous I/O framework, it has had
many fewer problems.</p>
<p>One of our custom message queuing services had been built using the older
framework, and it started to suffer from a kind of socket leak. A lot of
connections were sitting around in <code>CLOSE_WAIT</code> state, meaning the server
had received a notification that the client had closed the socket, but the
server never reciprocated by making its own call to close the socket. This
left the sockets in a kind of <code>CLOSE_WAIT</code> limbo.</p>
<p>The problem happened very slowly; across the entire pool of machines
handling this service, there might be millions of requests per second,
but usually only one socket on one server would enter this state in an
hour. It wasn’t an urgent issue because it took a long time before a
server needed a restart at that rate, but it also complicated tracking
down the cause. Extensive digging through the code didn’t help much
either: initially several places looked suspicious, but everything
ultimately checked out and we didn’t locate the problem.</p>
</blockquote>
</li>
<li>
<p><span class="note-mistake">[p238]</span> <em>Figure 16.6 shows the
relationship between …</em> → <em>figure 15.6</em></p>
</li>
<li>
<p><span class="note-improvement">[p239, Listing 15.2]</span> All presented
Scala code in this chapter is over-complicated and the complexity does not
serve any purpose except wasting space and increasing cognitive load. For
instance, why does <code>ChannelConnector</code> extend
<code>(SocketAddress => Future[Transport[In, Out]])</code> rather than just being a
simple method?</p>
</li>
<li>
<p><span class="note-improvement">[p239]</span> <em>This factory is provided a
<code>ChannelPipelineFactory</code>, which is …</em> What is <em>this factory</em>?</p>
</li>
</ul>
<style type="text/css">
span.note-mistake {
color: red;
}
span.note-improvement {
color: orange;
}
span.note-question {
color: green;
}
span.note-other {
color: silver;
}
</style>
<h1 id="conclusion">Conclusion</h1>
<p>In summary, <a href="https://www.manning.com/books/netty-in-action">Netty in Action</a>
is a book that I would recommend to everyone who wants to learn more about
Netty to use it in their applications. Almost the entire set of fundamental
Netty abstractions are covered in detail. The content is a bliss for novice
users in networking domain. Though this in return might make the book
uninteresting for people who already got their hands pretty dirty with
networking facilities available in Java Platform. That being said, the
presented historical perspective and shared case studies are still pretty
attractive even for the most advanced users.</p>
<p>I don’t know much about the 2<sup>nd</sup> author of the book, Marvin Allen
Wolfthal. Though, the 1<sup>st</sup> author, Norman Maurer, is a pretty known
figure in the F/OSS ecosystem. If he manages to transfer more juice from his
experience and presentations to the book, I will definitely buy the
2<sup>nd</sup> print of the book too!</p>
tag:volkan.yazi.ci,2017-04-18://blog/post/2017/04/18/inter-service-comm/Inter-Microservice Communication Fatigue2017-04-18T17:22:00Z2017-04-18T17:22:00Z
<p>Let me get this straight: <strong>Every single line of code that needs to
communicate with a remote microservice is the most bizarre, annoying, sad, and
hopeless experience in my daily coding routine.</strong> And the worst is: Most of
the time its <em>my client</em> code communicating with <em>my services</em>, so there is no
one else to blame that would sooth my anger. But I did not end up here out of
blue.</p>
<p>In my freshman years, I was given the responsibility of further development of
a microservice, where both the server and its driver (API models, HTTP client,
etc.) were written in Scala. Because Scala was still a cool buzzword back then
and the team wanted to experiment with it. (No, this won’t be a Scala FUD
post.) It was using an in-house built HTTP client, which is more or less yet
another buggy wrapper over an ancient version of <a href="https://github.com/ning/async-http-client">Ning
async-http-client</a>. I implemented a
(yes, another!) thin wrapper over it to expose the HTTP response models as
<code>scala.concurrent.Future</code>s, so we can compose them via Scala’s
for-comprehensions. (It did not take long for me to figure out that exposing
the API in Scala was one of the worst possible design decisions one could have
made in an ecosystem dominated by Java consumers.)</p>
<p>Later on as a team we adopted another critical microservice comprising a giant
Java client with Spring fizz and buzz, caching, Guava immutables all under the
hood with insanely strict <code>checkNotNull</code>/<code>checkArgument</code>-powered model
validation, etc. This comet created its own fan clubs. There are two types of
people in the company who are consuming this service:</p>
<ol>
<li>
<p>ones that bite the bullet and use the gigantic driver we
provide (say hello to a truck load of artifacts in your
not-sufficiently-sucking dependency hell) or</p>
</li>
<li>
<p>ones that prefer to implement his/her own HTTP driver hosting an empire of
bugs by manually building/parsing query request/response models formatted
in JSON/XML/Protobuf.</p>
</li>
</ol>
<p>Later on I said enough is enough! Let’s stick to a standard: JEE HTTP client,
that is, Jersey JAX-RS Client with Jackson cream on it. I still needed to
create all the API models and verify them myself every time. It was bearable
to some extent. But here comes the perfect storm: JAX-RS Client 2.0 (which
supports proper HTTP connection pooling with sanely configurable
socket+connection timeout support, which weren’t sanely available in 1.x)
requires <code>javax.ws.rs-api</code> 2.x, which is binary incompatible with 1.x, which
is used by 80% of microservices in the ecosystem. So in practice no other
microservice will be able to use my driver without the developer losing half
of his/her hairs.</p>
<p>Later on I kept repeating “enough is enough”! Let’s use <a href="https://github.com/AsyncHttpClient/async-http-client">Google’s
async-http-client</a>. It
is pluggable all around the place: the HTTP connector (Apache HC, etc.),
marshaller (Jackson, Gson, etc.). The project is more or less undocumented.
But thanks to an army of Android users, there is plenty of blog posts and
source code to discover your own personal bugs, so you can keep on blog
posting about it. Anyway… It worked. I still need to motivate myself to dive
into the source code to comprehend how it works under the hood, but it worked.</p>
<p>Today… When I need to talk to one of these services I need to pick a lousy,
juice, smelly sh*t of my preference:</p>
<ul>
<li>
<p>Inject the entire Scala milky way into your 1-file Java microservice, which
could have been delivered as a 5MB fat-JAR before Scala space-warping it
into 50MB. And don’t forget to pat your IDE in the back every time it needs
to auto-complete a Scala class. Oh, by the way, have you ever tried
accessing a <code>scala.Option</code> from Java? Boy! It is fun! I hope your service
consumers think likewise.</p>
</li>
<li>
<p>Let the giant Java driver bring all its feature-rich functionality together
with its cousins, its nephews, its uncle, its grandma, its grandpa, its
friends from the school, its ex, and of course with spring-core. All you
wanted is to make a <code>GET</code> to <code>/v1/user/<id></code>, but now you have the entire
Pivotal art gallery decorating your <code>mvn dependency:tree</code> output on the
wall.</p>
</li>
<li>
<p>You can of course purpose maven-shade-plugin to shade and relocate the
entire <code>javax.ws.rs-api</code>, Jersey dependencies, together with the entire
universe. I know you can do that.</p>
</li>
<li>
<p>Browse to Google’s <code>async-http-client</code> webpage and try to find the page that
explains how to make a simple fscking <code>GET</code> request.</p>
</li>
<li>
<p>Embrace the old Ning client wrapper, welcome bugs (the first time I needed
to use it I found out that <code>setHeaders()</code> wasn’t working as expected), stick
to JAX-RS 1.x and an ancient Netty version, which causes a JAR Hell with any
recent library, e.g., Elasticsearch. (Please refer back to
maven-shade-plugin item.)</p>
</li>
</ul>
<p>I can hear you shouting about compile-time generated HTTP clients based on
Swagger or WADL specs. But weren’t we just cursing WSDL and trying to run away
from it? <a href="square.github.io/retrofit/">Retrofit</a>?
<a href="https://twitter.github.io/finagle/">Finagle</a>? <a href="http://www.grpc.io/">gRPC</a>? I
bet it is a matter of time until you end up needing to consume two clients
which have transitive dependencies to two binary incompatible versions of
Retrofit/Finagle/gRPC. You can blame the Java class loader mechanism. But that
doesn’t make the problem fade away. Oh! I was just about to forget! Wait until
I migrate to <code>rx.Completable</code> from <code>rx.Single<Void></code>, which I migrated from
<code>rx.Observable<Void></code>.</p>
<p>I am exhausted and demotiviated to write yet another single line of code that
needs to communicate with a remote microservice and which could have been a
simple fucking RPC. I don’t have a solution for the mud ball in my hands. Even
if I do have, I am not sure whether it will survive a couple of years or not.
But in the back of my head, I keep on cursing the Java Platform SE guys: How
difficult could it be to come up with a proper pluggable HTTP client? Compared
to <code>NPE</code>, Java’s HTTP client is not <em>the</em> billion dollar mistake, but a really
close one.</p>
tag:volkan.yazi.ci,2017-04-18://blog/post/2017/10/20/hazelcast-guice/Guice Integration in Hazelcast2017-04-18T17:22:00Z2017-04-18T17:22:00Z
<p>For many occassions I find the distributed <code>ExecutorService</code> of Hazelcast
(aka. <code>IExecutorService</code>) pretty convenient to turn a set of nodes into a
tamed cluster waiting for orders. You just submit an either <code>Runnable</code> or
<code>Callable<T></code> and Hazelcast takes care of the rest – executing the task on
remote members, acknowledging the response(s) back, etc. Though note that
since the method and its response will be delivered over the wire, it is no
surprise that they all need to be <code>Serializable</code>.</p>
<pre><code class="language-java"><span class="kn">import</span> <span class="nn">com.hazelcast.core.Hazelcast</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">com.hazelcast.core.HazelcastInstance</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">com.hazelcast.core.IExecutorService</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">com.hazelcast.core.Member</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">com.hazelcast.core.MultiExecutionCallback</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.io.Serializable</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.util.Map</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.util.concurrent.Callable</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.util.concurrent.CompletableFuture</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.util.concurrent.TimeUnit</span><span class="o">;</span>
<span class="kd">public</span> <span class="kd">enum</span> <span class="n">HzGuiceDemo</span> <span class="o">{;</span>
<span class="kd">public</span> <span class="kd">static</span> <span class="kd">class</span> <span class="nc">ProcessorCountTask</span> <span class="kd">implements</span> <span class="n">Serializable</span><span class="o">,</span> <span class="n">Callable</span><span class="o"><</span><span class="n">Integer</span><span class="o">></span> <span class="o">{</span>
<span class="nd">@Override</span>
<span class="kd">public</span> <span class="n">Integer</span> <span class="nf">call</span><span class="o">()</span> <span class="o">{</span>
<span class="k">return</span> <span class="n">Runtime</span><span class="o">.</span><span class="na">getRuntime</span><span class="o">().</span><span class="na">availableProcessors</span><span class="o">();</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="kd">throws</span> <span class="n">Throwable</span> <span class="o">{</span>
<span class="n">HazelcastInstance</span> <span class="n">hzInstance</span> <span class="o">=</span> <span class="n">Hazelcast</span><span class="o">.</span><span class="na">newHazelcastInstance</span><span class="o">();</span>
<span class="n">IExecutorService</span> <span class="n">hzExecutorService</span> <span class="o">=</span> <span class="n">hzInstance</span><span class="o">.</span><span class="na">getExecutorService</span><span class="o">(</span><span class="s">"ballpark"</span><span class="o">);</span>
<span class="n">CompletableFuture</span><span class="o"><</span><span class="n">Integer</span><span class="o">></span> <span class="n">totalProcessorCountFuture</span> <span class="o">=</span> <span class="k">new</span> <span class="n">CompletableFuture</span><span class="o"><>();</span>
<span class="n">hzExecutorService</span><span class="o">.</span><span class="na">submitToAllMembers</span><span class="o">(</span>
<span class="k">new</span> <span class="nf">ProcessorCountTask</span><span class="o">(),</span>
<span class="k">new</span> <span class="nf">MultiExecutionCallback</span><span class="o">()</span> <span class="o">{</span>
<span class="nd">@Override</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">onResponse</span><span class="o">(</span><span class="n">Member</span> <span class="n">member</span><span class="o">,</span> <span class="n">Object</span> <span class="n">value</span><span class="o">)</span> <span class="o">{</span>
<span class="c1">// Ignored.</span>
<span class="o">}</span>
<span class="nd">@Override</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">onComplete</span><span class="o">(</span><span class="n">Map</span><span class="o"><</span><span class="n">Member</span><span class="o">,</span> <span class="n">Object</span><span class="o">></span> <span class="n">values</span><span class="o">)</span> <span class="o">{</span>
<span class="kt">int</span> <span class="n">totalProcessorCount</span> <span class="o">=</span> <span class="n">values</span>
<span class="o">.</span><span class="na">values</span><span class="o">()</span>
<span class="o">.</span><span class="na">stream</span><span class="o">()</span>
<span class="o">.</span><span class="na">mapToInt</span><span class="o">(</span><span class="n">object</span> <span class="o">-></span> <span class="o">(</span><span class="kt">int</span><span class="o">)</span> <span class="n">object</span><span class="o">)</span>
<span class="o">.</span><span class="na">sum</span><span class="o">();</span>
<span class="n">totalProcessorCountFuture</span><span class="o">.</span><span class="na">complete</span><span class="o">(</span><span class="n">totalProcessorCount</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">});</span>
<span class="kt">int</span> <span class="n">totalProcessorCount</span> <span class="o">=</span> <span class="n">totalProcessorCountFuture</span><span class="o">.</span><span class="na">get</span><span class="o">(</span><span class="mi">10</span><span class="o">,</span> <span class="n">TimeUnit</span><span class="o">.</span><span class="na">SECONDS</span><span class="o">);</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">format</span><span class="o">(</span><span class="s">"there are %d processors in total%n"</span><span class="o">,</span> <span class="n">totalProcessorCount</span><span class="o">);</span>
<span class="n">hzInstance</span><span class="o">.</span><span class="na">shutdown</span><span class="o">();</span>
<span class="o">}</span>
<span class="o">}</span></code></pre>
<p>Unfortunately many of our tasks are not isolated from the rest of the
application state (i.e., <em>stateless</em>) as <code>ProcessorCountTask</code> given above.
Most of the time the functional requirements necessitate access to the remote
node state that is available through beans provided by the underlying
dependency injection framework. Consider the following stateful <code>PizzaService</code>
that is responsible for cooking pizzas to its users.</p>
<pre><code class="language-java"><span class="kn">import</span> <span class="nn">javax.inject.Singleton</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">static</span> <span class="n">com</span><span class="o">.</span><span class="na">google</span><span class="o">.</span><span class="na">common</span><span class="o">.</span><span class="na">base</span><span class="o">.</span><span class="na">Preconditions</span><span class="o">.</span><span class="na">checkArgument</span><span class="o">;</span>
<span class="nd">@Singleton</span>
<span class="kd">public</span> <span class="kd">static</span> <span class="kd">class</span> <span class="nc">PizzaService</span> <span class="o">{</span>
<span class="kd">private</span> <span class="kd">volatile</span> <span class="kt">int</span> <span class="n">totalPizzaCount</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
<span class="kd">public</span> <span class="kd">synchronized</span> <span class="kt">int</span> <span class="nf">cook</span><span class="o">(</span><span class="kt">int</span> <span class="n">amount</span><span class="o">)</span> <span class="o">{</span>
<span class="n">checkArgument</span><span class="o">(</span><span class="n">amount</span> <span class="o">></span> <span class="mi">0</span><span class="o">,</span> <span class="s">"expecting: amount > 0, found: %s"</span><span class="o">,</span> <span class="n">amount</span><span class="o">);</span>
<span class="n">availablePizzaCount</span> <span class="o">+=</span> <span class="n">amount</span><span class="o">;</span>
<span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">format</span><span class="o">(</span><span class="s">"🍕 cooking %d pizza(s)%n"</span><span class="o">,</span> <span class="n">amount</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span></code></pre>
<p>We further have a task class to remotely command <code>PizzaService</code> to cook:</p>
<pre><code class="language-java"><span class="kn">import</span> <span class="nn">java.io.Serializable</span><span class="o">;</span>
<span class="kd">public</span> <span class="kd">static</span> <span class="kd">class</span> <span class="nc">PizzaCookTask</span> <span class="kd">implements</span> <span class="n">Serializable</span><span class="o">,</span> <span class="n">Runnable</span> <span class="o">{</span>
<span class="nd">@Inject</span>
<span class="kd">private</span> <span class="n">PizzaService</span> <span class="n">pizzaService</span><span class="o">;</span>
<span class="kd">private</span> <span class="kd">final</span> <span class="kt">int</span> <span class="n">amount</span><span class="o">;</span>
<span class="kd">public</span> <span class="nf">PizzaMakeTask</span><span class="o">(</span><span class="kt">int</span> <span class="n">amount</span><span class="o">)</span> <span class="o">{</span>
<span class="k">this</span><span class="o">.</span><span class="na">amount</span> <span class="o">=</span> <span class="n">amount</span><span class="o">;</span>
<span class="o">}</span>
<span class="nd">@Override</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">run</span><span class="o">()</span> <span class="o">{</span>
<span class="n">pizzaService</span><span class="o">.</span><span class="na">cook</span><span class="o">(</span><span class="n">amount</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span></code></pre>
<p>A naive approach to run this task on an <code>IExecutorService</code> would result in the
following code:</p>
<pre><code class="language-java"><span class="kn">import</span> <span class="nn">com.hazelcast.core.Hazelcast</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">com.hazelcast.core.HazelcastInstance</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">com.hazelcast.core.IExecutorService</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.util.concurrent.CompletableFuture</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.util.concurrent.TimeUnit</span><span class="o">;</span>
<span class="kd">public</span> <span class="kd">enum</span> <span class="n">HzGuiceDemo</span> <span class="o">{;</span>
<span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span><span class="n">String</span><span class="o">[]</span> <span class="n">args</span><span class="o">)</span> <span class="kd">throws</span> <span class="n">Throwable</span> <span class="o">{</span>
<span class="n">HazelcastInstance</span> <span class="n">hzInstance</span> <span class="o">=</span> <span class="n">Hazelcast</span><span class="o">.</span><span class="na">newHazelcastInstance</span><span class="o">();</span>
<span class="n">IExecutorService</span> <span class="n">hzExecutorService</span> <span class="o">=</span> <span class="n">hzInstance</span><span class="o">.</span><span class="na">getExecutorService</span><span class="o">(</span><span class="s">"ballpark"</span><span class="o">);</span>
<span class="n">hzExecutorService</span><span class="o">.</span><span class="na">executeOnAllMembers</span><span class="o">(</span><span class="k">new</span> <span class="nf">PizzaCookTask</span><span class="o">(</span><span class="mi">1</span><span class="o">));</span>
<span class="n">hzInstance</span><span class="o">.</span><span class="na">shutdown</span><span class="o">();</span>
<span class="o">}</span>
<span class="o">}</span></code></pre>
<p>which fails with a sweet <code>NullPointerException</code> as follows:</p>
<pre><code>Exception in thread "main" java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: java.lang.NullPointerException
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
at com.vlkan.hzguicedemo.HzGuiceDemo.main(HzGuiceDemo.java:??)
Caused by: java.util.concurrent.ExecutionException: java.lang.NullPointerException
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at com.hazelcast.executor.DistributedExecutorService$CallableProcessor.run(DistributedExecutorService.java:189)
at com.hazelcast.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:186)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:76)
at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:92)
Caused by: java.lang.NullPointerException
at com.vlkan.hzguicedemo.HzGuiceDemo$PizzaCookTask.call(HzGuiceDemo.java:??)
at com.vlkan.hzguicedemo.HzGuiceDemo$PizzaCookTask.call(HzGuiceDemo.java:??)
</code></pre>
<p>What is really happening here is that Hazelcast does not have a magical ball
to guess the dependency injection framework you are using to process the
<code>@Inject</code>-annotated properties of the <code>PizzaCookTask</code>. Though Hazelcast has
something else:
<a href="http://docs.hazelcast.org/docs/2.3/manual/html/ch14s02.html">ManagedContext</a>.
In a nutshell, <code>ManagedContext</code> provides means to intercept class
instantiation at deserialization. We can leverage this functionality to come
up with a <code>ManagedContext</code> implementation that bakes Guice dependency
injection into the Hazelcast class instantiation process.</p>
<pre><code class="language-java"><span class="kn">import</span> <span class="nn">com.google.inject.Injector</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">com.hazelcast.core.ManagedContext</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">javax.inject.Inject</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">javax.inject.Singleton</span><span class="o">;</span>
<span class="nd">@Singleton</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">HazelcastGuiceManagedContext</span> <span class="kd">implements</span> <span class="n">ManagedContext</span> <span class="o">{</span>
<span class="kd">private</span> <span class="kd">final</span> <span class="n">Injector</span> <span class="n">injector</span><span class="o">;</span>
<span class="nd">@Inject</span>
<span class="kd">public</span> <span class="nf">HazelcastGuiceManagedContext</span><span class="o">(</span><span class="n">Injector</span> <span class="n">injector</span><span class="o">)</span> <span class="o">{</span>
<span class="k">this</span><span class="o">.</span><span class="na">injector</span> <span class="o">=</span> <span class="n">injector</span><span class="o">;</span>
<span class="o">}</span>
<span class="nd">@Override</span>
<span class="kd">public</span> <span class="n">Object</span> <span class="nf">initialize</span><span class="o">(</span><span class="n">Object</span> <span class="n">instance</span><span class="o">)</span> <span class="o">{</span>
<span class="n">injector</span><span class="o">.</span><span class="na">injectMembers</span><span class="o">(</span><span class="n">instance</span><span class="o">);</span>
<span class="k">return</span> <span class="n">instance</span><span class="o">;</span>
<span class="o">}</span>
<span class="o">}</span></code></pre>
<p>Next all you need to do is to use this <code>ManagedContext</code> while creating your
<code>HazelcastInstance</code>:</p>
<pre><code class="language-java"><span class="n">Injector</span> <span class="n">injector</span> <span class="o">=</span> <span class="n">Guice</span><span class="o">.</span><span class="na">createInjector</span><span class="o">();</span>
<span class="n">HazelcastGuiceManagedContext</span> <span class="n">guiceManagedContext</span> <span class="o">=</span> <span class="n">injector</span><span class="o">.</span><span class="na">getInstance</span><span class="o">(</span><span class="n">HazelcastGuiceManagedContext</span><span class="o">.</span><span class="na">class</span><span class="o">);</span>
<span class="n">Config</span> <span class="n">hzConfig</span> <span class="o">=</span> <span class="k">new</span> <span class="nf">Config</span><span class="o">();</span>
<span class="n">hzConfig</span><span class="o">.</span><span class="na">setManagedContext</span><span class="o">(</span><span class="n">guiceManagedContext</span><span class="o">);</span>
<span class="n">HazelcastInstance</span> <span class="n">hzInstance</span> <span class="o">=</span> <span class="n">Hazelcast</span><span class="o">.</span><span class="na">newHazelcastInstance</span><span class="o">(</span><span class="n">hzConfig</span><span class="o">);</span></code></pre>
<p>While I have provided an example for Guice, this method is applicable to any
dependency injection framework that provides an equivalent to
<code>Injector#injectMembers()</code> of Guice. Needless to say, but Spring folks are
already covered by <code>SpringManagedContext</code> shipped with Hazelcast.</p>
tag:volkan.yazi.ci,2016-10-04://blog/post/2016/10/04/coders-at-work/Notes on "Coders at Work"2016-10-04T18:40:00Z2016-10-04T18:40:00Z
<p>There is nothing like thinking about work while your are on vacation. And that
was indeed what I did: Reading <a href="http://www.codersatwork.com/">Coders at Work: Reflections on the Craft of
Programming</a> in a tango-themed Buenos Aires
trip.</p>
<p>I had already met with Peter Seibel in his well-known splendid work:
<a href="http://www.gigamonkeys.com/book/">Practical Common Lisp</a>. He definitely has a
knack for transforming technically challenging problems into a pedagogically
digestable <a href="https://en.wikipedia.org/wiki/Nootropic">nootropic</a>. This uncanny
ability was placidly lurid in Coders at Work as well. Motivated by the mind
provoking exquisite content, I felt an urge to keep a record of its
reflections on me.</p>
<h1 id="on-the-content">On the Content</h1>
<p>I totally enjoyed the book and read it cover to cover. Nevertheless, I believe
the following subtleties could have been addressed in a better way.</p>
<ul>
<li>
<p>A majority of the interviewed <em>coders</em> are not actively <em>coding</em> any more. I
find this a little bit contradictory with the title of the book. While the
content still makes a great deal about the historical progress of
programming and programmers, I find the detachment of the interviewees from
the modern computing slightly unexpected.</p>
</li>
<li>
<p>Given the back that the book examines the events dating back to more than
half a century, I sometimes find myself lost in the time context. Additional
footnotes to enhance these kind of ambiguities could have been useful.</p>
</li>
</ul>
<h1 id="highligts">Highligts</h1>
<p>Below I collected my personal highligts on certain statements that are shared
by certain interviewees.</p>
<ul>
<li>
<p>In general, <em>coders</em> do not practice software testing extensively. Further,
I had the impression that that they do not read much programming books
either. This issue sometimes acclaimed to the lack of necessary set of
fundamental books at the early stages of their career.</p>
</li>
<li>
<p>Among the entire deck, I find Joshua Bloch, Bernie Cosell, and Donald Knuth
the ones with the most sensible and to-the-earth statements.</p>
</li>
<li>
<p>A notable subset of the interviewees dragged into computers not by a
deliberate decision, but by chosing a yet another career path that was
available to them. (For instance, Fran Allen got a Fortran instructor
position in IBM in order to finance her school loans to be able to continue
her math teacher career pursuit.)</p>
</li>
<li>
<p>None believes that reading Knuth’s <a href="https://en.wikipedia.org/wiki/The_Art_of_Computer_Programming">The Art of Computer
Programming</a>
is a must read for programmers, nevertheless they acknowledge that it is
good to have it under your hand for reference.</p>
</li>
<li>
<p>Except Knuth himself, nobody practices literate programming. (I am astounded
to observe how Seibel is biased to ask this non-sense question which
delivers no practical value at all to every single candidate.)</p>
</li>
<li>
<p>Majority agrees that programming is a way more complicated and challenging
occupation than it once used to be in the past.</p>
</li>
<li>
<p>More than half thinks that good writing skills are a big plus (some even
state necessity) for programming.</p>
</li>
<li>
<p><code>printf</code> is the clear winner as the debugging tool of preference among
interviewees.</p>
</li>
</ul>
<h1 id="quotes">Quotes</h1>
<p>Below you can find some snippets that I find worth mentioning from the book.</p>
<h2 id="jamie-zawinski">Jamie Zawinski</h2>
<p>I wish I would have known this when I was in high school. Could not agree
more.</p>
<blockquote>
<p><strong>Zawinski:</strong> When you’re in high school, everyone tells you, “There’s a lot
of repetitive bullshit and standardized tests; it’ll all be better once
you’re in college.” And then you get to your first year of college and
they’re like, “Oh, no – it gets better when you’re in grad school.” So it’s
just same shit, different day – I couldn’t take it. [p5]</p>
</blockquote>
<p>His comments on C++, which are shared by many other interviewees throughout
the book:</p>
<blockquote>
<p><strong>Zawinski:</strong> … when you’re programming C++ no one can ever agree on which
ten percent of the language is safe to use. [p20]</p>
</blockquote>
<p>The sad truth about F/OSS:</p>
<blockquote>
<p><strong>Seibel:</strong> Isn’t it exactly this thing – someone comes along and says, “I
can’t understand this stuff. I’ll just rewrite it” – that leads to the
endless rewriting you bemoan in open-source development?</p>
<p><strong>Zawinski:</strong> Yeah. But there’s also another aspect of that which is,
efficiency aside, it’s just more fun to write your own code than to figure
out someone else’s. So it is easy to understand why that happens. But the
whole Linux/GNOME side of things is straddling this line between someone’s
hobby and a product. Is this a research project where we’re deciding what
desktops should look like and we’re experimenting? Or are we competing with
Macintosh? Which is it? Hard to do both. [p23]</p>
</blockquote>
<h2 id="brad-fitzpatrick">Brad Fitzpatrick</h2>
<p>His thoughts on finishing a project, which I sadly share as well:</p>
<blockquote>
<p><strong>Fitzpatrick:</strong> The projects that I never finish … it’s because I did the
hard part and I learned what I wanted to learn and I never got around to
doing the boring stuff. [p20]</p>
</blockquote>
<p>He is also poisoned by LWN, Reddit, etc.</p>
<blockquote>
<p><strong>Fitzpatrick:</strong> I like working alone but I just bounce all over the place
when I do. On a plane I’ll bring extra laptop batteries and I have a whole
development environment with local web servers and I’ll be in a web browser,
testing stuff. But I’ll still be hitting new tabs, and typing “reddit” or
“lwn” – sites I read. Autocomplete and hit Enter, and then – error
message. I’ll do this multiple times within a minute. Holy fuck! Do I do
this at work? Am I reading web site this often that I don’t even think about
it? It’s scary. I had a friend, who had some iptables rule, that on
connection to a certain IP address between certain hours of the day would
redirect to a “You should be working” page. I haven’t got around to doing
that, but I need to do something like it, probably. [p73]</p>
</blockquote>
<h2 id="douglas-crockford">Douglas Crockford</h2>
<p>Why programming is difficult?</p>
<blockquote>
<p><strong>Crockford:</strong> Part of what makes programming difficult is most of the time
we’re doing stuff we’ve never done before. [p110]</p>
</blockquote>
<p>He talks about his preferred way for interviewing job candidates, which is also
shared by other coders in the book.</p>
<blockquote>
<p><strong>Crockford:</strong> The approach I’ve taken now is to do a code reading. I invite
the candidate to bring in a piece of code he’s really proud of and walk us
through it. [p129]</p>
</blockquote>
<h2 id="brendan-eich">Brendan Eich</h2>
<p>Nothing noteworthy, you may guess why.</p>
<h2 id="joshua-bloch">Joshua Bloch</h2>
<p>Is Java off in the weeds?</p>
<blockquote>
<p><strong>Seibel:</strong> … is Java off in the weeds a little bit? Is it getting more
complex faster than it’s getting better?</p>
<p><strong>Bloch:</strong> That’s a very difficult question. In particular, the Java 5
changes added far more complexity than we ever intended. I had no
understanding of just how much complexity generics and, in particular,
wildcards were going to add to the language. I have to give credit where is
due – Graham Hamilton did understand this at the time and I didn’t.</p>
<p>The funny things is, he fought against it for years, trying to keep generics
out of the language. But the notion of variance – the idea behind wildcards
– came into fashion during the years when generics were successfully being
kept out of Java. If they had gone in earlier, without variance, we might
have had a simpler, more tractable language today.</p>
<p>That said, there are real benefits to wildcards. There’s a fundamental
impedance mismatch between subtyping and generics, and wildcards go a long
way towards rectifying the mismatch. But at a significant cost in terms of
complexity. THere are some people who believe that declaration-site, as
opposed to use-site, variance is a better solution, but I’m not so sure.</p>
<p>The jury is basically still out on anything that hasn’t been tested by a
huge quantity of programmers under real-world conditions. Often languages
only succeed in some niche and people say, “Oh, they’re great and it’s such
a pity they didn’t become the successful language in the world.” But often
there are reasons they didn’t. Hopefully some language that does use
declaration-site variance, like Scala or C# 4.0, will answer this question
once and for all. [p191]</p>
</blockquote>
<p>On “obviously no deficiencies” versus “no obvious deficiencies”:</p>
<blockquote>
<p><strong>Bloch:</strong> There’s a brilliant quote by Tony Hoare in his Turing Award
speech about how there are two ways to design a system: “One way is to make
it so simple that there are <em>obviously</em> no deficiencies and the other way is
to make it is so complicated that there are no <em>obvious</em> deficiencies.”</p>
<p>The paragraph that follows is equally brilliant, though it isn’t as
well-known: “The first method is far more difficult. It demands the same
skill, devotion, insight, and even inspiration as the discovery of the
simple physical laws which underlie the complex phenomena of nature. It also
requires a willingness to accept objectives which are limited by physical,
logical, and technological constraints, and to accept a compromise when
conflicting objectives cannot be met. No committee will ever do this until
it is too late.” [p197]</p>
</blockquote>
<p>Smart people and programming:</p>
<blockquote>
<p><strong>Seibel:</strong> Speaking of writing intricate code, I’ve noticed that people who
are too smart, in a certain dimension anyway, make the worst code. Because
they can actually fit the whole thing in their head they can write these
great reams of spaghetti code.</p>
<p><strong>Bloch:</strong> I agree with you that people who are both smart enough to cope
with enormous complexity and lack empathy with the rest of use may fail prey
to that. They think, “I can understand this and I can use it, so it has to
be good.” [p202]</p>
<p>…</p>
<p>There’s this problem, which is, programming is so much of an intellectual
meritocracy and often these people are the smartest people in the
organization; therefore they figure they should be allowed to make all the
decisions. But merely the fact that they’re the smartest people in the
organization doesn’t mean they should be making all the decisions, because
intelligence is not a scalar quantity; it’s a vector quantity. And if you
lack empathy or emotional intelligence, then you shouldn’t be designing APIs
or GUIs or languages. [p203]</p>
</blockquote>
<h2 id="joe-armstrong">Joe Armstrong</h2>
<p>On paralyses of choice:</p>
<blockquote>
<p><strong>Armstrong:</strong> The funny thing is, thinking back, I don’t think all these
modern gizmos actually make you any more productive. Hierarchical file
systems – how do they make you more productive? Most of software
development goes on in your head anyway. I think having worked with that
simpler system imposes a kind of disciplined way of thinking. If you haven’t
got a directory system and you have to put all the files in one directory,
you have to be fairly disciplined. If you haven’t got a revision control
system, you have to be fairly disciplined. Given that you apply that
discipline to what you’re doing it doesn’t seem to me to be any better to
have hierarchical file systems and revision control. They don’t solve the
fundamental problem of solving your problem. They probably make it easier
for groups of people to work together. For individuals I don’t see any
difference.</p>
<p>Also, I think today we’re kind of overburdened by choice. I mean, I just had
Fortran. I don’t think we even had shell scripts. We just had batch files so
you could run things, a compiler, and Fortran. And assembler possibly, if
you really needed it. So there wasn’t this agony of choice. Being a young
programmer today must be awful – you can choose 20 different programming
languages, dozens of framework and operating systemsand you’re paralyzed by
choice. There was no paralysis of choice then. You just start doing it
because the decision as to which language and things is just made – there’s
no thinking about what you should do, you just go and do it. [p210]</p>
</blockquote>
<h2 id="simon-peyton-jones">Simon Peyton Jones</h2>
<p>Testing an API in Microsoft:</p>
<blockquote>
<p><strong>Peyton Jones:</strong> Well, they also do some interesting work on testing APIs.
Steven Clarke and his colleagues at Redmond have made systematic attempts to
watch programmers, given a new API, talk through what they’re trying to do.
And they get the people who designed the API to sit behind a glass screen
and watch them.</p>
<p>And the guys sitting there behind the glass screen say, “No, no, don’t do
that! That’s not the right way!” But it’s soundproof. That turns out often
to be very instructive. They go and change their API. [p253]</p>
</blockquote>
<h2 id="peter-norvig">Peter Norvig</h2>
<p>On the traditional master and apprentice approach:</p>
<blockquote>
<p><strong>Norvig:</strong> But I think part of the reasons why you had master and
apprentice is because the materials were rarer. When you were doing
goldsmithing, there’s only so much gold. Or when the surgeon’s operating,
there’s only one heart, and so you want the best person on that and you want
the other guys just helping. With coding, it’s not like that. You’ve got
plenty of terminals. You’ve got plenty of keyboards. You don’t have to
ration it. [p295]</p>
</blockquote>
<p>Why programming is not an art, but a craft:</p>
<blockquote>
<p><strong>Seibel:</strong> As a programmer, do you consider yourself a scientist, an
engineer, an artist, or a craftsman?</p>
<p><strong>Norvig:</strong> Well, I know when you compare the various titles of books and so
on, I always thought the “craft” was the right answer. So I thought art was
a little pretentious because the purpose of art is to be beautiful or to
have an emotional contact or emotional impact, and I don’t feel like that’s
anything that I try to do. Certainly I want programs to be pretty in some
ways, and sometimes I feel like I spend too much time doing that. I’ve been
in a position where I’ve had the luxury to say, “Gee, I have time to go back
and pretty this up a little bit.” And places where I’ve been able to write
for a publication, you spend more time doing that than you would if it was
just for your own professional growth.</p>
<p>But I don’t think of that as art. I think <em>craft</em> is really the right word
for it. You can make a chair, and it’s good looking, but it’s mostly
functional – it’s a chair. [p319]</p>
</blockquote>
<h2 id="guy-steele">Guy Steele</h2>
<p>On the difficulty of getting a program right:</p>
<blockquote>
<p><strong>Steele:</strong> I’ll give you another example – suppose I were to tell my smart
computer, “OK, I’ve got this address book and I want the addresses to always
be in sorted order,” and it responds by throwing away everything but the
first entry. Now the address book is sorted. But that’s not what you wanted.
It turns out that just specifying something as simple as “a list is in
sorted order and I haven’t lost any of the data and nothing has been
duplicated” is actually a fairly tricky specification to write. [p361]</p>
</blockquote>
<h2 id="dan-ingalls">Dan Ingalls</h2>
<p>Was a nice read, though I could not find anything particularly interesting
worth sharing. Nevertheless, along the lines Seibel says something that I have
never heard of:</p>
<blockquote>
<p><strong>Seibel:</strong> Alan Kay has said that both Lisp and Smalltalk have the problem
that they’re so good they eat their children. If you had known Lisp, then
Smalltalk would have been the first eaten child. [p378]</p>
</blockquote>
<h2 id="l-peter-deutsch">L Peter Deutsch</h2>
<p>On getting data structures right:</p>
<blockquote>
<p><strong>Deutsch:</strong> … if you get the data structures and their invariants right,
most of the code will just kind of write itself. [p420]</p>
</blockquote>
<p>Conceptualization of software and memory pointers:</p>
<blockquote>
<p><strong>Deutsch:</strong> … I don’t look around and see anything that looks like an
address or a pointer. We have objects; we don’t have these weird things that
computer scientists misname “objects.”</p>
<p><strong>Seibel:</strong> To say nothing of the scale. Two to the 64th of anything is a
lot, and things happening billions of times a second is fast.</p>
<p><strong>Deutsch:</strong> But that doesn’t bother us here in the real world. You know
Avogadro’s number, right? Ten to the 23rd? So, we’re looking here around at
a world that has incredible numbers of little things all clumped together
and happening at the same time. It doesn’t bother us because the world is
such that you don’t have to understand this table at a subatomic level. The
physical properties of matter are such that 99.9 percent of the time you can
understand it in aggregate. And everything you have to know about it, you
can understand from dealing with it in aggregate. To a great extent, that is
not true in the world of software.</p>
<p>People keep trying to do modularization structures for software. And the
state of that art has been improving over time, but it’s still, in my
opinion, very far away from the ease with which we look around and see
things that have, whatever it is, 10 to the 23rd atoms in them, and it
doesn’t even faze us.</p>
<p>Software is a discipline of detail, and that is a deep, horrendous
fundamental problem with software. Until we understand how to conceptualize
and organize software in a way that we don’t have to think about how every
little piece interacts with every other piece, things are not going to get a
whole lot better. And we’re very far from being there. [p424]</p>
</blockquote>
<h2 id="ken-thompson">Ken Thompson</h2>
<p>On teaching:</p>
<blockquote>
<p><strong>Thompson:</strong> … I love the teaching: the hard work of a first class, the
fun of the second class. Then the misery of the third. [p455]</p>
</blockquote>
<p>What I am supposed to do and what I am actually doing:</p>
<blockquote>
<p><strong>Thompson:</strong> We were supposed to be doing basic research but there was some
basic research we should be doing and some basic research we shouldn’t be
doing. And just coming out of the ashes of MULTICS, operating systems was
one of those basic research things we shouldn’t be doing. Because we tried
it, it didn’t work, it was a huge failure, it was expensive; let’s drop it.
So I kind of expected that for what I was doing I was going to eventually
get fired. I didn’t. [p458]</p>
</blockquote>
<p>Code rots:</p>
<blockquote>
<p><strong>Thompson:</strong> Code by itself almost rots and it’s gotta be rewritten. Even
when nothing has changed, for some reason it rots. [p460]</p>
</blockquote>
<p>10 percent of the work:</p>
<blockquote>
<p><strong>Thompson:</strong> NELIAC was a system-programming version of Algol 58.</p>
<p>Seibel: Was Bliss also from that era?</p>
<p><strong>Thompson:</strong> Bliss I think was after. And their emphasis was trying to
compile well. I think it was pretty clear from the beginning that you
shouldn’t kill yourself compiling well. You should do well but not really
good. And the reason is that in the time it takes you to go from well to
really good, Moore’s law has already surpassed you. You can pick up 10
percent but while you’re picking up that 10 percent, computers have gotten
twice as fast and maybe with some other stuff that matters more for
optimization, like caches. I think it’s largely a waste of time to do really
well. It’s really hard; you generate as many bugs as you fix. You should
stop, not take that extra 100 percent of time to do 10 percent of the work.
[p462]</p>
</blockquote>
<p>Writing an OS to test a file system:</p>
<blockquote>
<p><strong>Seibel:</strong> So you basically wrote an OS so you’d have a better environment
to test your file system.</p>
<p><strong>Thompson:</strong> Yes. Halfway through there that I realized it was a real time-
sharing system. I was writing the shell to drive the file system. And then I
was writing a couple other programs that drove the file system. And right
about there I said, “All I need is an editor and I’ve got an operating
system.” [p465]</p>
</blockquote>
<p>Economics of deciding on introducing a bag:</p>
<blockquote>
<p><strong>Thompson:</strong> Certainly every time I’ve written one of these non-compare
subroutine calls, strcpy and stuff like that, I know that I’m writing a bug.
And I somehow take the economic decision of whether the bug is worth the
extra arguments. [p468]</p>
</blockquote>
<p>On testing:</p>
<blockquote>
<p><strong>Thompson:</strong> … Mostly just
regression tests.</p>
<p><strong>Seibel:</strong> By things that are harder to test, you mean things like device
drivers or networking protocols?</p>
<p><strong>Thompson:</strong> Well, they’re run all the time when you’re actually running an
operating system.</p>
<p><strong>Seibel:</strong> So you figure you’ll shake the bugs out that way?</p>
<p><strong>Thompson:</strong> Oh, absolutely. I mean, what’s better as a test of an
operating system than people beating on it? [p469]</p>
</blockquote>
<p>Code at Google:</p>
<blockquote>
<p><strong>Thompson:</strong> I guess way more than 50 percent of the code is the what-if
kind. [p473]</p>
</blockquote>
<p>On literate programming:</p>
<blockquote>
<p><strong>Seibel:</strong> When I interviewed him, Knuth said the key to technical writing
is to say everything twice in complementary ways. So I think he sees that as
a feature of literate programming, not a bug.</p>
<p><strong>Thompson:</strong> Well if you have two ways, one of them is real: what the
machine executes. [p477]</p>
</blockquote>
<h2 id="fran-allen">Fran Allen</h2>
<p>What makes a program beautiful?</p>
<blockquote>
<p><strong>Allen:</strong> That it is a simple straightforward solution to a problem; that
has some intrinsic structure and obviousness about it that isn’t obvious
from the problem itself. [p489]</p>
</blockquote>
<h2 id="bernie-cosell">Bernie Cosell</h2>
<p>Should we teach Knuth to students?</p>
<blockquote>
<p><strong>Cosell:</strong> I would not teach students Knuth per se for two reasons. First,
it’s got all this mathematical stuff where he’s not just trying to present
the algorithms but to derive whether they’re good or bad. I’m not sure you
need that. I understand a little bit of it and I’m not sure I need any of
it. But getting a feel for what’s fast and what’s slow and when, that’s an
important thing to do even if you don’t know how much faster or how much
slower.</p>
<p>The second problem is once students get sensitive to that, they get too
clever by half. They start optimizing little parts of the program because,
“This is the ideal place to do an AB unbalanced 2-3 double reverse backward
pointer cube thing and I always wanted to write one of those.” So they spend
a week or two tuning an obscure part of a program that doesn’t need
anything, which is now more complicated and didn’t make the program any
better. So they need a tempered understanding that there are all these
algorithms, how they work, and how to apply them. It’s really more of a case
of how to pick the right one for the job you’re trying to do as opposed to
knowing that this one is an order n-cubed plus three and this one is just
order n-squared times four. [p527]</p>
</blockquote>
<p>Writing programs and learning how to program:</p>
<blockquote>
<p><strong>Cosell:</strong> The binary bits are what computers want and the text file is for
me. I would get people – bright, really good people, right out of college,
tops of their classes – on one of my projects. And they would know all
about programming and I would give them some piece of the project to work
on. And we would start crossing swords at our project-review meetings. They
would say, “Why are you complaining about the fact that I have my global
variables here, that I’m not doing this, that you don’t like the way the
subroutines are laid out? The program works.”</p>
<p>They’d be stunned when I tell them, “I don’t care that the program works.
The fact that you’re working here at all means that I expect you to be able
to write programs that work. Writing programs that work is a skilled craft
and you’re good at it. Now, you have to learn how to program.” [p543]</p>
</blockquote>
<p>Convictions:</p>
<blockquote>
<p><strong>Cosell:</strong> I had two convictions, which actually served me well: that
programs ought to make sense and there are very, very few inherently hard
problems. [p549]</p>
</blockquote>
<p>How long is it going to take you to put this change in?</p>
<blockquote>
<p><strong>Cosell:</strong> So when they ask, “How long is it going to take you to put this
change in?” you have three answers. The first is the absolute shortest way,
changing the one line of code. The second answer is how long it would be
using my simple rule of rewriting the subroutine as if you were not going to
make that mistake. Then the third answer is how long if you fix that bug if
you were actually writing this subroutine in the better version of the
program. [p550]</p>
</blockquote>
<p>Artistry in programming:</p>
<blockquote>
<p><strong>Cosell:</strong> Part of what I call the artistry of the computer program is how
easy it is for future people to be able to change it without breaking it.
[p555]</p>
</blockquote>
<p>Difficulty of programming and C:</p>
<blockquote>
<p><strong>Cosell:</strong> … programmers just can’t be careful enough. They don’t see all
the places. And C makes too many places. Too scary for me, and I guess it’s
fair to say I’ve programmed C only about five years less than Ken has. We’re
not in the same league, but I have a long track record with C and know how
difficult it is and I think C is a big part of the problem. [p559]</p>
</blockquote>
<p>75 million run-of-the-mill programmers and Java:</p>
<blockquote>
<p><strong>Cosell:</strong> When I first messed with Java – this was when it was little
baby language, of course – I said, “Oh, this is just another one of those
languages to help not-so-good programmers go down the straight and narrow by
restricting what they can do.” But maybe we’ve come to a point where that’s
the right thing. Maybe the world has gotten so dangerous you can’t have a
good, flexible language that one percent or two percent of the programmers
will use to make great art because the world is now populated with 75
million run-of-the-mill programmers building these incredibly complicated
applications and they need more help than that. So maybe Java’s the right
thing. I don’t know. [p560]</p>
</blockquote>
<p>Not-so-good programmers and C:</p>
<blockquote>
<p><strong>Cosell:</strong> I don’t want to say that C has outlived its usefulness, but I
think it was used by too many good programmers so that now not-good-enough
programmers are using it to build applications and the bottom line is
they’re not good enough and they can’t. Maybe C is the perfect language for
really good systems programmers, but unfortunately not-so-good systems and
applications programmers are using it and they shouldn’t be. [p560]</p>
</blockquote>
<h2 id="donald-knuth">Donald Knuth</h2>
<p>Teaching a class, writing a book, and programming:</p>
<blockquote>
<p><strong>Knuth:</strong> I could teach classes full-time and write a book full-time but
software required so much attention to detail. It filled that much of my
brain to the exclusion of other stuff. So it gave me a special admiration
for people who do large software projects – I would never have guessed it
without having been faced with that myself. [p572]</p>
</blockquote>
<p>Why isn’t everybody a super programmer and super writer?</p>
<blockquote>
<p><strong>Knuth:</strong> Now, why hasn’t this spread over the whole world and why isn’t
everybody doing it? I’m not sure who it was who hit the nail on the head –
I think it was Jon Bentley. Simplified it is like this: only two percent of
the world’s population is born to be super programmers. And only two percent
of the population is born to be super writers. And Knuth is expecting
everybody to be both. [p574]</p>
</blockquote>
<p>Use of pointers in C:</p>
<blockquote>
<p><strong>Knuth:</strong> To me one of the most important revolutions in programming
languages was the use of pointers in the C language. When you have
nontrivial data structures, you often need one part of the structure to
point to another part, and people played around with different ways to put
that into a higher- level language. Tony Hoare, for example, had a pretty
nice clean system but the thing that the C language added – which at first
I thought was a big mistake and then it turned out I loved it – was that
when x is a pointer and then you say, x + 1 , that doesn’t mean one more
byte after x but it means one more node after x , depending on what x points
to: if it points to a big node, x
+ 1 jumps by a large amount; if x points to a small thing, x + 1 just moves
a little. That, to me, is one of the most amazing improvements in notation.
[p585]</p>
</blockquote>
<p>I did not know about Knuth’s <em>change files</em>. But it seemed like an
inconvenient overkill:</p>
<blockquote>
<p><strong>Knuth:</strong> I had written TeX and Metafont and people started asking for it.
And they had 200 or 300 combinations of programming language and operating
system and computer, so I wanted to make it easy to adapt my code to
anybody’s system. So we came up with the solution that I would write a
master program that worked at Stanford and then there was this add-on called
a change file which could customize it to anybody else’s machine.</p>
<p>A change file is a very simple thing. It consists of a bunch of little blobs
of changes. Each change starts out with a few lines of code. You match until
you find the first line in the master file that agrees with the first line
of your change. When you get to the end of the part of the change that was
supposed to match the master file, then comes the part which says, “Replace
that by these lines instead.” [p586]</p>
<p>The extreme example of this was when TeX was adapted to Unicode. They had a
change file maybe 10 times as long as the master program. In other words,
they changed from an 8-bit program to a 16-bit program but instead of going
through and redoing my master program, they were so into change files that
they just wrote their whole draft of what they called Omega as change files,
as a million lines of change files to TeX’s 20,000 lines of code or
something. So that’s the extreme. [p587]</p>
</blockquote>
<p>Is programming fun any more?</p>
<blockquote>
<p><strong>Knuth:</strong> So there’s that change and then there’s the change that I’m
really worried about: that the way a lot of programming goes today isn’t any
fun because it’s just plugging in magic incantations – combine somebody
else’s software and start it up. It doesn’t have much creativity. I’m
worried that it’s becoming too boring because you don’t have a chance to do
anything much new. [p594]</p>
</blockquote>
<p>Code reading:</p>
<blockquote>
<p><strong>Knuth:</strong> … don’t only read the people who code like you. [p601]</p>
</blockquote>
tag:volkan.yazi.ci,2016-08-12://blog/post/2016/08/12/hotspot-heapdump-threadump/Programmatically Taking Heap and Thread Dumps in HotSpot2016-08-12T17:53:00Z2016-08-12T17:53:00Z
<p>While taking heap and thread dumps are one click away using modern JVM
toolset, in many cases the deployment environment access restrictions render
these options unusable. Hence, you might end up exposing these functionalities
in certain ways like an internal REST interface. This implies a new nasty
obstacle: You need to know how to programmatically take heap and thread dumps
in a Java application. Unfortunately, there does not exist a standard
interface to access these functionalities within the VM as of date. But if you
are only concerned about HotSpot, then you are in luck!</p>
<h1 id="heap-dumps">Heap Dumps</h1>
<p>For heap dumps, once you get your teeth into a
<a href="https://docs.oracle.com/javase/8/docs/jre/api/management/extension/com/sun/management/HotSpotDiagnosticMXBean.html">HotSpotDiagnosticMXBean</a>,
you are safe to go. It already exposes a
<a href="https://docs.oracle.com/javase/8/docs/jre/api/management/extension/com/sun/management/HotSpotDiagnosticMXBean.html#dumpHeap-java.lang.String-boolean-">dumpHeap()</a>
method ready to be used.</p>
<pre><code class="language-java"><span class="kn">import</span> <span class="nn">com.sun.management.HotSpotDiagnosticMXBean</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">javax.management.MBeanServer</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.io.File</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.io.IOException</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.lang.management.ManagementFactory</span><span class="o">;</span>
<span class="kd">public</span> <span class="kd">enum</span> <span class="n">HotSpotHeapDumps</span> <span class="o">{;</span>
<span class="kd">private</span> <span class="kd">static</span> <span class="kd">final</span> <span class="n">HotSpotDiagnosticMXBean</span> <span class="n">HOT_SPOT_DIAGNOSTIC_MX_BEAN</span> <span class="o">=</span>
<span class="n">getHotspotDiagnosticMxBean</span><span class="o">();</span>
<span class="kd">private</span> <span class="kd">static</span> <span class="n">HotSpotDiagnosticMXBean</span> <span class="nf">getHotspotDiagnosticMxBean</span><span class="o">()</span> <span class="o">{</span>
<span class="n">MBeanServer</span> <span class="n">server</span> <span class="o">=</span> <span class="n">ManagementFactory</span><span class="o">.</span><span class="na">getPlatformMBeanServer</span><span class="o">();</span>
<span class="k">try</span> <span class="o">{</span>
<span class="k">return</span> <span class="n">ManagementFactory</span><span class="o">.</span><span class="na">newPlatformMXBeanProxy</span><span class="o">(</span>
<span class="n">server</span><span class="o">,</span> <span class="n">HOT_SPOT_DIAGNOSTIC_MX_BEAN_NAME</span><span class="o">,</span> <span class="n">HotSpotDiagnosticMXBean</span><span class="o">.</span><span class="na">class</span><span class="o">);</span>
<span class="o">}</span> <span class="k">catch</span> <span class="o">(</span><span class="n">IOException</span> <span class="n">error</span><span class="o">)</span> <span class="o">{</span>
<span class="k">throw</span> <span class="k">new</span> <span class="nf">RuntimeException</span><span class="o">(</span><span class="s">"failed getting Hotspot Diagnostic MX bean"</span><span class="o">,</span> <span class="n">error</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">create</span><span class="o">(</span><span class="n">File</span> <span class="n">file</span><span class="o">,</span> <span class="kt">boolean</span> <span class="n">live</span><span class="o">)</span> <span class="kd">throws</span> <span class="n">IOException</span> <span class="o">{</span>
<span class="n">HOT_SPOT_DIAGNOSTIC_MX_BEAN</span><span class="o">.</span><span class="na">dumpHeap</span><span class="o">(</span><span class="n">file</span><span class="o">.</span><span class="na">getAbsolutePath</span><span class="o">(),</span> <span class="n">live</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span></code></pre>
<p>The second argument of <code>dumpHeap</code> denotes live objects, that is, objects that
are reachable from others.</p>
<p>Note that many real-world Java applications occupy quite some memory. As a
result of this, created heap dump generally end up consuming significant
amount of disk space. You need to come up with your own custom clean up
mechanism to tackle this problem. (For instance, in a JAX-RS resource, you can
purpose a custom <code>MessageBodyWriter</code> to delete the file after writing the
entire file to the output stream.)</p>
<h1 id="thread-dumps">Thread Dumps</h1>
<p>When you think first about thread dumps, they just contain simple plain text
data.</p>
<pre><code>2016-08-12 18:40:46
Full thread dump OpenJDK 64-Bit Server VM (25.76-b198 mixed mode):
"RMI TCP Connection(266)-127.0.0.1" #24884 daemon prio=9 os_prio=0 tid=0x00007f9474010000 nid=0x2cee runnable [0x00007f941571b000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x00000005c086e8b0> (a java.io.BufferedInputStream)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:550)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:826)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:683)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$$Lambda$83/628845041.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:682)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- <0x00000005c0489198> (a java.util.concurrent.ThreadPoolExecutor$Worker)
"JobScheduler FJ pool 0/4" #24883 daemon prio=6 os_prio=0 tid=0x00007f946415d800 nid=0x2ced waiting on condition [0x00007f94093d2000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000005d8a5f9e0> (a jsr166e.ForkJoinPool)
at jsr166e.ForkJoinPool.awaitWork(ForkJoinPool.java:1756)
at jsr166e.ForkJoinPool.scan(ForkJoinPool.java:1694)
at jsr166e.ForkJoinPool.runWorker(ForkJoinPool.java:1642)
at jsr166e.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:108)
Locked ownable synchronizers:
- None
</code></pre>
<p>Unfortunately, thread dumps do not have a standard syntax. While there are
various ways to produce this output, thread dump analysis software does not
play well with them. For instance, <a href="https://www.ibm.com/developerworks/community/groups/service/html/communityview?communityUuid=2245aa39-fa5c-4475-b891-14c205f7333c">IBM Thread and Monitor Dump Analyzer for
Java</a>
cannot parse thread dumps created by VisualVM using JMX. At the end of the
day, I always needed to fall back to a HotSpot thread dump.</p>
<p><code>tools.jar</code> shipped with JDKs (>=1.6) provide the magical
<a href="http://www.docjar.com/docs/api/sun/tools/attach/HotSpotVirtualMachine.html">HotSpotVirtualMachine</a>
class containing our saviour <code>remoteDataDump()</code> method. First add the
following lines to your <code>pom.xml</code>:</p>
<pre><code class="language-xml"><span class="nt"><dependencyManagement></span>
<span class="nt"><dependencies></span>
<span class="nt"><dependency></span>
<span class="nt"><groupId></span>com.sun<span class="nt"></groupId></span>
<span class="nt"><artifactId></span>tools<span class="nt"></artifactId></span>
<span class="nt"><version></span>${java.version}<span class="nt"></version></span>
<span class="nt"><scope></span>system<span class="nt"></scope></span>
<span class="nt"><systemPath></span>${tools.jar}<span class="nt"></systemPath></span>
<span class="nt"></dependency></span>
<span class="nt"></dependencies></span>
<span class="nt"></dependencyManagement></span>
<span class="nt"><profiles></span>
<span class="c"><!-- tools.jar path for GNU/Linux and Windows --></span>
<span class="nt"><profile></span>
<span class="nt"><id></span>default-tools.jar<span class="nt"></id></span>
<span class="nt"><activation></span>
<span class="nt"><file></span>
<span class="nt"><exists></span>${java.home}/../lib/tools.jar<span class="nt"></exists></span>
<span class="nt"></file></span>
<span class="nt"></activation></span>
<span class="nt"><properties></span>
<span class="nt"><tools.jar></span>${java.home}/../lib/tools.jar<span class="nt"></tools.jar></span>
<span class="nt"></properties></span>
<span class="nt"></profile></span>
<span class="c"><!-- tools.jar path for OSX --></span>
<span class="nt"><profile></span>
<span class="nt"><id></span>default-tools.jar-mac<span class="nt"></id></span>
<span class="nt"><activation></span>
<span class="nt"><file></span>
<span class="nt"><exists></span>${java.home}/../Classes/classes.jar<span class="nt"></exists></span>
<span class="nt"></file></span>
<span class="nt"></activation></span>
<span class="nt"><properties></span>
<span class="nt"><tools.jar></span>${java.home}/../Classes/classes.jar<span class="nt"></tools.jar></span>
<span class="nt"></properties></span>
<span class="nt"></profile></span>
<span class="nt"></profiles></span></code></pre>
<p>Then the rest is a matter of accessing to <code>HotSpotVirtualMachine</code> class:</p>
<pre><code class="language-java"><span class="kn">import</span> <span class="nn">com.google.common.io.ByteStreams</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">com.sun.management.HotSpotDiagnosticMXBean</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">com.sun.tools.attach.AttachNotSupportedException</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">com.sun.tools.attach.VirtualMachine</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">sun.tools.attach.HotSpotVirtualMachine</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.io.IOException</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.io.InputStream</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.lang.management.ManagementFactory</span><span class="o">;</span>
<span class="kd">public</span> <span class="kd">enum</span> <span class="n">HotSpotThreadDumps</span> <span class="o">{;</span>
<span class="kd">public</span> <span class="n">String</span> <span class="nf">create</span><span class="o">()</span> <span class="kd">throws</span> <span class="n">AttachNotSupportedException</span><span class="o">,</span> <span class="n">IOException</span> <span class="o">{</span>
<span class="c1">// Get the PID of the current JVM process.</span>
<span class="n">String</span> <span class="n">selfName</span> <span class="o">=</span> <span class="n">ManagementFactory</span><span class="o">.</span><span class="na">getRuntimeMXBean</span><span class="o">().</span><span class="na">getName</span><span class="o">();</span>
<span class="n">String</span> <span class="n">selfPid</span> <span class="o">=</span> <span class="n">selfName</span><span class="o">.</span><span class="na">substring</span><span class="o">(</span><span class="mi">0</span><span class="o">,</span> <span class="n">selfName</span><span class="o">.</span><span class="na">indexOf</span><span class="o">(</span><span class="sc">'@'</span><span class="o">));</span>
<span class="c1">// Attach to the VM.</span>
<span class="n">VirtualMachine</span> <span class="n">vm</span> <span class="o">=</span> <span class="n">VirtualMachine</span><span class="o">.</span><span class="na">attach</span><span class="o">(</span><span class="n">selfPid</span><span class="o">);</span>
<span class="n">HotSpotVirtualMachine</span> <span class="n">hotSpotVm</span> <span class="o">=</span> <span class="o">(</span><span class="n">HotSpotVirtualMachine</span><span class="o">)</span> <span class="n">vm</span><span class="o">;</span>
<span class="c1">// Request a thread dump.</span>
<span class="k">try</span> <span class="o">(</span><span class="n">InputStream</span> <span class="n">inputStream</span> <span class="o">=</span> <span class="n">hotSpotVm</span><span class="o">.</span><span class="na">remoteDataDump</span><span class="o">())</span> <span class="o">{</span>
<span class="kt">byte</span><span class="o">[]</span> <span class="n">bytes</span> <span class="o">=</span> <span class="n">ByteStreams</span><span class="o">.</span><span class="na">toByteArray</span><span class="o">(</span><span class="n">inputStream</span><span class="o">);</span>
<span class="k">return</span> <span class="k">new</span> <span class="nf">String</span><span class="o">(</span><span class="n">bytes</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="o">}</span></code></pre>
<p>You finished writing this code, you clicked on the Run button of the IDE, and
it worked like a charm. This get you so excited that you wanted to add this
functionality to your JEE service! Or better: Turn this into a JAR and pass it
to your client’s machine and watch them take their part in the joy of
thread-dump-oriented debugging! And this is what you get in return:</p>
<pre><code>java.lang.NoClassDefFoundError: com/sun/tools/attach/AttachNotSupportedException
</code></pre>
<p>Which indicates that you did not pay attention my words: <em><code>tools.jar</code> is
shipped with JDKs.</em> So neither your flashy JEE application server, nor your
client’s machine has a JDK, but a JRE. Rings a bell? Yes, you indeed can add
<code>tools.jar</code> into the final WAR/JAR of your project:</p>
<pre><code class="language-xml"><span class="nt"><build></span>
<span class="nt"><plugins></span>
<span class="c"><!-- copy tools.jar from JAVA_HOME --></span>
<span class="nt"><plugin></span>
<span class="nt"><groupId></span>org.apache.maven.plugins<span class="nt"></groupId></span>
<span class="nt"><artifactId></span>maven-dependency-plugin<span class="nt"></artifactId></span>
<span class="nt"><executions></span>
<span class="nt"><execution></span>
<span class="nt"><id></span>copy-system-dependencies<span class="nt"></id></span>
<span class="nt"><phase></span>prepare-package<span class="nt"></phase></span>
<span class="nt"><goals></span>
<span class="nt"><goal></span>copy-dependencies<span class="nt"></goal></span>
<span class="nt"></goals></span>
<span class="nt"><configuration></span>
<span class="nt"><outputDirectory></span>${project.build.directory}/${project.build.finalName}/WEB-INF/lib<span class="nt"></outputDirectory></span>
<span class="nt"><includeScope></span>system<span class="nt"></includeScope></span>
<span class="nt"></configuration></span>
<span class="nt"></execution></span>
<span class="nt"></executions></span>
<span class="nt"></plugin></span>
<span class="nt"></plugins></span>
<span class="nt"></build></span></code></pre>
<p>Note that this approach incorporates a JDK-specific JAR into your application
and assumes that the application will run on a HotSpot VM. But unfortunately
this is the only way that I know of to produce a thread dump that works with
thread dump analysis software. If you don’t have such a need and just want a
crude JMX generated thread dump, check out
<a href="https://java.net/projects/visualvm/sources/svn/content/branches/release134/visualvm/jmx/src/com/sun/tools/visualvm/jmx/impl/JmxSupport.java">JmxSupport.java</a>
shipped with VisualVM.</p>