LMAX Exchange - unique vision for global FX

LMAX Exchange

A few weeks ago one of the users of the Disruptor posted some worrying benchmarks:

ThreePublisherToOneProcessorSequencedThroughputTest
  run 0: BlockingQueue=645,161 Disruptor=1,772 ops/sec
  run 1: BlockingQueue=1,250,000 Disruptor=20,000,000 ops/sec
  run 2: BlockingQueue=1,250,000 Disruptor=56 ops/sec

It
appears under heavy contention with fewer available cores than busy
threads the Disruptor can perform terribly. After a bit of
investigation I managed to isolate the problem. One of the most complex
parts of the Disruptor is the multi-threaded claim strategy. It is the only place in the Disruptor where – out of necessity – we break the single-writer principal.

The
approach that we used was very simple. Each thread claims a slot in
the ring buffer using AtomicLong.incrementAndGet(). This ensures that
each claim will return a unique sequential value. The complexity
arrives when the multiple threads try to publish their sequence. We
require that all events placed in the ring buffer must be made available
to event processors in a strictly sequential order. To ensure this
behaviour we have a method called serialisePublishing(). Our simple
implementation would have the thread that is publishing busy spin until
the last published sequence (the cursor) is one less the value being
published.

This works because each sequence published is unique
and strictly ascending. For example if one thread wants to publish the
value 8, it will spin until the cursor value reaches 7. Because no
other thread will be trying publish value 8 it can make progress and
ensuring the sequential publishing behaviour in the process. However,
this busy spin loop causes problems when there are more threads than
cores. The threads that need wait for the prior sequences to be
published can starve out the thread that should be updating the cursor.
This leads to the unpredictable results shown above.

We need a better solution. In an ideal world there would be a Java API that would compile down to the Intel MONITOR/MWAIT instructions, unfortunately they’re limited to Ring 0, so require a little kernel assistance to be useful.
Another instruction (but unavailable in Java) would be the Intel PAUSE
instruction that could used in the middle of the spin loop. One of the
problems with busy loops on modern processors is in order keep the
pipeline full the CPU may speculatively execute the condition at the top
of the loop, causing an unnecessarily high number of instructions to
fill the CPU pipeline. This can starve other logical threads of CPU
resources. The PAUSE instruction on hyper-threaded Intel processors can improve this situation.

Java
has neither of those, so we need to go back and address the shape of
the serialisePublishing method. For the redesign I drew some
inspiration from Cliff Click’s non-blocking hash map. There 2 aspects of his design that are very interesting:

No locks
No CAS retries

While
the first is obvious, the second is trickier. Any one familiar CAS
operations will have seen the traditional loop until success approach to
handling concurrent updates. For example the incrementAndGet method
inside the AtomicLong in Oracle’s JVM uses a similar loop. It could
look something like¹:

While there are no locks here it is not necessarily wait-free. It is theoretically
possible, if a number of threads are trying to increment the value, for
one (or more) of the threads to get stuck unable to make progress if
other threads are constantly winning the race to the atomic operation.
For an algorithm to be wait free all calls must complete within a fixed
number of operations. One way to get closer to a true wait free
algorithm is to design the algorithm such that a failure of a CAS
operation is a signal to exit rather than to retry the operation. Cliff
Click’s approach was to model the algorithm using a state machine,
where all states are valid and a transition between states is typically a
CAS operation. E.g. image a state machine with 3 states {A, B, C} and 2
transitions {A->B, B->C}. If a instance of the state machine is
in state A and 2 threads try to apply the transition A->B only one
will succeed. For the thread that fails to apply its CAS operation
retrying the operation makes no sense. The instance has already
transitioned to state B. In fact the failure of CAS operation is an
indication that the instance is already in the desired state. The
thread can exit if B is the desired state of the action or try to apply
the B->C transition if that is what’s required.

How does this
apply to our concurrent sequencing problem? We could allow threads to
continue to make progress while waiting for other threads to catch by
maintaining a list of sequences that are pending publication. If a
thread tries to publish a sequence that is greater than 1 higher than
current cursor (i.e. it would need to wait for another thread to publish
its sequence) it could place that sequence into the pending list and
return. The thread that is currently running behind would publish its
own sequence, then check the pending list and publish those sequences
before exiting.

To represent this as a state machine we would
have 3 states {unpublished, pending, published} and 2 transitions
{unpublished->pending, pending->published}. In recognition of the
fact that computing resources are finite, we have a guard condition on
the unpublished->pending transition. I.e. a limit on number of
sequences we allow in the pending state. Because each sequence is
unique, the transition unpublished->pending does not require a CAS
operation. The pending list is represented as an AtomicLongArray and
the transition is a simple AtomicLongArray.set() where the index is the
sequence modulo the size of the pending list². The final
transition pending->published is where the CAS operation comes in.
The thread will first try to publish its own sequence number. If that
passes then the thread will try to publish the next value from the
pending list. If the CAS fails the thread leaves the method. The
failure means that the value is already published or will be by some
other thread.

Running the multi-publisher performance test on my 2-Core laptop (where at least 4 threads would normally be required):

ThreePublisherToOneProcessorSequencedThroughputTest
  run 0: BlockingQueue=5,832,264 Disruptor=6,399,590 ops/sec
  run 1: BlockingQueue=5,521,506 Disruptor=6,470,816 ops/sec
  run 2: BlockingQueue=5,373,743 Disruptor=6,931,928 ops/sec

Sanity restored.

This update will be included in the 2.8 release of the Disruptor as the default implementation of the MultiThreadedClaimStrategy.
The old implementation will still be available as
MultiThreadedLowContentionClaimStrategy. Those who have plenty of cores
where the publishers aren’t often contented may find the old
implementation faster, which it should be as it is simpler and requires
fewer memory barriers. I’m going to continue to revise and work on this
code. While improved, it is not truly wait free. It is possible for
one of the threads to get stuck doing all of the publishing.

¹ The AtomicLong.getAndIncrement() does use a slightly different loop structure by the semantics are the same.
²
Actually it’s a mask of the sequence and the pending list size minus 1.
This is equivalent when the size of the pending list is a power of 2.

Mo	Tu	We	Th	Fr	Sa	Su
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

LMAX Group blog - FX industry thought leadership

Concurrent Sequencing

LMAX Group blog - FX industry thought leadership

Sign up for Global FX Insights, the daily market commentary from LMAX Group