LMAX Exchange - unique vision for global FX

We were tracing a problem with occasional latency spikes on one of our systems the other day. The cause was rather unusual.

The write load was not high, being a few megabytes per second at peak on an otherwise completely idle disk. We checked for the usual hardware issues; cache batteries, failing disks as well as other programs writing to that disk, but came up empty. Fairly regularly we saw a write latency of up to 40ms. Occasionally we’d see more spikes at an increased frequency of occurrence. We discovered that sometimes the latency spikes clustered – so it looked like there was bimodal behaviour.

Our investigation showed we were waiting on writes to the page_cache whilst allocating buffer_head structures. Here’s what we found out.

A bit about the workload. The work load is a “normal” write. There’s no fancy async I/O and no direct I/O or synchronous flags flying around. A straight up normal write() syscall. The writes are sequential, and we never read the pages once they have been written.

It is latency sensitive, however we expect write() to complete within a few microseconds – essentially the overhead of a system call on the architecture and kernel in question given that all you’re doing is to stuff the data into a page in the pagecache for linux to write out at its convenience later. The speed of the disks doesn’t really factor into this.

The write load is not continous. We’re writing 220 bytes every 200 microseconds or so, and we aggregate the results into buckets and report the worst case latency. This does have the potential to suffer from “co-ordinated omission™ (Gil Tene 2012)” which we make a vague attempt at mitigating through aggregation over many samples and reporting the worst case.

The machine has plenty of free memory (20 gig), plenty of free cores (32 in total). And most of the time it does indeed complete in a few 10s of micros. Every now and then though, it gets stuck for 10s of milliseconds.

Transparent and other hugepages are off and disabled, so as a cause of these pauses, its nothing to do with that.

We were pretty sure from the measurements in the app and the subsequent microbenchmark we made that the problem was an occasional latency spike during a write. This was visible on several machines, with different specs, and the hardware seemed to check out anyway. It had to be something else.

The first things that pointed us in the direction of the page cache were two observations;

If you clear the page cache completely by dropping the caches ( echo 3 > /proc/sys/vm/drop_caches ) then the problem was alleviated.
The problem appeared to be file system independent. We saw the same problems with xfs as we did with a variety of extfs versions, irrespective of data journalling option.

As a backgrounder it is helpful to understand how zone memory reclaim works on a NUMA node.

Highwatermarks and Memory reclaim

Memory in Linux on NUMA machines is divided into different nodes. In a recent intel rack mount server architecture – for example using sandybridge processors – the resources local to a particular socket become a node. So for example memory/cpus attached to socket 0 is node0 and memory/cpus attached to socket 1 is node1.

For example.

$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 65490 MB
node 0 free: 1009 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 65536 MB
node 1 free: 678 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10

Each node will have one or more zones. Thankfully we’re on a 64bit architecture and kernel, so we don’t need to mess about with highmem zones. We just have 3 zones to think about; DMA, DMA32 and Normal.

On this system we have them laid out like this;

$ more /proc/zoneinfo | grep zone
Node 0, zone      DMA
Node 0, zone    DMA32
Node 0, zone   Normal
Node 1, zone   Normal

So, all of node 1 is given over to the Normal zone, whereas node0 has three zones in it. The DMA zones are for driver direct memory access.

Each zone in turn has 3 thresholds. In ascending order;

pages_min
pages_low
pages_high

We’re also interested in

pages_free
pages_present

pages_present tells us the size of each zone. The Normal Zone in node0 is smaller than in node1 on our machine, as pages have been taken by the other two zones.

pages_min, pages_low and pages_high are all set off a single kernel tunable min_free_kbytes. This tells the kernel how much memory to keep on the free list. From this, it determines on a pro rata basis the sizes of the different thresholds for each zone. Looking at node 1, zone Normal on our test machine;

$ grep -A7 'Node 1, zone   Normal' /proc/zoneinfo 
Node 1, zone   Normal
  pages free     174188
        min      131132
        low      163915
        high     196698
        scanned  0
        spanned  16777216
        present  16547840

Linux holds onto pretty much anything it can in memory. Free memory is wasted memory. However at some point when you need memory for what you want to do right now, the kernel will need to reclaim pages from whatever is using them. This normally happens asynchronously via a process called kswapd.

kswapd gets woken up when pages_free crosses the pages_low threshold for this node. Each node has it’s own kswapd, which will start running through memory and reclaiming pages on that node. There are plenty of resources on the web that describe how that happens in detail. Most are somewhat old, and seem to focus on the 2.4 kernel, but the principles are the same. kswapd stops when pages_free passes pages_high again.

pages_min is of particular interest for low latency apps. If the amount of memory pressure is sufficiently high that the pages_free passes the pages_min threshold (the lowest one) then whatever process is calling alloc_pages() inside the syscall it made will then synchronously call the same reclaim functions as kswapd (which is known as the direct reclaim path) [1]. This is particularly bad. Your process will pause in an unbounded fashion whilst the kernel free’s memory to give to your request. We never want to be here.

The last tunable to concern us is zone_reclaim_mode. This is the NUMA handle and affect how memory reclamation is done in the normal zone, as that is the only one that is present on more than one node. If zone_reclaim_mode is 0 then memory is treated as one big pool.

With zone_reclaim_mode off, If an allocation is made from socket 0 that would normally be filled from node0. If node0 is full however but there is plenty of memory in node1, then the fill will be made from node1. If zone_reclaim_mode is set to 1 then each node is considered to be more of an island, and if an allocation is made on node0, then it will be filled from node0’s normal zone including potentially waking up kswapd to add pages to the free list.

So to recap;

Memory is divided by cpu socket (node) and by use (zone).
The min_free_kbytes tunable drives 3 different thresholds in each zone.
Going below pages_min causes unbounded pauses under heavy memory pressure
A per node kswapd will be woken up when free memory drops below pages_low and work to free memory until pages_high is reached.
zone_reclaim_mode will ensure that all allocations are local to a node, avoiding NUMA cost.

Colliding with buffer_head

Everything in this program is single threaded, and we happened to be running on socket 0/node0. When experiencing the problem, on inspection of /proc/zoneinfo it became clear that the normal zone on that node had a pages_free value that was always close to the pages_low value for that zone. The pages_free for the normal zone on the other node had much more free.

On with the observations. These traces were taken with our test program pinned to socket 1. It is therefore allocating from node1 zone Normal by default.

node0/node1 memory and latency trace showing spikes in latency as we cross from using node0 to node1 memory

For this plot, the two memory lines show the difference between the pages_free and pages_low for each memory node. The latency line shows the worst case latency in that bucket. The X axis is time and the Y axis doubles as both pages for memory and microseconds for latency.

You want to be looking at the red line. Basically whenever we’re using memory on the other node, latency is high and stays high. If the latency is 40ms, this means that within a sample of several thousand writes, there was at least one write of 40ms. Conversely if the latency is 40 microseconds then it means the worst case latency in that bucket of writes was 40 microseconds.

With zone_reclaim_mode set to 0 you can see that as the free pages for node1 reaches the pages_low threshold, the memory allocator takes memory from node0’s Zone normal.

Results;

The linux allocator works the way the documentation says it does. Good.
Latency goes bad for an extended period when we’re using page_cache on node0, but pinned to node1.
Once kswapd clears memory on both nodes, we go back to base line spiky behaviour.

Now lets see how it looks with zone_reclaim_mode set to 1.

another plot showing different behaviour with zone reclaim mode set to 1

Now, when we reach the pages_low threshold for node1 we fill that from node1’s freelist, and initiate reclaim on node1. Node0’s free memory is unaffected.

We’re not the only user of memory on the system, and there are some interesting dips in pages_free on node0 when we free up memory on node1. This is unexpected and I’m not sure what’s going on there. I’d have expected the memory to increase if anything, as kswapd’s are woken up. It does appear to be a transitory thing however as its straight down and up. My current suspicions are that it’s to do with the type of memory that needs to be freed. A run with zone_reclaim_mode set to 3 so we’re telling zone_reclaim and friends to be a bit more eager would confirm or deny that (http://lxr.free-electrons.com/ident?v=2.6.32;i=zone_reclaim).

Results;

Latency spikes are still present but smaller. They peak in the 10-15ms range. This is still rubbish.
There’s no period of time when latency is elevated, and we remain in ‘baseline’ spiky behaviour all the time.

Using systemtap to instrument the call path of vfs_write() we can see that we lose the time under alloc_buffer_head()

Top Latency in this bucket: 40571
     0 java(3238): -> vfs_write
     3 java(3238):  -> do_sync_write
     6 java(3238):   -> kmem_cache_alloc
     8 java(3238):   <- kmem_cache_alloc (took 3 us)
    11 java(3238):   -> __find_get_block
    13 java(3238):   <- __find_get_block (took 2 us)
    16 java(3238):   -> kmem_cache_alloc
    18 java(3238):   <- kmem_cache_alloc (took 3 us)
    21 java(3238):   -> add_to_page_cache_locked
    23 java(3238):   <- add_to_page_cache_locked (took 3 us)
    27 java(3238):   -> block_write_begin
    29 java(3238):    -> block_write_begin
    31 java(3238):     -> __block_prepare_write
    33 java(3238):      -> __block_prepare_write
    35 java(3238):       -> create_empty_buffers
    37 java(3238):        -> alloc_page_buffers
    40 java(3238):         -> alloc_buffer_head
    42 java(3238):          -> kmem_cache_alloc
 44 java(3238): -> cache_alloc_refill
 40520 java(3238):            -> cache_grow
 40528 java(3238):            <- cache_grow (took 9 us)
 40530 java(3238): <- cache_alloc_refill (took 40487 us)
 40533 java(3238):          <- kmem_cache_alloc (took 40491 us)
 40535 java(3238):         <- alloc_buffer_head (took 40496 us)
 40537 java(3238):         -> set_bh_page
 40539 java(3238):         <- set_bh_page (took 2 us)
 40541 java(3238):        <- alloc_page_buffers (took 40504 us)
 40543 java(3238):       <- create_empty_buffers (took 40509 us)
 40547 java(3238):       -> unmap_underlying_metadata
 40549 java(3238):       <- unmap_underlying_metadata (took 2 us)
 40551 java(3238):      <- __block_prepare_write (took 40520 us)
 40553 java(3238):     <- __block_prepare_write (took 40522 us)
 40555 java(3238):    <- block_write_begin (took 40529 us)
 40557 java(3238):   <- block_write_begin (took 40530 us)
 40560 java(3238):   -> block_write_end
 40563 java(3238):   <- block_write_end (took 3 us)
 40565 java(3238):  <- do_sync_write (took 40563 us)
 40568 java(3238): <- vfs_write (took 40568 us)

Hello cache_alloc_refill(). Source can be seen here http://lxr.free-electrons.com/source/mm/slab.c?v=2.6.32#L2922.

There are several encouraging things in that routine that lend themselves to contention. There’s a couple of spin locks and a nice big for loop.

What seems to be the case is that when we run out of memory on the local node and need to go and grab memory from another node, that’s when things start getting expensive.

Certainly at several points, for example get_page_from_freelist() (http://lxr.free-electrons.com/ident?v=2.6.32;i=get_page_from_freelist) there seems to be a loop over zones until we fill the request type approach. If there’s an element there that’s expensive, or contends on a lock – for example if kswapd fires up at the same time due to being woken by a zone dipping below the pages_low threshold – then we could expect that this is an expensive thing.

I’d love to have the time to run this all the way down to the root cause. However, we’re on an old kernel (2.6.32) so chances are even if we find a bug or optimisation, it’ll already have been fixed upstream. The joys of using enterprise operating systems. I must fix that at some point.

To focus on the fixes/workarounds: We can turn on zone_reclaim_mode. That helps, but we still have way too many spikes over 1ms. That’s not acceptable.

The alternative we came up with is to just try and avoid the uncertain cost of allocation from the buffer_head slab cache or page cache completely.

buffer_head is now just a metadata cache for the page cache, so if we preallocate the file we are writing into in another thread and then prefetch the pages we are interested in into the page_cache in advance, then the precursor thread will take all of the variable latency hits for us in advance.

When we come to write on the latency sensitive thread, we’ll not need to drop into alloc_buffer_head() as the precursor thread will have already taken the hit for us, and we can sail through the system call on the fast path.

This comes in two parts. The first is to preallocate out a bunch of files that we’ll later write to; for example by using ‘dd if=zero of=write_file.dat bs=1024k count=1024’ to write an empty gigabyte sized file. The second part is to prefetch that into ram on the node/socket we’re working on just before its needed. This is the part that ensures that the page_cache and associated buffer_head structures are pre-populated before the latency sensitive bit gets to them.

The results show a flat line.

This is where the scale fails to do justice to the latency. Again, pages and microseconds on the Y axis.

So here’s a better chart of just the latency.

The latency is pretty consistent at 20 microseconds or less with occasional spikes to 70 micros.

Over a longer run (10’s of hours) there are still spikes over a millisecond, but they’re the rare exception not the rule. We’ll settle for that for now. Some day perhaps we’ll need to come back and understand this a bit more. Or maybe we’ll have fixed it a different way by then.

LMAX Group blog - FX industry thought leadership

latency and the buffer_head slabcache

Highwatermarks and Memory reclaim

Colliding with buffer_head

LMAX Group blog - FX industry thought leadership

Highwatermarks and Memory reclaim

Colliding with buffer_head

Sign up for Global FX Insights, the daily market commentary from LMAX Group