Tweet this quote
Skip to content
header background

LMAX Exchange blog - FX industry thought leadership

All the latest business and technology views and insights on the FX industry from LMAX Exchange management and staff

header background

Looking for heap distress

​​​​​Recently at LMAX Exchange we’ve had a couple of services suffer from memory leaks.

In both cases, we noticed the problem much later than we’d like. One stricken application started to apply backpressure on (really rather important) upstream components. Another caused an account related action to be intermittently unavailable.

What we’d like to be able to do is detect when any of our java services is in acute heap distress as soon as it enters that state. We already have monitoring around causes of backpressure; we want to spot our issue a little earlier on in the causality graph (that’s two ridiculous names you’ve invented just in one paragraph. Enough! – Ed).

Looking for heap distress

Handily, each service already publishes its heap usage and gc times to our home-grown monitoring app. Even better, our alerting application already streams live events from that monitor into some alerting rules, so we have some existing scaffolding to build on top of.

We reworked the way the existing streaming sensors in our alerting app worked until we had something like the following:


void addStreamingSensor(
    final String keyPattern,
    final Consumer<MonitoringEvent> consumer,
    final FaultSensor faultSensor);

interface FaultSensor {
    Collection<Fault> sense();
}

class MonitoringEvent {
    public final long epochMillis;
    public final String key;
    public final double value;

    public MonitoringEvent(long epochMillis, String key, double value) {
        this.epochMillis = epochMillis;
        this.key = key;
        this.value = value;
    }
}

class Fault {
    public final String description;
    public final String url;

    public Fault(final String description, final String url) {
        this.description = description;
        this.url = url;
    }
}

This construction will set things up so that monitoring events matching our key pattern are delivered to our consumer.

Side note: we ended up with the following signature, mostly because using this little known java feature always makes me smile:

    <T> extends Consumer<MonitoringEvent> & FaultSensor> void addStreamingSensor(final String keyPattern, final T t);

Occasionally, a different thread will call sense to see if any faults are currently present.

In our case, the keys we have available to us (per application, per host) are ‘heap.used’, ‘heap.max’ (both in bytes) and ‘gc.pause’. Can we reliably detect heap distress from such a stream of events?

Bad ideas we considered

Percentage of wall clock time spent in GC

We could work out what percentage of time an application is spending in GC, and alert if it goes above a threshold.

This isn’t too bad an idea. Unfortunately, that metric is published at most once per second per application, and it lists the maximum pause time seen in that second, not the sum. We could fix this, but we’d have to wait for a software release to get feedback.

Persistent memory usage above a threshold percentage

One thing we see in both of our failing applications is that they see consistent heap usage somewhere near to their max value. We could keep track of how many observations we see above a particular usage threshold, and once that observation count reaches a large enough number, alert.

This would detect errors like the ones that we would see, but it would also detect slow growing heaps in well behaved applications. Alert code that produces more false alarms than real ones is, from our experience, worse than no alerting at all.

    @Test
    public void slowlyRisingUsageIsNotAFault()
    {
        useHealthyHeapThresholdOf(90, Percent);

        reportHeapMax("hostOne", "serviceOne", 100.0);
        reportHeapUsage("hostOne", "serviceOne", 65.0);
        reportHeapUsage("hostOne", "serviceOne", 75.0);
        reportHeapUsage("hostOne", "serviceOne", 85.0);
        reportHeapUsage("hostOne", "serviceOne", 95.0);

        assertFaultFree();
    }

Alerting if an inferred GC reclaims less than a certain percentage of memory

We can keep hold of the last seen usage, and when a new value arrives, check if it is smaller. If it is, we can infer that a GC has occurred. If the amount of memory freed by that GC is too small, we could alert.

Closer still, but can still generate some false positives. A collection can be triggered even when percentage heap usage is low; at that point it might be impossible to reclaim the required amount.

    @Test
    public void punyAmountsFreedByGcAreNotFaultsIfEnoughHeapIsStillFree()
    {
        useHealthyHeapThresholdOf(90, Percent);

        reportHeapMax("hostOne", "serviceOne", 100.0);

        reportHeapUsage("hostOne", "serviceOne", 85.0);
        reportHeapUsage("hostOne", "serviceOne", 86.0);
        reportHeapUsage("hostOne", "serviceOne", 87.0);
        reportHeapUsage("hostOne", "serviceOne", 85.0);

        assertFaultFree();
    }

Our eventual solution

Alert if, after an inferred GC, more than x% of the heap is still in use.

Still not perfect, but just about good enough. Why? Well, we see multiple ‘heap.used’ metrics published across a GC cycle, some of which may still be above the warning usage threshold. Those intermediate values could cause false positives.

In practice, we found that we call ‘currentFaults’ infrequently enough to not be a problem. If we did, it would be the work of a couple of minutes to create a wrapper for our detector that only reported faults that had been present for multiple seconds.

    @Test
    public void afterAnInferredGcFaultIsReportedUntilHeapDropsBelowThreshold()
    {
        useHealthyHeapThresholdOf(90, Percent);

        reportHeapMax("hostOne", "serviceOne", 100.0);
        reportHeapUsage("hostOne", "serviceOne", 85.0);
        reportHeapUsage("hostOne", "serviceOne", 95.0);
        reportHeapUsage("hostOne", "serviceOne", 93.0);

        assertFaulty("hostOne", "serviceOne");

        reportHeapUsage("hostOne", "serviceOne", 90.1);

        assertFaulty("hostOne", "serviceOne");

        reportHeapUsage("hostOne", "serviceOne", 89.0);

        assertFaultFree();
    }

Bugs we totally missed

Some applications run on multiple hosts in the same environment

This one was a bit silly.

We run our applications in six or seven different ‘environments’ that we consider important enough to monitor. Each environment might have application ‘x’ deployed on hosts ‘h1’ and ‘h2’. We managed to not test this at all in our spike implementation and immediately generated large numbers of false positives. Whoops.

    @Test
    public void servicesOnDifferentHostsDoNotInterfereWithEachOther()
    {
        useHealthyHeapThresholdOf(90, Percent);

        reportHeapMax("hostOne", "serviceOne", 100.0);
        reportHeapMax("hostTwo", "serviceOne", 100.0);

        reportHeapUsage("hostOne", "serviceOne", 85.0);
        reportHeapUsage("hostOne", "serviceOne", 95.0);
        reportHeapUsage("hostTwo", "serviceOne", 93.0);

        assertFaultFree();
    }

More mysterious false positives

Having now written enough tests to convince ourselves that we weren’t totally inept, we launched the alerting app locally. We turned the warning threshold down to 10% to see if it spotted anything, and immediately saw some alerts. Clicking each alert (the ‘url’ feature of a fault is used to allow a user to see the data upon which the decision to alert was raised) led us to the following chart.

Looking for heap distress

The data underlying the chart looks like this:

epoch millis ,heap usage (bytes)
1494497832745,1.477746688E9
1494497833564,1.479913472E9
1494497833745,1.479913472E9
1494497834564,1.48424704E9
1494497834745,1.48424704E9
1494497835564,1.488580608E9
1494497835745,1.488580608E9
1494497836564,1.492914176E9
1494497836745,1.492914176E9
1494497837564,1.49508096E9
1494497837745,1.49508096E9
1494497838564,1.49508096E9
1494497838745,1.49508096E9
1494497839564,1.497247744E9
1494497839745,1.497247744E9
1494497840564,1.497247744E9

This was a bit odd. We added some judicious println at the point where we raised the alert, and found we had received the following:

Raising alert: heap dropped from 1.49508096E9 to 1.492914176E9

At this point, a light went on in my head, and we went back to log not only the values we were seeing, but also the timestamps of the monitoring events.

Raising alert: heap dropped from 1.49508096E9 (at 1494497837564) to 1.492914176E9 (at 1494497836745)

Aha. We appear to have gone back in time. How is this happening, though?

Well, we deliberately use UDP to transmit metrics from our applications to our monitoring. We’re OK with occasionally losing some metrics, we’re definitely not happy if being unable to send them prevents our app from actually running. This is a potentially controversial choice, let’s agree to not discuss it here :-).

When metrics arrive at the monitoring app, it has two jobs. One, to squirrel the data into a time series database on disk. Two, to transmit the data to appropriately subscribed websocket clients.

The chart and raw data above are both served, post hoc, via the on disk store. The websocket feed is not. The UDP packets of monitoring data were arriving out of order, and it was truthfully forwarding those metrics to our alerting in the same way.

We could probably fix that with some judiciously applied buffering, but given that each event we see is timestamped, we can simply drop events that are timestamped earlier than the last value we saw for a given service/host pair.

    @Test
    public void doNotReportFaultsBasedOnOutOfOrderData()
    {
        useHealthyHeapThresholdOf(90, Percent);

        reportHeapMax("hostOne", "serviceOne", 100.0);
        
        reportHeapUsage("hostOne", "serviceOne", 85.0);
        currentTime += 10;
        reportHeapUsage("hostOne", "serviceOne", 95.0);
        currentTime -= 5;
        reportHeapUsage("hostOne", "serviceOne", 93.0);

        assertFaultFree();
    }

Moral(s) of the story

Even the simplest thing needs more thinking about than you might normally expect! We got things wrong a lot, but it didn’t matter, because:

  • we had a very testable interface, so we could improve quickly
  • we could run the whole app locally to tell us what tests we missed

Any opinions, news, research, analyses, prices or other information ("information") contained on this Blog, constitutes marketing communication and it has not been prepared in accordance with legal requirements designed to promote the independence of investment research. Further, the information contained within this Blog does not contain (and should not be construed as containing) investment advice or an investment recommendation, or an offer of, or solicitation for, a transaction in any financial instrument. LMAX Exchange has not verified the accuracy or basis-in-fact of any claim or statement made by any third parties as comments for every Blog entry.

LMAX Exchange will not accept liability for any loss or damage, including without limitation to, any loss of profit, which may arise directly or indirectly from use of or reliance on such information. No representation or warranty is given as to the accuracy or completeness of the above information. While the produced information was obtained from sources deemed to be reliable, LMAX Exchange does not provide any guarantees about the reliability of such sources. Consequently any person acting on it does so entirely at his or her own risk. It is not a place to slander, use unacceptable language or to promote LMAX Exchange or any other FX, Spread Betting and CFD provider and any such postings, excessive or unjust comments and attacks will not be allowed and will be removed from the site immediately.

LMAX Exchange will clearly identify and mark any content it publishes or that is approved by LMAX Exchange.

FX and CFDs are leveraged products that can result in losses exceeding your deposit. They are not suitable for everyone so please ensure you fully understand the risks involved. The information on this website is not directed at residents of the United States of America, Australia (we will only deal with Australian clients who are "wholesale clients" as defined under the Corporations Act 2001), Canada (although we may deal with Canadian residents who meet the "Permitted Client" criteria), Singapore or any other jurisdiction where FX trading and/or CFD trading is restricted or prohibited by local laws or regulations.

LMAX Limited operates a multilateral trading facility. LMAX Limited is authorised and regulated by the Financial Conduct Authority (firm registration number 509778) and is a company registered in England and Wales (number 6505809). Our registered address is Yellow Building, 1A Nicholas Road, London, W11 4AN.

Sign up for Global FX Insights, the daily market commentary from LMAX Exchange

Thank you
for subscribing to the Global FX Insights newsletter

Thank you
you have already subscribed to the newsletter

Error
sorry there was a problem, please try again later

Your information will not be distributed or shared with third parties