Using Nagios to make things good

LMAX Exchange

At LMAX Exchange Nagios is one of our essential tools for monitoring and verifying the operation of our systems. We use it for three distinct purposes.

  • Alerting when things break.
  • Recording trends so that we can predict when problems will occur and then mitigate them.
  • Using Nagios to verify the overall structure of our environments.

Things have broken

Using Nagios to monitor things breaking down is perhaps the most common use case. These checks need to run often, perhaps every few seconds. Let us look at an example, a web server, and some of the tests we might want to run.

  • Does the server respond on ports 80, 443.
  • Is Apache running on the server.
  • Are all the network interfaces up.
  • Are all the fans working.
  • Has a disk failed.
  • Are there unexpected users logged in.

Some of these tests are implemented using the normal Nagios checks, some we write ourselves.
One feature these tests have in common is that they return a binary result – pass or fail.
Another aspect is that we want to know when they fail, and depending on the test send an email or SMS alert.

Things will break

There are some tests that check for trends, in this respect the graphing feature can be useful. We use this for resources that we can address e.g. filesystem storage, memory, CPU and network utilisation.

For these checks we want to know when they will become an issue. Typically, we can set a threshold and when that threshold is reached we can get an alert, and make a judgement call as to what to do e.g. assign more disk space to the filesystem or assign more cores to the VM. Some trends may require more planning so we want to be alerted sooner.

We might break things

The third example of monitoring is for things that are incorrectly configured, might cause us an annoyance in the future or are not as we expect them to be. Some examples are the versions of
certain firmware, which network card is plugged into which switch or the size of a disk volume.

The faults that can cause these tests to fail are normally ones that have happened because someone has changed something. We usually run most of these tests once a week but if we make a change, we manually trigger them to prevent the alert. Another time they are useful is when a new system is built, we can run the checks to verify that it has been built correctly, and then rectify the configuration until all tests are green. This is also useful if we have had work completed in a rack, we can then run all the tests to verify that nothing has been disturbed in the work. There will be a more in depth blog on this next month from Luke.

In Summary

Nagios can be used for so much more than just an up/down polling system. At LMAX Exchange Nagios helps us maintain a consistent environment through trend analysis and configuration anomaly detection.

Any opinions, news, research, analyses, prices or other information ("information") contained on this Blog, constitutes marketing communication and it has not been prepared in accordance with legal requirements designed to promote the independence of investment research. Further, the information contained within this Blog does not contain (and should not be construed as containing) investment advice or an investment recommendation, or an offer of, or solicitation for, a transaction in any financial instrument. LMAX Group has not verified the accuracy or basis-in-fact of any claim or statement made by any third parties as comments for every Blog entry.

LMAX Group will not accept liability for any loss or damage, including without limitation to, any loss of profit, which may arise directly or indirectly from use of or reliance on such information. No representation or warranty is given as to the accuracy or completeness of the above information. While the produced information was obtained from sources deemed to be reliable, LMAX Group does not provide any guarantees about the reliability of such sources. Consequently any person acting on it does so entirely at his or her own risk. It is not a place to slander, use unacceptable language or to promote LMAX Group or any other FX and CFD provider and any such postings, excessive or unjust comments and attacks will not be allowed and will be removed from the site immediately.