Tweet this quote
Skip to content
header background

LMAX Exchange blog - FX industry thought leadership

All the latest business and technology views and insights on the FX industry from LMAX Exchange management and staff

header background

Test Driven Infrastructure – Validating Layer 1 Networking with Nagios

Previously we’ve talked about how we use Nagios / Icinga for three broad types of monitoring at LMAX Exchange: alerting, metrics, and validation. The difference between our definitions of alerting and validation is a fine one and it more has to do with the importance of the state of the thing we are checking and the frequency in which we check it. An example of what I consider an “Alert” is if Apache is running or not on a web server. However the version of Apache might be something I “Validate” with Nagios as well, but I wouldn’t bother checking this every few minutes and if there was a discrepancy I wouldn’t react as fast as if the entire Apache service was down. It’s a loose distinction but a distinction none the less.

The vast majority of our network infrastructure is implemented physically in a data centre by a human being. Someone has to go plug in all those cables, and there’s usually some form of symmetry, uniformity and standard to how we patch things that gives Engineers like me warm fuzzy feelings. Over many years of building our Exchange platforms we’ve found that going back to correct physical work costs a lot of time, so we like to get it right the first time, or, be told very quickly if something is not where it’s expected to be. Thus enters our Test Driven Networking Infrastructure – our approach uses Nagios / Icinga as the validation tool, Puppet as the configuration and deployment engine, LLDP as the protocol on which everything runs on top of, and Patch Manager as the source of truth.

Validating Network Patching

I’ve written about our Networking Puppet module before and how we use it to separate our logical network design from it’s physical implementation. The same Puppet Networking module also defines the monitoring and validation for our network interfaces. Specifically this is defined inside Puppet Class networking::monitoring::interface, which has a hard dependency on LMAX Exchange internal Nagios module which unfortunately at this time is not Open Source (and would be one long blog post of it’s own to explain).

So since you can’t see the code I’ll skip over all the implementation and go straight to the result. Here is what our Puppet Networking module gives us in terms of alerts:

Test Driven Infrastructure - Validating Layer 1 Networking with Nagios

Pretty self explanatory. Here’s the end result of our networking infrastructure validation, with server names and switch names obfuscated:

Test Driven Infrastructure - Validating Layer 1 Networking with Nagios

However a green “everything-is-ok” screenshot is probably not a helpful example of why this is so useful, so here’s some examples of failing checks from out build and test environments:

Test Driven Infrastructure - Validating Layer 1 Networking with Nagios
Test Driven Infrastructure - Validating Layer 1 Networking with Nagios

To summarise the above, our validation fails when:

  • we think an interface should be patched somewhere but it’s not up or configured
  • an interface is patched in to something different than to what it should be
  • an interface is up (and maybe patched in to something) but not in our source of truth
Next I’ll describe how the Nagios check works. Combined with a specific provisioning process which I describe below, the above checks give us Test Driven Infrastructure that helps us quickly correct physical patching errors.

How The Nagios Check Actually Works

The idea behind the check is for the Nagios server to first retrieve what the server says the LLDP neighbour of each interface is, then compare this with it’s own source of truth and raise an appropriate OK, WARNING or CRITICAL check result.

Nagios knows what interfaces to check for because Puppet describes every interface to monitor. Nagios makes an SNMP call to the server, getting back CSV output that looks like this:

em1,yes,switch01,1/31,10,Brocade ICX6450-48
em2,yes,switch02,1/31,10,Brocade ICX6450-48

The fields are:

  1. interface name
  2. link
  3. remote LLDP device name
  4. remote LLDP device port
  5. VLAN
  6. remote LLDP device model

A version of this script is up on GitHub here. It contains a lot of conditional logic to handle the LLDP information for different vendor hardware. For example certain Brocade switches don’t mention the word “Brocade” so we infer that from the MAC address. Different switches use different fields for the same information as well, and the script parses the right field based on the remote side model type, eg: Brocades and Linux Kernels put the Port ID in the “descr” field but other devices put it in the “id” field.

The Nagios check cross references this data against it’s own records which is the “source of truth” file, which looks like this:

The Nagios check script has some smarts built in to handle logical implementations that don’t model well in Patch Manager. One of the complexities is stacked switches. The LLDP information from the server will describe a stacked switch port as something like “3/0/10”, where 3 is the Stack ID. In Patch Manager it would get confusing if we labelled every device in a stack the same, so instead we name them switch1-3 where the “-3” indicates the stack number. The Nagios script looks for and parses this as Stack ID.

Our TDI Workflow

The Nagios checks are the critical part of a much larger workflow which gives us Test Driven Infrastructure when we provision new machines. The workflow follows the steps below roughly, and I go into each step in more detail in the following sections:
  1. Physical design is done in Patch Manager, including placement in the rack and patching connections
  2. Connections are exported from Patch Manager into a format that our Nagios servers can parse easily
  3. Logical design is done in Puppet – Roles are assigned and necessary data is put in Hiera
  4. Hardware is physically racked and the management patches are put in first
  5. Server is kickstarted and does it’s first Puppet run, Nagios updates itself and begins to run checks against the new server
  6. Engineers use the Nagios checks as their test results, fixing any issues

As you might have deduced already the workflow is not perfectly optimised; the “tests” (Nagios checks) come from Puppet, so you need a machine to be installed before you get any test output. Also we need at least some patching done in order to kickstart the servers before we can get feedback on any of the other patching.

Physical Design in Patch Manager

We use Patch Manager’s Software-As-A-Service solution to model our physical infrastructure in our data centres. It is our source of truth for what’s in our racks and what connections are between devices. Here’s an example of a connection (well, two connections really) going from Gb1 in a server, through a top of rack patch panel, and into a switch:
Test Driven Infrastructure - Validating Layer 1 Networking with Nagios

Exporting Patch Manager Connections

Having all our Nagios servers continually reach out to the Patch Manager API in order to search for connections is wasteful, considering that day to day the data in Patch Manager doesn’t change much. Instead we export the connections in patch manager and at the same time filter to remove any intermediate patch panels or devices we don’t care about – we only want to know about both ends of the connection. Each Nagios server has a copy of the “patchplan.txt” file, which is an easy to parse CSV that looks like this:


Logical Design In Puppet

As part of creating the new server in Puppet, the networking configuration is defined and modelled in line with what has been planned in Patch Manager. So for example if a Dell server has it’s first two on board NICs connected to management switches in Patch Manager, somewhere in Puppet a bonded interface will be defined with NICs em1 and em2 as slaves (which are the default on board NIC names on a Dell Server).
How we model our logical network design in Puppet is covered in to much more detail here.

Hardware is Physically Racked

Obviously someone needs to go the data centre and rack the hardware. If it’s a large build it can take several days, or weeks if there’s restricted time we can work in the data centre (like only on weekends). We try to prioritise the patching for management first so we’re able to kickstart machines as quickly as possible.

Kickstarts and Puppet Runs

Once a new has done it’s first Puppet run and it’s catalog is compiled, a set of Exported Puppet Resources that describe Nagios checks for this server are available for collection. The Puppet runs on our Nagios servers will collect all these resources and turn them into relevant Nagios configuration files and begin running these service checks.

Make the Red and Yellow go Green

Since this is a newly built server it’s expected that a lot of the validation style Nagios checks will fail, especially if only the management networks are patched but our Puppet code and Patch Manager is expecting other NICs to be connected. Our engineers use the Nagios check results for the new server as the feedback for our Test Driven Infrastructure approach to provisioning new servers – make the tests pass (make the red and yellow go green) and the server is ready for production.

Any opinions, news, research, analyses, prices or other information ("information") contained on this Blog, constitutes marketing communication and it has not been prepared in accordance with legal requirements designed to promote the independence of investment research. Further, the information contained within this Blog does not contain (and should not be construed as containing) investment advice or an investment recommendation, or an offer of, or solicitation for, a transaction in any financial instrument. LMAX Exchange has not verified the accuracy or basis-in-fact of any claim or statement made by any third parties as comments for every Blog entry.

LMAX Exchange will not accept liability for any loss or damage, including without limitation to, any loss of profit, which may arise directly or indirectly from use of or reliance on such information. No representation or warranty is given as to the accuracy or completeness of the above information. While the produced information was obtained from sources deemed to be reliable, LMAX Exchange does not provide any guarantees about the reliability of such sources. Consequently any person acting on it does so entirely at his or her own risk. It is not a place to slander, use unacceptable language or to promote LMAX Exchange or any other FX, Spread Betting and CFD provider and any such postings, excessive or unjust comments and attacks will not be allowed and will be removed from the site immediately.

LMAX Exchange will clearly identify and mark any content it publishes or that is approved by LMAX Exchange.

FX and CFDs are leveraged products that can result in losses exceeding your deposit. They are not suitable for everyone so please ensure you fully understand the risks involved. The information on this website is not directed at residents of the United States of America, Australia (we will only deal with Australian clients who are "wholesale clients" as defined under the Corporations Act 2001), Canada (although we may deal with Canadian residents who meet the "Permitted Client" criteria), Singapore or any other jurisdiction where FX trading and/or CFD trading is restricted or prohibited by local laws or regulations.

LMAX Limited operates a multilateral trading facility. LMAX Limited is authorised and regulated by the Financial Conduct Authority (firm registration number 509778) and is a company registered in England and Wales (number 6505809). Our registered address is Yellow Building, 1A Nicholas Road, London, W11 4AN.

Sign up for Global FX Insights, the daily market commentary from LMAX Exchange

Thank you
for subscribing to the Global FX Insights newsletter

Thank you
you have already subscribed to the newsletter

sorry there was a problem, please try again later

Your information will not be distributed or shared with third parties