LMAX Exchange - unique vision for global FX

Some aspects of linux have the reputation of being hard. Traffic
control via queueing disciplines for bandwidth management for example.
Even the title is enough to strike fear into the heart of a seasoned
system admin.

Which is a pity really, as the things outlined in chapter 9
of the lartc are very useful in practise. The problem is the
documentation is very descriptive – which is good once you know roughly
what you’re doing – but which has quite a steep learning curve if you
don’t. In fact it’s pretty vertical if you don’t already know quite a
lot about networking. A few more worked examples would help over and
above those in the cookbook.

Instead, like most people in a rush, I have relied on attempting to
bash together snippets of code that are on random blogs to make /sbin/tc do what I want it to do, without really understanding what is going on.

This time, when presented with a problem for which this is the exact
tool, I found I needed to dive deeper, and actually understand it, as
none of the precanned recipes worked. It was a case of “if all else
fails try the manual”.

So now I think I’ve got a vague handle on what is going on, I’m
documenting what I ended up doing because I’m sure I will need a worked
example when I come back to this in the future. If its useful to you
too, so much the better.

The Problem

We have a need to test the loading speed of our web page and trading
platform under a set of network conditions that approximate the
following;

Local LAN, unrestricted
“Europe”, 20ms round trip latency, limit of 512kbit/sec in and out
“SE Asia”, 330ms round trip latency, limit of 128kbit/sec in and out.

In practise thats quite generous, particularly in the case of the
south east asia profile. There was no way I was getting 128kbit on the
wifi in shakey’s on Rizal Boulevard in Dumaguete earlier this month.
Which was better than the hotel wifi.

The Solution

Background

We have selenium to run the tests via webdriver/remotedriver to two
windows virtual machines, one running chrome and one running IE. They
run on a Linux host system, and can see a loadbalancer behind which lies
one of our performance test environments. We need to add latency and
bandwidth restrictions to their connections, effectively to put them
into each of the traffic classes above depending on which test our CI
system asks them to run.

The load balancer has been set up with three virtual servers, all listening on the same IP address but different ports.

Local: 9090
Europe: 9092
SE Asia: 9091

Each virtual server has the same webserver pool behind it, so they’re
all the same from the point of view of the load balancer, but we’ll use
the different destination ports to switch the traffic between the
different sets of network latency and bandwidth restriction we need to
simulate the different customer locations.

The linux virtual machine host has the guests vnet network devices
attached to a bridge. In turn the bridge is attached to the network, via
a bonded interface. In our case bond0.30.

To make this work for both machines, we’ll apply the traffic management on the bond0.30 side of the bridge.

Ascii art diagram of that;

    IE Windows VM - vnet0                               eth0
                                                     / 
                            host bridge 30 - bond0.30  
                          /                            
Chrome Windows VM - vnet1                               eth1

Qdiscs and Classes

There are three creatures we’re dealing with here;

qdisc – a Queueing Discipline. These are the active things we’re
going to use to control how the traffic is managed. qdiscs can be
classless or classful. We’re going to use a classful qdisc called htb
classes – We’ll use these to separate the traffic into its constituent flows and to apply different constraints on each flow.
filters – Similarly to iptables, these allow us to specify which traffic ends up in which class.

Chapter 9 says that you can only shape transmitted traffic, which is
not 100% accurate, as we can do things to inbound traffic too, however
our options are very limited.

So, looking at the default qdiscs, classes and filters

[root@vm01 ~]# tc -s qdisc show dev bond0.30     
qdisc pfifo_fast 0: bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 47844819829 bytes 140593932 pkt (dropped 0, overlimits 0 requeues 22) 
 rate 0bit 0pps backlog 0b 0p requeues 22 
[root@vm01 ~]# tc -s class show dev bond0.30
[root@vm01 ~]# tc -s filter show dev bond0.30
[root@vm01 ~]#

The “-s” option shows the statistics. So, but default, we have a queue discipline called pfifo_fast, which just passes traffic.

Each device has a default root which we use to build upon. We can
also attach handles to classes and qdiscs to allow us to relate each
part to the others and build up chains to process the packet stream.
“root” is shorthand for a handle of 1:0, or the top of the tree.

One of the most useful pages I found is here; http://luxik.cdi.cz/~devik/qos/htb/manual/userg.htm

Things worth repeating from that link are;

tc tool (not only HTB) uses shortcuts to denote units of rate. kbps means kilobytes and kbit means kilobits
Note: In general (not just for HTB but for all qdiscs and
classes in tc), handles are written x:y where x is an integer
identifying a qdisc and y is an integer identifying a class belonging to
that qdisc. The handle for a qdisc must have zero for its y value and
the handle for a class must have a non-zero value for its y value. The
“1:” above is treated as “1:0”

The whole page is worth reading carefully.

The Design

The Pentagons are filters, the circles represent qdiscs, and the
rectangles are classes. One important point is that this diagram in no
way implies flow. This is hard to get, and I had problems understanding
the comments in section 9.5.2.1 “How filters are used to classify
traffic” – particularly;

“You should *not* imagine the kernel to be at the apex of
the tree and the network below, that is just not the case. Packets get
enqueued and dequeued at the root qdisc, which is the only thing the
kernel talks to.”

The way I squared it in the end was to think of it as an order of application for traffic flowing through the root qdisc.

So in the above we have the root qdisc, which is an instance of the
HTB qdisc. From that depends each of the classes we set up to handle the
three different classes of traffic. We use htb to limit the outbound
bandwidth for each of the classes (1:10, 1:11, 1:12). When we define the
root qdisc we specify that class 1:10 will be our default class for the
bulk of the traffic we don’t want to delay.

Setting up the root qdisc;

INTERFACE=bond0.30
tc qdisc add dev $INTERFACE root handle 1:0 htb default 10

“root” is a synonym for handle 1:0. $INTERFACE is defined
in the shell script to make the porting from machine to machine easier.
This installs the htb qdisc on the root for our bond interface, and
tells it that by default all traffic should be put in a class called
1:10.

Now we add classes for each of the types of traffic, along with the
bandwidth limits we want to enforce on each of the traffic classes.

# default class
tc class add dev $INTERFACE parent 1:0 classid 1:10 htb rate 1024mbit

# "europe" traffic class - outbound bandwidth limit
tc class add dev $INTERFACE parent 1:0 classid 1:11 htb rate 512kbit

# "se asia" traffic class - outbound bandwidth limit
tc class add dev $INTERFACE parent 1:0 classid 1:12 htb rate 128kbit

We now attach the network emulator qdisc, netem, which we will use to introduce latency into each of the classes;

# network emulation - add latency.
tc qdisc add dev $INTERFACE parent 1:11 handle 11:0 netem delay 20ms 5ms 25% 
 distribution normal
tc qdisc add dev $INTERFACE parent 1:12 handle 12:0 netem delay 330ms 10ms 25%  
 distribution normal

This attaches the emulator instances to their parent classes, with
handles that match the parents Y value, for ease of tracing.The netem
parameters break down as follows.

delay 20ms – This is pretty self explanatory.
5ms – this is a jitter on the latency to give a bit of variation
25% – this indicates how much the variation in the latency of each packet will depend on its predecessor
distribution normal – how the variation is distributed.

The netem module is described completely here: http://www.linuxfoundation.org/collaborate/workgroups/networking/netem

One thing that could be improved here is that we’re adding all
the latency on the outbound leg. Ideally we’d add 165 ms on the way
there and on the way back for the SEAsia traffic (and 10ms for the EU
traffic). To do that means applying latency to the outbound interfaces
in both directions. In our case that would mean applying 165ms of
latency to both of the vnet interfaces as well as the bond0.30
interface. However that is tricky to do simply as the virtual machine
interface names may change as they get rebooted. Instead this way we end
up with the same result for far less faffing about.

Now, all we need to do is add the filters that classify the packets into their classes

SEASIAIP=172.16.10.10
SEASIAPORT=9091
EUIP=172.16.10.10
EUPORT=9092

# filter packets into appropriate traffic classes.
tc filter add dev $INTERFACE protocol ip parent 1:0 prio 1  
  u32 match ip dst $SEASIAIP match ip dport $SEASIAPORT 0xffff flowid 1:12
tc filter add dev $INTERFACE protocol ip parent 1:0 prio 1  
  u32 match ip dst $EUIP match ip dport $EUPORT 0xffff flowid 1:11

The action is mainly in the second line of each command, where we
match the target IP of the load balancer, and the ports we’ve setup. The
flowid is the class handle for the appropriate classes. We don’t need
to set up a filter for the “normal” traffic, as it is covered by the
“default 10” part of the original htb root qdisc declaration.

And that takes care of the outbound traffic shaping and latency.

We now need to handle inbound.

For this we use the special ingress qdisc. There’s very little we can
actually do with this qdisc. It has no classes, and all you can really
do is to attach a filter to it. Usefully we can use the “police” key
word to restrict (by packet dropping) the inbound flow. Its not exact,
but its good enough for our purposes.

# inbound qdisc.
tc qdisc add dev $INTERFACE handle ffff: ingress

# attach a policer for "se asia" class.
tc filter add dev $INTERFACE protocol ip parent ffff: prio 1 
 u32 match ip src $SEASIAIP  match ip sport $SEASIAPORT 0xffff 
 police rate 128kbit burst 10k drop flowid :1

# attach a policer for "europe" traffic class.
tc filter add dev $INTERFACE protocol ip parent ffff: prio 1 
 u32 match ip src $EUIP match ip sport $EUPORT 0xffff 
 police rate 512kbit burst 10k drop flowid :2

The handle ffff: is a synonym for the inbound traffic root. All you
can do is attach ingress to it as shown. To be frank I’ve not dived into
exactly how the burst keyword affects things. Essentially the above
filter rule is the same as the one we used on the outbound side except
we now match the source ports and IPs rather than the destination ports
and IPs. Then rather than using the flowid argument we use police to
instruct the kernel to drop packets from each of our loadbalancer ports
if they exceed the stated rates.

Cleanup

To clean up after all of this, its sufficient to just remove the root
and ingress qdiscs. Removing the top of the tree removes all the other
configuration.

# remove any existing ingress qdisc.
tc qdisc del dev $INTERFACE ingress
# remove any existing egress qdiscs
tc qdisc del dev $INTERFACE root

Which cleans up all classes and filters.

Conclusion

There’s an init script that encapsulates all of the above which can be downloaded from here.

[root@vm01 ~]# chkconfig latency on
[root@vm01 ~]# /etc/init.d/latency      
Usage: /etc/init.d/latency {start|stop|restart|condrestart|status}
[root@vm01 ~]# /etc/init.d/latency start
[root@vm01 ~]# /etc/init.d/latency stop
[root@vm01 ~]# /etc/init.d/latency status
 Active Queue Disciplines for bond0.10 

 Active Queueing Classes for bond0.10 

 Active Traffic Control Filters for bond0.10 
[root@vm01 ~]#

And thats it.

This mainly suits a static configuration, as is the case with our
load balancer and continuous integration environment. However for web
development use, this approach lacks flexibility, particularly if you
don’t have root access. For our developers, I looked at ipdelay but
eventally settled with charles which was adequate for our purposes.

HTH.

LMAX Group blog - FX industry thought leadership

Adding Latency and Limiting Bandwidth

The Problem

The Solution

Background

Qdiscs and Classes

The Design

Cleanup

Conclusion

LMAX Group blog - FX industry thought leadership

The Problem

The Solution

Background

Qdiscs and Classes

The Design

Cleanup

Conclusion

Sign up for Global FX Insights, the daily market commentary from LMAX Group