We performed several tests to measure latency and throughput for ØMQ test executables compiled using different compilers and different optimisation levels. The goal was to find out whether optimisation on compiler level yields any measurable performance improvement for ØMQ applications.
The test was performed on two identical boxes:
- P4 HT 3GHz
- 2GB RAM
- Debian GNU/Linux 4.0
- Kernel 22.214.171.124
- 1Gb Ethernet NIC (Broadcom BCM5751 1000Base-T PCI-Express)
- gcc version 4.1.2 20061115
- Intel C Compiler for applications, version 10.1, build 20080312
There was LinksysSR2024 1000Base-T switch between the boxes.
For measuring latency and throughput we used perf framework.
Latency was measured using raw latency scenario.
Thoughput was measured using raw density scenario.
Each test was compiled using gcc with -O0, -O1, -O2, -O3 and -Os optimisation levels and icc with -O0, -O1, -O2, -O3 and profile-guided optimisation. When presenting the results we've chosen to display the best and the worse optimisation level for gcc and the best and the worst optimisation level for icc.
The following graph shows the latency measured for messages of different sizes:
As can be seen, considerable latency improvement is achieved only for message below 16 bytes of length. For larger messages the latency can be improved only in a negligible way (at most 5 us). Given that below 16 byte message scenarios are really rare, we can conclude that impact of compiler and optimisation level on latency is quite low.
This graph shows throughput in megabits per second for messages of different sizes:
From message size of 256 bytes upwards any optimisation becomes irrelevant as the bottleneck is 1Gb Ethernet rather than CPU power. Throughput for very small messages (few bytes) is too low for any differences to be observable on the graph, however, for message sizes of 16-128 bytes, optimisation seems to have an effect. Specifically, for message size of 64 bytes the throughput improvement of the best configuration (icc with -O2 setting) over the worst configuration (gcc without optimisation) is 34.4% (732.6Mb/s vs. 544.9Mb/s).
This result is particularly interesting for market data distribution as the average size of the stock quite falls precisely into the 16-128 bytes range. Getting throughput increased by almost 35% only by choosing right compiler and right optimisation level is a possibility that shouldn't be ignored.
To see the impact of optimisation on small messages, we should check messages per second metric rather than megabits per second one. Have a look at following graph:
So, for example, throughput for 32 bytes long messages (think of FIX/FAST feed) increases from 1.33 million messages a second (gcc with no optimisation) to 1.76 million messages a second (icc with -O2 setting) giving some 32% of improvement.
As throughput measured in megabits per second tends to make performance differences for large messages more visible than those for small messages and throughput measured in messages per second prefers small messages in the opposite way, have a look at a neutral metric. Density is average time needed to process a single message:
The results show that for messages larger than 128 bytes, the gains from compiler optimization provide no significant benefits. This is due to the 1Gb Ethernet becoming the bottleneck. It would be interesting to perform the test on 10Gb Ethernet.
As for latency, optimisation can improve it by 5 microseconds, which is not that much, however, for ultra-low-latency applications it may prove to provide some benefit.
As for throughput, the most improvement can be gained for messages up to 128 bytes. For reasonable sized messages (stock quotes) the improvement can get up as high as 35%.