The goal of the tests is to give the users overall impression of the performance characteristics of ØMQ/0.3 in terms of latency, throughput, scalability etc. Also, it can be thought of as a check to ensure that the new version of the software haven't lost the performance levels offered by the preceding versions.
Testing environment consisted of two identical Linux boxes, each having two quad-core processors (Intel E5450, 3GHz) and 32GB of memory. They were interconnected by 1Gb Ethernet using Broadcom NetXtreme II BCM5708 1000Base-SX network interface cards. There was a 1Gb Ethernet switch in the middle.
Latencies for 1 byte messages we've measured are 40 microseconds. Out of that, some 25 microseconds are to be attributed to the networking stack and 15 microseconds to ØMQ itself. With 10GbE that exhibits TCP latency of 15 microseconds, we would expect to get overall latency of 30 microseconds. We are also working on delivering ØMQ on alternative networking stacks - bypassing the OS kernel and speaking directly to the hardware - that should get the latency under 20 microseconds.
It is often desirable to know how the latency behaves at different message throughputs. Is it substantially higher when million messages a second are passed as opposed to 50,000 messages a second? To perform this test thr-lat scenario in the perf performance measuring framework was used. Following graph shows the relationship for messages 6 bytes long:
The graph ends at 1.4M messages a second. At higher message rates the time to process single message is so low (less than 714 nanoseconds) that the time measurement in the test is no longer negligible, rather it becomes as significant throughput bottleneck as ØMQ itself. Thus the results for higher message rates would be heavily distorted and we opted not to include them into the test suite.
The smooth black line shows the maximal possible capacity of 1Gb Ethernet. It can be seen that ØMQ's throughput is mostly limited by network bandwidth except for very small messages where CPU power limit kicks in. In those cases, maximal throughput is varies between 3 and 3.5 million messages a second.
Throughput for raw TCP/IP is shown on the graph (black points) as well to give reader something to compare the results to.
Alternative way to present the throughput results is what we call density. This metric specifies how much time is needed to process a single message:
This graph makes the two main limiting factors of any messaging system clearly visible. For small messages, CPU power is the bottleneck. This can be seen for messages from 1 up to 32 bytes long. Time to process them doesn't vary with respect to their size. For large messages, network bandwidth becomes the bottleneck. This can be seen for messages larger than 64 bytes. The size of the message determines the maximal throughput. (Please, note that density is graphed in logarithmic scale.)
Unfortunately, we had just one 16-core box to test the scaling. Therefore we've run the scaling test in two configurations. Firstly, we've run message senders on two 8-core boxes and message receiver on the 16-core box. Secondly, we've run message receivers on the 8-core boxes and message sender on 16-core box.
The test consisted of N independent message streams running in parallel. Each stream used 2 CPU cores on sending box (application thread + I/O thread) and 2 cores on receiving box (I/O thread + application thread).
We've found out that scaling improved significantly when compared to ØMQ/0.2. Although it is not strictly linear it yields nice throughput increase for each additional message stream.
The throughput peaked at 9,500,000 (7,000,000 when the sender was located on the 16-core box) 8-byte messages a second in 6 message streams (12 cores used on each side of the test). We weren't able to test the scaling for more message streams as we haven't had machines with sufficient number of cores at hand.
Comment: As you may see, the throughput for 2 cores (single stream of messages) is somewhat lower than it is in throughput test. The reason is that the scaling test was run in the environment specifically tuned for real-time behaviour. Making system real-time means you experience less latency peaks (none in the ideal case), however, you pay for it by decreasing overall system performance.
ØMQ is intended to be as thin as possible, so that it can run even on platforms with limited memory. The low memory footprint is also important to optimise the usage of L1i cache. When size of the code is small enough processor can hold the entire messaging code in L1i cache avoiding slow accesses to the physical memory to get new chunks of code.
The shared library itself has approximately 350kB on Linux platform. The portion of the library holding the actual code is 80kB long. Still, a lot of code is inlined so it is actually kept in as inline functions in header files rather then in the library proper.
To check what's the actual memory usage in non-stressed environment (in stressed environment most of the memory would be used to hold message queues), we've run chat example bundled with ØMQ package and had a look at prompt application that does nothing but sending messages and display application that does nothing but receiving them. Following table shows the memory usage as reported by top utility:
|Application||Virtual memory||Resident memory||Resident code|
Note the values in the "code" column. This means that the system has to hold 2-3 pages of code in physical memory to be able to send and/or receive messages. Now that's what we call ultralight messaging layer!
Although there was a lot of functionality added to ØMQ/0.3, we've been able to increase throughput from 2.6M messages a second for ØMQ/0.2 to over 3.0M messages a second. The throughput peaked at 3.4M messages a second.
The latency overhead over the underlying transport got a bit worse when compared to ØMQ/0.2. It increased from 12 microseconds to approximately 15 microseconds.
Scaling to multiple cores have improved significantly. While ØMQ/0.2 scaled nicely with 2 parallel message streams, somehow worse with 3 streams and not particularly well with 4 streams, ØMQ/0.3 scales nice up to 6 parallel streams (12 CPU cores on each side of the connection).