Historically, benchmarks allowed comparison of compiler effectiveness and CPU performance. The first popular benchmark, SIEVE, calculated just prime numbers and was published in January 1983 in BYTE Magazine. Later, Dhrystone and Whetstone became popular. Dhrystone focuses on integer and string operations whereas Whetstone primarily uses floating point arithmetic. Today’s compiler technology allows calculation of many of the internal benchmark operations at compile time and the CPU performance indication of these benchmarks may be misleading. CoreMark uses randomly generated data as input, making it impossible for compilers to pre-compute parts of the benchmark at compile time. While CoreMark is more difficult to “defeat”1 than Dhrystone, clever compiler writers can still improve the benchmark result by crafting optimizations that are aimed at coding constructs of the benchmark. From a compiler-writer’s perspective, this is like shooting fish in a barrel!
The Experiment: Improve ARM Compiler Performance using CoreMark
We began our experiment by analyzing the CoreMark coding structure and the functions the benchmark performs. CoreMark consists of a linked list data structure which is scanned at runtime. Control is then offloaded to a loop-controlled state machine or matrix manipulation routine, depending on the linked list data value. After analysis, we identified techniques to deal with the state machine and opportunities for more aggressive loop unrolling.
Based on the structure of the switch statements, we know the new destination after each case statement, making it possible to eliminate the switch by directing each case to branch directly to the successive case. For example:
The Results: ARM Compiler shows significant CoreMark Improvements
Applying the techniques mentioned above resulted in a dramatic improvement in CoreMark benchmark performance. While the result was impressive, there was the unwelcome side effect of a 13% code size increase when enabling the switch statement and aggressive loop optimizations. Given that the ARM Compiler targets embedded developers, this is generally not an acceptable outcome for our customers.
Why Code Size Matters
As mentioned above, the performance improvement techniques come with an unwelcome impact on code size. Generally speaking, ARM’s powerful 32-bit microcontrollers have plenty of horsepower for embedded workloads, but memory capacity is often limited in embedded applications for cost reasons. Although MCU prices have fallen dramatically, moving to a microcontroller with more on-chip Flash memory can increase your BOM cost drastically. For example, purchasing 1500 pieces of a popular ARM Cortex-M3 microcontroller the unit price of the 128KB MCU Flash variant is $4.88 whereas the 64KB Flash variant is just $2.80. A compiler that is primarily tuned for performance can therefore result in significantly increasing the overall project cost.
System cost is not the only reason why code size is important in today’s modern embedded processors. Compact code increases the number of instructions which can fit into cache, potentially improving performance based on more efficient cache usage and, perhaps more importantly, potentially reducing overall power consumption.
The ARM Compiler team has always focused on both performance and code density, resulting in a well-tuned compiler that balances execution speed and code size. To compliment compact code generation, we created MicroLib, a size optimized library for ARM-based embedded applications. When compared to a standard C library, MicroLib provides significant code size advantages. For example, the 13% code size increase mentioned above turns into a net 23% code size reduction when using MicroLib.
For users who care more about performance than code size, the CoreMark improvements mentioned above will prove welcome. For example, code consisting of finite state machines implemented using switch statements or well-formed2 for loops and while loops could see significant improvement for those constructs. Testing the new optimizations with our standard benchmark suite, which consists of over 60 applications targeting a wide rang of embedded use cases, showed a substantial improvement only on some. This was not surprising as CoreMark is a small piece of software and targeted compiler optimizations aimed at a limited set of code constructs may not scale across broader code bases.
CoreMark performance can be significantly increased by applying compiler optimizations specifically targeting the constructs of the benchmark code. Comparing CoreMark scores between compilers can give an indication of which compiler fares better on CoreMark-like code, but this may or may not improve real-world embedded application performance and even worse, could introduce unwanted code bloat. The code size penalty can be mitigated by using a compact library, such as MicroLib. As always, the best approach is to evaluate compilers of interest on your code, taking into account impact on code size when aggressive performance optimizations are used.
For those who are interested in evaluating the ARM Compiler improvements referenced in this blog, download Keil MDK-ARM v4.70 or ARM DS-5 v5.14 (available March 2013). The release notes provide details on how to use the new optimizations. I encourage you to try it and give me your feedback.
 Idiomatic loops with known constant upper and lower bounds, loops with unknown upper bound, loops containing a small number of C-statements.
 Dhrystone can be compromised by the compiler by pre-computing values at compile time or optimizing away timed portions of the code.
Dan Owens, Compiler Product Manager, ARM. After earning a Bachelor of Science in Electrical Engineering in 1994, Dan has held various Engineering, Sales, and Marketing positions within the electronics industry. He joined ARM in 2009 as a Product Manager and has been responsible for the ARM Compiler, RealView Development Suite (RVDS), and Development Studio 5 (DS-5). Dan is currently the Product Manager of the ARM Compiler – the industry standard compilation tools for the ARM architecture.
4 Comments On This Entry
Please log in above to add a comment or register for an account
The Continuing Hazards of Dhrystone
on May 16 2013 10:24 AM
Make versus buy: it's about risk management!
on May 15 2013 02:29 PM
Accelerated Internet of Things (IoT) development with ARM mbed and Xively
on May 14 2013 12:21 PM
Embedded Systems and M2M Expo 2013: Deep Embedded, Consumer and Enterprise
on May 13 2013 10:26 AM
How ARM Compiler detects stack overflows or malicious tampering
on May 07 2013 06:50 PM