NEON is a wide SIMD data processing architecture extension introduced in ARMv7 architecture. It performs “Packed SIMD” processing and can be used to optimize multimedia codec algorithms, 2D/3D graphic libraries or other data processing applications. The use of NEON has proven to be very popular in many open-source projects or proprietary applications. The WebM Multimedia project and Android’s Skia library are good examples of software libraries utilizing NEON instructions.
Windows RT also utilizes NEON for optimization. The Microsoft Visual C++ compiler supports NEON intrinsics with implementation close to ARM RCVT compiler 4.1. You have access to NEON intrinsics by including the arm_neon.h header file. This is the same as what you would do for Linux/Android development. Refer to MSDN for more details. The SIMD C++ Math library (DirectXMath.h) is implemented using NEON intrinsics and can be used as a good reference.
As an example, I decided to port the HelloNEON program from Android NDK to see how easy it is to use NEON intrinsics on Windows RT. The HelloNEON program offers several benefits. It is small and nicely written; so it is easy to understand and modify if needed. It also offers both C and NEON implementations; so I can easily show the benefit of NEON optimization.
Rewriting the timestamp function:
Wrap the main routines as WinRT component:
Once the coding is done, you have to specify the platform to be ‘ARM’ and the build configuration to be ‘Release’. You also have to set up remote debugging for running the program on your Windows RT device. In my case, I tested it on my Surface RT tablet.
The result is great -- Normalizing the result, the NEON version is about twice as fast as the C version.
So, without any hardware change, I am able to get 100% improvement with NEON optimization over the original C implementation. Obviously, the result will vary depending upon your functions or algorithms, but the benefit is obvious. It is also worth noting that this is a single-thread implementation. The Surface RT device uses NVIDIA® Tegra® T30 chip, which utilizes a quad ARM Cortex™-A9 MPCore CPU. If your function or algorithm can be fairly paralleled into independent processing blocks, a multi-thread implementation will give you even further optimization.
With NEON intrinsics support in Microsoft Visual C++ compiler, using NEON to speed up your Windows RT application is as easy as including the relevant header file and compiler options. With so many applications benefiting from NEON optimization, your application should too. For more information on NEON, check out the ARM online infocenter. You can also find the online NEON programming reference guide as well.
Alan Chuang, Client Computing Engineering Specialist, ARM, has many years of software development experiences in networking, communication, embedded system and web technology. He is fascinated with networking and how various network technologies has transformed our life. The web technology is simply a prime example of that. His current work within ARM focuses mostly on client-computing software ecosystem including Linux, Android and lately Windows RT. With mobile computing becoming the norm for the foreseeable future, he certainly wants to be part of the action for this next stage of network evolution.
1 Comments On This Entry
Please log in above to add a comment or register for an account
Search My Blog
Coding Using NEON Technology
on May 21 2013 08:57 AM
on May 08 2013 06:15 PM
New Platform Bring-Up with ARM® Development Studio 5 (DS-5™)
on Apr 30 2013 09:55 AM
如何利用全志安卓4.0 HDMI Dongle进行ARM DS-5 Streamline性能分析
on Apr 26 2013 10:50 AM
DS-5 Streamline Performance Analyzer on Allwinner Android 4.0 HDMI Dongle
on Apr 25 2013 04:58 PM