In the information era with its increased use of mobile devices to communicate and access information, web browsers constitute the central component to navigate through the vast amount of information as they are able to fetch and visualize content spread across the world-wide data network known as the Internet. Over the last decade, visualization capabilities of web browsers have been greatly enhanced by the increase in processing power of general purpose CPUs and graphics accelerators. Most mobile platforms include general-purpose SIMD engine, such as ARM NEON which can be used to efficiently process multimedia formats and help enhance user experience – up to a 4x improvement as discussed in this article.
Speeding up SVG filters with the ARM NEON instruction set
SVG filters are powerful graphical operations which can be used to enhance the visual appearance of common graphical primitives (texts, boxes, circles) with effects such as lighting and shadow casting just to name a few.
Filters consist of filter primitives, where each primitive performs an atomic operation, while their results can be combined together to create amazing graphical effects. Some primitives are simple enough to be handled by the underlying graphical subsystem (image moving is an example) while others require software rendering support. The latter ones are both most appealing and most computationally intensive. In this article we share some of our experience that has accumulated during the implementation of ARM NEON based SVG filters.
Speeding up the lighting filter
The lighting filter produces a shining effect based on the alpha channel as a height-map and is well suited to the ARM NEON instruction set since both the increased computation power and the larger register size can be used in multiple ways.
The NEON instruction set allows multiplication of four floating point numbers simultaneously, which is especially useful for fast normalized dot product calculation. A (0,0,0)->(x,y,z) vector can be efficiently represented in a NEON register as follows:
The first three single precision floating point numbers contain the x, y and z coordinates, the fourth coordinate contains the length (which is redundant information but is useful for optimization purposes). The normalized dot product of two vector can be obtained as (x1*x2+y1*y2+z1*z2)/(length1*length2).
All multiplications in this formula can be done by a single NEON VMUL multiplication instruction! This example emphasizes the importance of efficient data layout which can further improve the efficiency of SIMD instruction sets.
NEON registers can also be used to hold temporary data, which can reduce the number of memory reads and writes. As for the lighting filter, the normal vector is calculated from the alpha values of the 3x3 pixel matrix centered around the current pixel. Since the alpha values are processed from left to right, the center and right columns become the next left and center column, respectively. This shift can be done by a VEXT NEON instruction as the three u16 alpha values representing the current row are stored in the upper 6 bytes of a D register, and only the new right column needs to be loaded from the memory.
The NEON instruction set has other features which are very useful with lighting filters such as the fast conversion between multiple integer and floating point numbers, and the efficient clamping of the light strength to the 0-1 range by VMAX and VMIN instructions.
Using the optimizations above, the hand-written NEON-optimized assembly lighting filter is able to run 4 times faster on an ARM Cortex-A9 CPU compared to its C++ counterpart implementation. The implementation supports both diffuse and specular lighting filters with ambient, point and spot light sources.
Speeding up the Gaussian blur
The Gaussian blur filter has somewhat less potential to use the extra processing power provided by NEON instructions since the blurring effect is quite simple. The value of the current pixel is simply replaced by the average of its neighbouring pixels.
The average calculation must be applied to each row first, then to each column. This sequence is repeated three times (total of 6 runs).
The average must be calculated for all four (red, green, blue, and alpha) channels. All operations on the four channels, including memory transfers and arithmetic operations can be parallelized using appropriate NEON instructions (e.g., VADD, VLDR, VMUL). The NEON-based algorithm is 4 times faster than the original algorithm which processes each channel one-by-one.
From our experience, using the ARM NEON instruction set can considerably speed up computation intensive algorithms, where the same operation can be executed on multiple data of the same type.
NEON registers can also be used to store temporary data in order to reduce the number of memory transfer operations.
All this work is open source, and can be accessed as part of the official WebKit trunk, HERE.
ARM welcomes its wealth of Partners in the ARM Connected Community (CC) to submit guest blogs to be published on our multiple community blogs. If interested in participating please submit email inquiries to Tell.Us@arm.com.
The ARM Connected Community (CC) is an extensive ecosystem covering all aspects of ARM processor-based design, from chip implementation through to system and device design. The CC provides a platform for collaborative innovation, with multiple types of forums for members to work with one another, and with customers, to solve industry challenges, all with the purpose of enabling designers to focus on differentiating features and an accelerated time-to-market for ARM powered solutions.
0 Comments On This Entry
Please log in above to add a comment or register for an account
Search My Blog
Coding Using NEON Technology
on May 21 2013 08:57 AM
on May 08 2013 06:15 PM
New Platform Bring-Up with ARM® Development Studio 5 (DS-5™)
on Apr 30 2013 09:55 AM
如何利用全志安卓4.0 HDMI Dongle进行ARM DS-5 Streamline性能分析
on Apr 26 2013 10:50 AM
DS-5 Streamline Performance Analyzer on Allwinner Android 4.0 HDMI Dongle
on Apr 25 2013 04:58 PM