Login

Important information

This site uses cookies to store information on your computer. By continuing to use our site, you consent to our cookies.

ARM websites use two types of cookie: (1) those that enable the site to function and perform as required; and (2) analytical cookies which anonymously track visitors only while using the site. If you are not happy with this use of these cookies please review our Privacy Policy to learn how they can be disabled. By disabling cookies some features of the site will not work.

ARM Community: Using ARM NEON to accelerate Scalable Vector Graphics in webkit by up to 4x - ARM Community

Jump to content

Using ARM NEON to accelerate Scalable Vector Graphics in webkit by up to 4x

Introduction
In the information era with its increased use of mobile devices to communicate and access information, web browsers constitute the central component to navigate through the vast amount of information as they are able to fetch and visualize content spread across the world-wide data network known as the Internet. Over the last decade, visualization capabilities of web browsers have been greatly enhanced by the increase in processing power of general purpose CPUs and graphics accelerators. Most mobile platforms include general-purpose SIMD engine, such as ARM NEON which can be used to efficiently process multimedia formats and help enhance user experience – up to a 4x improvement as discussed in this article.

Background
The web browser group at the University of Szeged, Hungary has been actively working on the WebKit browser engine since 2008 in cooperation with ARM Ltd. and industrial partners. Over the last three years we have successfully completed several performance improvements including accelerating the JavaScript , Scalable Vector Graphics (SVG) pixel manipulations and CSS engines. A number of these improvements were also able to efficiently exploit the Symmetric Multiprocessing (SMP) capabilities of recent ARM CPUs. Memory footprint and space requirements have also been a key area of focus throughout this work with Webkit.

Speeding up SVG filters with the ARM NEON instruction set
SVG filters are powerful graphical operations which can be used to enhance the visual appearance of common graphical primitives (texts, boxes, circles) with effects such as lighting and shadow casting just to name a few.

Filters consist of filter primitives, where each primitive performs an atomic operation, while their results can be combined together to create amazing graphical effects. Some primitives are simple enough to be handled by the underlying graphical subsystem (image moving is an example) while others require software rendering support. The latter ones are both most appealing and most computationally intensive. In this article we share some of our experience that has accumulated during the implementation of ARM NEON based SVG filters.

Attached Image



Speeding up the lighting filter

The lighting filter produces a shining effect based on the alpha channel as a height-map and is well suited to the ARM NEON instruction set since both the increased computation power and the larger register size can be used in multiple ways.

The NEON instruction set allows multiplication of four floating point numbers simultaneously, which is especially useful for fast normalized dot product calculation. A (0,0,0)->(x,y,z) vector can be efficiently represented in a NEON register as follows:

Attached Image


The first three single precision floating point numbers contain the x, y and z coordinates, the fourth coordinate contains the length (which is redundant information but is useful for optimization purposes). The normalized dot product of two vector can be obtained as (x1*x2+y1*y2+z1*z2)/(length1*length2).

All multiplications in this formula can be done by a single NEON VMUL multiplication instruction! This example emphasizes the importance of efficient data layout which can further improve the efficiency of SIMD instruction sets.

Attached Image


NEON registers can also be used to hold temporary data, which can reduce the number of memory reads and writes. As for the lighting filter, the normal vector is calculated from the alpha values of the 3x3 pixel matrix centered around the current pixel. Since the alpha values are processed from left to right, the center and right columns become the next left and center column, respectively. This shift can be done by a VEXT NEON instruction as the three u16 alpha values representing the current row are stored in the upper 6 bytes of a D register, and only the new right column needs to be loaded from the memory.

Attached Image


The NEON instruction set has other features which are very useful with lighting filters such as the fast conversion between multiple integer and floating point numbers, and the efficient clamping of the light strength to the 0-1 range by VMAX and VMIN instructions.

Using the optimizations above, the hand-written NEON-optimized assembly lighting filter is able to run 4 times faster on an ARM Cortex-A9 CPU compared to its C++ counterpart implementation. The implementation supports both diffuse and specular lighting filters with ambient, point and spot light sources.

Speeding up the Gaussian blur
The Gaussian blur filter has somewhat less potential to use the extra processing power provided by NEON instructions since the blurring effect is quite simple. The value of the current pixel is simply replaced by the average of its neighbouring pixels.

The average calculation must be applied to each row first, then to each column. This sequence is repeated three times (total of 6 runs).

The average must be calculated for all four (red, green, blue, and alpha) channels. All operations on the four channels, including memory transfers and arithmetic operations can be parallelized using appropriate NEON instructions (e.g., VADD, VLDR, VMUL). The NEON-based algorithm is 4 times faster than the original algorithm which processes each channel one-by-one.

Conclusion
From our experience, using the ARM NEON instruction set can considerably speed up computation intensive algorithms, where the same operation can be executed on multiple data of the same type.

NEON registers can also be used to store temporary data in order to reduce the number of memory transfer operations.

All this work is open source, and can be accessed as part of the official WebKit trunk, HERE.


Guest Blogger:
Attached Image
Zoltan Herczeg, Senior Developer - University of Szeged, is a Senior Developer at the Software Engineering Department in the University of Szeged, Hungary. He is an accepted contributor of several open source projects including the WebKit browser engine (reviewer status), Perl Compatible Regular Expressions (PCRE) library (commiter status) and maintainer of XEEMU, a cycle accurate ARM instruction simulator. He holds an MSc in Computer Science.

ARM welcomes its wealth of Partners in the ARM Connected Community (CC) to submit guest blogs to be published on our multiple community blogs. If interested in participating please submit email inquiries to Tell.Us@arm.com.

The ARM Connected Community (CC) is an extensive ecosystem covering all aspects of ARM processor-based design, from chip implementation through to system and device design. The CC provides a platform for collaborative innovation, with multiple types of forums for members to work with one another, and with customers, to solve industry challenges, all with the purpose of enabling designers to focus on differentiating features and an accelerated time-to-market for ARM powered solutions.
All company and product names appearing in the ARM Blogs are trademarks and/or registered trademarks of ARM Limited per ARM’s official trademark list. All other product or service names mentioned herein are the trademarks of their respective owners.

0 Comments On This Entry

Please log in above to add a comment or register for an account