Login

ARM The Architecture For The Digital World  

ARM Community: Software Enablement - ARM Community

Jump to content

Design West (ESC) Day 1: Optimizing Your Software on ARM

It was a full first day at the ESC Summit of Design West. I spent most of the time doing what I love to do best: talking with ARM Partners. Most of the conversations focused on how engineers can achieve better software on their ARM processor-based SoCs.

Andy Frame had a few meetings in the morning so he passed the baton (microphone) to me so I can share some of the insights with the ARMFlix followers. I’m looking forward to Day 2 and speaking with more of the ARM Partners at ESC (check out our handy map). Don’t miss the ARM Connected Community Theatre at ...

Ne10: A New Open Source Library to Accelerate your Applications with NEON

The past three years we have seen explosive growth in the use of the NEON™ SIMD engine by many of our software partners in the open-source community. The engine itself, defined as part of the ARM® Architecture, Version 7 (ARMv7), has shown itself to be extremely flexible and able to accelerate everything from Video Codecs such as VP8 to elements of the emerging HTML5 standard including <svg> and <canvas> filters. From an applications developer viewpoint, all of this acceleration takes place behind the scenes in upstream open source projects that are harvested to build the latest and greatest open source operating systems and frameworks such as Android™ and QT. While it is good to kno...

Using ARM NEON to accelerate Scalable Vector Graphics in webkit by up to 4x

Introduction
In the information era with its increased use of mobile devices to communicate and access information, web browsers constitute the central component to navigate through the vast amount of information as they are able to fetch and visualize content spread across the world-wide data network known as the Internet. Over the last decade, visualization capabilities of web browsers have been greatly enhanced by the increase in processing power of general purpose CPUs and graphics accelerators. Most mobile platforms include general-purpose SIMD engine, such as ARM NEON which can be used to efficiently process multimedia formats and help enhance user experience – up to a 4x improvement as discussed in this article.

Background
The web browser group at the ...

Coding for NEON - Part 5: Rearranging Vectors

This article describes the instructions provided by NEON for rearranging data within vectors. Previous articles in this series: Part 1: Loads and Stores, Part 2: Dealing with Leftovers, Part 3: Matrix Multiplication and Part 4: Shifting Left and Right.

Introduction

When writing code for NEON, you may find that sometimes, the data in your registers are not quite in the correct format for your algorithm. You may need to rearrange the elements in your vectors so that subsequent arithmetic can add the correct parts together, or perhaps the data passed to your function is in a strange format, and must be reordered before your speedy SIMD code can handle it.

This reordering operation is called a permutation. Permutation instructions rearrange individual elements, selected from single or multiple registers, to form a new vector.

Before we begin

Before you dive into using the permutation instructions provided by NEON, consider whether you rea...

Top 2011 ARM Software blogs: Android, NEON, RISC vs CISC & Assembly

2011 was a busy year for developing software on ARM and the activity is reflected in the page views of top Software Enablement blogs. The topics included managing caches, Android (multiple), NEON (multiple), Memory Access Ordering, RISC vs CISC architectures (multiple), and optimizing assembly code (listed by popularity below). In addition, the Software Enablement Community pages (Linux, Solution Center for Android, RTOS, Microsoft, etc) were some of the highest referenced pages on the ARM site. Please let us know if you have more ideas for easing your software development on the ...

x264 on ARM: Bringing a wider application of video conferencing (Part 3)

In part one and part two of this blog series, we introduced the video conferencing use case requirements and performed tuning of x264 for optimal tradeoff between bit rate, frame rate and video quality… In this part, we will test and analyze encode performance for optimal execution on the target ARM platform.

1 Test result of on the target ARM platform

Using the results from the previous step, we test the options “default”, “--preset ultrafast”, “--preset superfast” and “--preset very fast” with our optimal settings on the ARM platform and evaluate the performance against our use case requirements for video conferencing.

1). List of Combinations (settings tested)
Attached Image


2). Result

Attached Image
Attached Image
Attached Image


3). Conclusion
According to above information, we can conclude:
Attached Image


2 Summary of the optimal settings
According to the test results above, we might conclude: when rc-lookahead is set to 1, the bit rate is the minimum and the...

x264 on ARM: Bringing a wider application of video conferencing (Part 1)

Video is increasingly becoming an important and essential part of consumer electronics. Video centric features like augmented reality and video conferencing provide enhanced visual user interaction. Such features are now expected across a wide variety of application segments. In the embedded world, intensive video compression is typically done using standard DSP’s or specialized hardware accelerators, as they can provide both the specialized functionality and the high level of performance required. However, now ARM processors with NEON™ technology can be as capable of compressing video as some dedicated hardware, and do so with greater power efficiency.

H.264/MPEG-4 AVC (Advanced Video Coding) is currently one of the most commonly used formats for the recording, compression, and distribution of video content. The H.264 video format has a very broad application range that covers all forms of digital compressed video from low bit-rate Internet streaming applications to HDTV broadcast. With the use of H.264, bit rate savings of 50% or more ar...

Optimizing DirectFB with ARM NEON

DirectFB (Direct Frame Buffer) is a graphics library that is widely used in embedded systems, especially home market. More and more applications or libraries choose DirectFB as backend, such as Cairo, GDK, Qt, V8, X11 and Webkit. ARM NEON technology could be well used in 2D acceleration. In this blog, I’ll describe how to optimize DirecFB using NEON.

1. Introduction
1.1 DirectFB Introduction
DirectFB (Direct Frame Buffer) is a thin library that provides hardware graphics acceleration, input device handling and abstraction, integrated windowing system with support for translucent windows and multiple display layers. It is free software licensed under the terms of the GNU Lesser General Public License (LGPL). Graphics features provided by DirectFB including Rectangle Filling/Drawing; Triangle Filling/Drawing; Line Drawing Blit; Alpha Blending (texture alpha, alpha modulation); Porter/Duff; Colorizing; Source Color Ke...

ARM Fundamentals: Introduction to understanding ARM processors

Finding one's way through references to ARM processors is not always obvious.
This article is the first of a series on ARM fundamentals that will introduce various topics to help you get more familiar with the ARM architecture. It aims at helping you to better understand ARM processors, starting with explaining how they are named, and then showing how knowing your processor matters by introducing a few of their recent features.

If you are curious about what is in your pretty electronic device or are a developer willing to understand how to start getting the best out of your processor, you may find some useful information here. The second part of the article may be technically a bit more challenging than the first, but don't worry! The few code samples are only concrete examples used to illustrate the explanations. The specific details are not necessary to understand the global picture.

The first step is to understand how ARM processors are referenced: it certainly sounds nice, but what is this "dual Cortex-A9, based on ARMv7" in your super-phone?


Processor fami...

NEON编码 - 第4部分: 左右移位

本文将介绍NEON提供的移位运算,并显示如何利用移位运算在常用颜色深度之间转换影像数据。本系列前期已发布的文章包括:第1部分:加载与存储第2部分:余数的处理第3部分:矩阵乘法。

向量移位 NEON上的移位与标量ARM编码中可能用到的移位非常相似,即每个向量元素的位数均向左或向右移位,出现在每个元素左侧或右侧的位将被删除;它们不能移位至相邻的元素。

移位的数量可通过指令中编码的文字或附加的移位向量来指定。使用移位向量时,应用到输入向量每个元素的移位取决于移位向量中对应元素的值。移位向量中的元素被当作带符号的值来处理,因此按元素分配,左移位、右移位和零移位都有可能发生。

...

关于Android NDK的10个技巧

随着Android NDK本机开发套件)披露了许多的新设备和新功能现在我们可以充分利用这些ARM设备了。下面列举了一些快速提示,希望对您有所帮助。

1 - 关注目标

最新的设备一般是ARMv7,这意味着它可以使用v7版本和功能。最新版的NDK增加了ARMv7NEON代码支持,可以实现关键循环操作和媒体操作优化,远超其它方法。NDK提供小型静态库,可以帮助您识别运行时的选项。有关如何使用这些功能的示例,请参见NDK样本目录中的...

ARM technology software newbie? Try the Cortex A-Series Programmer's Guide

The ARM architecture has been used for many years in mobile phones and electronic devices, but it is only relatively recently that the architecture has diversified into being used in laptops, tablets and smartphones. There are now many companies that have adopted the ARM architecture as the basis for their next world-beating technology product. This is great, but the problem is that if you are new to the ARM architecture and want to start writing programs for an ARM processor, where do you start? What document do you need to read first before you dive into the library of technical information that is available on the ARM InfoCenter?

My choice would be the recently released Cortex A-Series Programmer's Guide. This guide provides a gentle in...

Google's V8 on ARM: Five Times Better

Attached Image
The modern web is built primarily from three technologies: HTML, CSS and JavaScript. It is JavaScript that drives the interactive web; slow JavaScript means slow web pages. So today, a huge amount of effort is being put into improving the performance of JavaScript, giving us access to powerful web applications, with features from your desktop, but available wherever you are.


Web applications like Gmail, Google Maps and Google Docs use JavaScript extensively, and the user experience is greatly improved on systems with fast, efficient JavaScript engines. In 2008, this motivated Google to create the V8 JavaScript engine project.


V8 is now, on modern benchmarks, the fastest JavaScript engine available. Rather than interpreting JavaScript as the old engines used to do, V8 uses a Just-In-Time compiler to produce and execute native instructions tailored to the processor on which it is running. The generated instructions are cached, avoiding the overhead of repeated code generation, and deleted when no longer needed.


V8 is now the core technology used in a number of important applications. It is the JavaScript engine used in Google's super-f...

Microsoft’s Windows Embedded Compact 7 Continues Support of ARM

As the majority of Microsoft Windows Embedded Compact business is already based on ARM Processors in products such as the Ford Sync, the announcement of Windows Embedded Compact 7 continues Microsoft’s support of the ARM architecture.

Windows Embedded Compact 7 will include full support for ARM
Windows Embedded Compact 7 will include full support for the ARMv7A architecture including support for ARM’s NEON SIMD instruction set in addition to symmetric multiprocessing (SMP). Building on ARM and Microsoft’s 14 year relationship, this is Microsoft’s ...

Valgrind 3.6.0 for ARM-Linux

Version 3.6.0 of Valgrind was released a couple of weeks ago. Probably the largest change this release is the addition of support for Linux running on ARM.

Valgrind is a GPL'd framework for building simulation based debugging and profiling tools, plus a set of "standard" tools. The best known of these is Memcheck, a memory error detector, but in fact it is only one of eight tools in the standard distribution: two memory checkers, two thread checkers, two performance profilers and two space profilers.

You can download the sources from www.valgrind.org. Alternatively, you may be able to get pre-built packages via your Linux distro, or via Linaro, although note that the 3.6.0 upstream release post-dates pre-built packages. 3.6.0 is known to work on Ubuntu 10.04 and 10.10 on ARM, and on the Nokia N900 running Maemo 5.

Also available online is full documentation. For those impatient to get going, the ...

Cortex-A15 to A5: Software compatibility from Superphone to Feature phone

It was always about the code (and where it would be used!)

When I was a software developer I would often find that the project team I was in would try to guess how many devices the code would eventually run on. So at the launch of the Cortex-A15 last week one of the main points that hit home for me was just how wide the spectrum of power and performance points the Cortex-A family of processors could cover - from feature phone to superphone, tablet to DTV, home server to web server etc. This means that a developer could now find their software running across a huge range of devices in the future.

So is it the same software?

Absolutely. Cortex-A15 is based on the same ARMv7A architecture that the other Cortex-A processors use, therefore allowing the exact same application code to run on all of them, from a ...

Coding for NEON - Part 4: Shifting Left and Right

This article introduces the shifting operations provided by NEON, and shows how they can be used to convert image data between commonly used color depths. Previous articles in this series: Part 1: Loads and Stores, Part 2: Dealing with Leftovers and Part 3: Matrix Multiplication.

Shifting Vectors

A shift on NEON is very similar to shifts you may have used in scalar ARM code. The shift moves the bits in each element of a vector left or right. Bits that fall of the left or right of each element are discarded; they are not shifted to adjacent elements.

The amount to shift can be specified with a literal encoded in the instruction, or with an additional shift vector. When using a shift vector, the shift applied to each element of the input vector depends on the value of the corresponding element in the shift vector. The elements in the shift vector are treated as signed values, so left, right and zero shifts are possible, on a per-element basis.

Attached Image

A right shift operating on a vector of signed elements, indicated by the type attached to the instruct...

Coding for NEON - Part 3: Matrix Multiplication

We have seen how to load and store data with NEON, and how to handle the leftovers resulting from vector processing. Let us move on to doing some useful data processing – multiplying matrices.

Matrices

In this post, we will look at how to efficiently multiply four-by-four matrices together, an operation frequently used in the world of 3D graphics. We will assume that the matrices are stored in memory in column-major order – this is the format used by OpenGL-ES.

Algorithm

We start by examining the matrix mutiply operation in detail, by expanding the calculation, and identifying sub-operations that can be implemented using NEON instructions.

Attached Image

Notice that in the diagram, we multiply each column of the first matrix (in red) by a corresponding single value in the second matrix (blue) then add together the results for each element to give a column of results. This operation is repeated for each of the four columns in the result matrix.

...

10 Android NDK Tips

With new devices and new capabilities being exposed by the Android NDK (Native Development Kit) it is now possible to really get the best out of these ARM based devices. Here are a few quick tips to help that along.

1 - Stay on Target

The newest devices are generally ARMv7, meaning that it can pay to use v7 builds and features. The latest version of the NDK adds support ARMv7 and NEON code allowing key loops and media operations to be optimized far beyond what would otherwise be possible. The NDK provides a small static library that will allow you to identify what options you have at runtime. For examples of how to use these features, look at the hello-neon example project in the samples directory of the NDK

The older devices are v6, but the NDK does not specifically support it, leaving you with the choice of building safely for v5TE or taking the risk that there may be v5TE devices out there. If you need every iota of speed, and know what hardware you are targeting, then it may be worth building for v6. The newest devices, supporting Android 2.0 and up, seem generally to be ARMv7 based, although yo...

Computex: Windows Embedded Compact 7 Highlights Investment in ARM

Yesterday at Computex, the Microsoft Windows Embedded team announced the availability of the latest version of Windows Embedded CE – officially known Windows Embedded Compact 7. The release is a Community Technology Preview (CTP) which is a fancy way to say public beta. The CTP can be downloaded from the Microsoft website.

Windows Embedded Compact 7 includes a list of cool features to help OEMs develop smart, connected, service oriented devices with custom user-interfaces. But, if you take a closer at the code you’ll notice an engineering investment and significant improvement – Compact 7 now includes support for more ARM architectures including ARMv7, ARMv7 NEON™ and SMP.

The added ARM architectures provide OEMs working with Windows Embedded competitive performance in the segments proliferated by ARM and our ARM Partners – ...

Support for VP8 and WebM on ARM

It continues to be an exciting time for the development of web technologies on the ARM architecture; allowing the Internet to reach the maximum number of devices. Today sees an advancement in video for the web with the WebM project that has been announced at Google I/O 2010 (Google’s annual developer’s conference). A key part of this announcement was the contribution of the VP8 video codec, free of royalties to Google.

So why is this good for ARM and our Partners? Well ultimately the delivery of the full web drives the development of great devices, and video in particular makes up an ever increasing proportion of data being consumed: in other words consumers want video, and an efficiently designed, open video codec helps.

There is already a huge amount of video being delivered on the Internet: Cisco’s Visual Networking ...

Coding for NEON - Part 2: Dealing With Leftovers

In the first post on NEON about loads and stores we looked at transferring data between the NEON processing unit and memory. In this post, we deal with an often encountered problem: input data that is not a multiple of the length of the vectors you want to process. You need to handle the leftover elements at the start or end of the array - what is the best way to do this on NEON?

Leftovers

Using NEON typically involves operating on vectors of data from four to sixteen elements in length. Frequently, you will find that your array is not a multiple of that length, and you have to process those leftover elements separately.

For example, you want to load, process and store eight elements per iteration using NEON, but your array is 21 elements long. The first two iterations go well, but for the third, there are only five elements remaining to be processed. What do you do?

Fixing Up

There are three ways to handle these leftovers. The methods vary in requirements, performance, and code size. They are listed below in order, with the fastest approach first.

Larger Arrays

If you can change the size of the arrays that you are processing, increase the length of the array to the next multiple of the vector size using padding elements. This allows you to read and write beyond the end of your data without corrupting ad...

Coding for NEON - Part 1: Load and Stores

ARM's NEON technology is a 64/128-bit hybrid SIMD architecture designed to accelerate the performance of multimedia and signal processing applications, including video encoding and decoding, audio encoding and decoding, 3D graphics, speech and image processing.

This is the first part of a series of posts on how to write SIMD code for NEON using assembly language. The series will cover getting started with NEON, using it efficiently, and later, hints and tips for more experienced coders. We will begin by looking at memory operations, and how to use the flexible load and store with permute instructions.

An Example

We will start with a concrete example. You have a 24-bit RGB image, where the pixels are arranged in memory as R, G, B, R, G, B... You want to perform a simple image processing operation, like switching the red and blue channels. How can you do this efficiently using NEON?

Using a load that pulls RGB data linearly from memory into registers makes the red/blue swap awkward.

Code to swap channels based on this input is not going to be elegant – masks, shifting, combining. It is unlikely to be efficient.

NEON provides structure load and store instructions to help in these situations. They pull in data from memory and simultaneously separate valu...

  • (13 Pages)
  • +
  • 1
  • 2
  • 3
  • Last »
All company and product names appearing in the ARM Blogs are trademarks and/or registered trademarks of ARM Limited per ARM’s official trademark list. All other product or service names mentioned herein are the trademarks of their respective owners.