Login

ARM The Architecture For The Digital World  

ARM Community: Software Enablement - ARM Community

Jump to content

Page Colouring on ARMv6 (and a bit on ARMv7)

Page colouring is a technique for allocating pages for an MMU such that the pages exist in the cache in a particular order. The technique is sometimes used as an optimization (and is not specific to ARM), but as a result of the cache architecture some ARMv6 processors actually require that the allocator uses some page colouring. Some ARMv7 processors also have related (though much less severe) restrictions. This article will explain why the cache architecture imposes this restriction, and what it means in practice.

Note that this restriction only very rarely needs to be considered outside of the physical memory allocator in the kernel (or other privileged code). Typical user-space code probably won't have to deal with this directly, though understanding page colouring can help to explain why some mmap calls work on ARMv7 but fail on ARMv6, for example.

The restriction stems from the fact that many ARMv6 processors use VIPT caches. VIPT means "virtually indexed, physically tagged". If you're not familiar with cache terminology, that probably won't mean a lot, but I will try to explain by way of example.

In general, ARMv7 is not affected by ARMv6's page colouring restrictions. However, ARMv7 can have VIPT ...

Design West (ESC) Day 1: Optimizing Your Software on ARM

It was a full first day at the ESC Summit of Design West. I spent most of the time doing what I love to do best: talking with ARM Partners. Most of the conversations focused on how engineers can achieve better software on their ARM processor-based SoCs.

Andy Frame had a few meetings in the morning so he passed the baton (microphone) to me so I can share some of the insights with the ARMFlix followers. I’m looking forward to Day 2 and speaking with more of the ARM Partners at ESC (check out our handy map). Don’t miss the ARM Connected Community Theatre at ...

Ne10: A New Open Source Library to Accelerate your Applications with NEON

The past three years we have seen explosive growth in the use of the NEON™ SIMD engine by many of our software partners in the open-source community. The engine itself, defined as part of the ARM® Architecture, Version 7 (ARMv7), has shown itself to be extremely flexible and able to accelerate everything from Video Codecs such as VP8 to elements of the emerging HTML5 standard including <svg> and <canvas> filters. From an applications developer viewpoint, all of this acceleration takes place behind the scenes in upstream open source projects that are harvested to build the latest and greatest open source operating systems and frameworks such as Android™ and QT. While it is good to kno...

Using ARM NEON to accelerate Scalable Vector Graphics in webkit by up to 4x

Introduction
In the information era with its increased use of mobile devices to communicate and access information, web browsers constitute the central component to navigate through the vast amount of information as they are able to fetch and visualize content spread across the world-wide data network known as the Internet. Over the last decade, visualization capabilities of web browsers have been greatly enhanced by the increase in processing power of general purpose CPUs and graphics accelerators. Most mobile platforms include general-purpose SIMD engine, such as ARM NEON which can be used to efficiently process multimedia formats and help enhance user experience – up to a 4x improvement as discussed in this article.

Background
The web browser group at the ...

Coding for NEON - Part 5: Rearranging Vectors

This article describes the instructions provided by NEON for rearranging data within vectors. Previous articles in this series: Part 1: Loads and Stores, Part 2: Dealing with Leftovers, Part 3: Matrix Multiplication and Part 4: Shifting Left and Right.

Introduction

When writing code for NEON, you may find that sometimes, the data in your registers are not quite in the correct format for your algorithm. You may need to rearrange the elements in your vectors so that subsequent arithmetic can add the correct parts together, or perhaps the data passed to your function is in a strange format, and must be reordered before your speedy SIMD code can handle it.

This reordering operation is called a permutation. Permutation instructions rearrange individual elements, selected from single or multiple registers, to form a new vector.

Before we begin

Before you dive into using the permutation instructions provided by NEON, consider whether you rea...

Setting Up Android Mobile Phone to Use ARM Streamline for Profiling

This is an article describing the steps how to set up your Android Phone to run ARM Streamline Performance Analyzer.

ARM Streamline Performance Analyzer is a system-wide visualizer and profiler for targets running ARM Linux or Android native applications and libraries. Combining an ARM Linux kernel module, target daemon, and a graphical user interface, it transforms system trace and sampling data into reports that present the data in both visual and statistical forms.

Streamline supports Cortex™-A8, Cortex-A9, ...

Developing Top Performing Graphics Applications for Android Made Easy

The new DS-5 Community Edition brings CPU and GPU statistics together to speed up Android games and applications

Game Developers Conference (GDC), San Francisco - These are very special days for Android application developers targeting ARM processor-based devices.


Attached Image
On March 2nd the version of the ARM® Development Studio 5 (DS-5™) toolchain dedicated to Android native application developers, the DS-5 Community Edition (CE), was selected as finalist for the Eclipse Community Awards in the ...

Annotating ARM Streamline Profiles of Mozilla Browsers from JavaScript

I have recently been using the ARM Streamline profiler to study the behaviour of Mozilla Mobile Firefox (code-named Fennec) on Android. Streamline is a graphical profiling tool that is provided with ARM's DS-5™ development tool suite. Some time ago, whilst investigating a Fennec performance regression bug using Streamline, I had noticed some unexpected activity on the browser's main process. However, it was not clear whether the activity was some periodic event unrelated to the benchmark (perhaps related to garbage collection) or something triggered by the benchmark itself. The graphical timeline view in Streamline is very good for identifying areas that might benefit from optimization, and for highlighting anomalies and bottlenecks. Howe...

Top 2011 ARM Software blogs: Android, NEON, RISC vs CISC & Assembly

2011 was a busy year for developing software on ARM and the activity is reflected in the page views of top Software Enablement blogs. The topics included managing caches, Android (multiple), NEON (multiple), Memory Access Ordering, RISC vs CISC architectures (multiple), and optimizing assembly code (listed by popularity below). In addition, the Software Enablement Community pages (Linux, Solution Center for Android, RTOS, Microsoft, etc) were some of the highest referenced pages on the ARM site. Please let us know if you have more ideas for easing your software development on the ...

Oracle's Java SE server compiler now on ARM

Last month Oracle shipped 2 sets of Java SE for Embedded releases for ARM: 7 Update 2 and 6 Update 30. Java SE for Embedded 7u2 is a key release for the ARM Community as it includes the first offering of Oracle's server JIT (Just-In-Time) bytecode compiler for ARM. The server compiler, a highly optimizing JIT compiler used to produce Oracle's record-setting Java SE benchmarks, is now available on ARMv7.

Some quick background on Oracle's JIT compilers - there are 2 compilers for Java SE: client and server. The client compiler is a fast start-up, lightly optimizing compiler. It's better suited for smaller footprint systems and those running applications that require fast start-up such as GUI apps. The server compiler is targeted for long-running applications where throughput is most important. It produces highly-optimized code but incurs a start-up cost in achieving that. At JavaOne 2011 in San Francisco, we shared information on the client and server compilers in a joint ...

Branch and Call Sequences Explained

In this post, I will explain the various branch and call instructions available in the ARM and Thumb instruction sets, and why the variants exist. Finally, I will provide a JavaScript tool that can help you find a typical branch sequence matching your requirements.

What Does a Branch Do?

A branch, quite simply, is a break in the sequential flow of instructions that the processor is executing. Some other architectures call them jumps, but they're essentially the same thing. The following is a trivial, and hopefully familiar example of a branch:

entry_point: mov r0, #0 @ Set r0 to 0. b target @ Jump forward to 'target'. mov r0, #1 @ Set r0 to 1. target: ... @ At this point, r0 holds the value 0. ... @ The second mov instruction did not execute.
Example of branch execution.

...

x264 on ARM: Bringing a wider application of video conferencing (Part 3)

In part one and part two of this blog series, we introduced the video conferencing use case requirements and performed tuning of x264 for optimal tradeoff between bit rate, frame rate and video quality… In this part, we will test and analyze encode performance for optimal execution on the target ARM platform.

1 Test result of on the target ARM platform

Using the results from the previous step, we test the options “default”, “--preset ultrafast”, “--preset superfast” and “--preset very fast” with our optimal settings on the ARM platform and evaluate the performance against our use case requirements for video conferencing.

1). List of Combinations (settings tested)
Attached Image


2). Result

Attached Image
Attached Image
Attached Image


3). Conclusion
According to above information, we can conclude:
Attached Image


2 Summary of the optimal settings
According to the test results above, we might conclude: when rc-lookahead is set to 1, the bit rate is the minimum and the...

x264 on ARM: Bringing a wider application of video conferencing (Part 2)

In part one of this series, we introduced the video encode requirement and target development environment. In this part, we start by examining benchmarking result of h.264 encode using x264 on a development host rather than the ARM target. This enables us to perform some initial tuning of the encode settings for our target use case of video conferencing.

1 Benchmarking

For video conferencing, we want low bit rate, but at the same time we must maintain at least a 15 fps frame rate so that video quality is acceptable. So we focus on balancing bit rate, frame rate and video quality when investigating the performance of x264 encoder. In x264, there are many options for enabling and tuning the performance of the encoder. So we test the various combinations and evaluate resultant bit rate and video quality. Since relative bit rate and video quality are facets of the encode algorithm and independent of the hardware platform, we first test x264 using our desktop development host.

1.1 Bit rate and video quality testing
1) Description
We test the different combinations of options which x264 support. We focus on bit rate, frame rate and video quality. Video quality is measured by PSNR (Peak Signal-To-Noise Ratio). PSNR is most commonly used as a measure of q...

x264 on ARM: Bringing a wider application of video conferencing (Part 1)

Video is increasingly becoming an important and essential part of consumer electronics. Video centric features like augmented reality and video conferencing provide enhanced visual user interaction. Such features are now expected across a wide variety of application segments. In the embedded world, intensive video compression is typically done using standard DSP’s or specialized hardware accelerators, as they can provide both the specialized functionality and the high level of performance required. However, now ARM processors with NEON™ technology can be as capable of compressing video as some dedicated hardware, and do so with greater power efficiency.

H.264/MPEG-4 AVC (Advanced Video Coding) is currently one of the most commonly used formats for the recording, compression, and distribution of video content. The H.264 video format has a very broad application range that covers all forms of digital compressed video from low bit-rate Internet streaming applications to HDTV broadcast. With the use of H.264, bit rate savings of 50% or more ar...

ARM DS-5 Community Edition: Enabling the Android Developer Community

The need for quality professional tools for cross platform development when battling obscure software bugs and performance issues cannot be understated. With the ARM® Development Studio 5 (DS-5™) Community Edition (CE) we deliver some of the professional development capabilities of the DS-5 toolkit to the Android developer community. In this blog we’ll explore a few of the debug features of DS-5 CE and look at how it enables application development on Android.

Is DS-5 Community Edition for me?
The purpose of DS-5 CE is to bring the power of DS-5 tools to small development firms (with 10 or fewer employees) and individuals who publish applications for Android. This edition of DS-5 tools supports debug of native C/C++ libraries included in Android applications, on real devices and emulators. DS-5 CE also includes a basic version of the ...

Software Debuggers: What next?

Great leaps in human knowledge are linked to advancements in tools. For software engineers the debugger is a cornerstone tool to create reliable products. We engage in a running battle fighting both the problem and the tool because each debug situation is unique.

This is a major challenge for debugger designers, i.e. how do we build a debugger with a balance of flexibility and simplicity with enough usefulness? With the rate of hardware change increasing – along with improvements in Operating Systems and Programming Languages, debugging and optimizing software becomes paramount i.e. improving performance, code-size and energy-efficiency.

As we transition towards more hardware parallelism the pressure grows to keep the time-to-market at least constant which in turn means new debugging strategies. From my perspective it is always good to draw from the current technology landscape and view how we can use these technologies in the future. For instance, Cover Flow UI could be used to improve multi-processor viewing, Microsoft Kinect to navigate the debug UI, and further more expert systems running in the cloud to help a...

Optimizing DirectFB with ARM NEON

DirectFB (Direct Frame Buffer) is a graphics library that is widely used in embedded systems, especially home market. More and more applications or libraries choose DirectFB as backend, such as Cairo, GDK, Qt, V8, X11 and Webkit. ARM NEON technology could be well used in 2D acceleration. In this blog, I’ll describe how to optimize DirecFB using NEON.

1. Introduction
1.1 DirectFB Introduction
DirectFB (Direct Frame Buffer) is a thin library that provides hardware graphics acceleration, input device handling and abstraction, integrated windowing system with support for translucent windows and multiple display layers. It is free software licensed under the terms of the GNU Lesser General Public License (LGPL). Graphics features provided by DirectFB including Rectangle Filling/Drawing; Triangle Filling/Drawing; Line Drawing Blit; Alpha Blending (texture alpha, alpha modulation); Porter/Duff; Colorizing; Source Color Ke...

ARM Fundamentals: Introduction to understanding ARM processors

Finding one's way through references to ARM processors is not always obvious.
This article is the first of a series on ARM fundamentals that will introduce various topics to help you get more familiar with the ARM architecture. It aims at helping you to better understand ARM processors, starting with explaining how they are named, and then showing how knowing your processor matters by introducing a few of their recent features.

If you are curious about what is in your pretty electronic device or are a developer willing to understand how to start getting the best out of your processor, you may find some useful information here. The second part of the article may be technically a bit more challenging than the first, but don't worry! The few code samples are only concrete examples used to illustrate the explanations. The specific details are not necessary to understand the global picture.

The first step is to understand how ARM processors are referenced: it certainly sounds nice, but what is this "dual Cortex-A9, based on ARMv7" in your super-phone?


Processor fami...

Dawn of Energy Efficient High Performance Computing at SC11

I’ve been attending Supercomputer Conferences (SC) since 2005. Supercomputers are more commonly called High Performance Computers or just HPC. HPCs, by design, stretch the boundaries of hardware and software technologies. Each generation attempts to attain the next big thing in performance in order to better understand and solve very complex scientific, medical, engineering, and environmental challenges (to name but a few) that would otherwise not be possible. The performance level for the current generation of high-end HPC computers is called Petascale 1015 FLOPS and the next generation performance level is called Exascale 1018 FLOPS.

Exascale will have a new burden of having to operate within a strict power budget. Taking today’s technologies as an extreme example, it can be seen that instead of requiring one power station per Petascale HPC, there is going to be two power stations required per Exascale HPC - this is obviously impractical.

The HPC industry has moved from being proud of the Megawatts of consumption, to adding “green” in the name. At SC2...

Solving the Challenge of Software Complexity for Today’s Embedded Developer

Just launched today: the Embedded Software Store! As mentioned in my previous post, here is our solution for addressing the challenges of embedded software development.

Management typically has four approaches to addressing the challenges of software complexity.

Increase work hoursOutsource workAdd headcountRaise efficiency
We will examine each of these approaches considering the trade-offs with cost, time to market and scalable engineering resources.

The first 3 solutions are the most obvious approaches. But there are limitations to these approaches and one major setback - cost implications.

Increase work hours: This is easily the most popular management approach. This can be effective if the duration is short, but looking at the magnitu...

Memory access ordering part 3 - memory access ordering in the ARM Architecture

In my previous posts, I have introduced the concept of memory access ordering and discussed barriers and their implementation in the Linux kernel. I chose to do it in this order because I wanted to start by communicating the underlying concepts before I went into detail about what the ARM architecture does about memory ordering. This post goes into the juicy bits of what this actually means and how this is handled in the ARM architecture.

Two separate concepts are relevant to memory access ordering in the ARM architecture — memory types and shareability domains. These progressively made their explicit entry into the ARM architecture in versions 6 and 7, implemented by the ARM11 and Cortex family of processors respectively.

Enter the abstract

When describing many of the concepts mentioned in thi...

ARM TechCon: Shall We Talk Android?

Are you working on Android and looking for information about Android Partners, Developers, Android Market, and more? If so, welcome to ARM® TechConTM 2011 where members of the ARM ecosystem and the ARM Connected Community® come together to share their vision for the future of technology and discuss the opportunities and challenges associated with today’s connected devices including several of the 175+ members of the Solution Center for Android program. The event will take place on Oct. 25-27 and there are lots of Android activities going on! Here are some more specifics:

Android sessions @Software & Systems Design Conference

When: Wednesday, October 26 & & Thursday, October 27
Where: Santa Clara Convention Center
...

Advances in technology create new problems for today’s embedded developers

For embedded developers, Moore’s Law states that “the number of transistors that can be placed inexpensively on a semiconductor integrated circuit (IC) doubles approximately every two years”. Conversely, advances in process technology have yielded increased processor bandwidth and higher memory densities as process geometries shrink.

These trends have created significant challenges for embedded developers, as they now have more capability to work with and hence can create more complex products. This increased design intricacy has created an environment where software is poised to grow significantly, as witnessed in the automotive and smart metering markets where the processing bandwidth has leaped from three to five DMIPS to over 150 DMIPS and the memory requirements for software have increased by up to 40x over the past 20 years.

Software engineers are facing many challenges these days including:

1) Complexity: As consumer demand for simplified interfaces and high performance low power electronic and home entertainment devices increases, the complexity of the software expands exponentially. In addition the integration of a wide array of hard...

定位,合作,共享 - 中科创达(Thundersoft)的Android红海战略

简介:ThundersoftAndroid核心技术和整体解决方案提供商, 通过提供完整的Android方案和服务,帮助OEM客户快速推出高品质产品。Thundersoft公司在Android 系统底层技术、中间件和应用开发、集成和服务方面经验丰富,在智能手机, 平板电脑等移动互联网终端产业链中具有独特优势Thundersoft...

10 ways to give your customers the DS-5 experience

ARM Development Studio 5 (DS-5) is the software development tool that sets the standard for ARM with its optimising compiler, its extensible and easy to use debugger, and its unique analysis tool, Streamline. But DS-5 is not just for the ARM IP: if you're the designer of an ARM based SoC, an operating system that supports ARM or have productivity tools that support ARM, you can join the DS-5 ecosystem to make sure that your customers also get the DS-5 experience. Here are 10 ways you can do it.

1. Add DS-5 debug support to your ARM based SoC. The DS-5 debugger has a target database that is extensible by you,...

Linaro系列文章第二篇,参与Linaro

在上一篇中,我们分享了 Linaro的历史、现状以及组织架构等内容。在本篇中,我们将了解到公司和个人开发者如何参与Linaro开发工作的相关内容。

本文是Linaro系列文章的第二篇,包含如下章节:

一、 Linaro会员
二、 Linaro合作伙伴计划(Linaro Partner Program, LPE)
三、 Linaro社区用户
四、 Linaro官方网站
五、 Linaro社区相关资源链接

系统芯片厂商(SoC Venders),请关注本文第一节(Linaro会员)。

...

Linaro系列文章第一篇,Linaro介绍

本篇文章是Linaro系列文章的第一篇,主要介绍Linaro的历史、现状以及组织架构。
本文包含如下章节:一、
Linaro溯源

二、
Linaro简介
三、
Linaro技术指导委员会(Technical Steering Committee, TSC)
四、 Linaro...

Debugging Airplay Android applications on the PandaBoard

This blog describes some of the techniques available to deploy and debug native Android applications on target. My main focus for setting about on this debug session was to debug an application generated by the Airplay SDK (see footnote#), a cross-platform development SDK that primarily uses host debug tools like Visual Studio and Xcode to debug a version of the mobile application built for the desktop. Using workflows that combine the powerful and easy to use DS-5 Debugger with Airplay SDK, developers can gain the convenience and flexibility which they greatly need when debugging code on target.

I’m going to start from the point of view that you already have the Android SDK and NDK installed since Google’s documentation goes through that process with some step-by-step instructions. I als...

Analysis of Airplay Android apps using ARM Streamline

On how many occasions have you engineered an application that works perfectly well on the Android SDK emulator, only to find that the performance is not quite the same when testing on a real device? My job in the ARM® Development Studio 5 (DS-5™) applications engineering team is to figure out ways in which the tools we produce can deliver quantifiable value for application developers by solving problems they face day to day.

I happened to see a partner presentation on Airplay SDK (see footnote#), which is an Ideaworks product that makes cross platform application development easy. Airplay supports Android as a target OS for the SDK. This blog entry is about my attempt to explore a sensible, easy way of using Airplay and ...

Debug & performance analysis of Linaro images with ARM Development Studio-5

ARM Development Studio 5 (DS-5™) provides a user friendly interface for debugging Linux applications running on ARM platforms. Also built into DS-5 is ARM Streamline, a powerful profiling tool that allows us to measure the performance of Linux applications running on ARM Linux.

The PandaBoard is a compact mobile platform built around the Texas Instruments OMAP 4430 processor. With a dual-core Cortex-A9 processor at its heart, it is ideal for running ARM-Linux, and has good connectivity.

In this article we will go through the steps required to setup Linux on the PandaBoard using files supplied from ...

谷歌V8部署到ARM上:性能得到5倍提升

Attached Image
]现代网页主要构建于三大技术HTMLCSS和JavaScript。JavaScript推动了交互式网页的发展JavaScript执行速度慢也就意味着网页打开的速度慢。所以现在,人们做了大量的工作,努力改善JavaScript的性能,让我们能够拥有与台式机相同的功能,无论身处何处,都能轻松体验到强大的网页应用

GmailGoogle MapsGoogle Docs等网络应用程序均广泛使用JavaScript,借助快速高效的JavaScript引擎,各类系统的用户体验得到了极大改善。正是在此驱使下,2008...

Android 2.3 (Gingerbread) NDK现在更加接近纯本机开发

随着近期Gingerbread的发布及基于ARMAndroid每日激活数量达到300,000参见James Bruce的博客),开发人员迎来了前所未有的好机遇。开发人员所使用的工具也比以前要好。Gingerbread的更新引起了对如何支持开发人员为快速成长的消费市场创建优质内容的强烈关注。

Android很早就随软件开发套件(SDK)提供了一套本机开发套件(NDK)NDK可以支持以C/ARM...

如何将Android带入互联网数字家庭? 第四篇

在第一篇中,我们分享了数字家庭软件平台的发展趋势和特点
在第二篇中,我们归纳了将Android移植到电视、机顶盒平台需要面对的五大技术挑战并重点探讨了挑战1 : 符合电视体验的2D/3D图形性能和用户交互模式方面的挑战

在第三篇中, 我们探讨了更多的挑战, 包括“适合大屏的丰富多媒体影音体验”,“如何集成数字电视相关功能”, “如何推动应用开发者开发适合于TV的Android应用”
在本篇中,我们将一起来关注在内容保护/系统安全方面的挑战以及And...

NEON编码 - 第4部分: 左右移位

本文将介绍NEON提供的移位运算,并显示如何利用移位运算在常用颜色深度之间转换影像数据。本系列前期已发布的文章包括:第1部分:加载与存储第2部分:余数的处理第3部分:矩阵乘法。

向量移位 NEON上的移位与标量ARM编码中可能用到的移位非常相似,即每个向量元素的位数均向左或向右移位,出现在每个元素左侧或右侧的位将被删除;它们不能移位至相邻的元素。

移位的数量可通过指令中编码的文字或附加的移位向量来指定。使用移位向量时,应用到输入向量每个元素的移位取决于移位向量中对应元素的值。移位向量中的元素被当作带符号的值来处理,因此按元素分配,左移位、右移位和零移位都有可能发生。

...

关于Android NDK的10个技巧

随着Android NDK本机开发套件)披露了许多的新设备和新功能现在我们可以充分利用这些ARM设备了。下面列举了一些快速提示,希望对您有所帮助。

1 - 关注目标

最新的设备一般是ARMv7,这意味着它可以使用v7版本和功能。最新版的NDK增加了ARMv7NEON代码支持,可以实现关键循环操作和媒体操作优化,远超其它方法。NDK提供小型静态库,可以帮助您识别运行时的选项。有关如何使用这些功能的示例,请参见NDK样本目录中的...

Using the ARM Profiler with the Cadence Virtual System Platform

One of the most common requests from software engineers running software on a Virtual Platform is to be able to profile the executing software. In this document I will describe how to use the ARM Profiler included in RVDS Professional with the Fast Models from ARM that are commonly used to create SystemC Virtual Platforms with the Cadence Virtual System Platform (VSP). As you can see from the introduction there is a combination of products that all work together to enable non-intrusive profiling of embedded software, but at times the HOW TO details of profiling can be a bit of a mystery and knowing where to look to find the details may not be obvious. In fact, a recent VSP user tried to setup profiling by himself and was not successful and wasn’t really sure where to look. As a result I created the following information that I’m sure would be valuable to many other readers.

...

Porting Linux made easy with DS-5

Here at ARM, a colleague recently wanted to port Linux to a prototype of a new high-performance Cortex-A9 based platform. To develop and debug this port, he needed to be able to set breakpoints, view registers, view memory, single-step at source level, and so on, in fact all the normal facilities provided by a debugger, but he wanted to do these both before the MMU is enabled (with a physical memory map), and after the MMU is enabled (with a virtual memory map).

The DS-5 Debugger has a slick Debug Configuration dialog in Eclipse that makes it easy to configure a debugging session to a target. Predefined debug configuration types include “Bare Metal Debug”, “Linux Application Debug”, and “Linux Kernel and/or Device Driver Debug”. The latter is the topic of this blog. This debug configuration type is primarily designed for post-MMU debug to provide full kernel awareness, but also has some extra features that allow it to be used for pre-MMU debug too. This makes it possible to debug the Linux kernel, all the way from its entry point, t...

From Zero to Boot: Porting Android to your ARM platform

This article describes how to get Android running on your favourite ARM-based System on Chip (SoC) board. We run through the overall procedure and point out potential pitfalls and other things that you may encounter.

Since the Android software stack was primarily designed around the ARM Architecture, there are not many things that need amending to get it to work on another ARM platform.

We assume that your workstation has Ubuntu (10.10 or later) Operating System installed, and that you have already followed the instructions found at [1] to be ready to build Android sources. These instructions have been tested with Ubuntu 10.10, but they should be compatible with other GNU/Linux OSes.

Terminology

For the purposes of this document, we use the following terms.

Mainline kernel ...

如何将Android带入互联网数字家庭? 第二篇

在上一篇中,我们探讨了数字电视/机顶盒软件架构的现状与未来,分享了数字家庭软件平台未来的发展趋势和特点(http://bit.ly/jCvlNs)。在本篇中,我们将一起来探讨为什么Android能够成为未来数字家庭软件平台的选择之一;而我们又如何才能将原本为手持设备量身定做的Android移植到电视/机顶盒平台?

1. 首先,我们需要回答的第一个问题就是:
Why Android?
为什么
Android能够成为未来数字家庭软件平台的有力竞争者?
先来看看Android自身的天然的优势:
Android 是一套完整的消费电子设备的软件解决方案,它包括:...

ARM technology software newbie? Try the Cortex A-Series Programmer's Guide

The ARM architecture has been used for many years in mobile phones and electronic devices, but it is only relatively recently that the architecture has diversified into being used in laptops, tablets and smartphones. There are now many companies that have adopted the ARM architecture as the basis for their next world-beating technology product. This is great, but the problem is that if you are new to the ARM architecture and want to start writing programs for an ARM processor, where do you start? What document do you need to read first before you dive into the library of technical information that is available on the ARM InfoCenter?

My choice would be the recently released Cortex A-Series Programmer's Guide. This guide provides a gentle in...

如何将Android带入互联网数字家庭? 第一篇

Android作为优秀的开源软件解决方案, 它的作用域已经从手机市场,波及到了平板电脑,甚至以数字电视、机顶盒为典型应用的数字家庭领域。Android最初是为手机移动设备量身定做的,它默认支持的分辨率,色彩深度, 多媒体播放架构,用户交互方式,2D/3D图形的性能等都无法适应类似于数字电视,机顶盒这样的家庭应用。
因此,将Android移植到数字电视或机顶盒需要对Android进行大量的定制和修改。这些修改和定制涉及到Android软件架构中的各个层面,我将用四篇Blog来依次介绍如何将标准的Android移植到数字电视或机顶盒平台。

在开始我们的讨论之前, 我们先来简单探讨一下目前数字电视和机顶盒软件的现状和未来
数字电视/机顶盒软件的现状
1 软件架构的差异化

目前, 数字电视和机顶盒的软件架构中由于使用了不同的操作系统, ...

CoreSight delivers debug and trace for tomorrow’s systems

Debugging and optimizing software is always a demanding job. As today’s systems become more complex the task of providing a suitable debug and trace solution becomes increasingly challenging – yet it’s more important than ever. Without comprehensive insight into the system’s behaviour debug and optimization is incredibly difficult. Years ago ARM developed debug logic, known as embedded ICE, to provide developers with a way to gain access to the heart of their system and see how the software and hardware interact:

Attached Image

This was the beginning of the CoreSight debug architecture that has gone on to provide a comprehensive range of debug & trace solutions that are present throughout a broad range of today’s platforms. Things have changed dramatically and even a relatively simple system of today has a broad range of debug & trace components:

Attached Image

Designing and delivering debug and trace solutions for complex systems is about to get a whole lot easier. In the past CoreSight has supplied design kits where a range of debug and trace components are supplied pre-configured and tailored for specific processors. At ...

Linaro Second Engineering Cycle Highlights

As we come to the end of our second engineering cycle, I thought it would be interesting to highlight 4 of the initiatives happening in Linaro that I believe are having the biggest impact on how we are demonstrating Linaro delivering on its initial mandate.

Linaro Evaluation Builds (LEBs): We’ve had an almost universally positive reaction to the initiative we started this year – to deliver evaluation builds of popular OSS distributions on our Member’s hardware. Our initial targets are Android and Ubuntu. The LEBs provide an integration point for Linaro Working Group developments, delivered on a set of reference platforms for the relevant OS. LEBs were created to make it easier for companies producing distributions or vertically integrated open source stacks to adopt Linaro software, reduce time to market for our Members through streamlined integration and validation of our Landing Team efforts, and mediate the flow of innovation between Linaro and their engineering teams ...

Google's V8 on ARM: Five Times Better

Attached Image
The modern web is built primarily from three technologies: HTML, CSS and JavaScript. It is JavaScript that drives the interactive web; slow JavaScript means slow web pages. So today, a huge amount of effort is being put into improving the performance of JavaScript, giving us access to powerful web applications, with features from your desktop, but available wherever you are.


Web applications like Gmail, Google Maps and Google Docs use JavaScript extensively, and the user experience is greatly improved on systems with fast, efficient JavaScript engines. In 2008, this motivated Google to create the V8 JavaScript engine project.


V8 is now, on modern benchmarks, the fastest JavaScript engine available. Rather than interpreting JavaScript as the old engines used to do, V8 uses a Just-In-Time compiler to produce and execute native instructions tailored to the processor on which it is running. The generated instructions are cached, avoiding the overhead of repeated code generation, and deleted when no longer needed.


V8 is now the core technology used in a number of important applications. It is the JavaScript engine used in Google's super-f...

Get the best from ARM debug tools: Stack frames & instruction trace

This blog covers the use of two powerful debugging techniques – stack frames and instruction trace – to debug random or timing-related bugs on ARM processor-based targets.

Timing-related and random bugs are a common nightmare for software developers. Any consistent, replicable defect can be easily debugged by stepping through the code until the execution branches to an unexpected path. However, when bugs are random or timing-dependent you could spend your life stepping through the code without ever reaching the error condition at “the right time”.

The typical approach to dealing with these problems involves instrumenting the code. The idea is simple: you add printf statements to the path of code you think the processor is executing, and each of those statements provides some information about the state of the software at that point. For example, you can print the value of program variables over time.

This approach often works, but it tends to be time consuming (and let’s face it, quite annoying). The reasons are many, and include:
You do not want to rebuild your software every time that you decide you need an extra printf statement. Building software takes time, a lot of it if the software is large enough.It may take hours to track down the execution path of your application and instrument the software to give you the information you needThis method affects the replicability of the pr...

Memory access ordering part 2 - barriers and the Linux kernel

My previous post provided an introduction to the concept of memory access ordering. It did not however provide any solution to the problem, or necessarily specify where such ordering can be significant.

Now, not all software developers need to be deeply aware of memory access ordering or barriers. Unless your code interacts directly with hardware, interacts directly with code executing on other cores or directly loads or generates instructions to be executed, things will mostly Just Work. If your interaction with hardware is completely through a device driver (meaning: no device control registers mapped directly into your application), then it is the responsibility of the driver to enforce ordering. If your communication with software running on a different core makes use of a multithreading API, for example using Pthreads or Java threads, then it is the responsibility of that API to enforce ordering. If your program executes on an operating system that implements demand paging, then clearly it is the responsibility of the operating system to enforce ordering of such operations.

However, if you are writing device drivers, implementing your own thread-communications or creating a JIT compiler, then not being aware of the proper use of barriers can lead to unexpected and difficult to diagnose problems. Where your program requires a ...

How to run LAMP and Drupal on a PandaBoard in seven simple steps

This tutorial explains how to have a LAMP server running Drupal on a PandaBoard. These instructions will apply to any other Cortex-A platform with few or no changes.

The growing variety of ARMv7-based inexpensive and easy to use devices, like PandaBoard, opens the door to leveraging ARM energy efficient and small form factor performance with server software. The availability of the Ubuntu Linux distribution for ARM, makes this a really simple task. The possible applications are many: domestic server, small business server, hobbyist experimentation, web development.

It can also be used to gain experience and become more prepared for the arrival of large-scale, ARM-based server systems.

One of the most well-known server software stacks is LAMP. There are many varieties of LAMP, but the most common is the combination of

Drupal is a versatile, well-known, open source platform running on top of this stack. The software is a generic Content Management System, used as the basis for many sites, from blogs to community forums to government web pages. Notable examples are The White House, Ubuntu, FastCompany, ...

Getting Started with Android Development

This blog is aimed at getting you started quickly in the world of Android development. I've included links to some golden tutorials and programs I found useful.. I've also included solutions to the annoying time-wasting problems I encountered when first starting out. I hope they will be helpful and will save you from the frustration I went through.

Setting up your Development environment

To get up and running quickly, follow the instructions from the Android developers site to set up your software development environment. If you don’t have an Android phone, don’t worry, the SDK contains an Android emulator.

I definitely recommend using the Android development tool plug-in for Eclipse. For someone like me who makes lots of common mistakes, such as missing imports of libraries and not putting the right parameters into a method, Eclipse is great because it easily spots common errors and offers quick fixes. There are many versions of Eclipse and it can be confusing which one you need to download. I use Eclipse Gali...

Memory access ordering - an introduction

I recently gave a presentation at the Embedded Linux Conference Europe 2010 called Software implications of high-performance memory systems. This title was my sneaky (and fairly successful) way to get people to attend a presentation really about memory access (re)ordering and barriers. I would now like to follow that up with a few posts on the topic. In this post, I will be introducing a few concepts and explain the reasons behind them. In future posts, I will follow up with some practical examples.

The Sequential Execution Model

In the Good Old Days, computer programs behaved in practice pretty much the way you might instinctively expect them to from looking at the source code: Things happened in the way specified in the program.Things happened in the order specified in the program.Things happened the number of times specified in the program (no more, no less).Things happened one at a time.


In modern computer architecture, this nostalgic fantasy is sometimes referred to as the Sequential Execution Model. In order for existing programs and programming models to remain functional, even the most extreme modern processors will attempt to preserve the illusion of Sequential Execution from within the executing program. However, underneath your feet...

ARM系统预引导固件的新机遇- UEFI, Part 2

上回我介绍了UEFI和它的历史,现在我将探讨它,特别是在ARM的系统上的优越性。我也会更详细地描述UEFI论坛的组织结构。

优越性
尽管现有的ARM预引导固件并没有BIOS的束缚,使用UEFI标准对ARM预引导固件仍有很多优越性。 OEM / ODM厂商一直在试图降低开发成本。代码共享是在预引导固件领域实现这目标的一种方式。

ARM和x86都注重计算连续性,UEFI不仅使得在ARM产品之间或在x86产品之间代码共享,还可以让代码在不同处理器架构的产品之间共享。产品可以共享外围设备(网络,SATA,USB控制器等),以及众多的设计功能集。

图2显示了从x86到ARM的一个移植有99.42%的代码不需要更改。

Attached Image
...

UEFI – A New Opportunity for Preboot Firmware on ARM-based Systems, Part 2

Previously I introduced UEFI and its history, now I will get into its benefits, especially for the usage on ARM-based systems. I will further explain the organization of the UEFI Forum.

Advantages
Even though existing ARM preboot firmware does not have the BIOS limitations, there are many advantages for ARM preboot firmware to standardize on UEFI. OEM/ODMs are always looking into reduced development cost. Code sharing among products is one way to achieve that.
With ARM and x86 both in the computing continuum, UEFI not only enables code sharing among ARM products or among x86 products, it also enables code-sharing across processor architectures. Products may share many of the peripheral devices (Network, SATA, USB controllers, etc.) and feature sets across the designs.
Figure 2 shows an ARM port where 99.42% lines of code does not need to change from an x86 port.
Attached ImageFigure 2: Lines Added/Change...

Microsoft Windows Embedded Compact 7 Announces Another Investment in ARM

Embedded World is well underway and the show is living up to its tagline of “It’s a Smarter World”. With many ARM Partner announcements, ARM continues to be the cornerstone that OEMs use to build energy-efficient devices in key embedded market segments, such as automotive, consumer, industrial and home.

Windows Compact 7 launched with ARMv7 architecture support
Yesterday Microsoft announced Windows Embedded Compact 7. Compact 7 is the latest offering of their embedded operating system – formerly CE. F...

Semihosting: a life-saver during SoC and board bring-up

In this blog you will find information about semihosting, an implementation of the C library that uses a JTAG debugger to interface the world. Semihosting is very useful for board bring-up, as it works on any ARM processor and it only requires a JTAG connection: you do not need any working peripherals or drivers in order to use it.

Semihosting is a feature of ARM software development tools that has been available for many years, proving its usefulness, but unfortunately is often poorly understood. I have used semihosting in many occasions, and it has been extremely useful, so I will share my experience with you.

A bit of background…
When you write a simple “hello world” application on your PC you expect a character string to be printed on a window in your debugger or on your PC console. However, when you have a brand new ARM-based development board coming to the lab and you write your first “hello world” application for it, what do you expect it to do?

At this stage of development, the “hello world” application is “bare metal”. By this I mean that the ARM target is not running an operating system providing peripheral drivers to your application. Therefore, an out-of-the-box bare metal application does not have the soft...

ARM系统预引导固件的新机遇- UEFI

ARM处理器已经在智能手机市场占主导地位,并越来越成为整个嵌入式领域的主流。最近,ARM处理器也进军服务器领域,追求计算连续性。

然而,历史上,ARM系统没有一个预引导固件的标准。这使得每个设计都有自己独特的与所引导的操作系统紧密结合的模式。这种传统的方法意味着固件开发者必须保持完全不同的代码库,即使系统可能使用的外围设备(网络,SATA接口,USB控制器等)和整个设计功能集是相同的。传统的ARM设计依赖诸如UBootRedboot,或专有软件启动软件包。

如何有效地开发生产这些产品来满足快捷上市需求已成为一个挑战。很有必要用某种形式的融合固件基础设施来实现代码重用的最大化,使这些产品可在有限的工程资源条件下实现更快推上市场,并同时添...

UEFI – A New Opportunity for Preboot Firmware on ARM-based Systems

ARM processors have been predominant in the smartphone market and are becoming increasingly mainstream in the overall embedded space. More recently, ARM processors are targeting servers as well, pursuing the computing continuum with solutions.

However, historically ARM systems did not have a preboot firmware standard. This led each design have its own distinct firmware model that is tightly coupled to the operating system being booted. This traditional approach means the firmware developers would need to maintain completely different codebases even though the systems may use many of the same types of peripheral devices (Network, SATA, USB controllers, etc.) and feature sets across the designs. Generations of ARM cores relied on boot packages such as UBoot, Redboot, or proprietary software.

How to efficiently develop and ship these products and meet time to market demands becomes a challenge. Some form of converged firmware infrastructure is necessary to maximize proper code re...

An introduction to ARM Development Studio 5 (DS-5)

A couple of weeks before Christmas, ARM released v5.3 of its new software development suite, DS-5. DS-5 is a new product, introduced to the market last year, but it builds on 20 years of software development tools from ARM. I have been personally involved in this development since inception, when we decided to embrace open source frameworks and build around Eclipse, and I’m very proud of what we have achieved. We’ve created a great new development tool chain with very broad applicability, helping to make it even easier to develop for ARM based platforms, and enabling collaboration with our partners and the ecosystem. In this short article, I'll describe what I mean by all this.

Firstly, at the heart of the ARM tools is comprehensive support for the ARM device itself. The tools are used here at ARM during the development and validation of the ARM architecture and ARM CPU, and are designed to make the best use of the features provided by the CPU and associated debug and trace capabilities with technology such as ...

RISC versus CISC Wars in the PostPC Eras - Part 2

In my first blog, we examined gave the historical context of the instruction set battles of ARM and x86, covering the RISC-CISC Wars in the PrePC Era and the PC Era. This blog covers Round 3, the PostPC Era [1].

Round 3: RISC vs. CISC in the PostPC Era
The importance of maintaining the sequential programming model combined with the increasingly abundant number of transistors from Moore’s Law led, in my view, to wretched excess in computer design. Measured by performance per transistor or by performance per watt, the designs of the late 1990s and early 2000s were some of the least efficient microprocessors ever built. This lavishness was acceptable for PCs, where binary compatibility was paramount and cost and battery life were less important, but performance was delivered more by brute force than by elegance.

However, these excessive designs are not a good match to the smartphones and tablets of the PostPC era. RISC dominates thes...

RISC versus CISC Wars in the PrePC and PC Eras - Part 1

This two-part blog gives a historical perspective on the ARM vs. 80x86 instruction set competition for three eras: PrePC (late 1970s/early 1980s), PC (mid 1980s to mid 2000s), and PostPC (late 2000s onward).

Round 1: The Beginning of Reduced vs. Complex Instruction Set Computers
The first round of the RISC-CISC Wars started 30 years ago with the publication of “The Case for the Reduced Instruction Set Computer” [1] and the companion piece “Comments on "The Case for the Reduced Instruction Set Computer"[2]. We argued then that an instruction set made up of simple or reduced instructions using easy-to-decode instruction formats and lots of registers was a better match to integrated circuits and compiler technology than the instructions sets of the 1970s that featured complex instructions and formats. Our counterexamples were the Digital VAX-11/780, the Intel iAPX-432, and the Intel 8086 architectures, which we labeled Complex Instruction Set Computers (CISC).

I recently found an old set of hand-drawn slides from 1981, one of which shows the simple instructions and formats of the Berkeley RISC architecture.

Attached Image
...

Android 2.3 (Gingerbread) NDK now close to pure Native Development

With the recent release of Gingerbread and the number of daily Android on ARM activations up to 300,000 (see James Bruce’s blog), the opportunity for developers has never been better. The tools developers have to work with have also never been better. The Gingerbread update brings a strong focus on enabling the developer to create premium content for a rapidly growing consumer market.

Android has long since included a Native Development Kit (NDK) alongside their Software Development Kit (SDK). The NDK enables the creation of native functions in C and/or ARM assembly code. These functions can then be called by Java applications via the Java Native Interface (JNI). A principle software engineer at ARM, under the pen name ARM_DaveB, has written a ...

Valgrind 3.6.0 for ARM-Linux

Version 3.6.0 of Valgrind was released a couple of weeks ago. Probably the largest change this release is the addition of support for Linux running on ARM.

Valgrind is a GPL'd framework for building simulation based debugging and profiling tools, plus a set of "standard" tools. The best known of these is Memcheck, a memory error detector, but in fact it is only one of eight tools in the standard distribution: two memory checkers, two thread checkers, two performance profilers and two space profilers.

You can download the sources from www.valgrind.org. Alternatively, you may be able to get pre-built packages via your Linux distro, or via Linaro, although note that the 3.6.0 upstream release post-dates pre-built packages. 3.6.0 is known to work on Ubuntu 10.04 and 10.10 on ARM, and on the Nokia N900 running Maemo 5.

Also available online is full documentation. For those impatient to get going, the ...

Travels of ARM Rubik’s Cube Lego Speedcuber: 0 to Solved in15 seconds

So I can’t turn down a challenge. How was I going to improve the speed of the solver? Lights, camera, action! Yes I did it. I improved the Rubik Speedcuber from 25 to 15 seconds. And it only took travel, lights, camera, image analysis, native Android coding, a new phone and food. Intrigued? Check out the new demo at the ARM Technology Conference (ARM Techcon) in Santa Clara, California on Thursday Nov 11.

Challenge: How much faster can you make the Speedcuber?
Some people consider me to be a perfectionist. My view on this is just that I like to do things to the best of my ability and am always ready to accept a challenge. So when Ian Pilkington, Applied Systems Engineering Manager at ARM, asked, “How much faster can you make your Speedcuber?” I just had to try…

For those of you who are new to my blogs, you should be aware that I have a passion for LEGO, Rubik’s Cubes and software programming (on ...

Oracle’s Java SE Embedded for ARM Multicore at Techcon

Just in time for the ARM® Technology Conference (Techcon) - last week Oracle released Java SE Embedded 6u21 with support for ARM. Java SE 6u21e syncs with the latest release of Java SE 6u21 for desktop and servers allowing developers to deploy on their ARM embedded device the same full Java SE version as found on their PC.

One of the key, new features of this release is multi-core support for ARMv7. The multi-core functionality of Java SE such as background JIT compilation and parallel garbage collection is now available for the growing use of ARM multi-core systems in embedded.

Java SE 6u21e release offers the following: latest features and fixes of standard SE 6u21multi-core support for ARMv7up to 20% performance improvements on ARMheadless support for ARMv5 soft-float and ARMv6/v7 hard-floatheadful support for ARMv7optimizations for embedded including small footprint, memory savings, power conservation
Stop by ...

Wealth of knowledge found at ARM Techcon: Linux, Android & development tools

The 2010 ARM Technology Conference (Techcon) is taking place in Santa Clara next week. A large number of companies will be presenting their solutions to support development and optimization of products based on ARM technology, and open source will be discussed in many of these with projects like Linux, Android and development tools. For instance, many of these solutions are using open source to leverage earlier work that ARM has done with the open source community, contributing CPU and architecture support to the upstream Linux kernel and GNU compilation tools ahead of partner silicon platforms being available. One of the most recent illustrations is the contribution of Cortex-A15 CPU support to the Linux kernel as the processor was announced. Linux kernel and GNU development tools are key building blocks to support the development of solutions such as Android, ...

Going Maverick - Ubuntu 10.10 for ARM

Wow it's that time again; our 4th release of Ubuntu on ARM is upon us. In the past we have provided a Freescale iMX51 image, a Marvell Dove image and a TI OMAP 3 image for Beagle Boards. This cycle we will be releasing images for Marvell dove and Texas Instruments (TI) OMAP series of processors both OMAP 3 and OMAP 4. Until now we have always provided a "live image” just like the X86 CD's,that is you could test Ubuntu and then choose to install it to your storage media. Well for the OMAP series of development boards this did not make sense so we have introduced a pre-installed image format that we are using...

Condition Codes 3: Conditional Execution in Thumb-2

Thumb-2 can make use of the same conditional execution features that the ARM instruction set provides. For conditionally executing one or two instructions, this mechanism can provide code-size and performance benefits over the (more conventional) conditional branching mechanism.

I noted at the end of the last post in this series that this mechanism is not directly available to Thumb. Instead, Thumb-2 has an instruction — it — which can provide the same functionality as ARM conditional execution. In this article, I will describe the it instruction, and I will also explain a few caveats of condition-setting instructions in Thumb-2. Note that the it instruction is only available to Thumb-2, and so most of this article will not be relevant to the old Thumb instruction set 1.

...

Cortex-A15 to A5: Software compatibility from Superphone to Feature phone

It was always about the code (and where it would be used!)

When I was a software developer I would often find that the project team I was in would try to guess how many devices the code would eventually run on. So at the launch of the Cortex-A15 last week one of the main points that hit home for me was just how wide the spectrum of power and performance points the Cortex-A family of processors could cover - from feature phone to superphone, tablet to DTV, home server to web server etc. This means that a developer could now find their software running across a huge range of devices in the future.

So is it the same software?

Absolutely. Cortex-A15 is based on the same ARMv7A architecture that the other Cortex-A processors use, therefore allowing the exact same application code to run on all of them, from a ...

Using DS-5 with Gumstix Overo

DS-5 Application Edition can be used to debug a Linux application running on pretty much any ARM Linux target, with a network connection, not just the BeagleBoard that is used in the examples. Ronan, a colleague of mine, saw the cute Gumstix Overo COM (Computer-on-Module) and convinced me I needed to get one and give it a try with DS-5.

Attached Image

The tiny Gumstix Overo next to 50p to show a size comparison


First I ordered the Gumstix Overo Water, but any of the Overo models (Earth, Air, Fire) will probably work the same for my purposes here. I also ordered a Gumstix Tobi so that I can easily hook it to Ethernet and/or USB.

The Gumstix developers website has great getting started material. There seem to be at least two other useful Gumstix websites as well: www.gumstix.com, and ...

Coding for NEON - Part 4: Shifting Left and Right

This article introduces the shifting operations provided by NEON, and shows how they can be used to convert image data between commonly used color depths. Previous articles in this series: Part 1: Loads and Stores, Part 2: Dealing with Leftovers and Part 3: Matrix Multiplication.

Shifting Vectors

A shift on NEON is very similar to shifts you may have used in scalar ARM code. The shift moves the bits in each element of a vector left or right. Bits that fall of the left or right of each element are discarded; they are not shifted to adjacent elements.

The amount to shift can be specified with a literal encoded in the instruction, or with an additional shift vector. When using a shift vector, the shift applied to each element of the input vector depends on the value of the corresponding element in the shift vector. The elements in the shift vector are treated as signed values, so left, right and zero shifts are possible, on a per-element basis.

Attached Image

A right shift operating on a vector of signed elements, indicated by the type attached to the instruct...

Detecting Overflow from MUL

Detecting Overflow from Arithmetic Operations

I discussed in a previous blog post that it is possible to set some condition flags based on the result of an arithmetic operation. Consider the following code:

adds r0, r0, r1 bvs some_address

The above code adds r1 to r0, then branches somewhere if a (signed) overflow was detected. This technique is used frequently in JIT-compilers for dynamic languages. In such contexts, the type and size of a variable is often not known when the code is compiled, so the JIT-compiler will test for overflow, and then fall back to a slower implementation in the case where a signed 32-bit integer cannot represent the result of the required operation. This is the approach taken by Mozilla's Trace Monkey JavaScript engine, for example.

Setting the Flags with mul

Those familiar with ARM's mul instruction may realize that although it can take the s suffix to upda...

Condition Codes 2: Conditional Execution

Revisiting if/else in Assembly

In my previous post ("Condition Codes 1"), I explained that some instructions can set some global condition codes, and that these codes can be used to conditionally execute code. I gave some examples of usage. One such example was an assembly implementation of C's if/else construct:

cmp r0, #20 bhi do_something_else do_something: @ This code runs if (r0 20). continue: @ Other code.

The example is valid, and will work on any ARM core. However, is this an efficient solution if you only need to execute one or two instructions in each case? Consider the following C code:

if (a >= 10) { a = 10; } else { a = a + 1; }

It should be clear that the code increments a unless it has hit or exceeded a limit of 10, in which case it is set to 10. Mapping this onto our if/else ...

How to Load Constants in Assembly for ARM Architecture

ARM is a 32-bit CPU architecture where every instruction is 32 bits long. Any constants which are part of an instruction must be encoded within the 32 bits of the given instruction and this naturally limits the range of constants that can be represented in one instruction. This post will show you how we can deal with these limitations and how the latest revision of the ARM architecture (ARMv7) provides a simple and efficient solution.

Most arithmetic and logical ARM instructions accept 3 parameters:
The destination: always a register. Operand 1: always a register. Operand 2: a register, an immediate constant value or a shifted register. We'll cover shifted registers in a future post. For now, we're only interested in the constants. Examples of such instructions are:
CODE
    add    r0, r1, r2    @ r0 = r1 + r2
    sub    r0, r1, #3    @ r0 = r1 - 3

An Operand 2 immediate must obey the following rule to fit in the instruction: an 8-bit value rotated right by an even number of bits between 0 and 30 (inclusive). This allows for constants such as 0xFF (0xFF rotated right by 0), 0xFF00 (0xFF rotated right by 24) or 0xF000000F (0xFF rotated right by 4).

Operand 2 immediates are also valid immed...

Condition Codes 1: Condition Flags and Codes

Every practical general-purpose computing architecture has a mechanism of conditionally executing some code. Such mechanisms are used to implement the if construct in C, for example, in addition to several other cases that are less obvious.

ARM, like many other architectures, implements conditional execution using a set of flags which store state information about a previous operation. I intend, in this post, to shed some light on the operation of these flags. Of course, the Architecture Reference Manual is the definitive source of information, so if you need to know about a specific corner-case that I do not cover here, that is where you need to look.

A Realistic Example

Consider a simple fragment of C code:

for (i = 10; i != 0; i--) { do_something(); }

A compiler might implement that structure as follows:

mov r4, #10 loop_label: bl do_something sub r4, r4, #1 cmp r4, #0 bne loop_label

The last two instructions are of particular interest. The cmp (compare) instruction compares r4 with 0, and the bne instruction is simply a b (branch) instruction that executes if the result of the cmp instruction was "not equal". The code works because cmp sets some global f...

Coding for NEON - Part 3: Matrix Multiplication

We have seen how to load and store data with NEON, and how to handle the leftovers resulting from vector processing. Let us move on to doing some useful data processing – multiplying matrices.

Matrices

In this post, we will look at how to efficiently multiply four-by-four matrices together, an operation frequently used in the world of 3D graphics. We will assume that the matrices are stored in memory in column-major order – this is the format used by OpenGL-ES.

Algorithm

We start by examining the matrix mutiply operation in detail, by expanding the calculation, and identifying sub-operations that can be implemented using NEON instructions.

Attached Image

Notice that in the diagram, we multiply each column of the first matrix (in red) by a corresponding single value in the second matrix (blue) then add together the results for each element to give a column of results. This operation is repeated for each of the four columns in the result matrix.

...

10 Android NDK Tips

With new devices and new capabilities being exposed by the Android NDK (Native Development Kit) it is now possible to really get the best out of these ARM based devices. Here are a few quick tips to help that along.

1 - Stay on Target

The newest devices are generally ARMv7, meaning that it can pay to use v7 builds and features. The latest version of the NDK adds support ARMv7 and NEON code allowing key loops and media operations to be optimized far beyond what would otherwise be possible. The NDK provides a small static library that will allow you to identify what options you have at runtime. For examples of how to use these features, look at the hello-neon example project in the samples directory of the NDK

The older devices are v6, but the NDK does not specifically support it, leaving you with the choice of building safely for v5TE or taking the risk that there may be v5TE devices out there. If you need every iota of speed, and know what hardware you are targeting, then it may be worth building for v6. The newest devices, supporting Android 2.0 and up, seem generally to be ARMv7 based, although yo...

Computex: Windows Embedded Compact 7 Highlights Investment in ARM

Yesterday at Computex, the Microsoft Windows Embedded team announced the availability of the latest version of Windows Embedded CE – officially known Windows Embedded Compact 7. The release is a Community Technology Preview (CTP) which is a fancy way to say public beta. The CTP can be downloaded from the Microsoft website.

Windows Embedded Compact 7 includes a list of cool features to help OEMs develop smart, connected, service oriented devices with custom user-interfaces. But, if you take a closer at the code you’ll notice an engineering investment and significant improvement – Compact 7 now includes support for more ARM architectures including ARMv7, ARMv7 NEON™ and SMP.

The added ARM architectures provide OEMs working with Windows Embedded competitive performance in the segments proliferated by ARM and our ARM Partners – ...

Android Phones, tablets, TV’s… oh my!

I’ve written before about the proliferation of Android as a consumer device platform beyond its humble origins as a handset OS, but I’m continually amazed at the pace of this innovation from consumer electronics companies crafting new and savvy products from Android. I am at Computex this week and there are numerous products on display that fall into this category.

I won’t catalog the litany of devices here, I’m sure you’ll get enough of that via the ARMFlix YouTube channel or your favorite consumer device blog, instead I want to talk about why I think Android is able to adapt at such a breakneck pace. While a case can be made for any number of reasons, fundamentally, I believe there are two overwhelming factors. They are; (1) the architecture and versatility of the Android software stack and (2) the size of the IP and services ecosystem that has rapidly ...

Support for VP8 and WebM on ARM

It continues to be an exciting time for the development of web technologies on the ARM architecture; allowing the Internet to reach the maximum number of devices. Today sees an advancement in video for the web with the WebM project that has been announced at Google I/O 2010 (Google’s annual developer’s conference). A key part of this announcement was the contribution of the VP8 video codec, free of royalties to Google.

So why is this good for ARM and our Partners? Well ultimately the delivery of the full web drives the development of great devices, and video in particular makes up an ever increasing proportion of data being consumed: in other words consumers want video, and an efficiently designed, open video codec helps.

There is already a huge amount of video being delivered on the Internet: Cisco’s Visual Networking ...

Coding for NEON - Part 2: Dealing With Leftovers

In the first post on NEON about loads and stores we looked at transferring data between the NEON processing unit and memory. In this post, we deal with an often encountered problem: input data that is not a multiple of the length of the vectors you want to process. You need to handle the leftover elements at the start or end of the array - what is the best way to do this on NEON?

Leftovers

Using NEON typically involves operating on vectors of data from four to sixteen elements in length. Frequently, you will find that your array is not a multiple of that length, and you have to process those leftover elements separately.

For example, you want to load, process and store eight elements per iteration using NEON, but your array is 21 elements long. The first two iterations go well, but for the third, there are only five elements remaining to be processed. What do you do?

Fixing Up

There are three ways to handle these leftovers. The methods vary in requirements, performance, and code size. They are listed below in order, with the fastest approach first.

Larger Arrays

If you can change the size of the arrays that you are processing, increase the length of the array to the next multiple of the vector size using padding elements. This allows you to read and write beyond the end of your data without corrupting ad...

How do you make Java Fast? Answer: Go down the pub! Part 2

In January 2009 Ed set about the task of rewriting the interpreter. Java byte codes are quite a compact code for representing programs but the Virtual Machine they target has a stack architecture rather than a register architecture. This makes the Java VM somewhat at odds with modern day processor architectures such as ARM which is register based. Ed’s approach was to use a Peephole Optimizer to spot common byte code sequences that loaded items onto the Java Virtual Machine stack, manipulated them and stored them back into memory. These complete sequences could then be executed in optimized ARM assembler. Having done this process for the first version of the optimized interpreter longhand it became clear that this repetitive task could be eased with the creation of a notation to describe the sequences and how they related to the ARM assembler. (Ironically I did the same thing some 20 years ago for a different VM). A tool could then be used to automatically generate the template interpreter from the notation. The tool proved to be extremely useful and was naturally processor agnostic so we contributed it back into open source along with the optimized interpreter it generated last year.

Ed’s optimized interpreter for OpenJDK increased the performance by around a factor of 4X for a couple of the classic Java benchmarks- Embedded Caffeine Mark and EEMBC Grinderbench ...

Locks, SWPs and two Smoking Barriers (Part 2)

In the last article, I explained how to modify SWP code to make use of compiler intrinsics. Using intrinsics hides the underlying detail needed to use the load and store exclusive instructions (LDREX and STREX) and the use of memory barriers. In this article I look at implementing atomic memory accesses in assembler.

In order to describe memory barriers, what they are and how they should be used, I need to describe two types of memory model, strongly ordered and weakly ordered. The strongly-ordered model is very natural for programmers. In this model, the order that a program writes data to memory is the order in which the data is observed being written into memory, that is, other programs sharing the data will "see" the same ordering regardless of the CPU that they are executing on.

For example, if a CPU writes a new X then writes a new Y, all other CPUs that subsequently read Y then a read X, will access either the new Y and new X, the old Y and the new X, or old Y and the old X. However, because the order of the write is strongly ordered as write X first then write Y, n o CPU will access the new Y and the old X .

Modern CPUs, such as ARM, optimize memory acce...

How do you make Java fast? Answer: Go down the pub!

It all started back in 2008, I’d been looking at what the Software Bill-of-Materials would be for an ARM-based Netbook. I’m a great fan of JEOS (Just-Enough-OS) to support the end users software needs but even taking a JEOS approach the list of software that we had to enable was quite daunting. Back then, the Cloud as a platform for desktop apps like word processing hadn’t quite taken shape. I had converted my family over to Google Docs but I wasn’t sure if the rest of the world would be quite as ready to make that move when ARM-based devices became available. Open Office was quite a popular office suite in the Western world, however in Asia a small company called Haansoft (now Hancom, Inc.) were making headway with an office suite called ThinkFree Office that was small, lightweight and could run across multiple device form factors. The one minor problem was that ThinkFree Office was writt...

Why is Open Source Important?

Sitting in the airport at the end of a week’s business trip to the US, I reflected back on the week. It turned out that my colleague on this trip has an even worse sense of direction than myself…Potentially disastrous, especially when you’re driving between airports, hotels and meetings in cities that you’ve never visited. This is where Google Maps becomes utterly indispensable. Installed on my Nokia E71 it makes use of the built in GPS and 3G and Edge networks to provide a running view of where we are, driving or walking. Without it we wouldn’t have found the wonderful Boulderado hotel or the Boulder Bookstore with its impressive converted ballroom. Actually, we’d probably still be driving around somewhere near Dallas.

Life changing and mind boggling as the online, always connected life of a sometime digital nomad is, w...

Android inspiring innovation for the home at CCBN

Android is fast becoming a ubiquitous solution for connected devices. We’ve all seen the successful handsets, like the Motorola Droid and the Nexus One. We are now also starting to see large screen Android tablets and netbook devices, and a few lucky folks have even seen the Android powered washing machine and microwave oven showcased at the CES 2010! The merits of Android in these far reaching peripheral categories are endlessly debatable. One thing is clear, the connectivity, application and content frameworks and low power technologies pioneered by the mobile industry and delivered through the Android platform on ARM, are equally relevant across a wider range of product categories, and none more so than the home marke...

Locks, SWPs and two Smoking Barriers

Before ARMv6, the main synchronisation mechanism was the SWP instruction. SWP has two aspects, in a uniprocessor system it allows the read and write operations not to be interrupted between them. In a multiprocessor system it ensures that multiple masters will do the locking. For multiprocessor systems with complex memory hierarchies and long memory latencies SWP creates performance bottlenecks.

This was replaced in the ARMv6 architecture by exclusive loads and stores (LDREX and STREX). This works on the principle of a monitor existing for the location in memory. This effectively tags the memory with the identity of the agent(s) trying to access it. In a spinlock implementation, an exclusive load reads data from the memory, tagging it with its identifier. A short number of instructions later, it uses an exclusive store to write data to memory but this only works if the tag is still valid and the tag will only be valid if some other ag...

Coding for NEON - Part 1: Load and Stores

ARM's NEON technology is a 64/128-bit hybrid SIMD architecture designed to accelerate the performance of multimedia and signal processing applications, including video encoding and decoding, audio encoding and decoding, 3D graphics, speech and image processing.

This is the first part of a series of posts on how to write SIMD code for NEON using assembly language. The series will cover getting started with NEON, using it efficiently, and later, hints and tips for more experienced coders. We will begin by looking at memory operations, and how to use the flexible load and store with permute instructions.

An Example

We will start with a concrete example. You have a 24-bit RGB image, where the pixels are arranged in memory as R, G, B, R, G, B... You want to perform a simple image processing operation, like switching the red and blue channels. How can you do this efficiently using NEON?

Using a load that pulls RGB data linearly from memory into registers makes the red/blue swap awkward.

Code to swap channels based on this input is not going to be elegant – masks, shifting, combining. It is unlikely to be efficient.

NEON provides structure load and store instructions to help in these situations. They pull in data from memory and simultaneously separate valu...

Caches and Self-Modifying Code

Ideally, caches act as some magic make-it-go-faster logic, sitting between your processor core (or cores) and your memory bank. Whilst it can be beneficial to consider specific cache features when writing some performance-critical code, it is usually advisable to consider only general cache behaviour in mind. However, there are cases where the cache behaviour must be considered in order to get the result that you want, and self-modifying code is an excellent example.

Cached ARM architectures have a separate cache for data and instruction accesses; these are called the D-cache and the I-cache, respectively. For this reason, the ARM architecture is often considered to be a Modified Harvard Architecture, though I must admit that with most real processors existing somewhere between Harvard and von Neumann architectures, I do not find that label particularly useful. There are a few benefits of this design, but the one I have seen discussed the most often is that with two interfaces to the CPU, the core can load an instruction and some data at the same time.

Whilst employing this Harvard-style memory interface is useful for performance, it does have its own drawbacks. The typical drawback of a pure Harvard architecture is that instruction memory is not directly accessible from the same address space as data memory, though this restriction does not apply to ...

"Hello World" in Assembly

Assembly language can be fairly daunting, even for experienced software engineers. The lists of strange instructions and squiggles can be hard to read at the best of times; indeed, that is why we use languages such as C, where the compiler worries about such things so you don't have to. However, understanding the instruction set of your processor can make C-level optimizations easier to spot and implement, and will help you to gain an understanding of what your program is really doing. In addition, it can enable you to create some finely-tuned code for specific tasks that are hard to implement in C. If nothing else, it's fun!

This post aims to provide a simple introduction to ARM assembly language. The code will be presented in such a way that you can understand what's going on without having to understand the nuances and specifics of each instruction. Future posts will explain the mechanisms in more detail.

Tools

In order to actually do anything interesting, you'll need an ARM device and a suitable tool-chain. If you have a reasonably powerful device with a desktop-like operating system (such as Ubuntu), you can work directly on the board; this is native development. On Ubuntu, you can use the built-in ...

Hello World! SW Development, Optimization and Partnership on ARM

ARM is hiring smile.gif OK, so that got some people’s attention and confused others – actually we are hiring, and in particular software developers. What can often come as a surprise to people is that as well as having a team of people that go plan and work alongside some our different software partners, ARM has a software engineering group that work on key bits of software, particularly on Cortex-A8 and Cortex-A9 projects. In fact it's very likely that some of the code running your mobile phone was developed by some of the ARM team.

The team cover a wide range of software projects that include:Web and web runtime optimization, for example JavaScript JIT optimization work on projects such as Tamarin, Webkit and Squirrelfish NitroExtreme), and OpenJDK optimization work.Operating System development – including Android, Linux kernel hacking and a ...

  • (48 Pages)
  • +
  • 1
  • 2
  • 3
  • Last »
All company and product names appearing in the ARM Blogs are trademarks and/or registered trademarks of ARM Limited per ARM’s official trademark list. All other product or service names mentioned herein are the trademarks of their respective owners.