John Carmack, interviewed in Arstechnica, said graphics processors such as the just-launched ARM Mali-T604 must address more than 32 bits-worth (4 Gbytes) of memory. Mobile downloadable apps are currently limited to 2 Gbytes, which cramps his style: on the desktop they are already much bigger. Consumer devices ship today with more than 512 MB DRAM and 16 GB of Flash, and Moore's law tells us they will cross the 4 Gbyte limit just after the first Mali-T604-enabled devices start shipping. Games engine programmers also want OpenCL for game physics and game AI.He wrote:
Quote
"... I'm very interested in when the transition to 64-bit addresses is going to come in the mobile space. One of my pet project directions is enabling GPU mapping of resources from static files on there, and we will be bumping into the 32-bit address space limit on that. Before we know it, it's right around the corner that we're going to be trying to map more than 4GB of memory data on mobile devices."
Why you need 64-bit addresses and integers
Carmack was famously right a decade ago when he said we were going to need 64-bit pixels, and he’s right today when he says we will need 64-bit addresses. (He agrees with me, so he must be right :rolleyes:). There is no reason why battery-powered devices should be constrained by arbitrary limitations: consumers expect the same high-quality visual experience across all their devices. Modern apps (including, but not solely restricted to, games) are built in a 64-bit world for desktops and consoles (e.g. even the Nintendo 64 had a 64-bit CPU and that was launched in 1995, more sensibly, desktop CPUs and graphics cards have been able to address more than 32-bits-worth for many years now), and currently there have to be limited versions of those apps built for mobiles and other devices.
Our latest GPU, the Mali-T604 was designed from the start for a new world of modern graphics and compute APIs like Khronos OpenCL, and modern hardware requirements including larger address spaces. The Midgard architecture (of which Mali-T604 is the first released family member) is designed for this new world. Just like the ARM Cortex-A15, it has its own MMU and it uses the same page table formats as the Cortex-A15. The MMU and page tables present external 40-bit addresses and are ready for a fully 64-bit world.
You could have horrid segmented addresses, but to manipulate pointers efficiently, you need 64-bit integer arithmetic in your GPU. Yes you can do these things on a 32-bit architecture: you can do 64-bit arithmetic on an 8-bit micro, but there comes a point when it's more efficient to support the required numerical operations in the architecture natively and it's just so much easier for developers.
Why you need 64-bit floats
For today's OpenGL ES content in the embedded world, you don't need anything more than FP32 (32-bit floating-point) for vertex shaders and FP16 for fragment shaders (just as Mali-200/300/400MP provide). The Midgard architecture looks forward to a new era for embedded where the focus is not to keep a separate embedded world, but to bring established high-end desktop/tethered APIs and features into the embedded world (and vice versa). We will also need backwards-compatibility with the desktop. In this new era, where development time matters, and where a lot of content is authored for both the embedded and desktop worlds, there will be applications that require FP64 to produce correct results, so Mali-T604 has native FP64 support. What we’ve found, and what our Partners are telling us, is that you don’t need FP64 often, but when you do, you need it very badly, and you need it to have good performance, even if FP32 is faster.
Of course, living with low-precision float or even fixed-point arithmetic has a long and honourable tradition in graphics and gaming, and it can be fun if you have time to do it. After all, graphics programmers lived with 16-bit depth buffers for a long time, and wasn’t that fun? Graphics programmers will be playing games like this for many years to come, because most hardware (both desktop and embedded) is going to give better performance with lower precision data types and approximation goes a long way when you are just creating images to be viewed by the human eye. However, anyone who has wrestled a floating point precision problem to the mat knows, there are times when going to double precision can save you weeks of development time. As the market for mobile graphics applications grows, we think development time will become increasingly important.
Outside the world of graphics, things are different. “Almost correct” is not what you want when you are doing general computing, as anyone who has ever had bugs in their floating-point accuracy knows to their cost. Desktop cards have had FP64 capabilities for a few years now with APIs like CUDA and developers are used to relying on it. A lot of that code (OK, maybe not the earthquake simulation code :rolleyes:) is coming our way, and precision is going to be required for algorithmic correctness. Double-precision IEEE-754-2008 is the default for algorithm development in the desktop and scientific communities for very good reasons: it makes everything easier. We believe that many of the use-cases for GPU computing in consumer devices will start out in the desktop and scientific computing worlds. That code will port easily to mobile devices like Mali-T604 that provide native double precision support. They’ll arrive late and buggy on devices that don’t, if they can be ported at all...
Even in the realm of image processing (a happy hunting ground for fixed-point and low-precision maths tricks), there are things you just can’t do without 64-bit arithmetic. One example is computing summed area tables which can be used for everything from face recognition to producing a Bokeh effect (some interesting examples here).
We have blogged before about the use-cases for OpenCL and GPU computing in the energy-conscious world, and we're convinced that double-precision has a major role to play here. We really couldn't imagine producing a new architecture without it.
Why you need full profile OpenCL
For those platforms that need OpenCL, in addition to floating-point precision (the number of bits natively supported), you also need accuracy in the library functions and in the specified arithmetic operations. In addition to the double precision extension, we provide full profile OpenCL (as well as embedded profile). Embedded profile might sound attractive, but all the mainstream developers are producing full profile implementations, not embedded. (Count "full profile" search hits as opposed to "embedded profile", for fun). If you don't have the precision and the accuracy, OpenCL code may not give you the correct results on your platform...
You don't just need double-precision and accuracy - there are whole bunch of other requirements. In addition to the superior precision requirements of full profile OpenCL, there are also the 32-bit and 64-bit atomic operations. Without them, there is no practical way to order output from threads in different workgroups. Atomics can do a lot of things that the built-in barrier primitive cannot. You also need them implemented extremely efficiently as they are on the Mali-T600 family or else you will end up wasting time on synchronisation.
You need atomics to manipulate data structures
If you want to use GPGPU to manipulate data structures other than just flat arrays, atomics are a critically important feature. Some examples of what can be done with atomic operations that cannot be done without them:
- Histogram construction. This is useful for a number of image processing algorithms. Histograms also make it possible to do efficient Radix-Sort on the GPU. Without atomics, the fastest possible GPU-based sort is bitonic-sort, which is much slower and consumes a huge amount more bandwidth.
- Irregular Z buffers; nVIDIA used something like this in a demo at SIGGRAPH 2008 to produce aliasing-free shadows.
- At Eurographics 2010, ATi showed a demo of a new depth-of-field algorithm that used atomics to solve longstanding problems with existing depth-of-field algorithms (in particular with respect to blurred near-field objects)
- Atomics make it possible to construct algorithms on the GPU that rely on incremental memory allocations; we use this in Mali-T604 as part of our driver stack. ATi has a demo that uses pixel-shader atomics to construct per-pixel linked lists for subsequent sorting, enabling order-independent transparency effects.
- Atomics make it possible to construct certain data structures with lockless algorithms, such as hash tables, and to implement mutexes for data structures that cannot be made fully lock-free.
A fully-featured, full profile OpenCL implementation is a serious work tool for computation. A simple (but still conformant) embedded profile implementation without the precision and without the features above is, by comparison, just a toy. Does your next-generation GPU have too many restrictions or are you ready for the future?
Jem is an ARM Fellow and likes to think of himself as "The Godfather" to technical talent in ARM. After spending some time in his youth writing software for satellites and traffic-lights among other fascinating things, Jem spotted the technical inflection point of the mobile industry: graphics, video and other visual computing. As VP of technology in the Media Processing Division of ARM, Jem is busy with a lot of projects involving the future of cool ARM technology, which will revolutionise how people experience and interact with digital devices.
All company and product names appearing in the ARM Blogs are trademarks and/or registered trademarks of ARM Limited per ARM’s official trademark list. All other product or service names mentioned herein are the trademarks of their respective owners.
0 Comments On This Entry
Please log in above to add a comment or register for an account
»
Blog Tags
»
Recent Entries
»
Search My Blog
»
Recent Comments
-
Jem Davies
on May 11 2012 06:46 AM
ARM Mali-T604: New GPU & Architecture For Highest Performance & Flexibility
-
Sean Lumly
on May 10 2012 10:34 PM
ARM Mali-T604: New GPU & Architecture For Highest Performance & Flexibility
-
Jem Davies
on Apr 26 2012 08:45 AM
ARM Mali-T604: New GPU & Architecture For Highest Performance & Flexibility
-
-


Leave Comment






















