We love to talk about “raising the bar” in graphics performance. The idea is that GPU performance competition is like a high jump, where you are constantly trying to clear a higher bar than you did last time; but when it comes to power and memory bandwidth, it’s more like a limbo contest. There, the goal is to keep lowering the bar, trying to keep the same level of performance while using ever smaller amounts of energy. In fact, it turns out that these goals are equivalent; in both cases, what you’re really trying to do is maximize efficiency. Want higher performance? Then you have to reduce energy consumption. Want lower power? Then you’ve just enabled higher performance.
It’s all about power
In fact, mobile GPU design is all about reducing energy consumption. This has been true for quite a while, but the reasons for it are changing. We used to worry about it because we were concerned about battery life. We still are, because nobody likes having to recharge their mobile devices in the middle of the day; and certainly we’d love it if some breakthrough in battery technology gave us ten times the energy storage capacity we have now. But even if that were to happen – not likely – it wouldn’t solve our problem, because nowadays the problem isn’t battery life. It’s heat.
Modern high-end applications processors, without exception, are thermally limited. Given enough work to do and permission to run as fast as they can, they can do so much work that they overheat their packages and destroy themselves. To keep that from happening, they contain lots of smart system hardware and software that forces them to slow down when they start to get too hot. Think about that for a minute; if your performance is limited, not by having too few transistors, or too little compute capability, or insufficient access to memory, but simply by the amount of energy you can use, then the only way to increase your performance is to reduce your energy consumption. And the most important performance metric to optimize for is not pixels or (God forbid!) triangles per second, but nanojoules per pixel (nJ/p).
Thinking like a GPU designer
To get a feel for the kind of reasoning this leads to, let’s look at one aspect of power optimization: reducing memory bandwidth. Since we’re playing engineer, we’ll do this using numbers; but don’t worry, we won’t need higher math and physics – simple arithmetic will do. We’ll start with some simple facts:
- Power is just a rate of energy consumption. Energy is measured in joules; one watt of power is one joule per second.
- Speaking very loosely, the power budget for a mobile GPU is about one watt (sometimes less).
- Every time the GPU reads or writes one byte of memory, it consumes about 150 picojoules (pJ), or millionths of a millionth of a joule. (Memory geeks, this is for 2x32 LPDDR2 and includes everything from the memory controller out, under a whole boatload of assumptions. It’s only a ballpark figure, so don’t take it too seriously. But it’s enough to get us started.)
The first question we have to ask is, does memory bandwidth use enough power to be worth worrying about? Here comes our first numerical argument: the kind of memory system we’re talking about can transfer something like 4 to 8 GB (gigabytes) of data per second. Multiply that by 150 pJ per byte, and we get 0.6 to 1.2 watts. In other words, memory bandwidth can eat up our entire power budget. So the answer to our question is yes, memory bandwidth does matter; in fact it’s critical.
The Tile Game
In the SIGGRAPH talk, I went on to talk about tile-based rendering. This is a way of organizing the graphics pipeline so that the color, depth, and stencil sample buffers stay on-chip. Tile-based rendering greatly reduces memory bandwidth usage, especially if the application is using multi-sampled antialiasing (MSAA), which requires multiple color, depth, and stencil samples for every pixel. We use it in all of the ARM® Mali™ GPUs, and it’s also used (with variations) in the Qualcomm Adreno™ and Imagination PowerVR™ cores. Our version works like this:
The GPU divides the output image into small rectangles called tiles, and maintains to-do lists of things that need to be drawn into each tile. When the application asks the GPU to draw a triangle, it doesn’t actually do it; it just figures out which tiles contain pixels that the triangle might cover, and adds the triangle to those tiles’ to-do lists. When it’s time to draw the pixels, the GPU processes the tiles one-at-a-time. For each tile, it reads the to-do list and draws all of its triangles in order; but since the tile is small, it can do this into a special on-chip memory called the tile buffer. When all the triangles have been drawn, it does what we call a resolve: it filters the color samples to produce one color per pixel, and writes the pixel colors into the external frame buffer. The color, depth, and stencil samples aren’t needed any more (usually), so the GPU just forgets about them and goes on to the next tile. Figure 1 shows what it looks like in pictures.
Figure 1: Tile-based rendering. Triangles submitted for drawing are written into per-tile to-do lists in system memory. When the pixels are needed, the rasterizer reads the to-do list for each tile and renders it into the on-chip multisample (MS) depth (Z) and color © buffers. When a tile is finished, it is resolved to obtain pixel colors, which are written into the off-chip framebuffer. In this figure, the GPU has just finished rendering tile 9 to the internal, multisampled tile buffers, and writing the resolved image to the external frame buffer.
Reducing texture bandwidth
Figure 1 shows us that tile-based rendering puts most of the heavy data traffic – specifically, traffic into and out of the multisampled Z and color buffers – inside the GPU, in on-chip memory. The fattest arrow that still crosses the bus into system memory is texture data. The first law of optimization is, “work on the stuff that’s hurting you the most” – so reducing texture bandwidth is the next thing we need to worry about.
This, of course, is exactly what motivated our work on Adaptive Scalable Texture Compression (ASTC). We’ve written several blogs about ASTC, so I won’t repeat the whole story here; for a great introduction to how it works, read Sean Ellis’s excellent blog based on our HPG paper. The latest development on the ASTC front is that the Khronos group has adopted a subset of ASTC as a Khronos-ratified OpenGL and OpenGL ES extension. We’ve announced plans to support the extension in the newly announced Mali-T624 and Mali-T678 GPUs, and several other GPU providers have expressed similar intentions. Since we’ve agreed to license the patents royalty-free under the terms of the Khronos members’ agreement, we expect that ASTC will be available on all OpenGL ES platforms within a few years.
The exciting thing about ASTC from a developer’s point of view is that it allows almost any texture you can imagine to be compressed. The formats in common use today (S3TC, PVRTC, ETC1, RGTC) offer only a limited number of bit rates, and a few choices of number of color components. ASTC offers just about any bit rate you could want, with any number of color components you like, in your choice of standard (8-bit) or HDR (float), all at a quality that is matched only by still-exotic high-end formats such as BPTC. This means that, for the first time, you can think about compressing all of the textures used by your application. There are no ‘holes’ in the coverage; no matter what your pixel format or quality requirement are, ASTC has a format to match. [FOOTNOTE: OK, there’s an exception to every rule. ASTC doesn’t have a way to compress integer textures, which are a new feature in OpenGL ES 3.0. Give us time.]
We expect it’ll take a little time for developers to get a feel for working with ASTC, and in particular to learn what kind of use cases demand what kinds of bit rates. For developers who want to get a head start, we’ve released an evaluation codec package, including source code. We hope you’ll find it interesting.
Making tiling even better
Looking back at Figure 1, we’ve used tile-based rendering to eliminate external traffic to the multisample buffers, and introduced ASTC to shrink texture-fetch traffic as much as we can. The biggest arrow remaining is tile writeback, where we write resolved color samples from the on-chip tile buffer to the framebuffer in external memory. As screens get bigger, this step becomes more and more important – and screens, trust me, are going to get ridiculously big. Can we do something about tile writeback?
During the design of the Midgard GPU architecture, we spent a lot of time looking at application behavior, looking for opportunities to reduce power or improve performance. One thing we noticed is that surprisingly often, the resolved pixels we write out to memory are exactly the same as the pixels we wrote during the preceding frame. That is, the part of the image corresponding to the tile hasn’t changed. The architects found this annoying; the GPU was burning energy to write data to memory, when that data was already there. Clearly, if we could detect situations where a tile hadn’t changed, we could skip writing it, and reduce power consumption.
Now, it’s not a surprise that a lot of pixels don’t change when the GPU is, say, compositing a web page or a window system. But we found that you get significant numbers of redundant tile writes even in modern FPS games, where you’d think the whole screen would be changing constantly; and you even get them during video playback. Obviously you don’t save a lot on that kind of content, but you pretty much always save enough to make it worth doing. So, we decided to attack the problem in Midgard, by adding a feature we call transaction elimination.
Introducing transaction elimination
OK, it’s not the coolest name in the world, but the technology itself is simple and elegant. Every time the GPU resolves a tile-full of color samples, it computes a signature or checksum – a short bit string that depends sensitively on every pixel in the resolved buffer. It writes each signature into a list associated with the output color buffer. The next time it renders to that buffer, after resolving each tile, it compares the new signature to the old one. If the signature hasn’t changed, it skips writing out the tile, because the probability that the pixels have changed is one in, well, a very, very, very large number.
Figure 2 illustrates the idea. For tiles where we have a (green) signature match, we can skip writing the tile; this happens (in this hypothetical case) for the skybox, parts of the heads-up display, and parts of the car. Where we have a (red) mismatch, we have to write the tile to memory.
Theory meets practice
Based on our design studies, we expected that transaction elimination would help a lot for browsing and GUI compositing, but only modestly for games. Now that we have access to partner silicon for the Mali-T604, however, we’ve been able to study its behavior in real applications, running on a real OS. It turns out it works better than we thought, for two reasons. First, display resolutions have once again grown faster than we predicted; and second, the kinds of games people are playing aren’t the kind we were expecting.
Saving the planet, one Angry Bird™ at a time
Currently, the most popular mobile game on the planet, by a wide margin, is Rovio’s Angry Birds. It is played a lot, according to its creators: about 200 million minutes per day worldwide. Statistically you’ve almost certainly played it, so I don’t need to tell you that its style is friendly to transaction elimination. But to help you visualize just how friendly it is, here are several images (Figures 2, 3, and 4). I’ve painted a red overlay on the tiles where we have a signature mismatch (and therefore have to write the tile to memory). As you can see, when we’re aiming the slingshot, there’s very little motion and only a handful of tiles need to be written. When we launch the bird, the whole screen pans and a lot of tiles change, but we still end up skipping almost 50% of tile writes. Finally, when the bird hits, the scrolling slows down and then stops, and the number of active tiles trails off.
Figure 3: Aiming. Transaction elimination is able to suppress 96% of tile writes
Figure 4: Bird in flight. Here there is a lot of background motion, but we are still able to eliminate about half of tile writes
Figure 5: Settling. As the physics engine converges, more and more of the scene becomes static and stops requiring tiles to be written to memory
So how much does this help?
To put numbers on the value of transaction elimination, we captured a couple of thousand frames of the OpenGL ES commands issued by Angry Birds “Seasons” during a playing session. We then ran the commands on a prototype high-end Android™ tablet with Mali-T604 silicon, first with transaction elimination disabled, and then with it enabled. We used the built-in debug protocols to read back the internal performance counters. We found that over the sequence, about 75% of tile writebacks were eliminated. Total GPU bandwidth was cut nearly in half, from 6.5 MB/frame to 3.4 MB/frame.
To put that into perspective: if every Angry Birds player on the planet were using Mali silicon at a resolution of 1368x760 and assuming a bandwidth cost of 150 pJ per byte, the technology would be saving about 3.8 kW continuous power world-wide. That’s enough to run several single-family houses, 24x7. It’s equivalent to about five horsepower, so it’s more than the max output of a Vespa S 50 motor scooter, or my old Sears lawnmower. But it’s more fun to think of it in terms of energy. Again, assuming every Angry Birds player were using the technology, transaction elimination would save 34 megawatt-hours of energy per year. If you’re interested in saving the planet, that’s 20 barrels of oil, which would yield 8.7 metric tons of carbon dioxide; if you’re more the Duke Nukem type, it’s approximately the energy released from exploding about 16.3 metric tons of dynamite. It’s a lot of energy!
I hope you’ve enjoyed this little dive into GPU design and energy-think. Deepest thanks to Rovio for giving me permission to use the Angry Birds images, for writing a game that is so awesomely well suited to transaction elimination, and (of course) for several hundred hours of my life which I will never, ever get back…
Got questions? Just like to argue? Drop me a line…
Tom Olson is Director of Graphics Research at ARM. After a couple of years as a musician (which he doesn't talk about), and a couple more designing digital logic for satellites, he earned a PhD and became a computer vision researcher. Around 2001 he saw the coming tidal wave of demand for graphics on mobile devices, and switched his research area to graphics. He spends his working days thinking about what ARM GPUs will be used for in 2013 and beyond. In his spare time, he chairs the Khronos OpenGL ES Working Group.
2 Comments On This Entry
Please log in above to add a comment or register for an account
on May 16 2013 02:10 PM
Low-Energy Application Parallelism
on May 14 2013 04:39 PM
Flipping the FLOPS - how ARM measures GPU compute performance
on May 13 2013 09:05 AM
ARM ecosystem expertise shared at GDC 2013
on Apr 25 2013 02:43 PM
What a Stunning GDC 2013 for ARM and Mali GPUs!
on Apr 11 2013 04:41 PM
Search My Blog
Jem Davies on May 14 2013 08:49 AM
Flipping the FLOPS - how ARM measures GPU compute performance
ericvh on May 13 2013 03:58 PM
Flipping the FLOPS - how ARM measures GPU compute performance