Inverse trigonometric functions GPU optimization for AMD GCN architecture

Version : 2.0 – Living blog – First version was 01 December 2014

First advice anybody have regarding inverse trigonometric functions (acos, asin, atan) is “do not use it”. And this is a good advise. It is often possible to get rid of all the trigonometric functions with trigonometric identities [1][2]. However with the growing complexity of lighting models I see more and more usage of them. One of the main use case I met is when I deal with solid angle calculation and area lights. So if we need to manipulate such functions, better to be aware of their cost. This post is about knowing the cost of GPU inverse trigonometric function and providing optimize version for the AMD GCN architecture (PS4, XBone, PC – AMD). For this post I have use the PS4 shader compiler (v2.00) for the analysis.

AMD GCN architecture basics

There is already plenty of good information available on the web about the AMD GCN architecture, so I will not repeat them here [3][4][5]. I recommend to read the excellent talk of Michal Drobot about “Low level optimization for GCN” [3] as I will mainly follow its vocabulary.

Main basics:
– instruction are classify into vector instructions v_ and scalar instruction s_. The scalar instruction can be coarsely consider as free as they are executed in parallel.
– instruction are full rate or quater rate, i.e this is equivalent to say there is instruction which are 4x slower than other. Full rate (FR): mul, mad, add, sub, and, or, bit shift… Quater rate(QR): transcendental instruction like rcp, sqrt, rsqrt, cos, sin, log, exp…
– macro instructions can expand to several instructions: tan, acos, asin, atan, pow, sign, length…
– there is free modifier: saturate, abs, negate, mul2, mul4, mul8, div2, div4…
– dynamic branching can be considering having cost >= 16 FR.
– VGPR count are more important than instruction count

Cost of inverse trigonometric function

How expensive is an inverse trigonometric function ?

On PS4 I get these numbers using the following code and isolating the instructions related to the function itself

float val;

float4 main() : S_TARGET_OUTPUT0
{
    float res = acos(val);
    return float4(res, res, res, res);
}

acos: 48 FR (40 FR, 2 QR), 2 DB, 12 VGPR
asin: 48 FR (40 FR, 2 QR), 2 DB, 1 scalar instruction, 12 VGPR
atan: 23 FR (19 FR, 1 QR), 2 scalar, 8 VGPR

I do not report the asm listing of these functions to avoid to overcharge the post with useless code.
The number 48 for acos is the equivalent full rate cost of the sum of full rate and quarter rate instructions.
To be fair, the PS4 implementation of acos/asin use a dynamic if to select between negative and positive value but the code in both branch is identical and only differ by the sign. So in reality the runtime cost is rather half of this, like 24FR + 1 DB. Still it bloat the shader code and cause increase of VGPR.
GPU compiler have a generic and accurate implementation of inverse trigonometric functions.  But we are free to use what we desire and are not force to use the GPU compiler version as they are only macro and not hardware instruction. I have decided to also provide the cost of the Cg reference implementation of these functions [6]. So the following cost are from explicit implementation of the function with the code provide in the Cg documentation:

acos: 19 FR(14 FR, 1 QR), 4 VGPR
asin: 18 FR(13 FR, 1 QR), 4 VGPR, 1 scalar instruction
atan: 23 FR (19 FR, 1 QR), 2 scalar, 8 VGPR

The difference between Cg version and macro version come from the “dynamic if” which is no more present in the Cg version. I also suppose that the Cg version is less accurate (Documentation said absolute error <= 6.7e-5). Still regarding these numbers, we get that inverse trigonometric function are expensive. As a game developer we know which accuracy and which range of values we need to support and we can tune functions to fit our need and reduce the cost.

Read more of this post