# Inverse trigonometric functions GPU optimization for AMD GCN architecture

Version : 2.0 – Living blog – First version was 01 December 2014

First advice anybody have regarding inverse trigonometric functions (acos, asin, atan) is “do not use it”. And this is a good advise. It is often possible to get rid of all the trigonometric functions with trigonometric identities . However with the growing complexity of lighting models I see more and more usage of them. One of the main use case I met is when I deal with solid angle calculation and area lights. So if we need to manipulate such functions, better to be aware of their cost. This post is about knowing the cost of GPU inverse trigonometric function and providing optimize version for the AMD GCN architecture (PS4, XBone, PC – AMD). For this post I have use the PS4 shader compiler (v2.00) for the analysis.

## AMD GCN architecture basics

There is already plenty of good information available on the web about the AMD GCN architecture, so I will not repeat them here . I recommend to read the excellent talk of Michal Drobot about “Low level optimization for GCN”  as I will mainly follow its vocabulary.

Main basics:
– instruction are classify into vector instructions v_ and scalar instruction s_. The scalar instruction can be coarsely consider as free as they are executed in parallel.
– instruction are full rate or quater rate, i.e this is equivalent to say there is instruction which are 4x slower than other. Full rate (FR): mul, mad, add, sub, and, or, bit shift… Quater rate(QR): transcendental instruction like rcp, sqrt, rsqrt, cos, sin, log, exp…
– macro instructions can expand to several instructions: tan, acos, asin, atan, pow, sign, length…
– there is free modifier: saturate, abs, negate, mul2, mul4, mul8, div2, div4…
– dynamic branching can be considering having cost >= 16 FR.
– VGPR count are more important than instruction count

## Cost of inverse trigonometric function

How expensive is an inverse trigonometric function ?

On PS4 I get these numbers using the following code and isolating the instructions related to the function itself

```float val;

float4 main() : S_TARGET_OUTPUT0
{
float res = acos(val);
return float4(res, res, res, res);
}```

acos: 48 FR (40 FR, 2 QR), 2 DB, 12 VGPR
asin: 48 FR (40 FR, 2 QR), 2 DB, 1 scalar instruction, 12 VGPR
atan: 23 FR (19 FR, 1 QR), 2 scalar, 8 VGPR

I do not report the asm listing of these functions to avoid to overcharge the post with useless code.
The number 48 for acos is the equivalent full rate cost of the sum of full rate and quarter rate instructions.
To be fair, the PS4 implementation of acos/asin use a dynamic if to select between negative and positive value but the code in both branch is identical and only differ by the sign. So in reality the runtime cost is rather half of this, like 24FR + 1 DB. Still it bloat the shader code and cause increase of VGPR.
GPU compiler have a generic and accurate implementation of inverse trigonometric functions.  But we are free to use what we desire and are not force to use the GPU compiler version as they are only macro and not hardware instruction. I have decided to also provide the cost of the Cg reference implementation of these functions . So the following cost are from explicit implementation of the function with the code provide in the Cg documentation:

acos: 19 FR(14 FR, 1 QR), 4 VGPR
asin: 18 FR(13 FR, 1 QR), 4 VGPR, 1 scalar instruction
atan: 23 FR (19 FR, 1 QR), 2 scalar, 8 VGPR

The difference between Cg version and macro version come from the “dynamic if” which is no more present in the Cg version. I also suppose that the Cg version is less accurate (Documentation said absolute error <= 6.7e-5). Still regarding these numbers, we get that inverse trigonometric function are expensive. As a game developer we know which accuracy and which range of values we need to support and we can tune functions to fit our need and reduce the cost.

I won’t dig into detail of all the possible approximation of acos, asin or atan because others like Robin Green’s document  already done it well. I will only focus on polynomial Minimax approximation which seems to be the most efficient method for the GCN architecture. People often heard about Taylor series to approximate trigonometric functions but polynomial Minimax approximations are way better. Polynomial Minimax approximations are typically generated with a variant of the Remez algorithm  and are polynomial that minimizes the maximum error between the polynomial and the function. The Minimax approach try to minimize either the absolute error abs(f(x) – approx(x)) or the relative error abs(1 – approx(x) / f(x)). I recommend to read the nice article of Sam Hocevar about why Remez algorithm is better than Taylor expansion .
Various tools such as Maple and Mathematica have built-in functionality for Minimax approximation, and there is some free tools like the Sam Hocevar’s framework call “lolremez” to play with a c++ implementation .

In this post I use Mathematica and I provide the Mathematica code at the end of the post. Mathematica provide MiniMaxApproximation and GeneralMiniMaxApproximation functions. The MiniMaxApproximation minimize the relative error and GeneralMiniMaxApproximation allow to minimize either the relative error or the absolute error.  Here is example usage:

```(* GeneralMiniMaxApproximation[{x, y, (g)}, {t, {t0, t1}, m, k}, xvar, (opts)]
The error to be minimized is (y[x[t]] - approx[x[t]])/g[t]
Using the same value g(t)= y(t) and x(t) = t allow to get relative error,
setting g(t) = 1 minimize the absolute error. *)

(* Approximate Exp on interval 0, 1 with polynomial of degree 3,
minimize absolute error *)
GeneralMiniMaxApproximation[{t, Exp[t], 1}, {t, {0, 1}, 3, 0}, x][[2,   1]]
0.999455 + 1.0166 x + 0.421703 x^2 + 0.279976 x^3

(* Approximate Exp on interval 0, 1 with polynomial of degree 3,
minimize relative error *)
GeneralMiniMaxApproximation[{t, Exp[t], Exp[t]}, {t, {0, 1}, 3, 0}, x][[2,   1]]
0.999678 + 1.01217 x + 0.434183 x^2 + 0.271371 x^3

(* Approximate Exp on interval 0, 1 with polynomial of degree 3,
minimize relative error *)
MiniMaxApproximation[Exp[x] , {x, {0, 1}, 3, 0}][[2, 1]]
0.999678 + 1.01217 x + 0.434183 x^2 + 0.271371 x^3```

However during my test on inverse trigonometric functions, MiniMaxApproximation and absolute error version of GeneralMiniMaxApproximation were providing very similar result. It appear that the relative error is correctly achieve but it is possible to get a better absolute error by playing a bit with the input range provide to the function. For example the default maximum absolute error on input range [0, 1] for a polynomial Minimax approximation of degree 3 with function GeneralMiniMaxApproximation is 6.7×10^-5, whereas it is 4.55×10^-5 with input range [0, 0.8776]. So in this post I use MiniMaxApproximation on range [0, 1] to get maximum relative error and absolute version of GeneralMiniMaxApproximation where I tweak the the input range to get the maximum absolute error. Should we use relative or absolute error ? It depends on what we are looking for. As we focus mainly on visual, I think relative error is a better choice but I provide result for both in this post.

## Optimizing inverse trigonometric function

Let’s first see the representation of the inverse trigonometric functions:

acos take input in the range [-1, 1] and output in the range [0, PI]
asin take input in the range [-1, 1] and output in the range [-PI/2, PI/2]
atan take input in the range [-infinity, infinity] and output in the range [-PI/2, PI/2] (input range limit to [-10, 10] on the graph below And recall that an odd function is f(-x) = -f(x) and an even function is f(-x) = f(x).
This is interesting because an odd function can be approximate with odd polynomial (polynomials that only use odd powers of x, such as x, x^3, x^5).

Some trigonometric identities:

acos(-x) = PI – acos(x)
asin(-x) = -asin(x)
asin(x) + acos(x) = PI/2
atan(-x) = -atan(x)
atan(x) = pi/2 – atan(1/x) for x > 0 and atan(x) = -pi/2 – atan(1/x) for x < 0
atan(x) = pi/4 + atan((x – 1) / (x + 1)) for x > -1 and atan(x) = -pi/4 – atan((x + 1) / (x – 1)) for x < 1

With the help of these identities we can find a polynomial Minimax approximation on a smaller range of value for acos , asin and atan, then simply flip or bias the result to get the full range of the function. This is call range reduction:

acos is approximate in the range of input [0, 1] then we use the identity acos(-x) = PI – acos(x) to get the [-1, 0] range.
asin is approximate in the range of input [0, 1] then we use the identity asin(-x) = -asin(x) to get the [-1, 0] range. asin can also be implemented based on acos with the identity asin(x) = PI/2 – acos(x)
atan is approximate in the range of input [0, 1] then we use the identity atan(x) = PI/2 – atan(1/x) to get the [1, infinity] range. Last step is to negate the result to get [-infinity, 0]  with the identity atan(-x) = -atan(x).
Alternatively atan can be approximate in the range of input [-1, 1] then we use the identity atan(x) = pi/4 + atan((x – 1) / (x + 1)) to get the [0, infinity] range. Last step is to negate the result to get [-infinity, 0]  with the identity atan(-x) = -atan(x).

Remark: If you do not use the full range of input for your use case you can save several instructions because you don’t need to undo the range reduction step. For example if you know that your atan input are always positive you don’t need to perform the last negate base on the input sign.

The Minimax approximation can produce any order of polynomial. But our goal is to find the optimal polynomial for AMD GCN architecture with just enough accuracy for our use case. How do we know which accuracy we require ? Simplest method is to check visually for your use case the required accuracy and spotting when artifacts or divergence appears. For this post I have take a diffuse sphere area light case I want to optimize (The code is extract from the course note of the Siggraph14 talk “Moving Frostbite to PBR“. Listing 10):

```float illuminanceSphereOrDisk ( float cosTheta , float sinSigmaSqr )
{
float sinTheta = sqrt (1.0 f - cosTheta * cosTheta );

float illuminance = 0.0f;
if ( cosTheta * cosTheta > sinSigmaSqr )
{
illuminance = FB_PI * sinSigmaSqr * saturate ( cosTheta );
}
else
{
float x = sqrt (1.0 f / sinSigmaSqr - 1.0 f);
float y = -x * ( cosTheta / sinTheta );
float sinThetaSqrtY = sinTheta * sqrt (1.0 f - y * y);
illuminance = ( cosTheta * acos (y) - x * sinThetaSqrtY ) *
sinSigmaSqr + atan(sinThetaSqrtY / x);
}

return max ( illuminance , 0.0 f);
}```

The code contain one atan and one acos.

I have write a framework in Mathematica to help optimizing the inverse trigonometric functions by testing quickly various polynomial degree for the approximation with various option. The Mathematica file is provide at the end of the post. Following section will provide more details on the optimization process of each function.

## Optimizing acos

MinimaxApproximation is only able to approximate a function f with a polynomial p:
f(x) ~= p(x)
But sometime the function we want to approximate can’t be represented easily with only a polynomial. This is the case of acos. The curve on the [0, 1] range of the diagram in previous section looks like more a sqrt(1-x), see the graph below. David H.Eberly in his GPGPU book  provides some information about why sqrt(1-x) is the function to use. It is possible to introduce this function as part of our approximation  :
acos(x) = sqrt(1-x) * p(x)

The polynomial Minimax approximation provide a set of coefficients for p(x). In the following I will present different result of coefficients set for various polynomial degree with associated maximum absolute and relative errors.  The GPU code and the cost is the same for a given degree. Below there is the optimized GPU code for GCN architecture and the AMD GCN ISA for first 3 degree. In all inverse trigonometric code I rely on the ternary operator ?: to map on a cmp and a cndmask which are 2 FR, check the assembly to be sure your GPU compiler do it.

```#define PI 3.141593f
#define HALF_PI 1.570796f
// p(x) coefficient
#define C0 ...
#define C1 ...
#define C2 ...
...```
```// polynomial degree 1
// input [-1, 1] and output [0, PI]
float ACos(float inX)
{
float x = abs(inX);
float res = C1 * x + C0; // p(x)
res *= sqrt(1.0f - x);

return (inX >= 0) ? res : PI - res; // Undo range reduction
}

// ISA GCN
// 4 VGPR, 12 FR (8 FR, 1 QR), 1 scalar
v_mov_b32     v0, #C0
v_mov_b32     v1, #C1
v_sqrt_f32    v1, v1
v_mul_f32     v2, v0, v1
s_mov_b32     s1, #PI
v_cmp_ge_f32  vcc, s0, 0
```
```// polynomial degree 2
// input [-1, 1] and output [0, PI]
float ACos(float inX)
{
float x = abs(inX);
float res = (C2 * x + C1) * x + C0; // p(x)
res *= sqrt(1.0f - x);

return (inX >= 0) ? res : PI - res; // Undo range reduction
}

// ISA GCN
// 4 VGPR, 14 FR (10 FR, 1 QR), 1 scalar
v_mov_b32     v0, #C1
v_mov_b32     v1, #C2
v_mov_b32     v1, #C0
v_sqrt_f32    v0, v0
v_mul_f32     v2, v1, v0
s_mov_b32     s1, #PI
v_cmp_ge_f32  vcc, s0, 0
```// polynomial degree 3
// input [-1, 1] and output [0, PI]
float ACos(float inX)
{
float x        = abs(inX);
float res    = ((C3 * x + C2) * x + C1) * x + C0; // p(x)
res            *= sqrt(1.0f - x);

return (inX >= 0) ? res : PI - res;
}

// ISA GCN
// 4 VGPR, 16 FR (12 FR, 1 QR), 1 scalar
v_mov_b32     v0, #0x3d9c664d    //0.076367
v_mov_b32     v1, #0xbca8a647    //-0.020587
v_mov_b32     v1, #0xbe59c843    //-0.212678
v_mov_b32     v0, #0x3fc90e56    //1.570750
v_sqrt_f32    v1, v1
v_mul_f32     v2, v0, v1
s_mov_b32     s1, #0x40490fdc    //3.141593
v_cmp_ge_f32  vcc, s0, 0

```

Remark about the ISA GCN code: As you can see a mad with two constant can’t be executed in a single instruction. This mean that we need to store a constant in a register before using it in a mad. This is why we have a lot of mov. And this is also why each new degree add 2 FR (one v_mad and one v_move) Sadly, our approach is plenty of this case and it is difficult to generate immediate number for the coefficient like 2, 4 or 0.5… For more details about constraint of mad, see the great talk of Emil Person . The abs is free and the sqrt cost 4 FR but provide a better accuracy and is cheaper than increasing the polynomial degree to a large amount.

Now let’s see how we get the coefficients. We will look for an approximation of p(x) as
acos(x) / sqrt(1-x) = p(x)

For acos, the optimization process is then:
– Call MiniMaxApproximation with acos(x) / sqrt(1-x) on range [0, 1] for minimize relative error
– Call GeneralMiniMaxApproximation with acos(x) / sqrt(1-x) on tweaked range [0, #] for minimize absolute error
– Undo range reduction to recreate the whole curve on [-1, 1]
– Write Horner form of both of the polynomial to generate mad instruction for AMD GCN
– Plot the acos function and the approximation
– Plot absolute/relative error
– Retrieve the maximum absolute/relative error
– If satisfying test in game

Here is a part of the Mathematica code doing that:

```(* ArcCos minimax approximation *)
(* Range can't be [0-1] because of singularity *)
acosApproxPositiveRelative[x_] = Sqrt[1 - x]*
MiniMaxApproximation[ArcCos[x]/Sqrt[1 - x], {x, {0, 0.99999}, 3, 0}][[2, 1]]
acosApproxPositiveAbsolute[x_] = Sqrt[1 - x]*
GeneralMiniMaxApproximation[{t, ArcCos[t]/Sqrt[1 - t], 1},
{t, {0, 0.8776}, 3, 0}, x][[2, 1]]
(* Undo range reduction to retrieve whole curve *)
acosApproxRelative[x_] = If[x >= 0,
acosApproxPositiveRelative[x], \[Pi] - acosApproxPositiveRelative[Abs[x]]]
acosApproxAbsolute[x_] = If[x >= 0,
acosApproxPositiveAbsolute[x], \[Pi] - acosApproxPositiveAbsolute[Abs[x]]];
(* Compare approx *)
Plot[{ArcCos[x], acosApproxRelative[x],
acosApproxAbsolute[x]}, {x, -1, 1}, PlotRange -> {0, \[Pi]}]
Plot[{ArcCos[x], acosApproxRelative[x]}, {x, -1, 1},
PlotRange -> {0, \[Pi]}]```

I run all these steps for polynomial degree 1 to 3 (the 2 in the code “{x, {0, 0.999}, 3, 0}” mean 3nd degree), goal is to use the smallest degree fitting our use case. The lower the degree the better the performance but the lower the accuracy. Here is a table summing up the result. Bold is to indicate if the function favor maximum relative or absolute error. Name PX mean polynomial degree X:

Name cost(FR) Input max Abs Err max rel Err bounds C0 C1 C2 C3
Poly1 12 [0, 0.9999] 6.1 x 10^-3 3.9 x 10^-3 No 1.56467 -0.155972
Poly1 12 [0, 0.713] 3.5 x 10^-3 1.1 x 10^-2 No 1.56734 -0.16882
Poly2 14 [0, 0.9999] 6.1 x 10^-4 3.9 x 10^-4 No 1.57018 -0.201877  0.0464619
Poly2 14 [0, 0.822] 3.7 x 10^-4 1.7 x 10^-3 No 1.57042 -0.205107  0.0512521
Poly3 16 [0, 0.9999] 7.1 x 10^-5 4.5 x 10^-5 No 1.57073 -0.212053  0.0740935  -0.0186166
Poly3 16 [0, 0.8776] 4.6 x 10^-5 2.6 x 10^-4 No 1.57075 -0.21271  0.0764532  -0.0206453

The bounds column will be explain in next section. In practice I found that the maximum relative error ACos approximation of degree 1 (first row) is sufficient for the diffuse sphere area light test case: The first row is a sphere lit by a 1m sphere area light, the second row is a sphere lit by a 10cm sphere area light. Only diffuse lighting is show. The acos call of the original function has been replace by ACos polynomial of degree 1. Difference are extremely subtle (image quality here is too bad to see anything).

The approximations describe above don’t respect the values at the input boundary [0, 1] which should be acos(0) = PI/2  (or HALF_PI) and acos(1) = 0.  For example the maximum absolute approximation of degree 3 is 1.57075 at 0. This may be a desirable property depends on your usage. Forcing the value to be HALF_PI without modifying the other coefficients can have a large impact on the maximum error like show on the graph below. For the degree 3 approximation, the error is now: The absolute error has almost double.

A way to deal with boundaries value is to define the approximation as:
p(x) = A + x (B – A) + x (x – 1) q(x)
First part ensure that p(0) = A, second part that p(1) = B and last part is a new polynomial chose specially to be 0 for 0 and 1 to not perturb p(0) and p(1) values (Thanks to Sam Hocevar for explaining me this method).

From previous section we have:
p(x) = acos(x) / sqrt(1-x)
Thus:
q(x) = (-A – x (B – A) + acos(x) / sqrt(1-x)) / (x (x – 1))

We want A = p(0) = HALF_PI and B = p(1) = sqrt(2)  (This is the limit when x tend to 1). Final equation is then:
q(x) = (-HALF_PI – x (sqrt(2) – HALF_PI) + acos(x) / sqrt(1-x)) / (x (x – 1))

Translate to Mathematica this give us:

```acosApproxPositiveRelative[x_] = Sqrt[1 - x] *
Expand[1.570796 + x (Sqrt - 1.570796) + x * (x - 1) *
MiniMaxApproximation[(-1.570796 - x*(Sqrt - 1.570796) +
ArcCos[x]/Sqrt[1 - x])/(x*(x - 1)), {x, {0.001, 0.99999}, 1, 0}][[2, 1]]]```

I use directly the value 1.570796 because this is what we use in the code for HALF_PI. Unlike previously I need to play with input range for both maximum relative and absolute error to get best result. Graphing the functions allow to better highlight of the impact of taking into account the bounds: The maximum relative and absolute error can be higher due to the added constraint of respecting bounds. However the function now behave correctly at bounds. David H. Eberly in his GPGPU book  discuss an implementation to get polynomial Minimax approximation. His approach take into account the boundary values and minimize for relative error. He have gathered all the coefficients calculated with his method in a freely readable header file  for all inverse trigonometric function and for various degrees. Coefficients he get are similar to the one I get with same maximum relative error. As his method work for all degree (The method I present here work only for degree >= 3 in Mathematica), I will add his result to mine. Here is a sum up of all the result for ACos:

Name Cost(FR) Input max Abs Err max rel Err bounds C0 C1 C2 C3
P1 12 [0, 0.9999] 6.1 x 10^-3 3.9 x 10^-3 No 1.56467 -0.155972
P1 12 [0, 0.713] 3.5 x 10^-3 1.1 x 10^-2 No 1.56734 -0.16882
P1 12 Eberly 9.0 x 10^-3 7.8 x 10^-3 Yes 1.570796 -0.1565827
P2 14 [0, 0.9999] 6.1 x 10^-4 3.9 x 10^-4 No 1.57018 -0.201877 0.0464619
P2 14 [0, 0.822] 3.7 x 10^-4 1.7 x 10^-3 No 1.57042 -0.205107 0.0512521
P2 14 Eberly 8.2 x 10^-4 6.3 x 10^-4 Yes 1.570796 -0.203471 0.0468878
P3 16 [0, 0.9999] 7.1 x 10^-5 4.5 x 10^-5 No 1.57073 -0.212053 0.0740935 -0.0186166
P3 16 [0, 0.8776] 4.6 x 10^-5 2.6 x 10^-4 No 1.57075 -0.21271 0.0764532 -0.0206453
P3 16 Eberly 8.8 x 10^-5 6.5 x 10^-5 Yes 1.570796 -0.2125329 0.0747737 -0.0188236
P3 16 [0.158, 0.694] 6.2 x 10^-5 1.0 x 10^-4 Yes 1.570796 -0.212936 0.0761885 -0.0198346

As we can see, taking into account bounds have an impact on the precision, requiring to have correct value at bounds depends on usage.

## Optimizing asin

The process to optimize asin is exactly the same than acos. However this time we also need to introduce a bias because we will use the relation asin(x) = PI/2 – acos(x) for the Minimax approximation.
asin(x) = PI/2 – acos(x)  = PI/2 – sqrt(1-x) * p(x)
(PI/2 – asin(x)) / sqrt(1-x) = p(x)

The GPU code is almost identical to the one of ACos and have same cost, so I only show the polynomial degree 3 version:

```// polynomial degree 3
// input [-1, 1] and output [-PI/2, PI/2]
float ASin(float inX)
{
float x        = abs(inX);
float res    = (((C3 * x + C2) * x + C1) * x + C0); // p(x)
res            *= -sqrt(1.0f - x) + HALF_PI;

return (inX >= 0) ? res : -res; // Undo range reduction
}

// or more optimal

float ASin(float inX)
{
float x        = abs(inX);
float res    = (((-C3 * x + -C2) * x + -C1) * x + -C0); // p(x)
res            *= sqrt(1.0f - x) + HALF_PI;

return (inX >= 0) ? res : -res; // Undo range reduction
}

// ISA GCN
// 4 VGPR, 16 FR (12 FR, 1 QR)
v_mov_b32     v0, #-C3
v_mov_b32     v1, #-C2
v_mov_b32     v1, #-C1
v_mov_b32     v0, #-C0
v_sqrt_f32    v1, v1
v_mul_f32     v0, v0, v1
v_cmp_ge_f32  vcc, s0, 0

asin can also be define directly based on acos:

```// input [-1, 1] and output [-PI/2, PI/2]
float ASin(float x)
{
return HALF_PI - ACos(x);
}```

This version cost an extra 1 FR for the sub. However depends on the context it can be way more efficient. If you call both ACos and ASin in the same function with the same input, it is better to implement ASin as dependent of ACos. The compiler is able to do more aggressive optimization then. But best it to do it manually when required and use first implementation. As always, know what you are doing.

For Mathematica, only the first step change:
– Call MiniMaxApproximation with (PI/2 – asin(x)) / sqrt(1-x) on range [0, 1]

This produce the same coefficient than ACos with the same maximum error, except that for the GPU code we will negate the coefficient. We get the same maximum error than Acos.

Remark: Like for ACos, we can take into account the bounds. We may think that we will have a gain with ASin version because in this case HALF_PI is present two times so we could save one v_mov instruction. But it is not the case as one of the HALF_PI use an immediate add, so the v_mov remains.

`v_add_f32     v1, #0x3fc90fd8, v1    //1.570796`

## Optimizing atan

For atan a polynomial is sufficient. Here I will present few methods at the same time. We have atan(0)=0 so we have no constant term. We may perform the approximation as
f(x) = x * p(x)
f(x) / x = p(x)
However as atan is a odd function, better accuracy can be get with a odd polynomial. We will require a change of variable to handle the odd polynomial (See lolremez documentation  for more information, “Remez tutorial 3/5: changing variables for simpler polynomials”) :
f(x) = x * p(x^2)
f(sqrt(x)) / sqrt(x) = p(x)

Lastly Minimax approximation based on odd polynomial will work on the range [-1, 1] allowing to exploit another atan identity as explain in introduction:
atan(x) = pi/4 + atan((x – 1) / (x + 1)) for x >= -1
resulting in cheaper code.

The optimized GPU code for atan is:

```#define PI 3.141593f
#define HALF_PI 1.570796f
#define QUATER_PI 0.785398
// p(x) coefficient
#define C0 ...
#define C1 ...
#define C2 ...
...

// Common function, ATanPos is implemented below
// input [-infinity, infinity] and output [-PI/2, PI/2]
float ATan(float x)
{
float t0 = ATanPos(abs(x));
return (x < 0.0f) ? -t0: t0; // undo range reduction
}```
```// polynomial degree 2
// Tune for positive input [0, infinity] and provide output [0, PI/2]
float ATanPos(float x)
{
float t0 = (x < 1.0f) ? x : 1.0f / x;
float t1 = (C2 * t0 + C1) * t0; // p(x)
return (x < 1.0f) ? t1: HALF_PI - t1; // undo range reduction
}

// ISA GCN for ATan
// 4 VGPR, 14 FR (10 FR, 1 QR), 2 scalar
v_rcp_f32     v0, abs(s0)
v_cmp_gt_f32  vcc, 1.0, abs(s0)
v_mov_b32     v1, s0
v_mov_b32     v1, #C1
s_mov_b32     s1, #C2
v_mac_f32     v1, s1, v0
v_mul_f32     v2, v0, v1
s_mov_b32     s1, #HALF_PI
v_cmp_lt_f32  vcc, s0, 0
```
```// odd polynomial degree 3
// Tune for positive input [0, infinity] and provide output [0, PI/2]
float ATanPos(float x)
{
float t0 = (x < 1.0f) ? x : 1.0f / x;
float t1= (C3 * t0 * t0 * t0) + C1 * t0; // p(x)
return (x < 1.0f) ? t1: HALF_PI - t1; // undo range reduction
}

// ISA GCN for ATan
// 4 VGPR, 14 FR (10 FR, 1 QR), 2 scalar
v_cmp_gt_f32  vcc, 1.0, abs(s0)
v_rcp_f32     v0, abs(s0)
v_mov_b32     v1, s0
v_mul_f32     v1, #C2, v0
v_mul_f32     v2, v0, v1
s_mov_b32     s1, #HALF_PI
v_cmp_lt_f32  vcc, s0, 0
```// odd polynomial degree 3 alternate
// Tune for positive input [0, infinity] and provide output [0, PI/2]
float ATanPos(float x)
{
float t0 = (x - 1.0f) / (x + 1.0f);
float t1= (C3 * t0 * t0 * t0) + C1 * t0; // p(x)
return QUATER_PI + t1; // undo range reduction
}

// ISA GCN for ATan
// 4 VGPR, 12 FR (8 FR, 1 QR)
v_rcp_f32     v1, v1
v_mul_f32     v0, v0, v1
v_mul_f32     v1, #C3, v0
v_cmp_lt_f32  vcc, s0, 0
```// Polynomial degree 3
// Tune for positive input [0, infinity] and provide output [0, PI/2]
float ATanPos(float x)
{
float t0 = (x < 1.0f) ? x : 1.0f / x;
float t1 = ((C3 * x + C2) * x + C1) * x; // p(x)
return (x < 1.0f) ? t1: HALF_PI - t1;
}

// ISA GCN for ATan
// 4 VGPR, 15 FR (11 FR, 1 QR), 2 scalar
v_rcp_f32     v0, abs(s0)
v_cmp_gt_f32  vcc, 1.0, abs(s0)
v_mov_b32     v1, s0
v_mov_b32     v1, #C2
s_mov_b32     s1, #C3
v_mac_f32     v1, s1, v0
v_mul_f32     v2, v0, v1
s_mov_b32     s1, #HALF_PI
v_cmp_lt_f32  vcc, s0, 0
```// odd polynomial degree 5
// Tune for positive input [0, infinity] and provide output [0, PI/2]
float ATanPos(float x)
{
float t0 = (x < 1.0f) ? x : 1.0f / x;
float t1 = t0 * t0;
float poly = C5;
poly = C3 + poly * t1;
poly = C1 + poly * t1;
poly = poly * t0;
return (x < 1.0f) ? poly : HALF_PI - poly;
}

// ISA GCN for ATan
// 4 VGPR, 16 FR (12 FR, 1 QR), 2 scalar
v_rcp_f32     v0, abs(s0)
v_cmp_gt_f32  vcc, 1.0, abs(s0)
v_mov_b32     v1, s0
v_mul_f32     v1, v0, v0
v_mov_b32     v2, #C3
s_mov_b32     s1, #C5
v_mac_f32     v2, s1, v1
v_mul_f32     v2, v0, v1
s_mov_b32     s1, #HALF_PI
v_cmp_lt_f32  vcc, s0, 0
```// odd polynomial degree 5 alternate
// Tune for positive input [0, infinity] and provide output [0, PI/2]
float ATanPos(float x)
{
float t0 = (x - 1.0f) / (x + 1.0f);
float t1 = t0 * t0;
float poly = C5;
poly = C3 + poly * t1;
poly = C1 + poly * t1;
poly = poly * t0;

return QUATER_PI + poly; // undo range reduction
}

// ISA GCN for ATan
// 4 VGPR, 14 FR (10 FR, 1 QR), 2 scalar
v_rcp_f32     v1, v1
v_mul_f32     v0, v0, v1
v_mul_f32     v1, v0, v0
v_mov_b32     v2, #C3
s_mov_b32     s1, #C5
v_mac_f32     v2, s1, v1
v_cmp_lt_f32  vcc, s0, 0

```

I present the 3 different implementation (polynomial, odd polynomial and odd polynomial alternative) because they have different maximun absolute and relative error.
Note: if you know that your atan input is positive (like it is the case for diffuse sphere area light), you can save 2 FR by avoiding the final negate of the ATan call.

The process to approximate atan is similar to other functions, only the first step change:
– Call MiniMaxApproximation with atan(x) / x  for polynomial or atan(sqrt(x)) / sqrt(x) for odd polynomial on range [0, 1]

Mathematica code is:

```(* Atan minimax approximation *)
(* Range can't be [0-1] because of singularity *)
atanApproxPositive01Relative[x_] =
x*MiniMaxApproximation[ArcTan[x]/x , {x, {0.0001, 1}, 1, 0}][[2, 1]]
atanApproxPositiveRelative[x_] =
If[x <= 1, atanApproxPositive01Relative[x^inputExponent], (\[Pi]/2) -
atanApproxPositive01Relative[(1/x)^inputExponent]]
or

atanApproxPositive01Relative[x_] =
Sqrt[x]*MiniMaxApproximation[ArcTan[Sqrt[x]]/Sqrt[x] , {x, {0.0001, 1}, 1, 0}][[2, 1]
atanApproxPositiveRelative[x_] =
If[x <= 1, atanApproxPositive01Relative[x^inputExponent], (\[Pi]/2) -
atanApproxPositive01Relative[(1/x)^inputExponent]]

or

atanApproxPositive01Relative[x_] =
MiniMaxApproximation[ArcTan[Sqrt[x]]/Sqrt[x] , {x, {0.0001, 1}, 1, 0}][[2, 1]]
atanApproxPositiveRelative[x_] =
0.785398 + ((x - 1)/(x + 1))*atanApproxPositive01Relative[((x - 1)/(x + 1))^2];

Then
(* Use identity to retrieve whole curve*)
atanApproxRelative[x_] =
If[x >= 0, atanApproxPositiveRelative[x], -atanApproxPositiveRelative[Abs[x]]];
```

Here is a sum up of all the result for ATan. Name PX mean polynomial degree X, OPX  mean odd polynomial degree X and “/ A” is the alternate version. The alternate version use the same coefficients set and have identical error than OPX. Bold is to indicate if the function favor maximum relative or absolute error. Like with acos, David H. Eberly provides coefficient for odd polynomial allowing to respect the boundary at atan(1)=PI/4 (or QUATER_PI), I included them in the table:

Name Cost(FR) Input max Abs Err max rel Err bounds C1 C2 C3 C5
P2 14 [0.0001,1] 1.6 x 10^-2 2.0^-2 No 1.01991 -0.218891
P2 14 [0.33, 1] 3.8 x 10^-3 5.9 x 10^-2 No 1.05863 -0.269408
OP3 / A 14 / 12 [0.0001,1] 1.0 x 10^-2 1.3 x 10^-2 No 0.987305 -0.211868
OP3 / A 14 / 12 [0.234,1] 5.4 x 10^-3 2.9 x 10^-2 No 0.970592 -0.190604
OP3 / A 14 / 12 Eberly 1.6 x 10^-2 2.6 x 10^-2
Yes 1 -0.214602
P3 15 [0.0001,1] 4.2^-3 5.4^-3 No 1.00536 -0.0889206 -0.135249
P3 15 [0.278, 1] 1.6 x 10^-3 3.1 x 10^-2 No 1.0307 -0.17239 -0.0745631
OP5 / A 16 / 14 [0.0001,1] 1.2 x 10^-3 1.6 x 10^-3 No 0.998422 -0.301029 0.0892423
OP5 / A 16 / 14 [0.148,1] 7.2 x 10^-4 5.2 x 10^-3 No 0.994792 -0.287946 0.079271
OP5 / A 16 / 14 Eberly 1.4 x 10^-3 3.0 x 10^-3 Yes 1 -0.301895 0.0872929

Note that odd polynomial are really more accurate from order 5 at order 3 this is mitigated.

Remark: The alternate has same maximum error but the error is spread differently from the normal version. The error is like stretched swap , meaning that accuracy for a given value is not the same for both functions: This maybe important depends where the precision matter.
Another remark is that to respect atan(1) = PI/4 the sum of all the coefficient should be PI/4. For example with Eberly’s coefficients: 1 – 0.214602 = 0.785398 which is QUATER_PI. However the first coefficient is not mandatory to be 1 to respect atan(0) = 0.

Also an interesting property, it is possible to make  the P3 working on [-1..1] range and thus using the cheap atan identity with the following GPU code:

```// polynomial degree 3 alternate
// Tune for positive input [0, infinity] and provide output [0, PI/2]
float ATanPos(float x)
{
float t0 = (x - 1.0f) / (x + 1.0f);
float poly = (C3 * t0 * t0 + C2 * abs(t0) + C1) * t0;
return QUATER_PI + poly; // undo range reduction
}

// ISA GCN for ATan
// 4 VGPR, 14 FR (10 FR, 1 QR), 1 scalar
v_rcp_f32     v1, v1
v_mul_f32     v0, v0, v1
v_mul_f32     v1, v0, v0
s_mov_b32     s1, #C3
v_mul_f32     v2, abs(v0), s1
v_cmp_lt_f32  vcc, s0, 0

This version use the same coefficients and have same stretch swapped error than P3. But it still less accurate than a OP5 A for the same cost.

## Discussion about ATan in practice

So, are we done ? Just take the best accuracy in the tab for a given cost and that’s all ? It is more complex than that. As I said in introduction, always check visually the approximation for your use case. Here is the result of some of the atan approximation above with my diffuse sphere area light: The first row is a sphere lit by a 1m sphere area light, the second row is a sphere lit by a 10cm sphere area light. The atan call of the original function has been replace by ATanPos. The second column with the polynomial of degree 2 use the version which minimize maximum absolute error and exhibit banding (not really visible on the screenshot for the 1m but it is really visible in reality). The third column with odd polynomial of degree 3 which minimize maximum absolute error is totally black instead of having a gradient for 10cm light, otherwise the approximation is good. The fourth column with odd polynomial of degree 3 using Eberly’s coefficients which minimize maximum relative error match well the 10cm light and the 1m have some discrepancy not really visible on the screenshot.

After inspection the banding region correspond to an input value around 0.1 to the atan function and it seems that it is very sensible to accuracy of relative error. Here is the absolute and relative error of OP3 with mine and Eberly’s coefficients. You can see how the absolute error is better with mine coefficients and how the relative error have a similar maximum. But the distribution around 0.1 is in favor of Eberly which make it behave more correctly for my area light test case.

Here is a similar test with more accurate polynomial: The visual artifacts see previously are still present on the 10cm light. However for the 1m light all approximations are good. Once again, only the Eberly’s version produce a pleasant result and you can’t really make the difference with the original version. The P3 and OP3 minimize maximum absolute error. I also include the Unreal engine 4 (v4.5)’s ATanPos approximation see below. Looking at the error graph of OP5 minimizing absolute error and against Eberly: Again the error distribution around 0.1 is in favor of Eberly and even the OP3 Eberly version beat this OP5 around 0.1. Thus the banding seen on the screenshot. To sum up, know your use case and input and chose the best approximation for it. For my area light case, OP3 Eberly is the best despite the number.

Remark: About UE4 version. It is almost equivalent to P3 minimizing maximum relative error and is generated with the following Mathematica code:

`MiniMaxApproximation[ArcTan[x], {x, {0.001, 1}, 3, 0}][[2, 1]]`

And here is the associated GPU code

```// 15 FR (11 FR + 1 QR), 2 scalar
// Max error 0.0039
// Only valid for x >= 0
MaterialFloat AtanFast( MaterialFloat x )
{
// Minimax 3 approximation
MaterialFloat3 A = x < 1 ?
MaterialFloat3( x, 0, 1 ) : MaterialFloat3( 1/x, 0.5 * PI, -1 );
return A.y + A.z *
( ( ( -0.130234 * A.x - 0.0954105 ) * A.x + 1.00712 ) * A.x - 0.00001203333 );
}```

The Unreal engine 4 version is 2 extra FR (This is a AtanPos version) compare to the polynomial positive version of degree 3. This is due to the way I reorder the GCN instruction to be more efficient.

Remark 2: Michal Drobot have publish a useful gpu fast math lib . With some inverse trigonometric approximation based on the same kind of approximation method. Note that the atanFast4 he provide is the polynomial of degree 3 alternate I provide here but without the undo range reduction, thus in the range [-1, 1]. Also it is actually 7 FR not 12 FR as say in the comment and the maximum error is wrong too.

Remark 3: atan2 can be implemented with the atan identity use for the odd polynomial alternate.

```// odd polynomial degree 3
float ATan2Pos(float x, float y)
{
float t0 = (x - y) / (x + y);
float t1= (C3 * t0 * t0 * t0) + C1 * t0; // p(x)
return QUATER_PI + t1; // undo range reduction
}```

## Floating point representation

The generated polynomial Minimax coefficients are typically computed using high-precision arithmetic. It is well-known that simply rounding those coefficients to machine precision leads to suboptimal accuracy in the resulting implementation, see documentation of lolremez “Remez tutorial 5/5: additional tips”  for an example. That mean a simple epsilon change due to rounding to float in the coefficient can increase the absolute error.  Mathematica work with machine precision by default which is good for us, but it is double not float. I haven’t been able to find a way to force Mathematica to work with float representation. However I have done some test to compare rounding to float and double coefficients accuracy for the GPU inverse trigonometric functions of this post and the error increase is too small to matter. All coefficients I provide here are representable in float and doesn’t change significantly enough the max error to require special treatment.

In case you are interested of minimizing with floating-point accuracy, documentation of lolremez  provide an example. Brisebarre and al. also describe a method to do floating minimax approximation  which is implemented in the Sollya software.

## Conclusion

Here are the Mathematica files including all the code and diagram of this post, one pdf: GPUInverseTrigoForGCN_v3 and the notebook :GPUInverseTrigoForGCN_nb_v3 (Right click, save, rename “.pdf” to “.nb” as wordpress don’t support zip files).

To conclude this long post, we have seen that we have plenty of control on the accuracy and performance of the approximation of inverse trigonometric functions. But keep in mind that the final code must be a win on GPU compare to the original code, else it is useless. We have observe that we should not rely on macro instruction like acos/asin/atan provide by the GPU compiler because they are too generalist. We must provide our own for our use case. The second lesson is that we should always check visually if an approximation is good. Numbers don’t detect edge case.

So which approximations to use ? It depends on context as seen in my area light use case. I provide many approximations in this post with various cost, goal is to take the cheaper approximation fitting your use case. Remember that you can have multiple implementation of the function in your code. But let’s try a simple answer. If you want to have a fast and reliable implementation for general case in game I will use the one provide by Eberly . I think it is good to have the approximation matching original function at interval boundary (0, 1 and -1) and better to minimize for relative error when only visual matter. To sum up this simple case:

```#define PI 3.141593f
#define HALF_PI 1.570796f

// max absolute error 9.0x10^-3
// Eberly's polynomial degree 1 - respect bounds
// 4 VGPR, 12 FR (8 FR, 1 QR), 1 scalar
// input [-1, 1] and output [0, PI]
float ACos(float inX)
{
float x = abs(inX);
float res = -0.156583f * x + HALF_PI;
res *= sqrt(1.0f - x);
return (inX >= 0) ? res : PI - res;
}

// Same cost as Acos + 1 FR
// Same error
// input [-1, 1] and output [-PI/2, PI/2]
float ASin(float x)
{
return HALF_PI - ACos(x);
}

// max absolute error 1.3x10^-3
// Eberly's odd polynomial degree 5 - respect bounds
// 4 VGPR, 14 FR (10 FR, 1 QR), 2 scalar
// input [0, infinity] and output [0, PI/2]
float ATanPos(float inX)
{
float t0 = (x < 1.0f) ? x : 1.0f / x;
float t1 = t0 * t0;
float poly = 0.0872929f;
poly = -0.301895f + poly * t1;
poly = 1.0f + poly * t1;
poly = poly * t0;
return (x < 1.0f) ? poly : HALF_PI - poly;
}

// 4 VGPR, 16 FR (12 FR, 1 QR), 2 scalar
// input [-infinity, infinity] and output [-PI/2, PI/2]
float ATan(float x)
{
float t0 = ATanPos(abs(x));
return (x < 0.0f) ? -t0: t0;
}```

Compare to the original cost of the inverse trigonometric function (Reminder):

The PS4 macro version:
acos: 48 FR (40 FR, 2 QR), 2 DB, 12 VGPR
asin: 48 FR (40 FR, 2 QR), 2 DB, 1 scalar instruction, 12 VGPR
atan: 23 FR (19 FR, 1 QR), 2 scalar, 8 VGPR

Or to be fair, compare to the Cg reference implementation:
acos: 19 FR(14 FR, 1 QR), 4 VGPR
asin: 18 FR(13 FR, 1 QR), 4 VGPR, 1 scalar instruction
atan: 23 FR (19 FR, 1 QR), 2 scalar, 8 VGPR

This is a win. Not using default macro instructions reduce the number of VGPR usage in all cases. Remind to only use the required function which fit your case, ATanPos save 2 FR compare to ATan. Also keep in mind that if you use multiple call to the same inverse trigonometric function, the compiler can factor a bit the code. For example the solid angle of a rectangle light requiring 4 acos call is 42 FR (26 FR, 4QR) and not 48 FR.
For my diffuse sphere area light, I use Eberly’s ATan above and the minimize absolute error of degree 1 Acos version of this post and it work very well.

To conclude, the algorithm approach describe in this post can be apply to any functions, inverse trigonometric functions are just a subset of what you can do. See  for various example.

## Reference

 Quilez, “avoiding trigonometry”, http://iquilezles.org/www/articles/noacos/noacos.htm
 Drobot, “Low level optimization for GCN”, http://michaldrobot.com/2014/05/12/low-level-optimizations-for-gcn-digital-dragons-2014-slides/
 Person, “Low level shader optimization for next gen and DX11, http://www.humus.name/index.php?page=Articles&ID=9
 Mah, “The AMD GCN architecture, crash course”, http://fr.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah
 Cg documentation, http://http.developer.nvidia.com/Cg/acos.html, asin.html, atan.html, atan.html, atan2.html
 Green, “Faster math function”, http://www.ue.eti.pg.gda.pl/~wrona/lab_dsp/cw04/fun_aprox.pdf
 Hocevar, “Better function approximations: Taylor vs. Remez”, http://lolengine.net/blog/2011/12/21/better-function-approximations
 Wikipedia, “Remez algorithm”, http://en.wikipedia.org/wiki/Remez_algorithm
 Hocevar, “LolRemez”, http://lolengine.net/wiki/oss/lolremez
 Eberly, “GPGPU Programming for Games and Science”, http://www.amazon.co.uk/GPGPU-Programming-Games-Science-Eberly/dp/1466595353
 Sollya software, http://sollya.gforge.inria.fr/
 Brisebarre, Muller, Tisserand, “Computing Machine-Efficient Polynomial Approximations”, hal.inria.fr/inria-00119513/fr/
 Eberly, “Constant.h”, http://www.geometrictools.com/GTEngine/Include/GteConstants.h

### 5 Responses to Inverse trigonometric functions GPU optimization for AMD GCN architecture

1. seblagarde says:

v1.1
Mistake in asin approximation function: inversion of PI/2 – Asin(X), with PI/2 – Asin(X). Re-upload as V2 and update the post.

2. seblagarde says:

V2.0
Rewrite a large part of the post to take into account various feedback.
Main difference:
– approximation for both minimizing relative error or maximum absolute error.
– alternate version of ATan cheaper base on specific atan identity. Various remark. New cleaner and more detailed Mathematica files.
– acos Mathematica code to approximate taking into account boundary

3. Pingback: Confluence: XGS

4. Cătălin George Feștilă says:

Very good article. Also, The NVIDIA comes for CUDA with https://docs.nvidia.com/cuda/archive/9.1/npp/group__image__norm.html. Thank’s