# Spherical Gaussian approximation for Blinn-Phong, Phong and Fresnel

Spherical Gaussian approximation for lighting calculation is not new to the graphic community but its use has  recently been adopted by the game developer community .
Spherical Gaussian (SG) is a type of spherical radial basis function (SRBF)  which can be used to approximate spherical lobes with Gaussian-like function.
This post will describe how SG can be use to approximate Blinn-Phong lighting, Phong lighting and Fresnel.

Why care about SG approximation ?
In the context of realtime rendering for games, the SG approximation allows to save a few instructions when performing lighting calculations. For modern graphics cards, saving few ALU in a shader is not always beneficial, but for older hardware like one can find in the PS3, every GPU cycle counts. This is less true for XBOX360 GPU, which performs better with arithmetic instructions, though this optimization can still be beneficial. It can also be used to schedule instructions in a different (low loaded) pipe on SPUs to increase performance .

Part of the work presented here credits to Matthew Jones (from Criterion Games) .

## Different spherical radial basis function

The post talk about SG, but it is good to know that there is several different types of SRBF found in graphics papers.
For completeness, I will talk quickly about SG and von Mises-Fisher (vMF) as this can be confusing.

SG definition can be found in : $G(v; p,\lambda,\mu)=\mu e^{\lambda(v.p -1)}$ where p ∈ 𝕊2 is the lobe axis, λ ∈ (0,+∞) is the lobe sharpness,
and μ ∈ ℝ is the lobe amplitude (μ ∈ ℝ3 for RGB color).

vMF is detailed in : $\gamma(n.\mu;\theta) = \frac{k}{4\pi \sinh(k)} e^{k(n.\mu)}$ with inverse width k and central direction μ. vMFs are normalized to integrate
to 1. Note: the notation from the paper  doesn’t match the notation from above.
The paper presents an approximation of this formula for k > 2 (which is almost always the case) : $\gamma(n.\mu;\theta)\approx \frac{k}{2\pi} e^{-k(1-n.\mu)}$

As you can see, the formulation stays the same for a lobe sharpness > 2, the difference lies only in the normalization constant for vMF, which we will omit. In the following, I will only refer to the SG function.

## Approximate Blinn-Phong with SG

An approximation of the Blinn-Phong model with SG is provided in the supplementary material of : $D(h)=(h.n)^k \approx e^{-k(1-(h.n))}$

Here is the comparison of the accuracy of the approximation for low specular power (<10), medium (25) and high( > 50)    As shown above, the approximation is accurate for medium and high specular values. From my test, above 15 is accurate. For really low specular power ( < 8) however, the result is a little poor but it still acceptable for a game.

Let’s analyze the performance. Translating the expression from above into code results to

float dotHN = saturate(dot(H,N));
pow(dotHN, K) = exp(-K*(1-dotHN)) A power function is generally implemented on the GPU as

pow(a, b) = exp(log(a) * b)

with exp (the base-e exponential) and log (the base-e logarithm). Exp/log instructions are often not implemented in hardware,  and are replaced by both the base-2 exponential and the base-2 logarithm.

exp(a) = exp2(a/log(2))
log(a) = log(2) * log2(a)

So

pow(a,b) = exp2(log(2)/log(2) * log2(a) * b) = exp2(log2(a)*b)

which is 3 ALU instructions: LOG2, MUL, EXP2

Going back to our SG approximation

pow(dotHN, K) = exp(-K*(1-dotHN) = exp2(-(K/log(2))(1-dotHN))

Refactoring to be more GPU friendly

float A=K/log(2)
pow(dotHN, K)=  exp2(A * dotHN - A)

which is 3 ALU instructions  : MUL, MAD, EXP2
Here we have no gain on instruction count. The goal is to be able to process the K/log(2) offline (on the CPU) or “burned-in” a previous computation in order to save the MUL instruction.

A good starting point would be to factor in the constant multiplication at decompression time of the glossiness factor. In our game we store glossiness in a texture as [0..1] and decompress it to a range of [2..2048] (see Adopting a physically based shading model)

SpecularPower = exp2(10 * gloss + 1)

Which is a MAD and a EXP2. With SG approximation we get

ModifiedSpecularPower = 1/log(2) * exp2(10 * gloss + 1)
= exp2(10 * gloss + 1 + log2(1/log(2)))

By precomputing the constant (the GPU compiler will do it anyway), we still have a MAD and a EXP2

#define Log2Of1OnLn2_Plus1    1.528766
ModifiedSpecularPower = exp2(10 * gloss + Log2Of1OnLn2_Plus1);
SpecularLighting = exp2(ModifiedSpecularPower * dotHN - ModifiedSpecularPower);

If you use a normalization term for energy conservation (see Adopting a physically based shading model), it needs to take into account the ModifiedSpecularPower

(SpecularPower + 2)/8 = SpecularPower * 0.125 + 0.25
= ModifiedSpecularPower * Log(2) * 0.125 + 0.25
= ModifiedSpecularPower  * 0.08664 + 0.25

Which ends up as a single MAD. The complete code is

#define LN2DIV8               0.08664
#define Log2Of1OnLn2_Plus1    1.528766
float SphericalGaussianApprox(float CosX, float ModifiedSpecularPower)
{
return exp2(ModifiedSpecularPower* CosX - ModifiedSpecularPower);
}
ModifiedSpecularPower = exp2(10 * gloss + Log2Of1OnLn2_Plus1);
SpecularLighting = (LN2DIV8 * ModifiedSpecularPower + 0.25) * SphericalGaussianApprox(DotNH, ModifiedSpecularPower);

Compared to a standard power implementation, we effectively save 1 instruction.

Better approximation

SG approximation is rather poor for low specular power. Matthew Jones    proposes to add a constant of one in the equation for better accuracy, based on the standard Fisher concentration factor given in the literature for the vMF function. He advertises that optimal solution lie between 0 and 1. $D(h)=(h.n)^k \approx e^{(k+x)((h.n)-1)}$

Therefore, I used Mathematica to find a best-fit for x. Here are two graphs showing the best-fit result for x on a small range,  as well as for a high range of SpecularPower on the horizontal axis. The Mathematica file is provided at the end of this blog post.  Graphs show us that the best-fit solution will converge as we increase SpecularPower to a value of x=0.75.

However, tests show that for x=0 or x = 1, we already get accurate results for medium and high specular. We can therefore focus on approximating x only for low SpecularPower range, as this is the range which causes trouble. Specular powers of 1 and 2 perform badly (you can see figure in Mathematica file), and will not be taken into account in the process. For SG approximation, it is best to only approximate SpecularPower > 2.

Taking the mean of x solution for SpecularPower ranging from 3 to 15 we get x=0.775. The following expression could be a good candidate for more accurate SG approximation $D(h)=(h.n)^k \approx e^{(k+0.775)((h.n)-1)}$

Let’s compare

The graph shows that the new formulation (with x=0.775) is a better match for the power curve, but the accuracy is not distributed equally and x = 0 performs better at a higher peak. For SpecularPowers of 10 and above, the difference is subtle. From a performance point of view:

pow(dotHN, K) = exp((K+0.775)*(dotHN-1)) = exp2((K+0.775)/Log(2)*(dotHN-1)

The problem here compare to previous formulation is the added 0.775/log(2). It will add a ADD instruction which can’t be hidden in previous computation (but it can be precomputed offline if gloss factor is a know value).
So we get ADD, MAD, EXP2 which is 3 instructions like our original power function. However, the LOG2 has been replaced by a ADD which benefits from better scheduling on PS3, so this still an improvement…

– A exponential function is non-zero everywhere compared to a power function which can reach zero

exp(1000 * 0 - 1000) != 0
pow(0, 1000) == 0

In practice, this is not an issue because you (should) multiply your specular BRDF by saturate(dot(N,L)), allowing zero at the boundaries of the hemisphere.

– With a power function implemented as exp2(log2(a) * b),  for a = 0 we get exp2(-inf * 0) = exp2(NaN) = NaN.
With an SG approximation, we never get a NaN.

– Other BRDFs can be represented with SG-like distributions, such as Ward or Beckmann 

## Approximate Phong with SG

In the previous section, we approximated a cosine lobe for the Blinn-Phong shading model. Similarly, this approximation can be implemented for the Phong shading model. In this case, with multiple lights, we can get a significant improvement.

The Phong lighting model require the angle between R and V where R is the reflected light around normal or for more efficiency, R and L with R the reflected view vector around normal.

pow(dot(R,L) = exp2(ModifiedSpecularPower * dot(R,L) - ModifiedSpecularPower);

We refactor this code as

float4 R2 = float4(ModifiedSpecularPower * R, -ModifiedSpecularPower);
SpecularLighting = exp2(dot(R2, float4(L, 1)));

Each added light will be represented by a DOT4 and an EXP2.

float4 R2 = float4(ModifiedSpecularPower * R, -ModifiedSpecularPower);
for (int LightIdx = 0, LightIdx < NumLight; ++LightIdx)
{
SpecularLighting = exp2(dot( R2, float4(L[LightIdx], 1) ) );
(...)
}

## Approximate Fresnel with SG

SG can also be used to improve the Schlick  approximation of the Fresnel term.

float dotEH = saturate(dot(E, H))
FresnelTerm = SpecularColor + (1.0f - SpecularColor) * pow(1 - dotEH, 5);

Let’s focus on the second part

pow(1 - dotEH, 5)

Which is a SUB, LOG2, MUL, EXP2
We will apply the SG approximation to this expression with the added constant from the first section. In this case, we have a know the SpecularPower value in order to not add any additional instructions. We can also choose a constant x as the best fit solution for SpecularPower 5 which is x = 0.788 (see Mathematica file at end of the post for details of this result).

pow(1 - dotEH, 5) = exp2((5 + 0.788)/Log(2) * ((1-dotEH) - 1)) = exp2(-8.35 * dotEH)

Which is MUL, EXP2. We save 50% instructions.
Here is a comparison graph It looks pretty good for a 50% gain. However in practice, the subtle difference can become visible, especially with small angles. Taking an angle of 0, with the original Fresnel we get 0 for Fresnel value, but for SG approximation we get a value of exp2(-8.35) = 0,003 which in the context of HDR light and linear space lighting can become a visible artifact. It is possible to get a better approximation of Fresnel by using a 2nd-order polynomial instead of a simple constant for the value inside the exp2.

pow(1 - dotEH, 5) = exp2(x * (dotEH * dotEH) + y * dotEH))

Fitting x and y in Mathematica, we get:

pow(1 - dotEH, 5) = exp2(-5.55473 * (dotEH * dotEH)- 6.98316 * dotEH))
= exp2((-5.55473 * EdotH - 6.98316) * EdotH)

We refactor the expression using Horner’s form since it is more GPU friendly (because it is composed of MADDs), the last line resulting in 3 instruction MAD, MUL, EXP2. This allows us to save one additional instruction.

Graph comparison This solution seems accurate enough. At an angle of 0, we get exp2(-12.53789)^=0.000168, which is 17 times smaller than previous approximation. We will still see artifacts in the extreme case, but the behavior is quite improved over the original implementation. In the end, the choice of an approximation depends on your tolerance to error.
Last, as Christian Schüler remember in the comment, it should be note that the exponentiation to 5 can simply be done with SUB, MUL, MUL, MUL instructions : First ^2, second ^4, third ^5 by multiplying with the original value. Even if we conserve the same number of instruction as the original form, this improve scheduling.

Edit 2015: Be careful, this optimization was valid on old NVidia graphic card like those found on PS3, on AMD GCN thing are different and using only mul is faster.

## Conclusion

This post provided an application of a SG approximation, where the goal was to save a few ALU instructions. Again, for modern GPUs, this is not necessarily beneficial, but for older mainstream hardware, such as the PS3 and XBOX360 GPUs, it is a quite useful tool to have in your pocket.

The Mathematica file SphericalGaussianApproximation_blog contains all the figures and results presented in this post. Feel free to try it out and play with it. Any feedback is welcomed.

I would like to thanks Colin Barré-Brisebois for reviewing the post and correcting my english 🙂

## Reference

 Kaplanyan, “CryENGINE 3: Reaching the Speed of Light” http://advances.realtimerendering.com/s2010/index.html
 Coffin, “SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3″ http://www.slideshare.net/DICEStudio/spubased-deferred-shading-in-battlefield-3-for-playstation-3
 Barré-Brisebois, “Approximating Translucency Revisited – With “Simplified” Spherical Gaussian Exponentiation”, http://colinbarrebrisebois.com/2012/04/09/approximating-translucency-revisited-with-simplified-spherical-gaussian/
 Wang, Ren, Gong, Snyder, Guo, “All-Frequency Rendering with Dynamic, Spatially-Varying Reflectance” http://research.microsoft.com/en-us/um/people/johnsny/papers/sg.pdf and http://research.microsoft.com/en-us/um/people/jpwang/paper_stuffs/sg_supp.pdf
 Iwasaki, Furuya, Dobashi, Nishita, “Real-time Rendering of Dynamic Scenes under All-frequency Lighting using Integral Spherical Gaussian” http://www.wakayama-u.ac.jp/~iwasaki/project/ibl/eg2012.pdf
 de Rousiers, Bousseau, Subr, Holzschuch, Ramamoorthi, “Real-Time Rough Refraction” http://graphics.berkeley.edu/papers/DeRousiers-RRR-2011-02/DeRousiers-RRR-2011-02.pdf
 Han, Sun, Ramamoorthi, Grinspun, “Frequency Domain Normal Map Filtering” http://www.cs.columbia.edu/cg/normalmap/index.html
 Tsai, Shih, “All-frequency precomputed radiance transfer using spherical radial basis functions and clustered tensor approximation” http://www.yzugraphics.cse.yzu.edu.tw/research/papers/TOG_2006/Paper.pdf
 Personal communication with Matthew Jones of Criterion games

### 24 Responses to Spherical Gaussian approximation for Blinn-Phong, Phong and Fresnel

1. Timothy Lottes says:

Does GPU precision of log2 and exp2 have a visible effect on these approximations?

2. seblagarde says:

I am not sure of your question.

If you talk about the GPU precision for the hardware implementation of log2/exp2 you know better than me :), but as pow is implemented with log2/exp2 (at least on PS3/XBOX360) this is the same result, or maybe slightly better due to the fact that we have only one exp2 involve.

If you talk about half/float precision, pow and these approximations have same problem.
For high specular power (like 1024) I get banding artifact when I use half everywhere, both with pow and SG. To fix this, I use

half SphericalGaussianApprox(float CosX, half ModifiedRoughness)
{
return exp2(ModifiedRoughness * CosX – ModifiedRoughness);
}

return ((LN2 * Glossiness+ 2) / 8) * (float)SphericalGaussianApprox(DotNH, Glossiness);

However, when using the flag “–fastprecision” for PS3 GPU compiler, it still optimize to half and I get banding.
For low specular, no visible issue as for Fresnel approximation.

Is it what you ask ?

• Timothy Lottes says:

Yeah, PS3 is what I was asking about, since I’m now very rusty there. Read-back of GPU results of those approximations would be interesting. BTW, on newer GPUs (at least for sure on NVIDIA) pow is still based on hardware log2 and exp2, also no half and a change to fused multiply add (better intermediate precision).

3. Brian Karis says:

Thank you very much for writing this. I tried using your previous spherical gaussian for fresnel and did not get good results for the reasons you list here. Dielectrics looked much brighter than they should. Plotting the relative error showed the reason. I haven’t tried your new approximation in a shader yet but the graphs look much better. Here is a comparison of the new approximation vs the old of the relative error for a 4% specular reflecting material:

I have on the other hand liked the results I’ve gotten for the spherical gaussian approximation for the specular distribution. I use the constant of 1.

• seblagarde says:

Thank you for your feedback. I use the new Fresnel approximation in our game and it is ok.

4. Christian Schüler says:

Hallo Sebastien
The compiler should be able to factor the integer power pow(1 – dotEH, 5).
You should get MUL, MUL, MUL (first ^2, second ^4, third ^5 by multiplying with the original value). That’s the reason I use Schlick with a power of 4 instead of 5, which saves one MUL.
If your compiler doesnt do this, complain.

• seblagarde says:

Hi Christian,

Good point! Right, the MUL way is fast and schedule well.
However, I can complain about the PS3 cg compiler as it generate (always ?) the LOG2, MUL, EXP2 code, so it should be done by hand . I should add that on PS3 using power of 4 result in LOG2*4 (the factor 4 is merge in the LOG2 instruction) and EXP2, which is 2 instructions.
Still I found the power 4 to be a bad approximation and I prefer the more accurate exp2(-8.35 * dotEH).

• seblagarde says:

I wasn’t fully complete on the last comment.
The MUL way result in 4 instructions : SUB, MUL,MUL,MUL, I forget the initial SUB. So this perform better than the original formula as MUL schedule better, but it still 4 instructions. Other approximation still helpful depends on the error tolerance.

5. robingreen says:

Why Gaussians? Why not the Gauss-Weierstrass kernel:

Gt(x) = 1/(2 π t)^1/2 exp(-x^2 / 2t)

or the Abel-Poisson kernel:

P(θ) = 1 − t^2 / 1 − 2t cos θ + t^2

I really don’t get this thing with Gaussians.

• seblagarde says:

Because I was not knowing the other form :).

Also keep in mind that this post was relevant for PS3 GPU because it was saving one instruction (For Fresnel term). With current GCN architecture on console GPU this is pointless as exp is quarter rate.
And current game use GGX and not Blinn-Phong anymore

6. robingreen says:

Any chance of some relative error graphs? The size of error as a percentage of the signal magnitude often makes really clear which approximation is best.

• seblagarde says:

I don’t think these optimizations have any interest today (see my previous comment). But agree with you that error analysis should have been more complete 🙂

7. Pingback: 实时渲染中的菲涅尔效应 – gleam

8. Leo Meng says:

Is this an optimization for mobile?

• seblagarde says:

Hey. This was an approximation for PS3 Nvidia gpu architecture (which was the limiting target at the time). It is not an optimization on AMD GCN. Regarding mobile, I have no knowledge in this area so can’t answer. But I bet the large range of GPU available for mobile mean the answer is: it depends.

9. SebMestre says:

Hello! I know this is really old but it still is among the top results in a google search. there is a mistake in the math: log(a) is not log2(a*log(2)), it is log2(a)*log(2) or log2(a)/log2(e). It would be neat if that was fixed; It threw me off for a while.

Thank you!

• seblagarde says:

Good catch, fixed. Thanks for the report :), Was a typo doesn’t change the derivation.