One important feature of Direct3D 11 and OpenGL 4, which I have seen very little written about, is the addition of a new `precise`

qualifier. It’s no surprise there’s some confusion about the purpose and motivation for this new keyword. The Direct3D documentation is vague about it:

precise affects all results that contribute to the variable’s outcome by preventing the compiler from doing unsafe optimizations. For instance, to improve performance the compiler ingores the possibility of NaN and INF results for floating point variables from constant and stream inputs in order to do several optimizations. By using the precise keyword, these optimizations will be prohibited for all calculations affecting the result of the variable.

Besides the typos (ingores?), it’s not very clear what “unsafe optimizations” actually means, since only a single example is provided. On the other side, the OpenGL specification is a lot more “precise”:

The qualifier “precise” will ensure that operations contributing to a

variable’s value are performed in the order and with the precision

specified in the source code. Order of evaluation is determined by

operator precedence and parentheses, as described in Section 5.

Expressions must be evaluated with a precision consistent with the

operation; for example, multiplying two “float” values must produce a

single value with “float” precision. This effectively prohibits the

arbitrary use of fused multiply-add operations if the intermediate

multiply result is kept at a higher precision.

OpenGL makes it clear that precise causes operations to be performed in the exact order prescribed by the language rules. It also highlights that the precision of the operations must also be exactly as prescribed and not greater, which rules out the use of fused multiply-add if implemented according to the IEEE 754-2008 specification, to which most modern GPUs conform.

Today we take watertight rasterization for granted. This is something that most of us don’t even have to think about, since the fixed function hardware takes care of it automatically.

On the other side, when using tessellation, triangle vertices are generated by programable shaders that run independently for adjacent patches, which usually have arbitrary parametric orientations. Obtaining watertightness of tessellated patches requires a lot of care to guarantee that vertices shared between adjacent patches are evaluated consistently. As opposed to rasterization, in this case the responsibility to obtain watertightness lies in the hands of the programmer.

In my first experiments it became clear that this was not only hard to achieve, it was simply impossible. Fortunately, at NVIDIA I had access to the whole software stack, so by using a custom Cg compiler, running with developer drivers, and turning some knobs, I could prevent unwanted optimizations from happening and confirm that it was indeed possible to obtain watertight results. On the other hand, this solution was not very efficient, what we needed was to have fine grained control over the way the compiler turned our expressions into code; preventing optimizations only in the situations where it mattered.

Among all the solutions proposed, the addition of a precise qualifier and an fma (**f**used **m**ultiply-**a**dd) intrinsic is the solution that caught on.

To understand how the solution works, let’s look at some simple code that evaluates a cubic curve:

float3 cubic(float x, float3 p0, float3 p1, float3 p2, float3 p3) { float B0 = x * x * x; float B1 = (1-x) * x * x; float B2 = (1-x) * (1-x) * x; float B3 = (1-x) * (1-x) * (1-x); return p0 * B0 + p1 * B1 + p2 * B2 + p3 * B3; }

In most situations that would work fine. However, when doing tessellation you need to be more careful. If you have two cubic surfaces adjacent to each other, but their parametric orientations do not agree, and we use that function to evaluate the surface on the boundary, then the resulting positions will also not agree.

That is because the following condition is not true:

cubic(x, a, b, c, d) == cubic(1-x, d, c, b, a);

That is, the evaluation of the curve is not symmetric. The result of evaluating the curve at `x`

is not the same if you change the order of the control points and evaluate it at `1-x`

.

It might seem counter-intuitive, but the way the compiler translates those expressions into a sequence of instructions does not guarantee symmetry. The computation itself must be expressed independent of parametric orientation.

Simply adding the precise qualifier to the result does not really solve our problems:

precisefloat3 cubic(float x, float3 p0, float3 p1, float3 p2, float3 p3) { float B0 = x * x * x; float B1 = (1-x) * x * x; float B2 = (1-x) * (1-x) * x; float B3 = (1-x) * (1-x) * (1-x); return p0 * B0 + p1 * B1 + p2 * B2 + p3 * B3; }

Note the order in which the operations are performed:

(((p0 * B0) + p1 * B1) + p2 * B2) + p3 * B3

Writing the expression with parenthesis to emphasize the evaluation order makes it clear that the expression does not have a symmetric structure, so it still doesn’t meet our requisite. What we should do then is to rewrite the expression as follows:

((p0 * B0) + p1 * B1) + ((p3 * B3) + p2 * B2)

It looks like now it should not matter in what direction the curve is evaluated, the result should be the same… but that’s not yet the case. If you look at the basis functions you will notice another problem:

float B0 = x * x * x; float B1 = (1-x) * x * x; float B2 = (1-x) * (1-x) * x; float B3 = (1-x) * (1-x) * (1-x);

`B1`

and `B2`

are not symmetric either. In order to guarantee that `B1(x) == B2(1-x)`

, they should be written as follows:

float B0 = x * x * x; float B1 = x * x * (1-x); float B2 = (1-x) * (1-x) * x; float B3 = (1-x) * (1-x) * (1-x);

That is getting closer to being correct, but there are still some problems. The precise qualifier only affects the expressions on the right hand side of the precise variable. That is, the above code is equivalent to:

float3 cubic(float x, float3 p0, float3 p1, float3 p2, float3 p3) { float B0 = x * x * x; float B1 = x * x * (1-x); float B2 = (1-x) * (1-x) * x; float B3 = (1-x) * (1-x) * (1-x);precisefloat3 p = ((p0 * B0) + p1 * B1) + ((p3 * B3) + p2 * B2); return p; }

So, the compiler is free to mess with our carefully arranged expressions. Instead we should declare all of them as precise:

precisefloat3 cubic(float x, float3 p0, float3 p1, float3 p2, float3 p3) {precisefloat B0 = x * x * x;precisefloat B1 = x * x * (1-x);precisefloat B2 = (1-x) * (1-x) * x;precisefloat B3 = (1-x) * (1-x) * (1-x); return ((p0 * B0) + p1 * B1) + ((p3 * B3) + p2 * B2); }

OK, this code is finally watertight! The only problem now is that the resulting code is much more expensive than our original implementation.

In both cases the evaluation of the Bezier basis requires about 7 scalar instructions. However, the evaluation of the curve used 12 instructions, while now it requires 21 instructions and one extra register. This is where we can use the fma intrinsic. We can use fma and still obtain symmetric results, but we have to tell the compiler explicitly when is it possible to do so. In this case, the resulting expression is as follows:

fma(p1, vec3(B1), (p0 * B0)) + fma(p2, vec3(B2), (p3 * B3))

This preserves the symmetry of the computation and only requires 15 instructions, much better!

In order to think more easily about the cost of the different implementations I use the following notation:

* --- > --- > --- > = 4*3 = 12 * -+- * -+- * -+- * = 7*3 = 21 * --- > -+- < --- * = 5*3 = 15

Here the stars (`*`

) represent mul instructions, the crosses (`+`

) represent add instructions, and the arrows (`><`

) represent fma instructions. The dashes (`-`

) simply describe the dependencies between instructions and the order of evaluation in a loose way.

Take note, because in the next article I’ll be using this notation more extensively.

## 1 Comment