NVIDIA Texture Tools 2.0.6

Yesterday I released another revision of the stable branch of the NVIDIA Texture Tools. I recommend everybody to upgrade, since it fixes some bugs and artifacts, and improves compatibility with current and future CUDA drivers. Starting with this release I’m planning to provide a verbose description of the changes in each release. So, here it goes:

In a multi-GPU environment NVTT will now use the fastest GPU available. This is quite useful on systems that have an embedded and a discrete GPU. You certainly don’t want to use the embedded one while the discrete GPU is idle.

Using recent CUDA runtimes with older CUDA drivers usually causes problems. To avoid that, NVTT now determines the version of the available CUDA driver and makes sure it’s compatible with the CUDA runtime that was used to compile it.

The NVTT shutdown code did not destroy the CUDA context properly, which caused problems the second time you tried to create a context in the same thread. The behavior on 2.0 drivers was to reuse the context already created, but CUDA 2.1 produces an error instead. This is fixed now.

Note however, that you are not allowed to create multiple compressors in the same thread simultaneously. This is a limitation in the CUDA runtime API that transparently creates a CUDA context per thread. In the future I’m planning to use the driver API directly, which should remove this restriction.

There have been issues reported with the CUDA compressor not producing correct output when processing 1×1 mipmaps. I was able to reproduce the problem on old drivers, but it seems to be fixed now. In any case, I realized that using the entire GPU to compress a 1×1 mipmap did not make any sense, so now I’m compressing small mipmaps in the CPU instead. That is slightly faster and avoids the aforementioned problem.

Per request of the Wolfire guys, NVTT now scales images with alpha by weighting the color channels by the alpha component. While a simple workaround to obtain the right results was to use premultiplied alpha, there was no reason not to do the right thing in all cases.

As mentioned in my previous post, I’ve updated the single color compression tables to take into account the possible hardware decoding errors.

There was also a bug in the decompression of some RGB DDS files that is now fixed.

Progress on the 2.1 release is slow, but steady. The target features are almost complete, but I’ve been cleaning things up, removing a lot of cruft, and simplifying the code, in order to make it more accessible and encourage more contributions. I’m also working in a lower level API that should be more flexible and easier to maintain. More on that in a future post.

GPU DXT Decompression

Even thought the DXT compression format is widely adopted, most of the hardware implementations use a slightly different decompression algorithm.

The problem is that there never was a specification that described the preferred implementation. The DXT format was first documented by S3 in US patent 5956431. However, in order to cover all possible implementations the description in the patent is very generic. It does not specify when or whether the colors are bit expanded, or how the palette entries are interpolated. In one of the proposed embodiments they approximate the 1/3 and 2/3 terms as 3/8 and 5/8, which would have been a good idea, if everybody had chosen the same approximation. However, different implementations approximated the expensive division in different ways.

Continue reading →

Ham and Beans

I’ve been thinking about writing about cooking in this blog for a while. Not just to share what I’ve learned, because, well, I don’t have much to share, but to get other people’s ideas and suggestions. While I may cook better than the average American and may be able to impress my wife and friends, I’m not really such a good cook. I know how to follow recipes, and I’m starting to get some sense of what works and what does not, but I have a lot to learn, which is actually fascinating; there’s a whole world of flavors out there and exploring it is a very rewarding experience.

It took me a while to get used to cook in the US. Spanish cooking is not very sophisticated, but it’s based on seasonal dishes and the use of high quality ingredients available locally. Not the kind of thing you find in the US at your nearest supermarket.

This is all changing very quickly, but many of the traditions still remain. I for one grew up without paying to much attention to food. I just thought it was fuel for my body. It was only when I moved to the US that I started to appreciate what I used to have and now missed.

The Mediterranean diet is mostly composed of olive oil, bread, fresh fruits, fish and vegetables. However, I grew up in a family that runs a pork farm and a small meat production and distribution business. So, I had a larger share of meat than the average. Despite of that, meat was most often used as a condiment or in small dishes, except on special occasions.

[missing image – Iberian pigs in the Dehesa]

Our pork is unlike any other. Most of it is free-range and acorn-fed, which gives its meat a very distinct flavor. It is generally destined to the production of the finest quality deli meats: cured ham, loins, and sausages.

As far as I know it’s not possible to find anything like that in the US. Exports of Iberian ham are legal now, but the FDA requires installations to be periodically inspected and approved. Small farmers cannot really afford to comply with all the requirements of the FDA, so the only ham that you can usually find in the US is from larger and lower quality producers.

To add more insult to injury, the low quality ham that is available here is sold at outrageous prices, as if it were a high quality product! If you are not discouraged yet, you can find some of it at the Spanish Table, or at online retailers such as La Tienda.

I would not recommend any of those options. Instead, what I do is to bring some of our ham with us every time we travel to Spain. This is actually illegal, but is usually safe. However, once we tried to bring an entire leg and, much to our dismay, it was confiscated by the border officers. Since then, I only bring small quantities that are hardly noticeable on the scanner.

Rather than eating all of it right away, I generally save it for months, waiting for the right opportunity to enjoy it. Last weekend that opportunity presented itself in the form of fresh fava beans:

Fava beans are delicious eaten raw, or simply sautéed in olive oil with red onions. However, a little bit of Iberian Ham turns that delicious dish into a delicacy.

The preparation is trivial:

  • Remove the beans from their pods. It’s possible to also remove the beans from the shells by cooking them in boiling water for a short period of time and cooling them quickly in cold water. I personally don’t bother with that and leave the beans in their shells.
  • Pour a small amount of olive oil on a sauté pan on high heat. Chop the onions and the ham, and cook until flagrant.
  • Then add the beans, season to taste with salt and freshly ground pepper, and cook lightly for a few minutes while stirring often.

Let me know if you like posts like this, and if so, in future posts I’ll continue rambling about my quest to adapt the traditional Spanish recipes to the resources available in the US.


Posted 18/5/2009 at 4:04 am | Permalink
Habas con jamón…. Yummy!

Did I tell you I bought a bread machine? I plan to make sourdough bread as soon as I’m able to grow the culture.

Posted 18/5/2009 at 8:07 am | Permalink
Yeah, fantastic post, more!

Posted 18/5/2009 at 2:54 pm | Permalink
¡Habas con jamón! ¡Rico, rico!
Well done! I have the same problem finding cured ham in England, and I also follow the same procedure to bring some here, although I haven’t tried yet to bring a whole leg :D

Chad Austin
Posted 18/5/2009 at 4:46 pm | Permalink
I love all food, so more!

Posted 18/5/2009 at 7:02 pm | Permalink
Thanks for all the feedback! I think that next I’ll share my technique to make a basic pâté.

Posted 19/5/2009 at 8:47 pm | Permalink
Great post, thanks! BTW, I like your typo “cook until flagrant”, it makes the meal seem much more exotic :) If you ever find a source for good quality hams please post it!

Optimal Grid Rendering

This is a trick that I learned from Matthias Wloka during my job interview at NVIDIA. I thought I had a good understanding of the behavior of the vertex post-transform cache, but embarrassingly it turned out it wasn’t good enough. I’m sure many people don’t know this either, so here it goes.

Rendering regular grids of triangles is common enough to make it worth spending some time thinking about how to do it most efficiently. They are used to render terrains, water effects, curved surfaces, and in general any regularly tessellated object.

It’s possible to simulate the native hardware tessellation by rendering a single grid multiple times, and the fastest way of doing that is using instancing. That idea was first proposed in Generic Mesh Refinement on GPU and at NVIDIA we also have examples that show how to do that in OpenGL and Direct3D.

That’s enough for the motivation. Imagine we have a 4×4 grid. The first two rows would look like this:

* - * - * - * - *
| / | / | / | / |
* - * - * - * - *
| / | / | / | / |
* - * - * - * - *

With a vertex cache with 8 entries, the location of the vertices after rendering the first 6 triangles of the first row should be as follows:

7 - 5 - 3 - 1 - *
| / | / | / | / |
6 - 4 - 2 - 0 - *
| / | / | / | / |
* - * - * - * - *

And after the next two triangles:

* - 7 - 5 - 3 - 1
| / | / | / | / |
* - 6 - 4 - 2 - 0
| / | / | / | / |
* - * - * - * - *

Notice that the first two vertices are no longer in the cache. As we proceed to the next two triangles two of the vertices that were previously in the cache need to be loaded again:

* - * - * - 7 - 5
| / | / | / | / |
3 - 1 - * - 6 - 4
| / | / | / | / |
2 - 0 - * - * - *

Instead of using the straightforward traversal, it’s possible to traverse the triangles in Morton or Hilbert order, which are known to have better cache behavior. Another possibility is to feed the triangles to any of the standard mesh optimization algorithms.

All these options are better than not doing anything, but still produce results that are far from the optimal. In the table below you can see the results obtained for a 16×16 grid and with a FIFO cache with 20 entries:

Method          ACMR    ATVR
Scanline        1.062   1.882
NVTriStrip      0.818   1.450
Morton          0.719   1.273
K-Cache-Reorder	0.711   1.260
Hilbert         0.699   1.239
Forsyth         0.666   1.180
Tipsy           0.658   1.166
Optimal         0.564   1.000

Note that I’m using my own implementation for all of these methods. So, the results with the code from the original author might differ slightly.

The most important observation is that, for every row of triangles, the only vertices that are reused are the vertices that are at the bottom of the triangles, and these are the vertices that we would like to have in the cache when rendering the next row of triangles.

When traversing triangles in scanline order the cache interleaves vertices from the first and second row. However, we can avoid that by prefetching the first row of vertices:

4 - 3 - 2 - 1 - 0
| / | / | / | / |
* - * - * - * - *
|   |   |   |   |
* - * - * - * - *

That can be done issuing degenerate triangles. Once the first row of vertices is in the cache, you can continue adding the triangles in scanline order. The cool thing now is that the vertices that leave the cache are always vertices that are not going to be used anymore:

* - 7 - 6 - 5 - 4
| / | / | / | / |
3 - 2 - 1 - 0 - *
| / | / | / | / |
* - * - * - * - *

In general, the minimum cache size to render a W*W grid without transforming any vertex multiple times is W+2.

The degenerate triangles have a small overhead, so you also want to avoid them when the cache is sufficiently large to store two rows of vertices. When the cache is too small you also have to split the grid into smaller sections and apply this method to each of them. The following code accomplishes that:

void gridGen(int x0, int x1, int y0, int y1, int width, int cacheSize)
    if (x1 - x0 + 1 < cacheSize)
        if (2 * (x1 - x0) + 1 > cacheSize)
            for (int x = x0; x < x1; x++)
                indices.push_back(x + 0);
                indices.push_back(x + 0);
                indices.push_back(x + 1);

        for (int y = y0; y < y1; y++)
            for (int x = x0; x < x1; x++)
                indices.push_back((width + 1) * (y + 0) + (x + 0));
                indices.push_back((width + 1) * (y + 1) + (x + 0));
                indices.push_back((width + 1) * (y + 0) + (x + 1));

                indices.push_back((width + 1) * (y + 0) + (x + 1));
                indices.push_back((width + 1) * (y + 1) + (x + 0));
                indices.push_back((width + 1) * (y + 1) + (x + 1));
        int xm = x0 + cacheSize - 2;
        gridGen(x0, xm, y0, y1, width, cacheSize);
        gridGen(xm, x1, y0, y1, width, cacheSize);

This may not be the most optimal grid partition, but the method still performs pretty well in those cases. Here are the results for a cache with 16 entries:

Method           ACMR   ATVR
Scanline         1.062  1.882
NVTriStrip       0.775  1.374
K-Cache-Reorder  0.766  1.356
Hilbert          0.754  1.336
Morton           0.750  1.329
Tipsy            0.711  1.260
Forsyth          0.699  1.239
Optimal          0.598  1.059

And for a cache with only 12 entries:

Method           ACMR   ATVR
Scanline         1.062  1.882
NVTriStrip       0.875  1.550
Forsyth          0.859  1.522
K-Cache-Reorder  0.807  1.491
Morton           0.812  1.439
Hilbert          0.797  1.412
Tipsy            0.758  1.343
Optimal          0.600  1.062

In all cases, the proposed algorithm is significantly faster than the other approaches. In the future it would interesting to take into account some of these observations in a general mesh optimization algorithm.


I think that the Average Cache Miss Ratio (ACMR) is not the best way of measuring the efficiency of the vertex cache under different mesh optimization algorithms. The ACMR is basically the number of vertex transforms divided by the number of primitives, and greatly depends on the topology of the mesh. For a triangle mesh the theoretical optimal is 0.5, but if the mesh has boundaries, or disconnected triangles, then the optimal value is much higher. So, you have to compare the ACMR against the vertex to triangle ratio of the mesh.

I think that a better way of measuring how well a mesh optimization algorithm or a cache implementation performs is using the average transform to vertex ratio (ATVR), that is, the number of vertex transforms divided by the number of vertices. In the optimal case it’s always 1, independently of the mesh and the primitive type. I think it provides a much more intuitive a much more intuitive idea of the cost of rendering a mesh. For example, if the transform to vertex ratio is 2, it means that each vertex is transformed an average of 2 times.

Here’s another example: In his article about Vertex Cache Optimisation, Tom Forsyth has to clarify that the results for the GrannyRocks mesh (between 0.797 and 0.765) are actually quite good, because the best ACMR possible is 0.732. That clarification would not be necessary when using the metric that I propose. Instead, you would say that the transform to vertex ratio is between 1.09 and 1.05, and it’s instantly obvious that it’s a good result.

Real-Time Creased Approximate Subdivision Surfaces

Denis finally put his I3D paper online!

Denis was my intern a couple of summers ago. We worked together on the implementation of the an emulation framework to implement and prototype various tessellation algorithms. Last summer he wanted to try something different, I recommended him to join a game studio, and he finally went to work at Valve.

Having spent so much time and effort working on approximate subdivision surfaces and talking about them to developers, it’s very pleasing to see these techniques being adopted in real applications.

Adding support for subdivision surfaces in a game engine requires significant changes in the production pipeline. Not only tools need to be updated, but production processes need to change, and artists need to be educated to adopt and learn the new tools.

I think this paper will be very valuable in that regard. It describes Valve’s experience integrating ACC into the Source engine. How the ACC scheme had to be extended to support creases in order to achieve the visuals that they were trying to achieve, the many problems that arose, and more importantly, the problems that could not be solved and that required artist intervention.

10 Fun Things to do with Tessellation

Hardware tessellation is probably the most notable feature of Direct3D11.

Direct3D11 was announced at the last Gamefest and a technical preview was released in the November 2008 DirectX SDK. Hardware implementations are expected to be available this year.

[missing image]

Direct3D11 extends the Direct3D10 pipeline with three new stages: Two programmable shader stages (the Hull and Domain Shaders), and a fixed function stage (the Tessellator). More details can be found here and here.

Rendering of Catmull-Clark subdivision surfaces is often mentioned as the primary application for the tessellation pipeline, but there are many other interesting uses that have not received that much attention.

I thought it would be interesting to take a closer look at those other applications, and submitted a proposal to do that at GDC’09. However, it seems that the organizers do not think tessellation is as interesting as I do, or they didn’t like my proposal, or maybe it’s just that they know I’m a lousy speaker. I will never know, because the gracious feedback of the GDC review committee can be represented by a single boolean.

In any case, here’s a brief overview of the 10 fun things that I was planning to talk about. I don’t get very deep into the technical details, but in future posts I may describe some of these applications more thoroughly. Please, leave your comments if there’s something you would like to learn more about.


Curved PN Triangles is a triangle interpolation scheme that operates directly on triangle meshes whose vertices are composed of positions and normals (PN stands for Point-Normal).

[missing image – PN Triangles]

It’s an interesting way of improving visual quality that offers a simple migration path, since assets do not need to be heavily modified.

The PN Triangle evaluation consists of two steps: First, for every triangle of the input mesh a triangular cubic patch is derived solely from the vertex positions and normals; no adjacency information is required. Then, the resulting patch is subdivided or tessellated for rendering.

The resulting surface is smoother than the polygonal surface, but does not have tangent continuity in general, and that results in shading discontinuities. To hide these discontinuities normals are interpolated independently using either linear or quadratic interpolation. These normals are not the true surface normals, but they provide a smooth appearance to the surface.

This two-step evaluation maps very well to the Direct3D11 tessellation pipeline. The evaluation of the control points can be performed in the Hull Shader, the fixed function tessellator can produce a tessellation pattern in the triangle domain, and the actual surface can be evaluated for each of the tessellated vertices in the Domain Shader.

[missing image – Scalar Tagged PN Triangles]

In order to support sharp edges a rim of small triangles is added along the edges. That increases the number of patches, and it’s not entirely clear how to properly texture map them. Scalar Tagged PN-Triangles solve that problem in a more elegant way by tagging each crease vertex with three scalar that act as shape controllers and modify the construction of the surface control points. However, this representation does not support crease corners.

Silhouette Refinement

When tessellation is enabled the only supported primitive type is the patch primitive. In Direct3D11 a patch is an abstract primitive with an arbitrary number of vertices. You can use patches to represent traditional primitives (ie. a triangle is just a patch with 3 vertices), but this also enables you to represent other input primitives with arbitrary topology and additional connectivity information.

An interesting extension of of PN-Triangle tessellation is to augment the input triangles with the neighbor vertices in order to perform silhouette refinement.

With this additional information it’s possible to compute tessellation factors in he Hull Shader based on whether an edge is on the silhouette or the interior of the mesh. Then the fixed function tessellator uses these edge tessellation factors to produce a semi-regular tessellation pattern and the Domain Shader transforms it to interpolate the surface.

Phong Tessellation

[missing image – phong tessellation]

Phong Tessellation is a geometric version of Phong interpolation, but applied to vertex positions instead of normals.

First, points are interpolated linearly over each triangle using its barycentric coordinates, then the points are projected onto the planes defined by the corner position and normal, and finally the result of the three projections is interpolated again.

This procedure produces a smooth surface comparable to PN Triangles, but its evaluation is much cheaper, since no additional control points need to be computed.

Bezier Surfaces

Curved surfaces are not only useful for characters, but also for level geometry and objects.

[missing image – Quake 3 Arena]

id Software introduced the use of quadratic Bezier patches for architectural geometry in Quake 3 Arena and has been using them ever since.

Climax Brighton’s Moto GP used cubic Bezier patches to model the motorcycles.

Bezier patches can be evaluated very efficiently, because they don’t need any information about the surrounding mesh. As these games show, tessellation hardware is not required to render these surfaces. However, hardware tessellation will allow doing it much more efficiently, and will facilitate the use of these and more complex surfaces.

Approximation to Subdivision Surfaces

Rendering of approximated Catmull-Clark subdivision surfaces is probably the most anticipated application of hardware accelerated tessellation. Several approximation methods exist.

[missing images]

Approximating Catmull-Clark Subdivision Surfaces with Bicubic Patches is the most popular one. This approximation constructs a geometry patch and a pair of tangent patches for each quadrilateral face of the control mesh. The geometry patch approximates the shape and silhouette, but does not provide tangent continuity. A smooth normal field is constructed using two additional tangent patches. The approximation supports boundaries and has also been extended to support creases in Real-Time Creased Approximate Subdivision Surfaces.

GPU Smoothing of Quad Meshes proposes an alternative approximation using piecewise quartic triangular patches that have tangent continuity and do not require additional tangent patches to provide a smooth appearance. In Fast Parallel Construction of Smooth Surfaces from Meshes with Tri/Quad/Pent Facets the same approach is extended to approximate triangular and pentagonal faces.

[missing image]
(c) Kenneth Scott, id Software

Gregory patches are a more compact representation that also provides a very similar approximation, but only support quad and triangle control faces.

The availability of sculpting tools like ZBrush and Mudbox makes it possible to create highly detailed meshes. Displaced subdivision surfaces provide a compact and efficient representation for these meshes.

Rendering Geometry Images

Another approach to render highly detailed surfaces is to use geometry images. While geometry images can be rendered very efficiently, their video memory requirements are generally higher than displacement maps due to the lack of high precision texture compression formats. Traditional animation algorithms are not possible with this representation, and view dependent tessellation level evaluation is complicated, because geometry information is not directly available at the Hull Shader stage. However, geometry images may be the fastest approach to render small static objects at fixed tessellation levels.

Terrain Rendering

Terrain rendering is one of the most obvious applications for tessellation. The flexibility of the tessellation pipeline enables the use of sophisticated algorithms to evaluate the level of refinement of the terrain patches, and frees you from having to worry about many of the implementation details.

[missing image – Saga of Ryzom]

It’s also possible to extend traditional terrain engines with arbitrary topologies. Some MMORPGs are already doing that to create more rich environments.

For example Saga of Ryzom, a game that is based on the Nevrax engine, uses cubic patches to model the terrain, which enables them to create impressive cliffs and overhangs.

[missing image – Saga of Ryzom]

Tessellation should make it possible to combine regular heightfields, with caves, cliffs, arches, and other interesting rock formations.

I think that ZBrush or Mudbox would be excellent tools to create natural looking rugged terrain.

Hair Rendering

Efficient hair rendering is one of the most interesting applications of the Direct3D11 tessellation pipeline. In addition to triangular and quad patches the fixed function tessellator can also generate lines, which are very useful for applications like hair and fur rendering.

[missing image – nalu]

The algorithm described in Hair Animation and Rendering in the Nalu Demo maps very well to the tessellation pipeline.

As shown in Real-Time Rendering of Realistic Hair, the use of the hardware tessellation pipeline makes it very easy to simulate and render realistic hair with high geometric complexity in real-time.

That’s possible, because the simulation is performed only on a few hundred guide hairs, that are expanded by the tessellator into thousands of hair strands.

Rendering Panoramas

Another application for tessellation is to perform arbitrary non linear projections, that is useful, for example, to create real-time panoramas.

Since graphics hardware relies on homogeneous linear interpolation for rasterization, arbitrary projections and deformations at the vertex level result in errors unless the surface is sufficiently refined.

[missing image – panquake]

The traditional image based approach is to render the scene to a cube map and then perform an arbitrary projection of the cubemap to screenspace relying on texture hardware to do the sampling and interpolation. This was the approach taken in Fisheye Quake and Pan Quake.

While that works well, it requires rendering the scene to the 6 cube faces, and sometimes results in oversampling or undersampling of some areas of the scene.

[missing image – panorama]

Dynamic Mesh Refinement on GPU using Geometry Shaders proposes the use of the geometry shader to dynamically refine the surfaces to prevent linear interpolation artifacts. However, the Geometry Shader operates sequentially and is not well suited for this task. On the other side, the dynamic mesh refinement algorithm maps well to the Direct3D11 tessellation pipeline.

Rendering of 2D curved shapes

While GPUs can render simple polygons, they are not able to automatically handle complex concave and curved polygons with overlaps and self intersections, without prior triangulation and tessellation.

[missing image – svg tiger]

The Direct3D11 tessellation pipeline is not designed to perform triangulation. However, there’s a well known method to render arbitrary polygons using the stencil buffer that can be used in this case. This method was first described in the OpenGL Red Book, but was recently popularized by its implementation in the Qt graphic library.

It’s possible to combine this technique with hardware tessellation to render curved tessellated shapes without the need of expensive CPU tessellation and triangulation algorithms.


Posted 11/1/2009 at 10:31 am | Permalink
Excellent post Ignacio! Great inspiration for future geometry pipelines & usages in games

Posted 11/1/2009 at 11:36 am | Permalink
Do you think you could do Reyes-style motion blur?

Similar to the Rendering Panoramas method that you mention – tesselate triangles until they are near 1 pixel size and then extrude them based on their velocity (assume the engine is passing a velocity vector per vertex or something) ?

Posted 11/1/2009 at 1:27 pm | Permalink
> Do you think you could do Reyes-style motion blur?

I think that tessellation does not help much in this setting. There’s no tessellation mode that creates multiple copies of the same triangle. However, you can do that in the geometry shader. The problem is how to combine these copies. You can use some sort of order independent transparency, which might actually be possible in Direct3D11, since pixel shaders support arbitrary gather and scatter operations.

In a reyes style render shading is performed in object space. Fast moving objects do not need to be shaded exactly, so using tessellation you can dynamically change the shading frequency based on this and other factors.

ren canjiang
Posted 7/2/2009 at 3:55 am | Permalink
thank you for so interesting topics. i’m using opengl for demos, i hope it can catch up with dx11?-?

sonali joshi
Posted 21/6/2009 at 4:17 am | Permalink
it was just ok

Posted 24/7/2009 at 10:40 am | Permalink
gud gemacht ) weiter so !

Posted 12/3/2010 at 9:39 pm | Permalink
You’re an excellent speaker, and have an interesting article. It’s a shame you didn’t get to speak! =(

Ownership-based Zippering

Since my Gamefest talk I’ve received numerous questions about the ownership-based zippering algorithm that I proposed. So, I’ll try to explain it in more detail. See my previous article on watertight texture sampling for more background info.

In the averaging method we would have to store the texture coordinate of every patch that contributes to a shared feature. Edges are shared by only two patches, but corners can be shared by many patches. By defining the ownership of the shared features (corners and edges), we only have to store the texture coordinates of the patch that owns the corresponding feature.

So, we have:

  • 4 texture coordinates for the interior (4).
  • 2 texture coordinates for each edge (8).
  • 1 texture coordinate for each corner (4).

Therefore, the total number of texture coordinates per patch is: 4+8+4 = 16

Deciding what patch owns a certain edge or corner is done as a pre-process, so that the patch texture coordinates can be computed in advance. The way I store these texture coordinates is as follows:

[missing picture]

Each vertex has:

1 interior texture coordinate. (index 0)
1 edge texture coordinate for each of the edges. (index 1 and 2)
1 corner texture coordinate. (index 3)

On the interior, we interpolate the interior texture coordinates bilinearly:

float2 tc = bar.x * texCoord[0][0] +
            bar.y * texCoord[1][0] +
            bar.z * texCoord[2][0] +
            bar.w * texCoord[3][0];

where bar stands for the barycentric coordinates:

bar.x = (    uv.x) * (    uv.y);
bar.y = (1 - uv.x) * (    uv.y);
bar.z = (1 - uv.x) * (1 - uv.y);
bar.w = (    uv.x) * (1 - uv.y);

On the edges we interpolate the edge texture coordinates linearly:

if (uv.y == 1) tc = texCoord[0][1] * bar.x + texCoord[1][2] * bar.y;
if (uv.y == 0) tc = texCoord[2][1] * bar.z + texCoord[3][2] * bar.w;
if (uv.x == 1) tc = texCoord[3][1] * bar.w + texCoord[0][2] * bar.x;
if (uv.x == 0) tc = texCoord[1][1] * bar.y + texCoord[2][2] * bar.z;

And at the corners we simply select the appropriate corner texture coordinate:

if (bar.x == 1) tc = texCoord[0][3];
if (bar.y == 1) tc = texCoord[1][3];
if (bar.z == 1) tc = texCoord[2][3];
if (bar.w == 1) tc = texCoord[3][3];

The same thing can be done more efficiently using a single bilinear interpolation preceded by some predicated assignments:

// Interior
float2 t0 = texCoord[0][0];
float2 t1 = texCoord[1][0];
float2 t2 = texCoord[2][0];
float2 t3 = texCoord[3][0];

// Edges
if (uv.y == 1) { t0 = texCoord[0][1]; t1 = texCoord[1][2]; }
if (uv.y == 0) { t2 = texCoord[2][1]; t3 = texCoord[3][2]; }
if (uv.x == 1) { t3 = texCoord[3][1]; t0 = texCoord[0][2]; }
if (uv.x == 0) { t1 = texCoord[1][1]; t2 = texCoord[2][2]; }

// Corners
if (bar.x == 1) t0 = texCoord[0][3];
if (bar.y == 1) t1 = texCoord[1][3];
if (bar.z == 1) t2 = texCoord[2][3];
if (bar.w == 1) t3 = texCoord[3][3];

float2 tc = bar.x * t0 + bar.y * t1 + bar.z * t2 + bar.w * t3;

And finally, the predicated assignments can be simplified and replaced by an index calculation as I proposed in my previous article:

// Compute texture coordinate indices (0: interior, 1,2: edges, 3: corner)
int idx0 = 2 * (uv.x == 1) + (uv.y == 1);
int idx1 = 2 * (uv.y == 1) + (uv.x == 0);
int idx2 = 2 * (uv.x == 0) + (uv.y == 0);
int idx3 = 2 * (uv.y == 0) + (uv.x == 1);

float2 tc = bar.x * texCoord[0][idx0] +
            bar.y * texCoord[1][idx1] +
            bar.z * texCoord[2][idx2] +
            bar.w * texCoord[3][idx3];

The same idea also applies to triangles:

// Interior
float2 t0 = texCoord[0][0];
float2 t1 = texCoord[1][0];
float2 t2 = texCoord[2][0];

// Edges
if (bar.x == 0) { t1 = texCoord[1][1]; t2 = texCoord[2][2]; }
if (bar.y == 0) { t2 = texCoord[2][1]; t0 = texCoord[0][2]; }
if (bar.z == 0) { t0 = texCoord[0][1]; t1 = texCoord[1][2]; }

// Corners
if (bar.x == 1) t0 = texCoord[0][3];
if (bar.y == 1) t1 = texCoord[1][3];
if (bar.z == 1) t2 = texCoord[2][3];

float2 tc = bar.x * t0 + bar.y * t1 + bar.z * t2;

And the resulting code can be optimized the same way:

int idx0 = 2 * (bar.z == 0) + (bar.y == 0);
int idx1 = 2 * (bar.x == 0) + (bar.z == 0);
int idx2 = 2 * (bar.y == 0) + (bar.x == 0);

float2 tc = bar.x * texCoord[0][idx0] +
            bar.y * texCoord[1][idx1] +
            bar.z * texCoord[2][idx2];

Approximate Subdivision Shading

Subdivision Shading is a new approach to compute normal fields of subdivision surfaces that was presented at SIGGRAPH Asia 2008.

Subdivision Shading

It’s a very simple idea that provides surprisingly good results. The idea is to interpolate the subdivision surface normals using the same procedure used for positions. The resulting normal field is not the actual surface normal, but looks smooth and doesn’t exhibit some of the artifacts characteristic of subdivision surfaces at the extraordinary vertices.

The main disadvantage is that it looks too smooth compared to the real surface normal, but I’m not sure that’s necessarily bad. To avoid that problem the paper suggests blending the surface normals and the interpolated vertex normals so that the interpolated normals are used only in the proximity of extraordinary vertices.

The same idea can also be applied to the Approximate Catmull-Clark Subdivision Surfaces (Bezier ACC) proposed by Loop and Schaefer. Instead of constructing the normal from the cross product of the tangent patches, the normal can be interpolated directly using the same approximation used to evaluate positions. The resulting surface has G1 discontinuities around extraordinary vertices in both the geometry and the normal field. However, I haven’t been able to notice any artifact due to that in any of our test models.

This approach is quite efficient. It requires the same number of control points as Bezier ACC, but only one half of the stencil weights, because positions and normals are evaluated exactly the same way. The evaluation of the surface normal itself is also more efficient; evaluating a single 4×4 Bezier patch is faster than evaluating two 3×4 Bezier patches and computing a cross product.

However, the nicest thing about this scheme is that it facilitates watertightness when using displacement maps.

At Gamefest 2008 I mentioned that in order to achieve watertight surfaces when using displacement maps it was necessary to:

a) Sample textures in a watertight way.
b) Construct a watertight normal field.

[[acc tangents]] (right aligned)

In order to obtain a watertight normal field, adjacent patches need to compute the normals along their edges consistently.

The approach proposed in the ACC paper produces a smooth and continuous normal field, but the normals at extraordinary vertices and along the edges that surround them are not consistent. Patches around the extraordinary vertices have tangents that lie on the same plane, but their cross product does not yield exactly the same normal.

There are several ways of to solve that problem, but all of them too complicated to cover them in this post. On the other side, the normal interpolation approach does not suffer from that problem and provides a much more simple solution.

Watertight Texture Sampling

One of the problems when implementing displacement mapping is dealing with texture seams.

Texture seams are discontinuities in the parameterization. These discontinuities are always necessary unless the mesh has the topology of a disk, but in general meshes need to be partitioned into charts that are then parameterized independently.

[missing image]

This problem is not new. When using color and normal maps, texture seams usually cause color and lighting discontinuities. Artists usually deal with these artifacts by placing the seams on areas of the model, where they are less visible, so that the artifacts become less noticeable. However, when using displacement maps, the seams cause holes in the mesh, which produce much more visible artifacts that are harder to hide.

A Simple Solution

A simple solution to this problem is to manually alter the geometry along the seams to make it self intersect. While that does not result in watertight surfaces, the resulting holes are not visible. The main problem of this approach is that it requires artist intervention, creates open meshes, and only works for opaque surfaces; it would be nice to have more robust solutions.

Seamless Parameterization Methods

Fortunately this problem has been studied extensively and seamless parameterization methods have been developed. These are automatic parameterizations in which the chart texels are properly aligned across boundaries.

Displacement maps are generally not painted using traditional image editing applications, but created in specialized sculpting tools (such us ZBrush and Mudbox) or generated procedurally in attribute capture tools like xNormal. That means that it’s possible to store displacements using automatic parameterizations, since the artist does not need to edit the texture manually.

The most straightforward seamless parameterization method is the one used by ZBrush, which is very similar to the ones proposed in Meshed Atlases for Real-Time Procedural Solid Texturing. ZBrush maps every face of the mesh to a quad in texture space, so that all edges are either vertical or horizontal, and have the same length.
This method is very easy to implement, but has several problems:

  • It introduces a large number of seams in the mesh, which increments the number of vertex transforms.
  • It does not make efficient use of the texture space, because all faces, independently of their area, are mapped to quads of the same size.

ZBrush provides an option to group faces into larger charts while preserving the edge length and orientation, which helps reducing the number of vertex transforms. It also has an option to scale the charts in proportion to their surface area, but that breaks the seamlessness.

[missing image]

In order to alleviate these problems, another solution is to use rectangular charts (instead of single faces) to map them to texture-space quads.

That was first proposed in Seamless Texture Atlases, where rectangular charts are created by first clustering polygons into arbitrary sided charts and then partitioning them into quadrilaterals using a Catmull-Clark-inspired scheme:

[missing image]

An entirely different approach is described in Spectral Surface Quadrangulation. In this paper a quadrangulation is constructed from the optimized Morse-Smale complex of the natural harmonics of the surface. The main limitation of this approach is that it only handles closed surfaces, and requires manual selection of the eigenfunction to produce charts of the desired size. Spectral Quadrangulation with Orientation and Alignment Control solves these problems and also adds support for explicit constraints.

[missing image]

These two methods create quadrangulations without T-Junctions. That seems like a nice property, but Rectangular Multi-Chart Geometry Images shows that it’s possible to remove that constraint and still achieve smooth parameterizations, resulting in better parameterizations with less distortion.

The most interesting method is probably the one described in Designing Quadrangulations with Discrete Harmonic Forms. This method is even more flexible; instead of defining charts before the parameterization, it introduces singularity points in the mesh and computes the parameterization globally. Then the mesh can be cut connecting the singularity points arbitrarily.

Other methods try to achieve continuity between patches using constraints and parameterizing adjacent patches simultaneously. That’s for example the case of Matchmaker: Constructing Constrained Texture Maps, but it only minimizes the discontinuities and does not fully remove them.

For more information about parameterization methods these two surveys are a great resource:

Watertightness and Floating Point Precision

Modern hardware uses a floating point representation to interpolate texture coordinates. However, when using programmable tessellation hardware as specified by Direct3D11, interpolation is performed explicitly in the domain shader (or in the vertex shader when using instanced tessellation on pre-Direct3D11 GPUs). This is done to enable the use of higher order interpolation, but also allows the use of fixed point interpolation.

[missing image]

That’s important, because the use of floating point for interpolation causes some problems. Floating point values have more precision closer to the origin than farther from it. As a result, interpolation of texture coordinates along an edge closer to the origin will produce a different set of samples than interpolation along an edge that is farther from it. This is exactly what happens on texture seams and will result in small cracks in the mesh even when using a seamless parameterization.

If that seems surprising, then I’d recommend reading What Every Computer Scientist Should Know About Floating-Point Arithmetic.

It’s also important to note that a seamless parameterization and fixed point interpolation are not sufficient conditions to achieve watertightness. It’s also necessary to generate the displacement maps appropriately so that the values of the texels along the chart boundaries match exactly along both sides of the seams.

Zippering Methods

A different approach is to introduce a triangle strip connecting the vertices along the seam. These strips can be generated with the same tessellation unit used to generate the patches, by setting the tessellation level equal to 1 in one of the parametric directions. This solves the problem nicely, but requires rendering more patches, and introduces additional triangles that are almost degenerate.

[missing image]

Another interesting solution is the zippering method proposed in the Multi-Chart Geometry Images paper. The idea that the paper proposes is to sample the displacement (or the geometry image) on both sides of the seam and to use the average of the two samples.

The main problem of this approach is that it requires two texture samples along the seams, which means you have to take two samples in all cases, or use branching to take an extra sample on the seam vertices only.

The averaging method does not work for corners. Along the edges there are only two possible displacement values, one for each side of the seam. However, on corners there are more than two. Storing an arbitrary number of texture coordinates, and taking an arbitrary number of texture samples would be too expensive. A simple solution is to snap the corner texture coordinates to the nearest texel, and make sure that the displacement value for that vertex is the same for all patches that meet at that corner.

A cheaper solution that only requires a single texture sample, and that handles corners more gracefully is to define patch ownership of the seams. This is what I have proposed at Gamefest and SIGGRAPH last year.

By designating the patch owner for every edge and corner, all patches can agree what texture coordinate to use when sampling the displacement at those locations. That means that for every edge and for every corner we need to store the texture coordinates of the owner of those features. That’s a total of 4 texture coordinates per vertex, (16 for quads and 12 for triangles).

At runtime, only a single texture sample is needed; the corresponding texture coordinate can be selected with a simple calculation:

// Compute texture coordinate indices (0: interior, 1,2: edges, 3: corner)
int idx0 = 2 * (uv.x == 1) + (uv.y == 1);
int idx1 = 2 * (uv.y == 1) + (uv.x == 0);
int idx2 = 2 * (uv.x == 0) + (uv.y == 0);
int idx3 = 2 * (uv.y == 0) + (uv.x == 1);

// Barycentric interpolation of texture coordinates
float2 tc = bar.x * texCoord[0][idx0] +
            bar.y * texCoord[1][idx1] +
            bar.z * texCoord[2][idx2] +
            bar.w * texCoord[3][idx3];

Partition of Unity

The zippering methods produce watertight results independently of the parameterization. However, if the parameterization is not seamless or if the features of the displacement map are different on each side of the seam, then that will result in sharp discontinuities, not holes, but undesirable creases along the seams.

[missing image]

These problems can be avoided using a seamless parameterization and generating the displacement maps making sure that the displacements match along the seams. However, another solution is to use a partition of unity.

A partition unity is a method to combine multiple texture parameterizations to produce smooth transitions between them. The idea is to define transition regions around the seams, so that on those regions both parameterizations are used to sample the texture and the results are blended smoothly.

The zippering methods described before are just a special case of a partition of unity in which the blend function is just the unit step function.


There are many different solutions to achieve watertightness when sampling of displacement maps. I personally prefer the zippering methods, since they don’t impose any restriction on the parameterization of the mesh, and work with arbitrary displacement maps. They are easy to implement, and do not add much overhead to the shaders, even though they increase the number of texture coordinates. Note that even when using zippering methods to guarantee watertightness, the use of seamless (or nearly seamless) parameterizations is still useful, because they eliminate any visible crease or discontinuity along the seams.

Posted 1/1/2009 at 12:51 pm | Permalink
Very awesome post !!

It’s a shame that texture coordinates are true floats and not something like 16.16 fixed point. It basically forces you to use zippering. If you had fixed point you could make quadrilateral charts and put the corners exactly on pixels and make the edges the same length and I believe that would work.

Posted 1/1/2009 at 4:07 pm | Permalink
There’s actually nothing stopping you from representing texture coordinates using fixed point. The floating point precision problems only matter in the domain shader, where two instances of the same vertex at opposite sides of the seam must sample the same value from the texture. In the domain shader interpolation is done explicitly, and the math can be done in fixed point.

However, the problem is a bit more complicated than what I’ve described in the article. Hardware bilinear interpolation of texels is not symmetric. Sampling at 0.7 between A and B is not the same as sampling at 0.3 between B and A. Nearest filtering is not symmetric either, the result of sampling at 0.5 is undefined.

For this reason mapping the faces to equal-sized texture-space quads does not solve the problem entirely. It would also be necessary to make sure that seam edge vectors have the same sense.

All that is doable, but the zippering method is so much more simple and robust that I think is in most cases preferable.

Posted 1/1/2009 at 4:29 pm | Permalink
BTW, all that becomes even more complicated if you take mipmapping into account.

Posted 2/1/2009 at 5:29 pm | Permalink
Urg I wrote something but didn’t answer the challenge and when I hit back my comment was gone. >:|

Posted 2/1/2009 at 7:16 pm | Permalink
Ouch, that sucks. Posts that fail the challenge are discarded and do not go to the moderation queue, so I have no way of recovering it…