Optimal Grid Rendering

This is a trick that I learned from Matthias Wloka during my job interview at NVIDIA. I thought I had a good understanding of the behavior of the vertex post-transform cache, but embarrassingly it turned out it wasn’t good enough. I’m sure many people don’t know this either, so here it goes.

Rendering regular grids of triangles is common enough to make it worth spending some time thinking about how to do it most efficiently. They are used to render terrains, water effects, curved surfaces, and in general any regularly tessellated object.

It’s possible to simulate the native hardware tessellation by rendering a single grid multiple times, and the fastest way of doing that is using instancing. That idea was first proposed in Generic Mesh Refinement on GPU and at NVIDIA we also have examples that show how to do that in OpenGL and Direct3D.

That’s enough for the motivation. Imagine we have a 4×4 grid. The first two rows would look like this:

* - * - * - * - *
| / | / | / | / |
* - * - * - * - *
| / | / | / | / |
* - * - * - * - *

With a vertex cache with 8 entries, the location of the vertices after rendering the first 6 triangles of the first row should be as follows:

7 - 5 - 3 - 1 - *
| / | / | / | / |
6 - 4 - 2 - 0 - *
| / | / | / | / |
* - * - * - * - *

And after the next two triangles:

* - 7 - 5 - 3 - 1
| / | / | / | / |
* - 6 - 4 - 2 - 0
| / | / | / | / |
* - * - * - * - *

Notice that the first two vertices are no longer in the cache. As we proceed to the next two triangles two of the vertices that were previously in the cache need to be loaded again:

* - * - * - 7 - 5
| / | / | / | / |
3 - 1 - * - 6 - 4
| / | / | / | / |
2 - 0 - * - * - *

Instead of using the straightforward traversal, it’s possible to traverse the triangles in Morton or Hilbert order, which are known to have better cache behavior. Another possibility is to feed the triangles to any of the standard mesh optimization algorithms.

All these options are better than not doing anything, but still produce results that are far from the optimal. In the table below you can see the results obtained for a 16×16 grid and with a FIFO cache with 20 entries:

Method          ACMR    ATVR
Scanline        1.062   1.882
NVTriStrip      0.818   1.450
Morton          0.719   1.273
K-Cache-Reorder	0.711   1.260
Hilbert         0.699   1.239
Forsyth         0.666   1.180
Tipsy           0.658   1.166
Optimal         0.564   1.000

Note that I’m using my own implementation for all of these methods. So, the results with the code from the original author might differ slightly.

The most important observation is that, for every row of triangles, the only vertices that are reused are the vertices that are at the bottom of the triangles, and these are the vertices that we would like to have in the cache when rendering the next row of triangles.

When traversing triangles in scanline order the cache interleaves vertices from the first and second row. However, we can avoid that by prefetching the first row of vertices:

4 - 3 - 2 - 1 - 0
| / | / | / | / |
* - * - * - * - *
|   |   |   |   |
* - * - * - * - *

That can be done issuing degenerate triangles. Once the first row of vertices is in the cache, you can continue adding the triangles in scanline order. The cool thing now is that the vertices that leave the cache are always vertices that are not going to be used anymore:

* - 7 - 6 - 5 - 4
| / | / | / | / |
3 - 2 - 1 - 0 - *
| / | / | / | / |
* - * - * - * - *

In general, the minimum cache size to render a W*W grid without transforming any vertex multiple times is W+2.

The degenerate triangles have a small overhead, so you also want to avoid them when the cache is sufficiently large to store two rows of vertices. When the cache is too small you also have to split the grid into smaller sections and apply this method to each of them. The following code accomplishes that:

void gridGen(int x0, int x1, int y0, int y1, int width, int cacheSize)
{
    if (x1 - x0 + 1 < cacheSize)
    {
        if (2 * (x1 - x0) + 1 > cacheSize)
        {
            for (int x = x0; x < x1; x++)
            {
                indices.push_back(x + 0);
                indices.push_back(x + 0);
                indices.push_back(x + 1);
            }
        }

        for (int y = y0; y < y1; y++)
        {
            for (int x = x0; x < x1; x++)
            {
                indices.push_back((width + 1) * (y + 0) + (x + 0));
                indices.push_back((width + 1) * (y + 1) + (x + 0));
                indices.push_back((width + 1) * (y + 0) + (x + 1));

                indices.push_back((width + 1) * (y + 0) + (x + 1));
                indices.push_back((width + 1) * (y + 1) + (x + 0));
                indices.push_back((width + 1) * (y + 1) + (x + 1));
            }
        }
    }
    else
    {
        int xm = x0 + cacheSize - 2;
        gridGen(x0, xm, y0, y1, width, cacheSize);
        gridGen(xm, x1, y0, y1, width, cacheSize);
    }
}

This may not be the most optimal grid partition, but the method still performs pretty well in those cases. Here are the results for a cache with 16 entries:

Method           ACMR   ATVR
Scanline         1.062  1.882
NVTriStrip       0.775  1.374
K-Cache-Reorder  0.766  1.356
Hilbert          0.754  1.336
Morton           0.750  1.329
Tipsy            0.711  1.260
Forsyth          0.699  1.239
Optimal          0.598  1.059

And for a cache with only 12 entries:

Method           ACMR   ATVR
Scanline         1.062  1.882
NVTriStrip       0.875  1.550
Forsyth          0.859  1.522
K-Cache-Reorder  0.807  1.491
Morton           0.812  1.439
Hilbert          0.797  1.412
Tipsy            0.758  1.343
Optimal          0.600  1.062

In all cases, the proposed algorithm is significantly faster than the other approaches. In the future it would interesting to take into account some of these observations in a general mesh optimization algorithm.

This entry was posted in Graphics. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

9 Comments

  1. cb
    Posted February 2, 2009 at 8:28 am | Permalink

    Good stuff. I guess this is because it’s an unusual cache that doesn’t actually LRU – it doesn’t promote vertices when you use ones existing in the cache.

    One thing I like to see is what’s the behavior when you optimize for the wrong cache size. eg. if you optimize for 8-member cache then run on 16-member cache, how does it compare to Forsyth?

    • Posted February 2, 2009 at 9:19 am | Permalink

      If I optimize for 8, but run it on a 16 entry cache I get:

      Optimal ATVR: 1.125
      Forsyth’s ATVR: 1.433

      While if the cache has only 8-entries:

      Optimal ATVR: 1.125
      Forsyth’s ATVR: 1.547

      The proposed method wins in both cases, but increases in the size of the actual cache do not improve the result. It’s possible that at some extreme configurations this method might lose.

  2. Thatcher
    Posted March 23, 2009 at 7:29 pm | Permalink

    Ha, that’s cool.

  3. alpha2
    Posted May 26, 2010 at 2:50 am | Permalink

    Nice, article. Is there any way to know the cache size for specific gpu?

  4. Remage
    Posted April 9, 2011 at 12:45 am | Permalink

    alpha2: D3DQUERYTYPE_VCACHE (late reply, but I was looking for the same info…)

  5. SLeo
    Posted August 30, 2011 at 10:04 am | Permalink

    Seems this code generates triangle list. Do you have code for triangle strip?

    • Posted September 1, 2011 at 8:12 am | Permalink

      No, but it should be trivial to generate a triangle strip instead. That is left as an exercise. Feel free to share your implementation!

  6. SLeo
    Posted September 4, 2011 at 5:34 am | Permalink

    Ok, gridGen(0, mWidth-1, 0, mHeight-1, mWidth, vcache_size);

    void GridVertexBufferRenderable::gridGen(int x0, int x1, int y0, int y1, int stride, int cacheSize)
    {
        if (x1 - x0 + 2 = cacheSize)
            {
                *mIndexesFill++ = x0;
                for (int x = x0; x <= x1; x++)
                {
                    *mIndexesFill++ = x;
                }
            }
    
            for (int y = y0; y < y1; y++)
            {
    #ifdef RESTART_INDEX
    			*mIndexesFill++ = RESTART_INDEX;
    #else
    			*mIndexesFill++ = stride * (y + 0) + x1;
    			*mIndexesFill++ = stride * (y + 0) + x0;
    #endif
    
                for (int x = x0; x <= x1; x++)
                {
                    *mIndexesFill++ = stride * (y + 0) + x;
                    *mIndexesFill++ = stride * (y + 1) + x;
                }
            }
        }
        else
        {
            int xm = x0 + cacheSize - 2;
            gridGen(x0, xm, y0, y1, stride, cacheSize);
            gridGen(xm, x1, y0, y1, stride, cacheSize);
        }
    }
    

    • Posted September 5, 2011 at 7:57 am | Permalink

      Looks good, but note that when prewarming the cache you should use degenerate triangles, and in your code, only the first triangle is degenerate.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Subscribe without commenting