{"id":34,"date":"2009-02-02T23:48:35","date_gmt":"2009-02-03T07:48:35","guid":{"rendered":"http:\/\/www.ludicon.com.php5-14.dfw1-1.websitetestlink.com\/castano\/blog\/?p=34"},"modified":"2019-01-17T23:43:33","modified_gmt":"2019-01-18T07:43:33","slug":"optimal-grid-rendering","status":"publish","type":"post","link":"http:\/\/www.ludicon.com\/castano\/blog\/2009\/02\/optimal-grid-rendering\/","title":{"rendered":"Optimal Grid Rendering"},"content":{"rendered":"<p>This is a trick that I learned from <a href=\"http:\/\/www.linkedin.com\/pub\/matthias-wloka\/0\/376\/b86\">Matthias Wloka<\/a> during my job interview at NVIDIA. I thought I had a good understanding of the behavior of the vertex post-transform cache, but embarrassingly it turned out it wasn\u2019t good enough. I\u2019m sure many people don\u2019t know this either, so here it goes.<\/p>\n<p>Rendering regular grids of triangles is common enough to make it worth spending some time thinking about how to do it most efficiently. They are used to render terrains, water effects, curved surfaces, and in general any regularly tessellated object.<\/p>\n<p>It\u2019s possible to simulate the native hardware tessellation by rendering a single grid multiple times, and the fastest way of doing that is using instancing. That idea was first proposed in <a href=\"http:\/\/iparla.labri.fr\/publications\/2005\/BS05\/\">Generic Mesh Refinement on GPU<\/a> and at NVIDIA we also have examples that show how to do that in <a href=\"http:\/\/developer.download.nvidia.com\/SDK\/10.5\/opengl\/samples.html#instanced_tessellation\">OpenGL<\/a> and <a href=\"http:\/\/developer.download.nvidia.com\/SDK\/10.5\/direct3d\/samples.html#InstancedTessellation\">Direct3D<\/a>.<\/p>\n<p>That\u2019s enough for the motivation. Imagine we have a 4\u00d74 grid. The first two rows would look like this:<\/p>\n<pre>\r\n* - * - * - * - *\r\n| \/ | \/ | \/ | \/ |\r\n* - * - * - * - *\r\n| \/ | \/ | \/ | \/ |\r\n* - * - * - * - *\r\n<\/pre>\n<p>With a vertex cache with 8 entries, the location of the vertices after rendering the first 6 triangles of the first row should be as follows:<\/p>\n<pre>\r\n7 - 5 - 3 - 1 - *\r\n| \/ | \/ | \/ | \/ |\r\n6 - 4 - 2 - 0 - *\r\n| \/ | \/ | \/ | \/ |\r\n* - * - * - * - *\r\n<\/pre>\n<p>And after the next two triangles:<\/p>\n<pre>\r\n* - 7 - 5 - 3 - 1\r\n| \/ | \/ | \/ | \/ |\r\n* - 6 - 4 - 2 - 0\r\n| \/ | \/ | \/ | \/ |\r\n* - * - * - * - *\r\n<\/pre>\n<p>Notice that the first two vertices are no longer in the cache. As we proceed to the next two triangles two of the vertices that were previously in the cache need to be loaded again:<\/p>\n<pre>\r\n* - * - * - 7 - 5\r\n| \/ | \/ | \/ | \/ |\r\n3 - 1 - * - 6 - 4\r\n| \/ | \/ | \/ | \/ |\r\n2 - 0 - * - * - *\r\n<\/pre>\n<p>Instead of using the straightforward traversal, it\u2019s possible to traverse the triangles in Morton or Hilbert order, which are known to have better cache behavior. Another possibility is to feed the triangles to any of the standard mesh optimization algorithms.<\/p>\n<p>All these options are better than not doing anything, but still produce results that are far from the optimal. In the table below you can see the results obtained for a 16\u00d716 grid and with a FIFO cache with 20 entries:<\/p>\n<pre>\r\nMethod          ACMR    ATVR\r\nScanline        1.062   1.882\r\nNVTriStrip      0.818   1.450\r\nMorton          0.719   1.273\r\nK-Cache-Reorder\t0.711   1.260\r\nHilbert         0.699   1.239\r\nForsyth         0.666   1.180\r\nTipsy           0.658   1.166\r\nOptimal         0.564   1.000\r\n<\/pre>\n<p>Note that I\u2019m using my own implementation for all of these methods. So, the results with the code from the original author might differ slightly.<\/p>\n<p>The most important observation is that, for every row of triangles, the only vertices that are reused are the vertices that are at the bottom of the triangles, and these are the vertices that we would like to have in the cache when rendering the next row of triangles.<\/p>\n<p>When traversing triangles in scanline order the cache interleaves vertices from the first and second row. However, we can avoid that by prefetching the first row of vertices:<\/p>\n<pre>\r\n4 - 3 - 2 - 1 - 0\r\n| \/ | \/ | \/ | \/ |\r\n* - * - * - * - *\r\n|   |   |   |   |\r\n* - * - * - * - *\r\n<\/pre>\n<p>That can be done issuing degenerate triangles. Once the first row of vertices is in the cache, you can continue adding the triangles in scanline order. The cool thing now is that the vertices that leave the cache are always vertices that are not going to be used anymore:<\/p>\n<pre>\r\n* - 7 - 6 - 5 - 4\r\n| \/ | \/ | \/ | \/ |\r\n3 - 2 - 1 - 0 - *\r\n| \/ | \/ | \/ | \/ |\r\n* - * - * - * - *\r\n<\/pre>\n<p>In general, the minimum cache size to render a W*W grid without transforming any vertex multiple times is W+2.<\/p>\n<p>The degenerate triangles have a small overhead, so you also want to avoid them when the cache is sufficiently large to store two rows of vertices. When the cache is too small you also have to split the grid into smaller sections and apply this method to each of them. The following code accomplishes that:<\/p>\n<p><code><\/p>\n<pre>\r\nvoid gridGen(int x0, int x1, int y0, int y1, int width, int cacheSize)\r\n{\r\n    if (x1 - x0 + 1 < cacheSize)\r\n    {\r\n        if (2 * (x1 - x0) + 1 > cacheSize)\r\n        {\r\n            for (int x = x0; x < x1; x++)\r\n            {\r\n                indices.push_back(x + 0);\r\n                indices.push_back(x + 0);\r\n                indices.push_back(x + 1);\r\n            }\r\n        }\r\n\r\n        for (int y = y0; y < y1; y++)\r\n        {\r\n            for (int x = x0; x < x1; x++)\r\n            {\r\n                indices.push_back((width + 1) * (y + 0) + (x + 0));\r\n                indices.push_back((width + 1) * (y + 1) + (x + 0));\r\n                indices.push_back((width + 1) * (y + 0) + (x + 1));\r\n\r\n                indices.push_back((width + 1) * (y + 0) + (x + 1));\r\n                indices.push_back((width + 1) * (y + 1) + (x + 0));\r\n                indices.push_back((width + 1) * (y + 1) + (x + 1));\r\n            }\r\n        }\r\n    }\r\n    else\r\n    {\r\n        int xm = x0 + cacheSize - 2;\r\n        gridGen(x0, xm, y0, y1, width, cacheSize);\r\n        gridGen(xm, x1, y0, y1, width, cacheSize);\r\n    }\r\n}\r\n<\/pre>\n<p><\/code><\/p>\n<p>This may not be the most optimal grid partition, but the method still performs pretty well in those cases. Here are the results for a cache with 16 entries:<\/p>\n<pre>\r\nMethod           ACMR   ATVR\r\nScanline         1.062  1.882\r\nNVTriStrip       0.775  1.374\r\nK-Cache-Reorder  0.766  1.356\r\nHilbert          0.754  1.336\r\nMorton           0.750  1.329\r\nTipsy            0.711  1.260\r\nForsyth          0.699  1.239\r\nOptimal          0.598  1.059\r\n<\/pre>\n<p>And for a cache with only 12 entries:<\/p>\n<pre>\r\nMethod           ACMR   ATVR\r\nScanline         1.062  1.882\r\nNVTriStrip       0.875  1.550\r\nForsyth          0.859  1.522\r\nK-Cache-Reorder  0.807  1.491\r\nMorton           0.812  1.439\r\nHilbert          0.797  1.412\r\nTipsy            0.758  1.343\r\nOptimal          0.600  1.062\r\n<\/pre>\n<p>In all cases, the proposed algorithm is significantly faster than the other approaches. In the future it would interesting to take into account some of these observations in a general mesh optimization algorithm.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This is a trick that I learned from Matthias Wloka during my job interview at NVIDIA. I thought I had a good understanding of the behavior of the vertex post-transform cache, but embarrassingly it turned out it wasn\u2019t good enough. I\u2019m sure many people don\u2019t know this either, so here it goes. Rendering regular grids&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9],"tags":[],"class_list":["post-34","post","type-post","status-publish","format-standard","hentry","category-coding"],"_links":{"self":[{"href":"http:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/posts\/34","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/comments?post=34"}],"version-history":[{"count":8,"href":"http:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/posts\/34\/revisions"}],"predecessor-version":[{"id":1074,"href":"http:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/posts\/34\/revisions\/1074"}],"wp:attachment":[{"href":"http:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/media?parent=34"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/categories?post=34"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/tags?post=34"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}