Salmorejo means food in summer

I don’t know what’s going on with me lately. Mariana and Nachito are in Brazil visiting Mariana’s family, so that leaves me with plenty of free time. I finally got an epidural injection, so physically I’m feeling great. In the past I would have taken this time to code like crazy and get lots of things done, but things are going much slower than I was hoping. Maybe it’s the heat of the summer, or that now that my body feels fine, I need some time to enjoy it and relax.

Anyhow, today it’s time for some more food blogging, and now that we are in midst of the summer there’s nothing better than Salmorejo. As soon as the tomato season starts, this is something that I have for lunch or dinner almost every day. It’s easy to cook, it’s fast, it’s refreshing, and it’s filling.

Continue reading →

Efficient Substitutes for Subdivision Surfaces

Our siggraph course has been accepted and is finally up on the siggraph website! Here’s the brief description of the course:

An overview of the most recent theoretical results and their implementations on the current and next-generation GPUs, and a demonstration of applications in the gaming and movie industry.

We will bring together participants with different backgrounds: ISV from the game and movie industry, IHVs and academia. Each one will provide their own point of view on the topic and describe their own experiences with these methods. I think that will be very valuable to understand the weaknesses and strengths of these techniques, and will hopefully encourage other developers to adopt them.

NVTT quote

I love this quote from Ivan-Assen Ivanov about the NVIDIA Texture Tools:

We switched to CUDA nvcompress in our pipeline and totally forgot
about it. Until yesterday, when an artist had to work on a 7600 for a
while, and raised hell about how slow the export has become.


Standing Desk 2

So, I finally bought the Ikea standing desk and I’ve been using it full time during this week.

As many predicted my feet hurt, and at the end of the day I feel extremely tired, almost as if I had been hiking, which is actually quite a nice feeling. For once I now get to sleep before midnight and wake up early in the mornings. I’ve realized I have to do stretching exercises to prevent cramps during the night. I suppose my body will end up getting used to it.

My doctor has finally prescribed me some stronger drugs. I’m now taking Prednisone for the inflammation and Vicodin for the pain. During the first few days the Vicadin eliminated the pain entirely; after weeks of constant pain with some acute episodes that was quite a relief. I felt energized and euphoric, suddenly I could walk normally, I could even run, jump, speed up my bike, and get some adrenaline flowing!

Unfortunately the body creates tolerance fairly quickly, so I had to reduce the dose. However, I think the inflammation is finally receding, the pain is still there, but it’s just a mere discomfort. I think the results of the physiotherapy are also starting to show. The S shape of my column has been corrected, it feels much more straight now. My abs are stronger than ever. I’m not sure it has been of any help for the pain, but at least it will hopefully prevent more damage in the future.

NVIDIA Texture Tools 2.0.6

Yesterday I released another revision of the stable branch of the NVIDIA Texture Tools. I recommend everybody to upgrade, since it fixes some bugs and artifacts, and improves compatibility with current and future CUDA drivers. Starting with this release I’m planning to provide a verbose description of the changes in each release. So, here it goes:

In a multi-GPU environment NVTT will now use the fastest GPU available. This is quite useful on systems that have an embedded and a discrete GPU. You certainly don’t want to use the embedded one while the discrete GPU is idle.

Using recent CUDA runtimes with older CUDA drivers usually causes problems. To avoid that, NVTT now determines the version of the available CUDA driver and makes sure it’s compatible with the CUDA runtime that was used to compile it.

The NVTT shutdown code did not destroy the CUDA context properly, which caused problems the second time you tried to create a context in the same thread. The behavior on 2.0 drivers was to reuse the context already created, but CUDA 2.1 produces an error instead. This is fixed now.

Note however, that you are not allowed to create multiple compressors in the same thread simultaneously. This is a limitation in the CUDA runtime API that transparently creates a CUDA context per thread. In the future I’m planning to use the driver API directly, which should remove this restriction.

There have been issues reported with the CUDA compressor not producing correct output when processing 1×1 mipmaps. I was able to reproduce the problem on old drivers, but it seems to be fixed now. In any case, I realized that using the entire GPU to compress a 1×1 mipmap did not make any sense, so now I’m compressing small mipmaps in the CPU instead. That is slightly faster and avoids the aforementioned problem.

Per request of the Wolfire guys, NVTT now scales images with alpha by weighting the color channels by the alpha component. While a simple workaround to obtain the right results was to use premultiplied alpha, there was no reason not to do the right thing in all cases.

As mentioned in my previous post, I’ve updated the single color compression tables to take into account the possible hardware decoding errors.

There was also a bug in the decompression of some RGB DDS files that is now fixed.

Progress on the 2.1 release is slow, but steady. The target features are almost complete, but I’ve been cleaning things up, removing a lot of cruft, and simplifying the code, in order to make it more accessible and encourage more contributions. I’m also working in a lower level API that should be more flexible and easier to maintain. More on that in a future post.

GPU DXT Decompression

Even thought the DXT compression format is widely adopted, most of the hardware implementations use a slightly different decompression algorithm.

The problem is that there never was a specification that described the preferred implementation. The DXT format was first documented by S3 in US patent 5956431. However, in order to cover all possible implementations the description in the patent is very generic. It does not specify when or whether the colors are bit expanded, or how the palette entries are interpolated. In one of the proposed embodiments they approximate the 1/3 and 2/3 terms as 3/8 and 5/8, which would have been a good idea, if everybody had chosen the same approximation. However, different implementations approximated the expensive division in different ways.

Continue reading →

Ham and Beans

I’ve been thinking about writing about cooking in this blog for a while. Not just to share what I’ve learned, because, well, I don’t have much to share, but to get other people’s ideas and suggestions. While I may cook better than the average American and may be able to impress my wife and friends, I’m not really such a good cook. I know how to follow recipes, and I’m starting to get some sense of what works and what does not, but I have a lot to learn, which is actually fascinating; there’s a whole world of flavors out there and exploring it is a very rewarding experience.

It took me a while to get used to cook in the US. Spanish cooking is not very sophisticated, but it’s based on seasonal dishes and the use of high quality ingredients available locally. Not the kind of thing you find in the US at your nearest supermarket.

This is all changing very quickly, but many of the traditions still remain. I for one grew up without paying to much attention to food. I just thought it was fuel for my body. It was only when I moved to the US that I started to appreciate what I used to have and now missed.

The Mediterranean diet is mostly composed of olive oil, bread, fresh fruits, fish and vegetables. However, I grew up in a family that runs a pork farm and a small meat production and distribution business. So, I had a larger share of meat than the average. Despite of that, meat was most often used as a condiment or in small dishes, except on special occasions.

[missing image – Iberian pigs in the Dehesa]

Our pork is unlike any other. Most of it is free-range and acorn-fed, which gives its meat a very distinct flavor. It is generally destined to the production of the finest quality deli meats: cured ham, loins, and sausages.

As far as I know it’s not possible to find anything like that in the US. Exports of Iberian ham are legal now, but the FDA requires installations to be periodically inspected and approved. Small farmers cannot really afford to comply with all the requirements of the FDA, so the only ham that you can usually find in the US is from larger and lower quality producers.

To add more insult to injury, the low quality ham that is available here is sold at outrageous prices, as if it were a high quality product! If you are not discouraged yet, you can find some of it at the Spanish Table, or at online retailers such as La Tienda.

I would not recommend any of those options. Instead, what I do is to bring some of our ham with us every time we travel to Spain. This is actually illegal, but is usually safe. However, once we tried to bring an entire leg and, much to our dismay, it was confiscated by the border officers. Since then, I only bring small quantities that are hardly noticeable on the scanner.

Rather than eating all of it right away, I generally save it for months, waiting for the right opportunity to enjoy it. Last weekend that opportunity presented itself in the form of fresh fava beans:

Fava beans are delicious eaten raw, or simply sautéed in olive oil with red onions. However, a little bit of Iberian Ham turns that delicious dish into a delicacy.

The preparation is trivial:

  • Remove the beans from their pods. It’s possible to also remove the beans from the shells by cooking them in boiling water for a short period of time and cooling them quickly in cold water. I personally don’t bother with that and leave the beans in their shells.
  • Pour a small amount of olive oil on a sauté pan on high heat. Chop the onions and the ham, and cook until flagrant.
  • Then add the beans, season to taste with salt and freshly ground pepper, and cook lightly for a few minutes while stirring often.

Let me know if you like posts like this, and if so, in future posts I’ll continue rambling about my quest to adapt the traditional Spanish recipes to the resources available in the US.


Posted 18/5/2009 at 4:04 am | Permalink
Habas con jamón…. Yummy!

Did I tell you I bought a bread machine? I plan to make sourdough bread as soon as I’m able to grow the culture.

Posted 18/5/2009 at 8:07 am | Permalink
Yeah, fantastic post, more!

Posted 18/5/2009 at 2:54 pm | Permalink
¡Habas con jamón! ¡Rico, rico!
Well done! I have the same problem finding cured ham in England, and I also follow the same procedure to bring some here, although I haven’t tried yet to bring a whole leg :D

Chad Austin
Posted 18/5/2009 at 4:46 pm | Permalink
I love all food, so more!

Posted 18/5/2009 at 7:02 pm | Permalink
Thanks for all the feedback! I think that next I’ll share my technique to make a basic pâté.

Posted 19/5/2009 at 8:47 pm | Permalink
Great post, thanks! BTW, I like your typo “cook until flagrant”, it makes the meal seem much more exotic :) If you ever find a source for good quality hams please post it!

Optimal Grid Rendering

This is a trick that I learned from Matthias Wloka during my job interview at NVIDIA. I thought I had a good understanding of the behavior of the vertex post-transform cache, but embarrassingly it turned out it wasn’t good enough. I’m sure many people don’t know this either, so here it goes.

Rendering regular grids of triangles is common enough to make it worth spending some time thinking about how to do it most efficiently. They are used to render terrains, water effects, curved surfaces, and in general any regularly tessellated object.

It’s possible to simulate the native hardware tessellation by rendering a single grid multiple times, and the fastest way of doing that is using instancing. That idea was first proposed in Generic Mesh Refinement on GPU and at NVIDIA we also have examples that show how to do that in OpenGL and Direct3D.

That’s enough for the motivation. Imagine we have a 4×4 grid. The first two rows would look like this:

* - * - * - * - *
| / | / | / | / |
* - * - * - * - *
| / | / | / | / |
* - * - * - * - *

With a vertex cache with 8 entries, the location of the vertices after rendering the first 6 triangles of the first row should be as follows:

7 - 5 - 3 - 1 - *
| / | / | / | / |
6 - 4 - 2 - 0 - *
| / | / | / | / |
* - * - * - * - *

And after the next two triangles:

* - 7 - 5 - 3 - 1
| / | / | / | / |
* - 6 - 4 - 2 - 0
| / | / | / | / |
* - * - * - * - *

Notice that the first two vertices are no longer in the cache. As we proceed to the next two triangles two of the vertices that were previously in the cache need to be loaded again:

* - * - * - 7 - 5
| / | / | / | / |
3 - 1 - * - 6 - 4
| / | / | / | / |
2 - 0 - * - * - *

Instead of using the straightforward traversal, it’s possible to traverse the triangles in Morton or Hilbert order, which are known to have better cache behavior. Another possibility is to feed the triangles to any of the standard mesh optimization algorithms.

All these options are better than not doing anything, but still produce results that are far from the optimal. In the table below you can see the results obtained for a 16×16 grid and with a FIFO cache with 20 entries:

Method          ACMR    ATVR
Scanline        1.062   1.882
NVTriStrip      0.818   1.450
Morton          0.719   1.273
K-Cache-Reorder	0.711   1.260
Hilbert         0.699   1.239
Forsyth         0.666   1.180
Tipsy           0.658   1.166
Optimal         0.564   1.000

Note that I’m using my own implementation for all of these methods. So, the results with the code from the original author might differ slightly.

The most important observation is that, for every row of triangles, the only vertices that are reused are the vertices that are at the bottom of the triangles, and these are the vertices that we would like to have in the cache when rendering the next row of triangles.

When traversing triangles in scanline order the cache interleaves vertices from the first and second row. However, we can avoid that by prefetching the first row of vertices:

4 - 3 - 2 - 1 - 0
| / | / | / | / |
* - * - * - * - *
|   |   |   |   |
* - * - * - * - *

That can be done issuing degenerate triangles. Once the first row of vertices is in the cache, you can continue adding the triangles in scanline order. The cool thing now is that the vertices that leave the cache are always vertices that are not going to be used anymore:

* - 7 - 6 - 5 - 4
| / | / | / | / |
3 - 2 - 1 - 0 - *
| / | / | / | / |
* - * - * - * - *

In general, the minimum cache size to render a W*W grid without transforming any vertex multiple times is W+2.

The degenerate triangles have a small overhead, so you also want to avoid them when the cache is sufficiently large to store two rows of vertices. When the cache is too small you also have to split the grid into smaller sections and apply this method to each of them. The following code accomplishes that:

void gridGen(int x0, int x1, int y0, int y1, int width, int cacheSize)
    if (x1 - x0 + 1 < cacheSize)
        if (2 * (x1 - x0) + 1 > cacheSize)
            for (int x = x0; x < x1; x++)
                indices.push_back(x + 0);
                indices.push_back(x + 0);
                indices.push_back(x + 1);

        for (int y = y0; y < y1; y++)
            for (int x = x0; x < x1; x++)
                indices.push_back((width + 1) * (y + 0) + (x + 0));
                indices.push_back((width + 1) * (y + 1) + (x + 0));
                indices.push_back((width + 1) * (y + 0) + (x + 1));

                indices.push_back((width + 1) * (y + 0) + (x + 1));
                indices.push_back((width + 1) * (y + 1) + (x + 0));
                indices.push_back((width + 1) * (y + 1) + (x + 1));
        int xm = x0 + cacheSize - 2;
        gridGen(x0, xm, y0, y1, width, cacheSize);
        gridGen(xm, x1, y0, y1, width, cacheSize);

This may not be the most optimal grid partition, but the method still performs pretty well in those cases. Here are the results for a cache with 16 entries:

Method           ACMR   ATVR
Scanline         1.062  1.882
NVTriStrip       0.775  1.374
K-Cache-Reorder  0.766  1.356
Hilbert          0.754  1.336
Morton           0.750  1.329
Tipsy            0.711  1.260
Forsyth          0.699  1.239
Optimal          0.598  1.059

And for a cache with only 12 entries:

Method           ACMR   ATVR
Scanline         1.062  1.882
NVTriStrip       0.875  1.550
Forsyth          0.859  1.522
K-Cache-Reorder  0.807  1.491
Morton           0.812  1.439
Hilbert          0.797  1.412
Tipsy            0.758  1.343
Optimal          0.600  1.062

In all cases, the proposed algorithm is significantly faster than the other approaches. In the future it would interesting to take into account some of these observations in a general mesh optimization algorithm.


I think that the Average Cache Miss Ratio (ACMR) is not the best way of measuring the efficiency of the vertex cache under different mesh optimization algorithms. The ACMR is basically the number of vertex transforms divided by the number of primitives, and greatly depends on the topology of the mesh. For a triangle mesh the theoretical optimal is 0.5, but if the mesh has boundaries, or disconnected triangles, then the optimal value is much higher. So, you have to compare the ACMR against the vertex to triangle ratio of the mesh.

I think that a better way of measuring how well a mesh optimization algorithm or a cache implementation performs is using the average transform to vertex ratio (ATVR), that is, the number of vertex transforms divided by the number of vertices. In the optimal case it’s always 1, independently of the mesh and the primitive type. I think it provides a much more intuitive a much more intuitive idea of the cost of rendering a mesh. For example, if the transform to vertex ratio is 2, it means that each vertex is transformed an average of 2 times.

Here’s another example: In his article about Vertex Cache Optimisation, Tom Forsyth has to clarify that the results for the GrannyRocks mesh (between 0.797 and 0.765) are actually quite good, because the best ACMR possible is 0.732. That clarification would not be necessary when using the metric that I propose. Instead, you would say that the transform to vertex ratio is between 1.09 and 1.05, and it’s instantly obvious that it’s a good result.

Real-Time Creased Approximate Subdivision Surfaces

Denis finally put his I3D paper online!

Denis was my intern a couple of summers ago. We worked together on the implementation of the an emulation framework to implement and prototype various tessellation algorithms. Last summer he wanted to try something different, I recommended him to join a game studio, and he finally went to work at Valve.

Having spent so much time and effort working on approximate subdivision surfaces and talking about them to developers, it’s very pleasing to see these techniques being adopted in real applications.

Adding support for subdivision surfaces in a game engine requires significant changes in the production pipeline. Not only tools need to be updated, but production processes need to change, and artists need to be educated to adopt and learn the new tools.

I think this paper will be very valuable in that regard. It describes Valve’s experience integrating ACC into the Source engine. How the ACC scheme had to be extended to support creases in order to achieve the visuals that they were trying to achieve, the many problems that arose, and more importantly, the problems that could not be solved and that required artist intervention.