{"id":1468,"date":"2025-02-20T01:20:15","date_gmt":"2025-02-20T09:20:15","guid":{"rendered":"https:\/\/www.ludicon.com\/castano\/blog\/?p=1468"},"modified":"2025-09-24T20:13:06","modified_gmt":"2025-09-25T04:13:06","slug":"gpu-texture-compression-everywhere","status":"publish","type":"post","link":"https:\/\/www.ludicon.com\/castano\/blog\/2025\/02\/gpu-texture-compression-everywhere\/","title":{"rendered":"GPU Texture Compression Everywhere"},"content":{"rendered":"\n<p>When I joined NVIDIA in 2005, one of my main goals was to work on texture and mesh processing tools. The NVIDIA Texture Tools were widely used, and Cem Cebenoyan, Sim Dietrich and Clint Brewer had been doing interesting work on mesh processing and optimization (nvtristrip, nvmeshmender). That was exactly the kind of work I wanted to be involved in.<\/p>\n\n\n\n<p>However, the priorities of the tools team were different, and I ended up working on FX Composer instead. I wasn\u2019t particularly excited about that, so in 2006, I switched to the Developer Technology group.<\/p>\n\n\n\n<p>At the time, NVIDIA and ATI were competing for dominance in the GPU market. While we had a solid market share, our real goal was to grow the overall market. Expanding the \u201cpie\u201d rather than just our slice of it. If you imagine a gamer with a fixed budget, we wanted him to allocate more of that budget to the GPU rather than the CPU. One way to achieve this was by encouraging developers to shift workloads from the CPU to the GPU.<\/p>\n\n\n\n<p>This push was part of the broader GPGPU movement. CUDA had just been released, but it had no integration with graphics APIs, and compute shaders didn\u2019t exist yet. One of the workloads that caught our attention was GPU texture compression. Under the pretext of harnessing the GPU, I found my way back to working on texture compression.<\/p>\n\n\n\n<p>Another idea gaining traction at the time was runtime texture compression.<\/p>\n\n\n\n<p>In April 2004, <a href=\"https:\/\/en.wikipedia.org\/wiki\/Farbrausch\">Farbrausch<\/a> released <a href=\"https:\/\/en.wikipedia.org\/wiki\/.kkrieger\">.kkrieger<\/a>, a first-person shooter that packed all its content into just 96 KB by using procedural generation for levels, models, and textures, but it wasn&#8217;t until late into the development of <a href=\"https:\/\/www.pouet.net\/prod.php?which=30244\" data-type=\"link\" data-id=\"https:\/\/www.pouet.net\/prod.php?which=30244\">fr-041: debris<\/a> in 2007 that they started using runtime DXT compression to reduce GPU memory usage and improve performance.<\/p>\n\n\n\n<p>Around the same time, Allegorithmic was developing ProFX, the predecessor to Substance Designer, a middleware for real-time procedural texturing. ProFX also included a fast DXT encoder, allowing procedural textures to be converted into GPU-friendly formats at load time.<\/p>\n\n\n\n<p><a href=\"https:\/\/sjbrown.co.uk\/\">Simon Brown<\/a> was working on PlayStation Home, Sony\u2019s 3D social virtual world where players could create and customize their avatars. To support this, he wrote a fast DXT encoder optimized for the PS3\u2019s SPUs, demonstrating the potential of offloading texture compression to parallel processors.<\/p>\n\n\n\n<p>John Carmack had been talking about the megatexture technology for a while, but in 2006, when <a href=\"https:\/\/mrelusive.com\/\" data-type=\"link\" data-id=\"https:\/\/mrelusive.com\/\">Jan Paul van Waveren<\/a> published the details of their <a href=\"https:\/\/mrelusive.com\/publications\/papers\/Real-Time-Dxt-Compression.pdf\">Real-Time DXT Compression<\/a> implementation on the Intel Software Network, we at NVIDIA saw a potential problem: if <em>Rage<\/em> ended up CPU-limited, it could push gamers toward CPU upgrades rather than GPUs. That made real-time texture compression in the GPU a strategic priority for us.<\/p>\n\n\n\n<!--more-->\n\n\n\n<p>The first problem we encountered was that in Direct3D 10, updating a DXT texture without transferring data back to the CPU wasn\u2019t possible. Fortunately, OpenGL had recently introduced pixel buffer objects (PBOs). While not designed for this purpose, PBOs provided a workaround: we could render each block to an integer render target, copy the contents to a PBO, and then transfer the compressed data from the PBO to a DXT-compressed texture.<\/p>\n\n\n\n<p>This is the technique that I employed in the examples that accompanied the papers that Jan Paul and I co-authored and that can still be found online:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/developer.download.nvidia.com\/SDK\/10\/opengl\/samples.html#compress_YCoCgDXT\">Compress YCoCg-DXT<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/developer.download.nvidia.com\/SDK\/10\/opengl\/samples.html#compress_NormalDXT\">Compress Normal-DXT<\/a><\/li>\n<\/ul>\n\n\n\n<p>These examples allowed us to demonstrate that GPU texture compression was possible and show its potential.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Direct3D 10<\/h2>\n\n\n\n<p>I then pushed for changes in graphics APIs to better support this use case. The first API to enable direct copies from uncompressed to compressed textures was Direct3D 10.1, which relaxed the restrictions in the <a href=\"https:\/\/learn.microsoft.com\/en-us\/windows\/desktop\/api\/D3D10\/nf-d3d10-id3d10device-copyresource\"><code>CopyResource<\/code><\/a> and <a href=\"https:\/\/learn.microsoft.com\/en-us\/windows\/desktop\/api\/D3D10\/nf-d3d10-id3d10device-copysubresourceregion\"><code>CopySubresourceRegion<\/code><\/a> APIs to allow copies between prestructured-typed textures and block-compressed textures of the same bit width.<\/p>\n\n\n\n<p>For more details see: <a href=\"https:\/\/learn.microsoft.com\/en-us\/windows\/win32\/direct3d10\/d3d10-graphics-programming-guide-resources-block-compression#format-conversion-using-direct3d-101\">Format Conversion Using Direct3D 10.1<\/a><\/p>\n\n\n\n<p>This functionality was also available on consoles at the time through low-level APIs provided by hardware vendors. In this <a href=\"https:\/\/www.gdcvault.com\/play\/1012254\/Texture-compression-in-real-time\">GDC talk<\/a>, Jason Tranchida from Volition discusses their experience implementing real-time DXT compression using these APIs and Direct3D 10.1.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">OpenGL<\/h2>\n\n\n\n<p>A year later, this feature arrived in OpenGL with <a href=\"https:\/\/github.com\/KhronosGroup\/OpenGL-Registry\/blob\/main\/extensions\/NV\/NV_copy_image.txt\">NV_copy_image<\/a>. However, it wasn\u2019t until 2012, six years after its introduction in Direct3D, that it became more widely available via <a href=\"https:\/\/github.com\/KhronosGroup\/OpenGL-Registry\/blob\/main\/extensions\/ARB\/ARB_copy_image.txt\">ARB_copy_image<\/a> and OpenGL 4.3.<\/p>\n\n\n\n<p>In OpenGL, the equivalent of D3D\u2019s <code>CopySubresourceRegion<\/code> is <a href=\"https:\/\/registry.khronos.org\/OpenGL-Refpages\/gl4\/html\/glCopyImageSubData.xhtml\"><code>glCopyImageSubData<\/code><\/a>. Like its Direct3D 10 counterpart, it enables copying data from an uncompressed texture to a block-compressed texture, provided the formats are compatible, which generally means that the texel size matches the block size.<\/p>\n\n\n\n<p>This feature could have facilitated the use of GPU texture compression in id Tech 5, enabling <em>Rage<\/em> to offload more work from the CPU to the GPU. However, <em>Rage<\/em> did not employ GPU texture compression at launch. Instead, it continued relying on the CPU-based approach described in Jan Paul&#8217;s paper.<\/p>\n\n\n\n<p>At that point I was already working on The Witness, so it was Evan Hart from NVIDIA\u2019s Developer Technology team who revisited this problem and implemented GPU texture compression for id Tech 5, building on the foundations I had put in place, and addressing all the practical challenges that arise when integrating real-time GPU compression into a production engine.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">OpenGL ES<\/h2>\n\n\n\n<p>I didn\u2019t pay much attention to these features again until I started developing <a href=\"https:\/\/ludicon.com\/spark\/\" data-type=\"link\" data-id=\"https:\/\/ludicon.com\/spark\/\">Spark<\/a>. Since my focus was on mobile, I was particularly interested in OpenGL ES, which appeared to support the same functionality through the <a href=\"https:\/\/registry.khronos.org\/OpenGL\/extensions\/EXT\/EXT_copy_image.txt\">GL_EXT_copy_image<\/a> extension and had become a core feature in OpenGL ES 3.2.<\/p>\n\n\n\n<p>While testing this feature, I encountered an unexpected limitation: OpenGL ES does not support the <code>rg32ui<\/code> output image layout. Instead, to output 64-bit blocks, you have to use <code>rgba16ui<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>layout(rgba16ui) uniform restrict writeonly highp uimage2D dst;<\/code><\/pre>\n\n\n\n<p>This requires packing the 64-bit <code>uvec2<\/code> block manually before passing it to <code>imageStore<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>uvec4 pack_uvec2(uvec2 v) {\n    return uvec4(v.x &amp; 0xFFFFu, v.x &gt;&gt; 16, v.y &amp; 0xFFFFu, v.y &gt;&gt; 16);\n}\n...\nimageStore(dst, uv, pack_uvec2(block));<\/code><\/pre>\n\n\n\n<p>That said, in a world where compute shaders are ubiquitous, there\u2019s now a much simpler way to accomplish the same task. By using a compute shader to write compressed data into a temporary buffer, we can bind that buffer to the <code>GL_PIXEL_UNPACK_BUFFER<\/code> target and source data directly from it using <a href=\"https:\/\/registry.khronos.org\/OpenGL-Refpages\/gl4\/html\/glCompressedTexSubImage2D.xhtml\"><code>glCompressedTexSubImage2D<\/code><\/a>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>glBindBuffer(GL_PIXEL_UNPACK_BUFFER, tmp_buffer);\nglBindTexture(GL_TEXTURE_2D, dst_texture);\nglCompressedTexSubImage2D(GL_TEXTURE_2D, dst_level, dst_x, dst_y, width, height, \n    gl_format, tmp_buffer_size, (const void*)tmp_buffer_offset);<\/code><\/pre>\n\n\n\n<p>This turned out to work well on all devices and run with similar performance.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Vulkan<\/h2>\n\n\n\n<p>Like OpenGL, Vulkan supports copying data from uncompressed to block-compressed textures. This functionality has been available since Vulkan 1.0 through the <a href=\"https:\/\/registry.khronos.org\/vulkan\/specs\/1.3-extensions\/man\/html\/vkCmdCopyImage.html\"><code>vkCmdCopyImage<\/code><\/a> API.<\/p>\n\n\n\n<p>So imagine my surprise when I tested it and found that most devices didn&#8217;t implement it correctly. The following screenshots show the results on Adreno, PowerVR, and Mali devices. Only Mali produced the correct output!<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"581\" src=\"https:\/\/www.ludicon.com\/castano\/blog\/wp-content\/uploads\/2025\/02\/Android-Vulkan-CmdCopyImage-700x581.png\" alt=\"\" class=\"wp-image-1469\" srcset=\"https:\/\/www.ludicon.com\/castano\/blog\/wp-content\/uploads\/2025\/02\/Android-Vulkan-CmdCopyImage-700x581.png 700w, https:\/\/www.ludicon.com\/castano\/blog\/wp-content\/uploads\/2025\/02\/Android-Vulkan-CmdCopyImage-267x222.png 267w, https:\/\/www.ludicon.com\/castano\/blog\/wp-content\/uploads\/2025\/02\/Android-Vulkan-CmdCopyImage-768x638.png 768w, https:\/\/www.ludicon.com\/castano\/blog\/wp-content\/uploads\/2025\/02\/Android-Vulkan-CmdCopyImage-800x664.png 800w, https:\/\/www.ludicon.com\/castano\/blog\/wp-content\/uploads\/2025\/02\/Android-Vulkan-CmdCopyImage.png 1439w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/figure>\n\n\n\n<p>At first, I thought my efforts to bring real-time texture compression to Vulkan on Android were doomed. But there was still some hope. I discovered that the <a href=\"https:\/\/registry.khronos.org\/vulkan\/specs\/1.3-extensions\/man\/html\/VK_KHR_maintenance2.html\">KHR_maintenance2<\/a> extension allowed the creation of uncompressed image views of compressed images, eliminating the need for a copy altogether. This feature was later promoted to core in Vulkan 1.1, which happened to be our minimum requirement, so it seemed like a promising solution.<\/p>\n\n\n\n<p>Getting this to work, however, was far from straightforward. The documentation was sparse, and initial attempts led to validation errors, but after some effort, I managed to get it running correctly on PC. My optimism was short-lived, though. When I tested on mobile, the results were disastrous. The image below shows the output from two Adreno devices (left) and a PowerVR device (right):<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"516\" src=\"https:\/\/www.ludicon.com\/castano\/blog\/wp-content\/uploads\/2025\/02\/Android-Vulkan-RenderToBlockTexelView-700x516.png\" alt=\"\" class=\"wp-image-1470\" srcset=\"https:\/\/www.ludicon.com\/castano\/blog\/wp-content\/uploads\/2025\/02\/Android-Vulkan-RenderToBlockTexelView-700x516.png 700w, https:\/\/www.ludicon.com\/castano\/blog\/wp-content\/uploads\/2025\/02\/Android-Vulkan-RenderToBlockTexelView-267x197.png 267w, https:\/\/www.ludicon.com\/castano\/blog\/wp-content\/uploads\/2025\/02\/Android-Vulkan-RenderToBlockTexelView-768x567.png 768w, https:\/\/www.ludicon.com\/castano\/blog\/wp-content\/uploads\/2025\/02\/Android-Vulkan-RenderToBlockTexelView-1536x1133.png 1536w, https:\/\/www.ludicon.com\/castano\/blog\/wp-content\/uploads\/2025\/02\/Android-Vulkan-RenderToBlockTexelView-800x590.png 800w, https:\/\/www.ludicon.com\/castano\/blog\/wp-content\/uploads\/2025\/02\/Android-Vulkan-RenderToBlockTexelView-80x60.png 80w, https:\/\/www.ludicon.com\/castano\/blog\/wp-content\/uploads\/2025\/02\/Android-Vulkan-RenderToBlockTexelView.png 1628w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/figure>\n\n\n\n<p>By this point, I had been working on Spark for almost 6 months, and it felt like all that work might have been for nothing. And this wasn\u2019t the only problem. Under Vulkan most of my codecs produced incorrect output, failed to compile, or outright crashed the device. Only the simplest kernels worked correctly, and even they ran at much lower performance than their OpenGL ES equivalents.<\/p>\n\n\n\n<p>Thankfully, I was stubborn and persevered. I spent the next few months narrowing down every issue, sending repro cases to IHVs, and trying every possible workaround. That effort paid off. A few months later, all Spark codecs were running correctly and performing as well as, or better than, their GLES counterparts.<\/p>\n\n\n\n<p>The root cause of many of these issues was my initial reliance on fragment shaders. At the time, the version of the Hype engine that I was using lacked support for compute shaders, so my first integration experiments ran the codecs using fragment shaders. This meant I couldn\u2019t output compressed blocks to a buffer like in OpenGL, but instead, I had to render to a texture, requiring <code>vkCmdCopyImage<\/code>.<\/p>\n\n\n\n<p>However, Vulkan also allows updating block-compressed textures using <a href=\"https:\/\/registry.khronos.org\/vulkan\/specs\/1.3-extensions\/man\/html\/vkCmdCopyBufferToImage.html\"><code>vkCmdCopyBufferToImage<\/code><\/a>. In fact, this is the same command that you would use to upload block-compressed textures, which means this code path was heavily tested across vendors, and unsurprisingly, it worked flawlessly.<\/p>\n\n\n\n<p>It also turned out that the texture corruption issues when using block-texel views were associated to the use of fragment shaders, in particular, to the use of <code>VK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT<\/code>. Replacing that flag with <code>VK_IMAGE_USAGE_STORAGE_BIT<\/code> resolved the problems entirely.<\/p>\n\n\n\n<p>Getting this to work correctly was challenging, and this was enhanced by the fact that I never knew whether I was doing something wrong and the validation layer was not catching my errors, or whether the hardware or the software was at fault. So, I&#8217;m gonna explain this procedure in a bit more detail:<\/p>\n\n\n\n<p>To create block-texel views you first have to create an image with the following flags:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>EXTENDED_USAGE<\/strong>: This allows creating a block-compressed image with the <code>VK_IMAGE_USAGE_STORAGE_BIT<\/code> flag, even if that usage isn\u2019t normally supported for compressed formats, as long as it is removed from the block-compressed views.<\/li>\n\n\n\n<li><strong>MUTABLE_FORMAT<\/strong>: This allows us to create image views with different, but compatible formats.<\/li>\n\n\n\n<li><strong>BLOCK_TEXEL_VIEW_COMPATIBLE<\/strong>: This extends the list of compatible formats to include uncompressed formats where the texel matches the block size.<\/li>\n<\/ul>\n\n\n\n<p>Here&#8217;s an example:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>VkFormat compressed_format = VK_FORMAT_ASTC_4x4_UNORM_BLOCK;\nVkFormat uncompressed_format = VK_FORMAT_R32G32B32A32_UINT;\n\nVkImageCreateInfo image_info = { VK_STRUCTURE_TYPE_IMAGE_CREATE_INFO };\nimage_info.imageType = VK_IMAGE_TYPE_2D;\nimage_info.format = compressed_format;\nimage_info.extent = { w, h, 1 };\nimage_info.mipLevels = 1;\nimage_info.arrayLayers = 1;\nimage_info.samples = VK_SAMPLE_COUNT_1_BIT;\nimage_info.tiling = VK_IMAGE_TILING_OPTIMAL;\nimage_info.initialLayout = VK_IMAGE_LAYOUT_UNDEFINED;\n\n\/\/ Note, we create the compressed image with the *STORAGE* usage flag. This is only allowed thanks to EXTENDED_USAGE image flag.\nimage_info.usage = \n    VK_IMAGE_USAGE_SAMPLED_BIT | \n    VK_IMAGE_USAGE_TRANSFER_DST_BIT | \n    VK_IMAGE_USAGE_STORAGE_BIT;\n\n\/\/ Provide the required flags:\nimage_info.flags = \n    VK_IMAGE_CREATE_EXTENDED_USAGE_BIT | \n    VK_IMAGE_CREATE_BLOCK_TEXEL_VIEW_COMPATIBLE_BIT | \n    VK_IMAGE_CREATE_MUTABLE_FORMAT_BIT;<\/code><\/pre>\n\n\n\n<p>After this you would allocate, create the image, and upload the data as you would do normally.<\/p>\n\n\n\n<p>In order to create a view for sampling the texture in the shader, you use the compressed format, but for that to succeed you have to explicitly remove the <code>VK_IMAGE_USAGE_STORAGE_BIT<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>VkImageViewCreateInfo sample_view_info = {};\nsample_view_info.sType = VK_STRUCTURE_TYPE_IMAGE_VIEW_CREATE_INFO;\nsample_view_info.image = image;\nsample_view_info.viewType = VK_IMAGE_VIEW_TYPE_2D;\nsample_view_info.format = compressed_format;\nsample_view_info.subresourceRange.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;\nsample_view_info.subresourceRange.levelCount = 1;\nsample_view_info.subresourceRange.layerCount = 1;\n\n\/\/ Remove the STORAGE usage flag from this view.\nVkImageViewUsageCreateInfo sample_view_usage_info = {};\nsample_view_usage_info.sType = VK_STRUCTURE_TYPE_IMAGE_VIEW_USAGE_CREATE_INFO;\nsample_view_usage_info.usage = image_info.usage &amp; ~VK_IMAGE_USAGE_STORAGE_BIT;\nsample_view_info.pNext = &amp;sample_view_usage_info;\n\nVkImageView sample_view = VK_NULL_HANDLE;\nvkCreateImageView(device, &amp;sample_view_info, nullptr, &amp;sample_view);<\/code><\/pre>\n\n\n\n<p>And to create a view to use the texture as storage in the compute shader, you use the uncompressed format. For maximum compatibility you should also remove the <code>VK_IMAGE_USAGE_SAMPLED_BIT<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/\/ Create texel view for storage:\nVkImageViewCreateInfo store_view_info = {};\nstore_view_info.sType = VK_STRUCTURE_TYPE_IMAGE_VIEW_CREATE_INFO;\nstore_view_info.image = image;\nstore_view_info.viewType = VK_IMAGE_VIEW_TYPE_2D;\nstore_view_info.format = uncompressed_format;\nstore_view_info.subresourceRange.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;\nstore_view_info.subresourceRange.levelCount = 1;\nstore_view_info.subresourceRange.layerCount = 1;\n\n\/\/ Remove the SAMPLED usage flag from this view.\nVkImageViewUsageCreateInfo store_view_usage_info = {};\nstore_view_usage_info.sType = VK_STRUCTURE_TYPE_IMAGE_VIEW_USAGE_CREATE_INFO;\nstore_view_usage_info.usage = image_info.usage &amp; ~VK_IMAGE_USAGE_SAMPLED_BIT;\nstore_view_info.pNext = &amp;store_view_usage_info;\n\nVkImageView store_view = VK_NULL_HANDLE;\nVK_CHECK(vkCreateImageView(device, &amp;image_view_create_info, nullptr, &amp;store_view));<\/code><\/pre>\n\n\n\n<p>Vulkan also provides the <code>KHR_image_format_list<\/code> extension, which is promoted to Vulkan 1.2. This allows you to provide a list of compatible formats at creation time. Using this extension is not strictly required, but is recommended as it can be an optimization on some devices.<\/p>\n\n\n\n<p>Here&#8217;s an usage example:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>VkImageFormatListCreateInfo format_list_info = {\n    VK_STRUCTURE_TYPE_IMAGE_FORMAT_LIST_CREATE_INFO_KHR \n};\n\nVkFormat view_formats&#91;2];\nif (physical_device_properties.apiVersion &gt;= VK_API_VERSION_1_2 || supported_extensions.KHR_image_format_list)\n{\n    view_formats&#91;0] = compressed_format;\n    view_formats&#91;1] = uncompressed_format;\n    format_list_info.viewFormatCount = 2;\n    format_list_info.pViewFormats = view_formats;\n    image_info.pNext = &amp;format_list_info;\n}<\/code><\/pre>\n\n\n\n<p>You can use this same procedure to create render target textures with the <code>VK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT<\/code> flag and then remove it from the corresponding compressed views for sampling. However, as noted earlier, this is known to be broken on many devices. For simplicity, I strongly recommend running the codecs in a compute shader instead.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Direct3D 12<\/h2>\n\n\n\n<p>The Direct3D 10.1 <code>CopyResource<\/code> APIs are available in both Direct3D 11 and Direct3D 12, but like Vulkan 1.1, Direct3D 12.1 also offers the possibility of creating a compressed Unordered Access View (UAV) of a block-compressed texture. To use this functionality you first need to check that the device supports the <code>RelaxedFormatCastingSupported<\/code> feature:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>D3D12_FEATURE_DATA_D3D12_OPTIONS12 feature_options12 = {};\nhr = device-&gt;CheckFeatureSupport(D3D12_FEATURE_D3D12_OPTIONS12,\n                                 &amp;feature_options12, sizeof(feature_options12));\nif (SUCCEEDED(hr)) {\n    supports_relaxed_format_casting = feature_options12.RelaxedFormatCastingSupported;\n}<\/code><\/pre>\n\n\n\n<p>Compared to Vulkan this is refreshingly easy. The only additional requirement is to use the <code>CreateCommittedResource3<\/code> method in <code>ID3D12Device10<\/code> and provide the list of compatible formats up front:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>DXGI_FORMAT format_list&#91;2] = {\n    DXGI_FORMAT_BC7_UNORM, DXGI_FORMAT_R32G32B32A32_UINT\n};\nhr = device10-&gt;CreateCommittedResource3(\n    &amp;heap_properties,\n    D3D12_HEAP_FLAG_NONE,\n    &amp;texture_desc,\n    D3D12_BARRIER_LAYOUT_COMMON,\n    nullptr,\n    nullptr,\n    countof(format_list),\n    format_list,\n    IID_PPV_ARGS(bc_texture));<\/code><\/pre>\n\n\n\n<p>After that you simply create the shader resource views and unordered access view using the formats specified on the list and it just works!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Metal<\/h2>\n\n\n\n<p>Unlike Vulkan and Direct3D 12, Metal does not support copying data directly from uncompressed to compressed textures. The only available method is to use a compute shader to write the compressed data to a buffer and then use a blit encoder to transfer the buffer contents to the texture:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>let blitEncoder = commandBuffer.makeBlitCommandEncoder()\n\nblitEncoder.copy(\n    from: buffer, \n    sourceOffset: bufferOffset, \n    sourceBytesPerRow: bufferRowLength, \n    sourceBytesPerImage: 0, \n    sourceSize: MTLSize(width:width, height:height, depth:1), \n    to: outputTexture, \n    destinationSlice: 0, \n    destinationLevel: 0, \n    destinationOrigin: MTLOrigin(x: 0, y: 0, z: 0))\n\nblitEncoder.endEncoding()<\/code><\/pre>\n\n\n\n<p>If, for any reason, the encoder outputs compressed data to an uncompressed texture, two additional copies are required: one from the texture to a buffer and another from the buffer to the compressed texture. The recommended approach is to have the encoder write compressed blocks directly to a buffer, so that only a single copy is required.<\/p>\n\n\n\n<p>It&#8217;s a bit disappointing that Metal lacks the resource casting capabilities available in Vulkan and Direct3D 12. However, iOS devices generally perform exceptionally well compared to Android, so in practice, they don\u2019t fall behind despite this limitation. The main challenge is maintaining an efficient cross-API abstraction that achieves optimal performance across platforms, because that requires slightly different code paths and shader variations.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusions<\/h2>\n\n\n\n<p>Getting this to work everywhere was a rollercoaster. If I had been doing this as part of a regular job, it would have been a fun ride, but since I was investing my own resources, it was nerve-wracking. For a long time, it wasn\u2019t even clear whether I\u2019d be able to make Spark work reliably on a sizeable subset of the devices I was targeting.<\/p>\n\n\n\n<p>Initially, I chose to keep these details private, sharing them only with clients to assist their integration efforts. However, I\u2019ve come to realize that many developers don\u2019t fully grasp the immense amount of work that has gone into ensuring Spark runs well across platforms, or the risks I\u2019ve taken to get there.<\/p>\n\n\n\n<p>Hopefully, this not only helps others facing similar challenges but also provides a better appreciation for the effort behind Spark.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>When I joined NVIDIA in 2005, one of my main goals was to work on texture and mesh processing tools. The NVIDIA Texture Tools were widely used, and Cem Cebenoyan, Sim Dietrich and Clint Brewer had been doing interesting work on mesh processing and optimization (nvtristrip, nvmeshmender). That was exactly the kind of work I&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9,22],"tags":[],"class_list":["post-1468","post","type-post","status-publish","format-standard","hentry","category-coding","category-spark"],"_links":{"self":[{"href":"https:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/posts\/1468","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/comments?post=1468"}],"version-history":[{"count":17,"href":"https:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/posts\/1468\/revisions"}],"predecessor-version":[{"id":1491,"href":"https:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/posts\/1468\/revisions\/1491"}],"wp:attachment":[{"href":"https:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/media?parent=1468"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/categories?post=1468"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/tags?post=1468"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}