{"id":1768,"date":"2026-04-10T18:17:40","date_gmt":"2026-04-11T02:17:40","guid":{"rendered":"https:\/\/www.ludicon.com\/castano\/blog\/?p=1768"},"modified":"2026-04-10T18:17:42","modified_gmt":"2026-04-11T02:17:42","slug":"writing-to-compressed-textures-in-metal","status":"publish","type":"post","link":"http:\/\/www.ludicon.com\/castano\/blog\/2026\/04\/writing-to-compressed-textures-in-metal\/","title":{"rendered":"Writing to Compressed Textures in Metal"},"content":{"rendered":"\n<p>About a year ago I wrote <a href=\"https:\/\/www.ludicon.com\/castano\/blog\/2025\/02\/gpu-texture-compression-everywhere\/#metal\">GPU Texture Compression Everywhere<\/a>, a post in which, among other things, I lamented that Metal did not have support for writing to compressed textures.<\/p>\n\n\n\n<p>Unlike Vulkan or D3D12, Metal doesn&#8217;t support resource casting. There&#8217;s no way to write to a compressed texture through an uncompressed view. The only way we can do that is by using a blit operation, so we need to output our results to a temporary buffer and then copy the contents of the buffer to the texture. This requires a temporary memory allocation that needs to be managed, and if the buffer is reused for multiple uploads, hazard tracking may add some synchronization overhead.<\/p>\n\n\n\n<p>I requested support for this feature to Apple when I started working on Spark, more than 3 years ago. Since then not much progress has been made, and support is still not available, so I decided it was time to take matters into my own hands.<\/p>\n\n\n\n<!--more-->\n\n\n\n<p>The workaround I came up with is to emulate resource casting by using a Metal heap to allocate the two resources at the same memory location:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/\/ Determine block compressed texture size:\nMTLTextureDescriptor* bc_desc = &#91;MTLTextureDescriptor\n    texture2DDescriptorWithPixelFormat:MTLPixelFormatBC7_RGBAUnorm\n    width:input.w height:input.h mipmapped:false];    \nbc_desc.storageMode = MTLStorageModePrivate;\nbc_desc.usage = MTLTextureUsageShaderRead;\n\nMTLSizeAndAlign sa = &#91;device heapTextureSizeAndAlignWithDescriptor:bc_desc];\n\n\/\/ Allocate heap:\nMTLHeapDescriptor* hd = &#91;&#91;MTLHeapDescriptor alloc] init];\nhd.size               = sa.size;\nhd.storageMode        = MTLStorageModePrivate;\nhd.hazardTrackingMode = MTLHazardTrackingModeUntracked;\nhd.type               = MTLHeapTypePlacement;\nid&lt;MTLHeap&gt; heap = &#91;device newHeapWithDescriptor:hd];\n\n\/\/ Allocate texture and buffer at the same location:\nid&lt;MTLTexture&gt; bc_tex = &#91;heap newTextureWithDescriptor:bc_desc offset:0];\nid&lt;MTLBuffer&gt;  bc_buf = &#91;heap newBufferWithLength:sa.size\n    options:MTLResourceStorageModePrivate offset:0];<\/code><\/pre>\n\n\n\n<p>We then write to the buffer and read from the texture. With fences to synchronize access, we can ensure writes are visible before sampling:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/\/ Invoke Spark kernel\nid enc = &#91;cb computeCommandEncoder];\n&#91;enc useResource:bc_buf usage:MTLResourceUsageWrite];\n\/\/ \u2026 dispatch spark kernel \u2026\n&#91;enc updateFence:fence];\n&#91;enc endEncoding];\n\n\/\/ Sample the aliased texture from compute or graphics.\n&#91;enc2 waitForFence:fence];\n&#91;enc2 useResource:bc_tex usage:MTLResourceUsageSample];\n\/\/ \u2026 dispatch that samples bc_tex \u2026\n&#91;enc2 endEncoding];<\/code><\/pre>\n\n\n\n<p>However, for this to work the codec needs to read the input texture and write to the output buffer taking texture swizzling into account. For an overview of how texture tiling works, I recommend Fabian&#8217;s post: <a href=\"https:\/\/fgiesen.wordpress.com\/2011\/01\/17\/texture-tiling-and-swizzling\/\">Texture Tiling and Swizzling<\/a><\/p>\n\n\n\n<p>To determine the tiling order on Apple silicon, I wrote blocks with increasing color values and read back the buffer to observe the storage order. On the M4 (the machine I&#8217;m writing this on), the GPU uses the following approach for BC blocks: The texture is divided into macro tiles (32\u00d732 blocks, or equivalently 128\u00d7128 pixels) arranged in row-major order. Within each macro tile, blocks are stored in X-first Morton order.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1200\" height=\"630\" src=\"https:\/\/www.ludicon.com\/castano\/blog\/wp-content\/uploads\/2026\/04\/metal-alias-card.jpg\" alt=\"\" class=\"wp-image-1770\" srcset=\"http:\/\/www.ludicon.com\/castano\/blog\/wp-content\/uploads\/2026\/04\/metal-alias-card.jpg 1200w, http:\/\/www.ludicon.com\/castano\/blog\/wp-content\/uploads\/2026\/04\/metal-alias-card-267x140.jpg 267w, http:\/\/www.ludicon.com\/castano\/blog\/wp-content\/uploads\/2026\/04\/metal-alias-card-700x368.jpg 700w, http:\/\/www.ludicon.com\/castano\/blog\/wp-content\/uploads\/2026\/04\/metal-alias-card-768x403.jpg 768w, http:\/\/www.ludicon.com\/castano\/blog\/wp-content\/uploads\/2026\/04\/metal-alias-card-800x420.jpg 800w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><\/figure>\n\n\n\n<p>There are two ways to write a texture encoder that produces swizzled output: Read 4&#215;4 pixel blocks linearly and scatter writes, or gather reads and write compressed blocks linearly. For each approach need to either compute the linear address for a 2D coordinate, or the 2D coordinate of a linear address.<\/p>\n\n\n\n<p>For that we need to compute Morton codes efficiently, and Fabian also has an article covering that subject: <a href=\"https:\/\/fgiesen.wordpress.com\/2009\/12\/13\/decoding-morton-codes\/\">Decoding Morton Codes<\/a><\/p>\n\n\n\n<p>In our case we only need to interleave the lower 5 bits of each coordinate to index within a macro tile. We can do this efficiently by packing both coordinates into a single word, x in bits 0\u20134 and y in bits 16\u201320, and then spreading or compacting both halves in parallel through the same shift-and-mask chain. This lets us encode or decode both coordinates in a single pass:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/\/ X-first Morton encode two 5-bit coordinates packed in one word.\n\/\/ Pack x in bits 0..4 and y in bits 16..20, spread both halves in parallel\n\/\/ through the same shift-mask chain, then merge the two 10-bit results.\ninline uint morton10(uint x, uint y) {\n    uint v = (x &amp; 0x1Fu) | ((y &amp; 0x1Fu) &lt;&lt; 16);\n    v = (v | (v &lt;&lt; 4)) &amp; 0x0F0F0F0Fu;\n    v = (v | (v &lt;&lt; 2)) &amp; 0x33333333u;\n    v = (v | (v &lt;&lt; 1)) &amp; 0x55555555u;\n    v &amp;= 0x03FF03FFu;\n    return (v | (v &gt;&gt; 15)) &amp; 0x3FFu;\n}\n\n\/\/ Inverse: extract (x, y) from a 10-bit X-first Morton index.\n\/\/ Place even bits (x) in bits 0..4 and odd bits (y) in bits 16..20,\n\/\/ then compact both halves in parallel.\ninline uint2 morton10_inv(uint m) {\n    uint v = (m &amp; 0x155u) | ((m &amp; 0x2AAu) &lt;&lt; 15);\n    v = (v | (v &gt;&gt; 1)) &amp; 0x33333333u;\n    v = (v | (v &gt;&gt; 2)) &amp; 0x0F0F0F0Fu;\n    v = (v | (v &gt;&gt; 4)) &amp; 0x001F001Fu;\n    return uint2(v &amp; 0x1Fu, (v &gt;&gt; 16) &amp; 0x1Fu);\n}<\/code><\/pre>\n\n\n\n<p>For the version that writes blocks in swizzled order, we use a 2D kernel. Threads form a 2D grid over the (bx, by) block coordinates. Each thread reads its 4\u00d74 input block in scan order and writes the encoded BC7 block at the swizzled buffer offset.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>constant uint MT      = 32u;       \/\/ macro-tile size in blocks\n\n\/\/ (bx, by) to linear slot index in the heap buffer.\n\/\/ `bw` is the number of blocks across the texture.\ninline uint bc_block_slot(uint bx, uint by, uint bw) {\n    uint mt_grid_w = (bw + MT - 1u) \/ MT;\n    return ((by &gt;&gt; 5) * mt_grid_w + (bx &gt;&gt; 5)) * MT * MT + morton10(bx, by);\n}\n\nkernel void encode_writes_swizzled(\n    texture2d&lt;half, access::read&gt; src &#91;&#91;texture(0)]],\n    device uint4* output_blocks       &#91;&#91;buffer(1)]],\n    ushort2 gid                       &#91;&#91;thread_position_in_grid]])\n{\n    uint bw = (src.get_width()  + 3u) \/ 4u;\n    uint bh = (src.get_height() + 3u) \/ 4u;\n    if (gid.x &gt;= bw || gid.y &gt;= bh) return;\n\n    half3 rgb&#91;16];\n    read_rgb_block(src, gid * 4, rgb);\n\n    uint slot = bc_block_slot(uint(gid.x), uint(gid.y), bw);\n    output_blocks&#91;slot] = spark_encode_bc7_rgb(rgb, quality);\n}<\/code><\/pre>\n\n\n\n<p>For the version that reads blocks in swizzled order, we use a 1D kernel. Threads form a 1D grid over the linear slot indices in the buffer. Each thread reverses the swizzle to find the (bx, by) block coordinates it owns, reads that 4\u00d74 block from the input texture, then writes the encoded block at its own linear slot.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/\/ Linear slot to (bx, by). Inverse of bc_block_slot.\ninline uint2 bc_slot_to_block(uint slot, uint bw) {\n    uint mt_grid_w = (bw + MT - 1u) \/ MT;\n    uint mt_index  = slot \/ (MT * MT); \/\/ 1024 blocks per macro-tile\n    uint2 local    = morton10_inv(slot - mt_index * MT * MT);\n    uint mt_y = mt_index \/ mt_grid_w;\n    uint mt_x = mt_index - mt_y * mt_grid_w;\n    return uint2(mt_x * MT + local.x, mt_y * MT + local.y);\n}\n\nkernel void encode_reads_swizzled(\n    texture2d&lt;half, access::read&gt; src &#91;&#91;texture(0)]],\n    device uint4* output_blocks       &#91;&#91;buffer(1)]],\n    uint tid                          &#91;&#91;thread_position_in_grid]])\n{\n    uint bw = (src.get_width()  + 3u) \/ 4u;\n    uint bh = (src.get_height() + 3u) \/ 4u;\n    uint mt_grid_w = (bw + MT - 1u) \/ MT;\n    uint mt_grid_h = (bh + MT - 1u) \/ MT;\n    uint slot_count = mt_grid_w * mt_grid_h * MT * MT;\n    if (tid &gt;= slot_count) return;\n\n    uint2 b = bc_slot_to_block(tid, bw);\n    if (b.x &gt;= bw || b.y &gt;= bh) return;  \/\/ padding inside the last macro-tile\n\n    half3 rgb&#91;16];\n    read_rgb_block(src, ushort2(b * 4u), rgb);\n\n    output_blocks&#91;tid] = spark_encode_bc7_rgb(rgb, quality);\n}<\/code><\/pre>\n\n\n\n<p>I measured the performance of both approaches and compared them with the standard method using a temporary buffer and a blit. I ran each method 100 times on an M4 at quality level 2 and report the best and average times for a 1024\u00d71024 texture:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>method<\/th><th>best<\/th><th>average<\/th><\/tr><\/thead><tbody><tr><td>writes_swizzled<\/td><td>0.020 ms<\/td><td>0.036 ms<\/td><\/tr><tr><td>reads_swizzled<\/td><td>0.021 ms<\/td><td>0.038 ms<\/td><\/tr><tr><td>linear + blit<\/td><td>0.025 ms<\/td><td>0.043 ms<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>The two swizzled variants are closely tied. On smaller textures <code>writes_swizzled<\/code> tends to win, on larger textures <code>reads_swizzled<\/code> pulls ahead, but the difference is negligible either way.<\/p>\n\n\n\n<p>Both outperform the blit approach by around 20%, an improvement that you don&#8217;t want to overlook. At quality level 1 the kernel is purely bandwidth limited and the difference is even more pronounced, approximately 28% faster:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>method<\/th><th>best<\/th><th>avg<\/th><\/tr><\/thead><tbody><tr><td>writes_swizzled<\/td><td>0.013 ms<\/td><td>0.026 ms<\/td><\/tr><tr><td>reads_swizzled<\/td><td>0.014 ms<\/td><td>0.026 ms<\/td><\/tr><tr><td>linear + blit<\/td><td>0.018 ms<\/td><td>0.036 ms<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>On mobile the differences are more muted, but still worthwhile at around 17% improvement.<\/p>\n\n\n\n<p><strong>iPhone 16 (A18):<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>method<\/th><th>best<\/th><th>avg<\/th><\/tr><\/thead><tbody><tr><td>writes_swizzled<\/td><td>0.168 ms<\/td><td>0.257 ms<\/td><\/tr><tr><td>reads_swizzled<\/td><td>0.170 ms<\/td><td>0.245 ms<\/td><\/tr><tr><td>linear + blit<\/td><td>0.201 ms<\/td><td>0.304 ms<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p><strong>iPhone 8 (A11):<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>method<\/th><th>best<\/th><th>avg<\/th><\/tr><\/thead><tbody><tr><td>writes_swizzled<\/td><td>0.578 ms<\/td><td>0.606 ms<\/td><\/tr><tr><td>reads_swizzled<\/td><td>0.573 ms<\/td><td>0.593 ms<\/td><\/tr><tr><td>linear + blit<\/td><td>0.689 ms<\/td><td>0.721 ms<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Despite the benefits, this technique is somewhat risky. There&#8217;s no guarantee that future devices will use the same tiling format, in fact, some of the iOS devices I&#8217;ve tested use a different Morton ordering.<\/p>\n\n\n\n<p>It&#8217;s also interesting to note that in Vulkan, some (most?) vendors disable tiling when writes to compressed textures are enabled, resulting in textures that are in a block-linear format. To avoid that they would have to modify every shader that writes to those textures in order to apply the swizzling, which would require dynamic shader compilation.<\/p>\n\n\n\n<p>When sampled, block-linear textures have worse caching performance than swizzled textures. I haven&#8217;t measured the impact, and IHVs have dismissed this concern, but I think it&#8217;s worth taking a closer look at this in the future and implementing a similar approach in Vulkan.<\/p>\n\n\n\n<p>The drawback is that you would need slightly different code paths depending on the target device, but hey, that&#8217;s why you want to license <a href=\"https:\/\/ludicon.com\/spark\/\">Spark<\/a>, rather than rolling your own.<\/p>\n\n\n\n<p>The upcoming Spark SDK will include a Metal example demonstrating this technique in more detail, with support for non-power of two textures, different block-compression formats, and devices as old as iPhone 6 (A8).<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>About a year ago I wrote GPU Texture Compression Everywhere, a post in which, among other things, I lamented that Metal did not have support for writing to compressed textures. Unlike Vulkan or D3D12, Metal doesn&#8217;t support resource casting. There&#8217;s no way to write to a compressed texture through an uncompressed view. The only way&#8230;<\/p>\n","protected":false},"author":1,"featured_media":1770,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1768","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"http:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/posts\/1768","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/comments?post=1768"}],"version-history":[{"count":5,"href":"http:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/posts\/1768\/revisions"}],"predecessor-version":[{"id":1774,"href":"http:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/posts\/1768\/revisions\/1774"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/media\/1770"}],"wp:attachment":[{"href":"http:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/media?parent=1768"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/categories?post=1768"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.ludicon.com\/castano\/blog\/wp-json\/wp\/v2\/tags?post=1768"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}