D3D12 Root Descriptor Alignment: Mind the Gap

Here is a little known and unintuitive alignment pitfall in D3D12 descriptor binding. In the wrong circumstances it can cause a large performance degradation on NVIDIA GPUs.

Let’s start with a simple, contrived shader copying 4x4 matrices: 64 bytes per thread.

ByteAddressBuffer Input : register(t0);
RWByteAddressBuffer Output : register(u0);

void cs_main(uint dtid : SV_DispatchThreadID)
{
    uint offset = dtid.x * 16 * 4;
    Output.Store<float4x4>(offset, Input.Load<float4x4>(offset));
}

Now let’s look at a Nsight capture of this shader in two different binding configurations:

Nsight Capture of Root Descriptor Alignment Issue

Over 2x slower. The only difference is the descriptor binding configuration, which at first seems extremely odd – here is a short investigation.

A root descriptor isn’t a descriptor heap entry in the traditional sense. It is simply bound as a GPU virtual address set via methods such as SetComputeRootShaderResourceView and SetComputeRootUnorderedAccessView. This method of binding has documented limitations, mostly prohibiting buffers that require format conversion. However, nothing in the user-facing D3D12 docs explains how these binding methods could affect performance so drastically. The shader will have one less indirection fetching the descriptor in the root descriptor case, but this does not explain a performance delta of this magnitude.

The likely explanation is the shader compiler’s buffer alignment assumptions. D3D12_RAW_UAV_SRV_BYTE_ALIGNMENT is defined as 16, so it is natural to expect a raw ByteAddressBuffer SRV/UAV to carry a 16-byte alignment guarantee. In practice, that guarantee is available when creating normal raw SRV/UAV descriptors, and the debug layer will complain if the raw descriptor does not satisfy it.

The key detail is described not in any official documentation, but in a proposal to fix the issue itself!

The primary limitation occurs with root descriptor buffer views, which are constrained to 4-byte alignment in the current specification. When applications choose root descriptors over descriptor tables for performance or resource binding reasons, shader compilers must conservatively assume this worst-case alignment scenario. This conservative assumption prevents optimizations that depend on higher alignment guarantees, even when the application has allocated and bound buffers with stronger alignment properties.

So the distinction here is that a ByteAddressBuffer bound as a root descriptor does not expose the same 16-byte base alignment guarantee to the shader compiler. The proposal itself is aimed at solving this issue with user annotations.

Note: there is a separate but related issue regarding indexing the buffer itself: the shader compiler needs to prove the dynamic index provided is aligned. In our case, since we multiply dtid.x by 16 * 4, it is provably aligned.

To see why that matters on NVIDIA, here is the same access pattern ported to CUDA. This is not proof of the D3D12 behavior by itself; it is a way to show how NVIDIA code generation changes when the compiler can or cannot rely on 16-byte alignment. We can force the struct to be 16-byte aligned and inspect the assembly on Compiler Explorer.

// Switch between statically defined 16-byte alignment
// and natural 4-byte alignment.
#if defined(ALIGNED) 
    #define ALIGN alignas(16)
#else
    #define ALIGN
#endif

struct ALIGN Matrix44
{
    float m[16];
};

__global__ void copy_matrix44(Matrix44* a, Matrix44* b, int n) {
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    a[tid] = b[tid];
}

Things become very obvious reading the PTX in Compiler Explorer. The unaligned version issues 16 single-float load/store instructions (ld.global.f32 / st.global.f32), whereas the aligned version issues 4 float4 load/store instructions (ld.global.v4.f32 / st.global.v4.f32).

The access pattern in the example shader makes the performance difference especially visible. On NVIDIA GPUs, a warp’s global memory accesses are coalesced per instruction into the minimal 32-byte transactions needed to service the active lanes. In this shader, each lane copies one 64-byte matrix, so a single field of each matrix is 64 bytes apart across lanes. Each lane therefore touches a separate 32-byte sector for a given load instruction.

Breakdown of memory requests per instruction, per warp

Load widthUseful bytesSectors touchedSector bytes requestedSector Util
float32 x 4 B = 128 B3232 * 32 B = 1024 B12.5%
float432 x 16 B = 512 B3232 * 32 B = 1024 B50%

With 16-byte float4 loads, each load makes much more efficient use of each sector. For the scalar case, we fetch a 32-byte sector and use only 4 bytes from it. For the float4 case, we use 16 bytes from each sector and issue one-quarter as many load/store instructions overall.

In conclusion, this is not a recommendation to completely avoid root descriptors: they have benefits for CPU binding cost and shader-side descriptor indirection. However, for alignment-sensitive shaders using ByteAddressBuffer, it is worth profiling the impact of binding them as root descriptors.

Thanks for reading!

Additional References: