Optimizing copy of null descriptors in D3D12

While more and more engines have moved to bindless texturing, you might have a codebase that still uses descriptors for texture binding. It might be that you need to maintain compatibility with older APIs, hardware, or simply do not have enough time to update all your code to use bindless. And even if you are using bindless texturing, you may still have to copy some descriptors for your samplers.

A common thing with D3D12 is to use a single root signature that has a fixed number of slots for all your shaders. This leads to a management of shader resource views very similar to what you could do with D3D11. The main issue with this however, is that if your root signature allows up to let’s say, 64 SRVs (Shader Resource Views), but only use 2 for your shader, you will have to copy 62 null descriptors for “nothing”. This is because by default you are not allowed to have invalid descriptors in your descriptor table.

Naive filling of heap descriptors

One may find code that looks like this:

const int MaxSRVsInTable = 64;

const int firstSlot = heap.AllocateSlots(MaxSRVsInTable);
for (int i = 0; i < MaxSRVsInTable; i++)
{
	// If we have a SRV to bind, copy its descriptor
	if (const SomeViewDataStructure* viewData = viewsPtrArray[i])
	{
		devicePTr->CopyDescriptorsSimple(1,
			heap.GetCPUSlotHandle(firstSlot + i),
			viewData->GetCPUDescriptor(),
			D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV);
	}
	else // Copy a null descriptor
	{
		devicePTr->CopyDescriptorsSimple(1,
			heap.GetCPUSlotHandle(firstSlot + i),
			NullSRVDescriptorCPU,
			D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV);
	}
}

This is quite inefficient as you have to copy each descriptor one by one! In one of the engines I worked on, almost half of the render thread time was spent on such descriptor copies, that’s a lot.

Did you notice how we always pass 1 to the NumDescriptors of CopyDescriptorsSimple? The first thing you can do to save a lot of time is to copy multiple descriptors at once. But how can we do this since we do not know which slots might be bound by the shader? Maybe our 2 SRVs are using slots 5 and 62. Figuring out what ranges we need to copy the null descriptors to might be as expensive as copying them one by one.

First optimization

Luckily, the DirectX Shader Compiler will allocate registers so that resources you will be using are all packed in the first slots, and holes only really appear if you have conditional access to resources in your shader.

Based on this, we can optimize for one case: empty slots that are after the last referenced slot.

Drawcalls descriptors padding

You usually can determine the last referenced slot in the shader easily from your engine or using reflection (preferably stored in your shader assets) with ID3D12FunctionReflection::GetResourceBindingDesc.

Then, instead of only creating a single null descriptor at the startup of your program, create a contiguous array of MaxSRVsInTable null descriptors. This enables you to call CopyDescriptorsSimple only once for the remaining slots, saving a lot of CPU time!

for (int i = 0; i < nbSrvSlotUsedByShader; i++)
{
	// If we have a SRV to bind, copy its descriptor
	if (const SomeViewDataStructure* viewData = viewsPtrArray[i])
	{
		devicePTr->CopyDescriptorsSimple(1,
			heap.GetCPUSlotHandle(firstSlot + i),
			viewData->GetCPUDescriptor(),
			D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV);
	}
	else // Copy a null descriptor
	{
		devicePTr->CopyDescriptorsSimple(1,
			heap.GetCPUSlotHandle(firstSlot + i),
			NullSRVDescriptorCPU,
			D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV);
	}
}
// Copy the remaining null descriptors in a single API call, way, way faster!
if(const int remainingSlots = MaxSRVsInTable - nbSrvSlotUsedByShader)
{
	devicePTr->CopyDescriptorsSimple(remainingSlots,
		heap.GetCPUSlotHandle(firstSlot + nbSrvSlotUsedByShader),
		NullSRVDescriptorsArray,
		D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV);
}

By applying this optimization on SRVs, CBVs, UAVs and sampler descriptors, we observed a CPU gain of almost 2ms on our rendering thread, for a scene of ~2100 drawcalls!

⚠ For descriptors actually used by the shader, they need to be of the correct type unless the resource binding tier is appropriate. This means you cannot use a null descriptor of a Texture2D in place of a Texture1D, or UAV instead of SRV.

This simple optimization earned us a few precious milliseconds, but can we do better?

Hardware tiers

It happens that we can! While our resource binding code is now way faster than before, D3D12 allows for even greater optimizations here.

According to the documentation:

In summary, to create a null descriptor, pass null for the pResource parameter when creating the view with methods such as CreateShaderResourceView. For the view description parameter pDesc, set a configuration that would work if the resource was not null (otherwise a crash may occur on some hardware).

On Tier1 hardware (see Hardware Tiers), all descriptors that are bound (via descriptor tables) must be initialized, either as real descriptors or null descriptors, even if not accessed by the hardware, otherwise behaviour is undefined.

On Tier2 hardware, this applies to bound CBV and UAV descriptors, but not to SRV descriptors.

On Tier3 hardware, there’s no restriction on this, provided that uninitialized descriptors are never accessed.

This means that depending on the hardware tier, you can even skip the remaining null descriptors copy! You can (must) check the hardware tier before enabling the next optimization. In your renderer startup code, you can call CheckFeatureSupport for that.

D3D12_RESOURCE_BINDING_TIER resourceBindingTier = D3D12_RESOURCE_BINDING_TIER_1;
D3D12_FEATURE_DATA_D3D12_OPTIONS options;
if (SUCCEEDED(lDevice->CheckFeatureSupport(D3D12_FEATURE_D3D12_OPTIONS, &options, sizeof(options))))
{
	resourceBindingTier = options.ResourceBindingTier;
}

You may want to enforce a minimum tier of 2 instead of checking it dynamically. You might even already require it implicitly due to other tiers requirements. Support matrix is available on the corresponding wikipedia page.
No GPU launched post 2015 and no CPU launched post 2018 are Tier 1 only.

The code becomes

// Copy the remaining null descriptors unless hardware tier is high enough
if(resourceBindingTier >= 2)
if(const int remainingSlots = MaxSRVsInTable - nbSrvSlotUsedByShader)
{
	devicePTr->CopyDescriptorsSimple(remainingSlots,
		heap.GetCPUSlotHandle(firstSlot + nbSrvSlotUsedByShader),
		NullSRVDescriptorsArray,
		D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV);
}

Note that tier 2 is required for this optimization with SRVs and samplers, and tier 3 for UAVs and CBVs.

In our test case (~2100 draw calls), we earned an additional 1ms of CPU time on our render thread by skipping the copies altogether, for a total of 3ms compared to the naive version.

Retrofitting the optimization for lower tiers

We saw that we can have huge gains by skipping the copy of null descriptors. But this is only possible on higher tiers, what if instead of simply skipping the copy of the padding descriptors, we instead reused this memory for the next drawcall?

For that, we simply reserve the maximum number of descriptors but only allocate those used by the shader. The reason we do not only allocate what is needed without reserving, is that the heap must still be able to contain enough space for all the descriptors of the drawcall. Then, before submitting the commandlist, pad the last drawcall of each heap (or not, if the hardware tier is high enough) using null descriptors instead of padding for every single drawcall.

Overlapping drawcalls descriptors

That way, not only do we save some CPU time, but also memory and more importantly, bandwidth!

Common pitfall

You must ensure that the shaders will not access the “padding” slots. While you can most of the time assume this is the case (as long as the sampling is done in a branch and the branch is never taken), there is a case where this is not true.

If you compile your shaders with the -all-resources-bound parameter (which is recommended for performance! See this article by NVidia and blog post from Microsoft), then the compiler will always assume that all the referenced resources are bound. This means that nbSrvSlotUsedByShader must be retrieved from the shader itself (for example through reflection) and not just skip the copy of some null descriptors. For example, if your shader uses 5 SRVs, but the 4th and 5th ones are optional, you still need to bind null descriptors for those 2 slots, but not for the remaining MaxSRVsInTable - 5 slots.

This is why I highly recommend using the DirectX debug layer and even turning on GPU Based Validation as the debug layer alone might not be able to determine if a descriptor is accessed or not.

Notes: Timings are given for a machine with a Intel CPU i7-9700K @ 3.60GHz

Clément GRÉGOIRE

Performance & Optimization Expert

# C++,Performance,Video Games,Rendering