keywords: DirectX 12, Direct3D 12, D3D12, Vulkan, Asynchronous Compute, Async Compute

Overview

Quoted from digitaltrends.com:
Another significant change in DirectX 12 is parallel compute. DirectX 11 handles serial operations, which means there’s a single queue of operations that execute in order. Parallel compute opens up the option for developers to make multiple calls at the same time, vastly improving the efficiency of operations.

Similarly, DirectX 12 opens up the possibility of asynchronous operations. This is similar to parallel compute, but they’re not the same thing. Asynchronous compute allows your hardware to continue operations without waiting for another operation to complete. For example, your CPU can execute an introduction to receive textures from memory and move on to executing another function (like AI for a character) without waiting for that memory instruction to finish. This prioritization can shave minor bits of latency in thousands of places, making your games run much faster overall.

AMD Card is better then Nvidia’s on Async Compute

Quoted from reddit.com:
Firstly the topic at hand is hard to understand because people have confused 2 similar concepts. What AMD is better at is Asynchronous Shaders, which is different than Asynchronous Compute. With Async Shaders the GPU is not doing 2 things at once, and therefor not in parallel. What does this mean? AMD’s GPU’s have die space dedicated to ACE’s these handle the switching between the tasks at a hardware level, this is called context switching.

Context switching means interrupting one running thread, throwing the associated data in a cache, working on another thread. AMD (GCN) can do a quick context switch because it has a dedicated buffer (I forgot what they call it) so they can afford to run concurrent loads on a single CU (64 ALUs, or rather 4 16-wide vector units)

NVIDIA cannot do this, instead they use dynamic load balancing (as opposed to static load balancing on MAXWELL) to repartition the SMs between compute and graphics when necessary This is my understanding at least. Async compute is simply a concept, the fact that AMD’s implementation operates at the CU level does not make it better or worse, simply more suited for the architecture. for the record GCN can’t do graphics and compute in parallel within one CU. I found this particularly strange when the issue exploded initially. Lots of people, well informed people, mistakenly claimed it is done in parallel. That is absurd. There’s a context switch, it’s fast, but it’s a context switch. So it is not parallel. It is concurrent. Any parallelism would stem from work done on different CUs, exactly like on maxwell and pascal.

TLDR:

  • Async Compute = paradigm
  • Async Shaders = AMD implementation
  • AMD (GCN): concurrent graphics + compute within ONE CU
  • NVIDIA (CUDA): concurrent graphics + compute within one GPC (5 SMs)

Quoted from reddit.com:
Very much is, along with mesh shaders and other tech. It’s why the GTX10 series cards have been falling behind GPUs that it handily beat before.

In Cyberpunk 2077 for example the GTX 1080 Ti is being beaten by a 2060 Super and 5700XT, something that would have been hard to imagine back during the DX11 days.

Hardware Unboxed revisited the 1080 Ti back in 2021 and it was barely ahead of the 5700XT. I believe if they do a re-review today it might be behind even the 2060 Super.

Documents

Advanced Graphics Tech: “Async Compute: Deep Dive” & “Raster Ordered Views and Conservative Rasterization”
https://www.gdcvault.com/play/1024385/Advanced-Graphics-Tech-Async-Compute

D3D Async compute for physics
Flex is a particle-based simulation library designed for real-time applications.
https://github.com/NVIDIAGameWorks/FleX

Using asynchronous compute on Arm Mali GPUs: A practical sample
https://community.arm.com/arm-community-blogs/b/graphics-gaming-and-vr-blog/posts/using-asynchronous-compute-on-arm-mali-gpus

Moving To Vulkan Asynchronous Compute

Leveraging Asynchronous Queues for Concurrent Execution
https://gpuopen.com/learn/concurrent-execution-asynchronous-queues/

Optimizing the Graphics Pipeline with Compute, GDC 2016
https://www.slideshare.net/slideshow/optimizing-the-graphics-pipeline-with-compute-gdc-2016/59747720

Examples

This sample demonstrates the use of asynchronous compute shaders (multi-engine) to simulate an n-body gravity system.
https://github.com/microsoft/DirectX-Graphics-Samples/tree/master/Samples/Desktop/D3D12nBodyGravity
nBody DirectX® 12 Sample (asynchronous compute version): slightly modified by AMD
https://github.com/GPUOpen-LibrariesAndSDKs/nBodyD3D12/tree/master/Samples/D3D12nBodyGravity

Vulkan timeline semaphore + async compute performance sample
https://github.com/nvpro-samples/vk_timeline_semaphore

Vulkan implementation of a particle rendering system using async compute.
https://github.com/Chainsawkitten/AsyncCompute


A little impatience will spoil great plans. -Chinese Proverbs