keywords: Graphics, Vulkan Subgroup, D3D12 Warp, Wave Intrinsics, Wavefronts, SIMD, GPU Scalarization

Overview
  • Vulkan/OpenCL calls it a Subgroup
  • D3D12 calls it a Wave
  • Nvidia calls it a Warp
  • AMD calls it a Wavefront
Documents

Wave Programming in D3D12 and Vulkan - GDC2017
https://gpuopen.com/wp-content/uploads/2017/07/GDC2017-Wave-Programming-D3D12-Vulkan.pdf

Surfing the Wave(front)s with Radeon GPU Profiler
https://gpuopen.com/presentations/2019/Surfing_the_Wavefronts.pdf

Occupancy explained
https://gpuopen.com/learn/occupancy-explained/

SIMD in the GPU world
https://www.rastergrid.com/blog/gpu-tech/2022/02/simd-in-the-gpu-world/

Subgroups - 2018 Vulkan Devday
https://www.khronos.org/assets/uploads/developers/library/2018-vulkan-devday/06-subgroups.pdf

Wave Intrinsics
https://github.com/Microsoft/DirectXShaderCompiler/wiki/Wave-Intrinsics

Unlocking GPU Intrinsics in HLSL
https://developer.nvidia.com/blog/unlocking-gpu-intrinsics-in-hlsl/

Wave Intrinsics - Intel® Arc™ A-series Graphics Gaming API

HLSL 6.0 wave operations
https://learn.microsoft.com/en-us/windows/win32/api/d3d12/ns-d3d12-d3d12_feature_data_d3d12_options1

INTRO TO GPU SCALARIZATION – PART 1 (Recommended)
https://flashypixels.wordpress.com/2018/11/10/intro-to-gpu-scalarization-part-1/
INTRO TO GPU SCALARIZATION – PART 2 -SCALARIZE ALL THE LIGHTS
https://flashypixels.wordpress.com/2018/11/10/intro-to-gpu-scalarization-part-2-scalarize-all-the-lights/

Blogs

Visualizing GL_NV_shader_sm_builtins
https://wunkolo.github.io/post/2020/02/visualizing-gl_nv_shader_sm_builtins/

Forums

Quote from gamedev.net:
None of the things you’ve quoted are contractory – the first quote says that a wavefront is 64 threads, not that a wavefront is 1 thread.

A SIMD unit can have up to 10 wavefronts in flight at once. Each wavefront contains 64 threads. Hence a SIMD unit can have up to 640 threads in flight at once (in multiples of 64).

The scheduler will take the pixels/vertices that need to be processed, allocate one thread per pixel/vertex, and then tries to group up to 64 threads together into a wavefront. That bundle of threads is then given to a SIMD, which runs the shader code.

The number of wavefronts that ‘fit’ into a SIMD depends on the complexity of the shader code. For simple shaders, you can squeeze 10 wavefronts at a time into a SIMD, but for complex shaders you may only be able to fit one or two wavefronts into a SIMD.

This is because different shaders require different numbers of temporary registers, which are stored in the SIMD’s register array. Say the SIMD has 1000 registers in total – if a shader uses 100 or less, then you can fit 10 (or more) “instances” of that shader into the register array. If a shader uses 500 temporary registers, then only two “instances” of that shader will fit into the SIMD - so the SIMD will only accept two concurrent wavefronts.

Each “register” actually contains 64 floats – which is why this calculation is done for wavefronts and not threads. One register is used by a wavefront to store a value for each of it’s threads.

What’s a “wavefront” in the context of real-time rendering?
https://stackoverflow.com/a/70249817/1645289

Clustered shading - why deferred?
https://www.gamedev.net/forums/topic/683544-clustered-shading-why-deferred/

What are screen space derivatives and when would I use them?
https://gamedev.stackexchange.com/a/130933/117871


Teachers open the door, but you must enter by yourself. -Chinese Proverbs