Compute Shader¶
In this tutorial, we learn how to use the low level ComputeKernel api to create, compile and dispatch compute shaders manually rather than use the functional api covered in the basics section. Whilst this is typically unnecesary, it can be a useful tool, especially when converting code bases that already have significant numbers of manually written compute kernels.
We start by importing slangpy and numpy.
[1]:
import slangpy as spy
import numpy as np
Next, we create a Device instance. This object is used for creating and managing resources on the GPU.
[2]:
device = spy.Device()
Most objects in slangpy will display useful information when being printed:
[3]:
print(device)
Device(
type = d3d12,
adapter_name = "NVIDIA GeForce RTX 4090",
adapter_luid = 992d0100000000000000000000000000,
enable_debug_layers = false,
enable_cuda_interop = false,
enable_print = false,
enable_hot_reload = true,
enable_compilation_reports = false,
supported_shader_model = sm_6_7,
shader_cache_enabled = false,
shader_cache_path = ""
)
At a glance we can see what type of underlying graphics API is being used, if debug layers are enabled, the default shader model being used etc.
Next, we write a simple slang compute kernel that adds two floating point arrays. We mark our shader entry point using the [[shader("compute")]] attribute. This will allow the slang compiler to find the entry point by name.
// compute_shader.slang
[shader("compute")]
[numthreads(32, 1, 1)]
void main(
uint tid: SV_DispatchThreadID,
uniform uint N,
StructuredBuffer<float> a,
StructuredBuffer<float> b,
RWStructuredBuffer<float> c
)
{
if (tid < N)
c[tid] = a[tid] + b[tid];
}
We can load the shader program using the low level api with Device.load_program, passing in the shader module name and the entry point name. Once we have the program loaded, we can create a new compute kernel using the loaded program.
[4]:
program = device.load_program(module_name="compute_shader.slang", entry_point_names=["main"])
kernel = device.create_compute_kernel(program=program)
We continue to create buffers to pass to our compute shader. Buffers a and b will be used as input only, while buffer c will be used as an output. We create all three buffers as structured buffers, using the kernels reflection data to determine the size of each element in the buffer. Buffers a and b are initialized with linear sequences using numpy.linspace. Buffer c is not initialized, but we have to set its usage to sp.BufferUsage.unordered_access in order
to allow GPU side writes.
[5]:
buffer_a = device.create_buffer(
element_count=1024,
resource_type_layout=kernel.reflection.main.a,
usage=spy.BufferUsage.shader_resource,
data=np.linspace(0, 1, 1024, dtype=np.float32),
)
buffer_b = device.create_buffer(
element_count=1024,
resource_type_layout=kernel.reflection.main.b,
usage=spy.BufferUsage.shader_resource,
data=np.linspace(1, 0, 1024, dtype=np.float32),
)
buffer_c = device.create_buffer(
element_count=1024,
resource_type_layout=kernel.reflection.main.c,
usage=spy.BufferUsage.unordered_access,
)
We can now dispatch the compute kernel. We first specify the number of threads to run using thread_count=[1024, 1, 1]. This will automatically be converted to a number of thread groups to run based on the thread group size specified in the shader ([numthreads(32,1,1)]). We pass the entry point parameters using additional kwargs.
[6]:
kernel.dispatch(thread_count=[1024, 1, 1], N=1024, a=buffer_a, b=buffer_b, c=buffer_c)
After the dispatch, we can read back the contents of the c buffer to a numpy array and print it.
[7]:
data = buffer_c.to_numpy().view(np.float32)
print(data)
assert np.all(data == 1.0)
[1. 1. 1. ... 1. 1. 1.]