Compute Shader¶
In this tutorial, we learn how to run simple compute shaders.
We start by importing slangpy
and numpy
.
[1]:
import slangpy as spy
import numpy as np
Next, we create a Device
instance. This object is used for creating and managing resources on the GPU.
[2]:
device = spy.Device()
Most objects in slangpy
will display useful information when being printed:
[3]:
print(device)
Device(
type = d3d12,
adapter_name = "NVIDIA GeForce RTX 4090",
adapter_luid = 00000000000000000000000000000000,
enable_debug_layers = false,
supported_shader_model = sm_6_7,
shader_cache_enabled = false,
shader_cache_path = ""
)
At a glance we can see what type of underlying graphics API is being used, if debug layers are enabled, the default shader model being used etc.
Next, we write a simple slang compute kernel that adds two floating point arrays. We mark our shader entry point using the [[shader("compute")]]
attribute. This will allow the slang compiler to find the entry point by name.
// compute_shader.slang
[shader("compute")]
[numthreads(32, 1, 1)]
void main(
uint tid: SV_DispatchThreadID,
uniform uint N,
StructuredBuffer<float> a,
StructuredBuffer<float> b,
RWStructuredBuffer<float> c
)
{
if (tid < N)
c[tid] = a[tid] + b[tid];
}
We can load the shader program using load_program
, passing in the shader module name and the entry point name. Once we have the program loaded, we can create a new compute kernel using the loaded program.
[4]:
program = device.load_program(module_name="compute_shader.slang", entry_point_names=["main"])
kernel = device.create_compute_kernel(program=program)
We continue to create buffers to pass to our compute shader. Buffers a
and b
will be used as input only, while buffer c
will be used as an output. We create all three buffers as structured buffers, using the kernels reflection data to determine the size of each element in the buffer. Buffers a
and b
are initialized with linear sequences using numpy.linspace
. Buffer c
is not initialized, but we have to set its usage
to sp.BufferUsage.unordered_access
in order
to allow GPU side writes.
[5]:
buffer_a = device.create_buffer(
element_count=1024,
struct_type=kernel.reflection.main.a,
usage=spy.BufferUsage.shader_resource,
data=np.linspace(0, 1, 1024, dtype=np.float32),
)
buffer_b = device.create_buffer(
element_count=1024,
struct_type=kernel.reflection.main.b,
usage=spy.BufferUsage.shader_resource,
data=np.linspace(1, 0, 1024, dtype=np.float32),
)
buffer_c = device.create_buffer(
element_count=1024,
struct_type=kernel.reflection.main.c,
usage=spy.BufferUsage.unordered_access,
)
We can now dispatch the compute kernel. We first specify the number of threads to run using thread_count=[1024, 1, 1]
. This will automatically be converted to a number of thread groups to run based on the thread group size specified in the shader ([numthreads(32,1,1)]
). We pass the entry point parameters using additional kwargs
.
[6]:
kernel.dispatch(thread_count=[1024, 1, 1], N=1024, a=buffer_a, b=buffer_b, c=buffer_c)
After the dispatch, we can read back the contents of the c
buffer to a numpy array and print it.
[7]:
data = buffer_c.to_numpy().view(np.float32)
print(data)
assert np.all(data == 1.0)
[1. 1. 1. ... 1. 1. 1.]