Compute Shader

In this tutorial, we learn how to run simple compute shaders.

We start by importing slangpy and numpy.

[1]:
import slangpy as spy
import numpy as np

Next, we create a Device instance. This object is used for creating and managing resources on the GPU.

[2]:
device = spy.Device()

Most objects in slangpy will display useful information when being printed:

[3]:
print(device)
Device(
  type = d3d12,
  adapter_name = "NVIDIA GeForce RTX 4090",
  adapter_luid = 00000000000000000000000000000000,
  enable_debug_layers = false,
  supported_shader_model = sm_6_7,
  shader_cache_enabled = false,
  shader_cache_path = ""
)

At a glance we can see what type of underlying graphics API is being used, if debug layers are enabled, the default shader model being used etc.

Next, we write a simple slang compute kernel that adds two floating point arrays. We mark our shader entry point using the [[shader("compute")]] attribute. This will allow the slang compiler to find the entry point by name.

// compute_shader.slang

[shader("compute")]
[numthreads(32, 1, 1)]
void main(
    uint tid: SV_DispatchThreadID,
    uniform uint N,
    StructuredBuffer<float> a,
    StructuredBuffer<float> b,
    RWStructuredBuffer<float> c
)
{
    if (tid < N)
        c[tid] = a[tid] + b[tid];
}

We can load the shader program using load_program, passing in the shader module name and the entry point name. Once we have the program loaded, we can create a new compute kernel using the loaded program.

[4]:
program = device.load_program(module_name="compute_shader.slang", entry_point_names=["main"])
kernel = device.create_compute_kernel(program=program)

We continue to create buffers to pass to our compute shader. Buffers a and b will be used as input only, while buffer c will be used as an output. We create all three buffers as structured buffers, using the kernels reflection data to determine the size of each element in the buffer. Buffers a and b are initialized with linear sequences using numpy.linspace. Buffer c is not initialized, but we have to set its usage to sp.BufferUsage.unordered_access in order to allow GPU side writes.

[5]:
buffer_a = device.create_buffer(
    element_count=1024,
    struct_type=kernel.reflection.main.a,
    usage=spy.BufferUsage.shader_resource,
    data=np.linspace(0, 1, 1024, dtype=np.float32),
)
buffer_b = device.create_buffer(
    element_count=1024,
    struct_type=kernel.reflection.main.b,
    usage=spy.BufferUsage.shader_resource,
    data=np.linspace(1, 0, 1024, dtype=np.float32),
)
buffer_c = device.create_buffer(
    element_count=1024,
    struct_type=kernel.reflection.main.c,
    usage=spy.BufferUsage.unordered_access,
)

We can now dispatch the compute kernel. We first specify the number of threads to run using thread_count=[1024, 1, 1]. This will automatically be converted to a number of thread groups to run based on the thread group size specified in the shader ([numthreads(32,1,1)]). We pass the entry point parameters using additional kwargs.

[6]:
kernel.dispatch(thread_count=[1024, 1, 1], N=1024, a=buffer_a, b=buffer_b, c=buffer_c)

After the dispatch, we can read back the contents of the c buffer to a numpy array and print it.

[7]:
data = buffer_c.to_numpy().view(np.float32)
print(data)
assert np.all(data == 1.0)
[1. 1. 1. ... 1. 1. 1.]

See also