Tensors In Slang¶

Introduction¶

SlangPy provides a comprehensive set of tensor types for use within Slang shaders. These types allow you to work with multi-dimensional arrays on the GPU with different access patterns and capabilities. All tensor types are templates parameterized by an element type T and number of dimensions D.

Tensor Type Overview¶

Commonly Used Types¶

The four most commonly used tensor types are:

Tensor<T, D> - Read-only tensor for loading data
RWTensor<T, D> - Read-write tensor for loading and storing data
DiffTensor<T, D> - Read-only differentiable tensor (for forward/backward passes)
WDiffTensor<T, D> - Write-only differentiable tensor (for forward/backward passes)

These cover the majority of use cases for compute kernels and differentiable programming.

Complete Type List¶

For specialized scenarios, additional tensor types are available:

Non-differentiable tensors:

Tensor<T, D> - Read-only
WTensor<T, D> - Write-only
RWTensor<T, D> - Read-write
AtomicTensor<T, D> - Read-write with atomic operations (requires T : IAtomicAddable)

Differentiable tensors:

DiffTensor<T, D> - Read-only with writable atomic output gradients
WDiffTensor<T, D> - Write-only with readable input gradients
RWDiffTensor<T, D> - Read-write with both gradients

Primal-only differentiable tensors (store only primal values, no separate gradient buffers):

PrimalTensor<T, D> - Read-only primal tensor
WPrimalTensor<T, D> - Write-only primal tensor
RWPrimalTensor<T, D> - Read-write primal tensor

Note

What are PrimalTensor types for?

PrimalTensor types exist primarily as an internal mechanism to allow passing tensors without gradients to IDiffTensor interfaces.

When you write a function accepting IDiffTensor<T, D> and call it with a Python tensor that has no gradients attached, SlangPy binds a PrimalTensor instead of a DiffTensor. This avoids errors from trying to bind null gradient buffers and eliminates unnecessary binding overhead.

Once upcoming Slang auto-diff improvements land, the need for separate IDiffTensor interfaces will be removed, and PrimalTensor types will no longer be necessary. At that point, the only difference between DiffTensor and Tensor will be that the former has gradient storage and the latter does not.

Interface Types¶

For maximum flexibility when writing reusable functions, SlangPy provides interface types that abstract over the concrete tensor implementations. Using interfaces is strongly recommended for function parameters because:

The generated kernel can choose the most efficient storage type (e.g., PrimalTensor vs DiffTensor)
The same function can work for both forward and backward differentiation passes
Code is more generic and reusable, and will be compatible with future tensor types added to SlangPy.

Available tensor interfaces:

ITensor<T, D> - Read-only tensor interface
IWTensor<T, D> - Write-only tensor interface
IRWTensor<T, D> - Read-write tensor interface
IDiffTensor<T, D> - Read-only differentiable tensor interface
IWDiffTensor<T, D> - Write-only differentiable tensor interface
IRWDiffTensor<T, D> - Read-write differentiable tensor interface

// Good: Uses interface types - works with any compatible tensor implementation
void process_data(int2 idx, ITensor<float, 2> input, IRWTensor<float, 2> output)
{
    float value = input[idx];
    output[idx] = value * 2.0;
}

// Less flexible: Requires specific tensor types
void process_data_concrete(int2 idx, Tensor<float, 2> input, RWTensor<float, 2> output)
{
    float value = input[idx];
    output[idx] = value * 2.0;
}

When SlangPy generates a kernel that calls a function accepting interface types, it automatically selects the appropriate concrete type based on the Python tensor’s properties (read-only, writable, differentiable, etc.).

Currently, the main use-case for concrete Tensor types is when you need to directly access the gradient buffers for custom operations, such as manually written backwards passes, as these are only exposed by the concrete DiffTensor types.

Tensor Operations¶

Accessing Elements¶

Tensors support multiple ways to access elements:

Subscript operator:

void example(int2 idx, ITensor<float, 2> input, IRWTensor<float, 2> output)
{
    // Read using subscript with array indices
    int[2] arr_idx = {idx[0], idx[1]};
    float value = input[arr_idx];

    // Read using subscript with vector indices
    int2 vec_idx = int2(idx.x, idx.y);
    value = input[vec_idx];

    // Read using subscript with variadic indices
    value = input[idx.x, idx.y];

    // Write using subscript
    output[arr_idx] = value * 2.0;
    output[vec_idx] = value * 2.0;
    output[idx.x, idx.y] = value * 2.0;
}

load/store methods:

void example_loadstore(int2 idx, ITensor<float, 2> input, IRWTensor<float, 2> output)
{
    // Load using array indices
    int[2] arr_idx = {idx[0], idx[1]};
    float value = input.load(arr_idx);

    // Load using vector indices
    int2 vec_idx = int2(idx.x, idx.y);
    value = input.load(vec_idx);

    // Load using variadic indices
    value = input.load(idx.x, idx.y);

    // Store
    output.store(arr_idx, value * 2.0);
    output.store(vec_idx, value * 2.0);
    output.store(idx.x, idx.y, value * 2.0);
}

Warning

Index Convention Differences

As noted in Index Representation Conventions, there is an important difference between array and vector indexing:

Array indices: int[2] - Follow tensor dimension order (e.g., [row, col] for 2D)
Variadic indices: Multiple integer arguments in order (e.g., (row, col) for 2D)
Vector indices: int2 - Reverse order with x component indexing the rightmost dimension (e.g., (col, row) for 2D)

void index_demo(ITensor<float, 2> tensor)
{
    // These access the SAME element at row=3, col=5:
    int[2] arr_idx = {3, 5};        // Array: [row, col]
    int2 vec_idx = int2(5, 3);      // Vector: (col, row) - note reversed order!

    float value1 = tensor.load(arr_idx);
    float value2 = tensor.load(vec_idx);  // Same as value1
}

Tensor Properties¶

All tensors expose a shape property to query dimensions:

void check_dimensions(ITensor<float, 3> tensor)
{
    uint[3] dims = tensor.shape;
    uint width = dims[0];
    uint height = dims[1];
    uint depth = dims[2];

    // Use dimensions in computations
    if (width > 100 && height > 100) {
        // ...
    }
}

Working with Structs¶

Tensors can store any user-defined struct types, so can be used in place of a StructuredBuffer in classical GPU programming:

struct Particle
{
    float3 position;
    float3 velocity;
    float mass;
};

void update_particles(int idx, ITensor<Particle, 1> particles_in, IRWTensor<Particle, 1> particles_out)
{
    // Load entire struct
    Particle p = particles_in[idx];

    // Update fields
    p.position += p.velocity * 0.016;  // 60 FPS timestep

    // Store back
    particles_out[idx] = p;
}

Note that when a differentiable tensor type is used, the user-defined struct must implement both IDifferentiable and IAtomicAddable interfaces. If gradient accumulation is not needed (eg the input tensor is write-only, so gradients are read-only) the 2 atomicAdd functions required by IAtomicAddable can be left as no-ops but must be present.

Differentiable Operations¶

Differentiable tensor types (DiffTensor, WDiffTensor, RWDiffTensor) support automatic differentiation. Operations on these tensors can be differentiated by Slang’s auto-diff system:

[Differentiable]
float compute_loss(int idx, IDiffTensor<float, 1> predictions, IDiffTensor<float, 1> targets)
{
    float pred = predictions[idx];
    float target = targets[idx];

    // Mean squared error
    float diff = pred - target;
    return diff * diff;
}

When this function is called with Tensor arguments from Python, SlangPy can automatically generate both forward and backward passes. See autodiff for more details on automatic differentiation and Differentiable Tensors for using differentiable tensors.

Atomic Operations¶

The AtomicTensor type supports atomic operations, and is typically used by SlangPy for thread-safe accumulation of gradients:

void accumulate_gradients(int idx, Tensor<float, 1> local_grads, AtomicTensor<float, 1> global_grads)
{
    float local_grad = local_grads[idx];

    // Atomic add - safe for concurrent writes from multiple threads
    global_grads.add(idx, local_grad);
}

Both DiffTensor and RWDiffTensor use AtomicTensor internally for gradient accumulation in backward passes.

Examples¶

Element-wise Operations¶

Whilst it would be unusual to write an element-wise operation manually (since SlangPy automatically vectorizes function calls), this example shows a classical element-wise scaling operation:

void scale_values(int2 idx, ITensor<float, 2> input, IRWTensor<float, 2> output, float scale)
{
    output[idx] = input[idx] * scale;
}

Called from Python with:

input = spy.Tensor.from_numpy(device, data)
output = spy.Tensor.empty(device, shape=input.shape, dtype=float)
module.scale_values(spy.grid(shape=input.shape), input, output, scale=2.0)

The most common reason to utilize tensor types in this way is when upgrading an old code base that already operates on the deprecated Slang TensorView and DiffTensorView types, which required explicit element-wise kernels.

Neighborhood Operations¶

Accessing neighboring elements (e.g., convolution, blur) currently requires access to the full tensor:

void blur_3x3(int2 idx, ITensor<float, 2> input, IRWTensor<float, 2> output)
{
    float sum = 0.0;
    int count = 0;

    // 3x3 neighborhood
    for (int dy = -1; dy <= 1; dy++) {
        for (int dx = -1; dx <= 1; dx++) {
            int2 neighbor = idx + int2(dx, dy);

            // Check bounds
            if (neighbor.x >= 0 && neighbor.x < input.shape[0] &&
                neighbor.y >= 0 && neighbor.y < input.shape[1]) {
                sum += input[neighbor];
                count++;
            }
        }
    }

    output[idx] = sum / float(count);
}

Work is in progress to support this pattern with a tile abstraction to allow more efficient shared memory usage.

Reduction Operations¶

Summing or finding max/min across a dimension:

void sum_rows(int row, ITensor<float, 2> input, IRWTensor<float, 1> output)
{
    uint width = input.shape[1];
    float sum = 0.0;

    for (uint col = 0; col < width; col++) {
        sum += input.load(row, col);
    }

    output[row] = sum;
}

Called from Python:

input = spy.Tensor.from_numpy(device, data_2d)  # Shape (100, 200)
output = spy.Tensor.empty(device, shape=(100,), dtype=float)
module.sum_rows(spy.grid(shape=(100,)), input, output)

Differentiable Functions¶

Functions that work in both forward and backward passes:

[Differentiable]
void apply_activation(int idx, IDiffTensor<float, 1> input, IWDiffTensor<float, 1> output)
{
    float x = input[idx];

    // ReLU activation
    output[idx] = max(0.0, x);
}

Called from Python for forward pass:

input = spy.Tensor.from_numpy(device, data).with_grads()
output = spy.Tensor.empty(device, shape=input.shape, dtype=float).with_grads()

# Forward pass
module.apply_activation(spy.grid(shape=input.shape), input, output)

# ... compute loss and set output gradients ...

# Backward pass
module.apply_activation.bwds(spy.grid(shape=input.shape), input, output)

Generic slang function to take generic tensors¶

A normalization function that works for any floating-point tensor:

void normalize_tensor<T: __BuiltInFloatingPointType>(int idx, ITensor<T, 1> input, IRWTensor<T, 1> output)
{
    T value = input[idx];
    output[idx] = value / T(255.0);
}

When called from Python, SlangPy will select the appropriate concrete tensor types based on the properties of the passed tensors. In this case, if Tensor of float16 were passed, SlangPy would generate a kernel using ITensor<float16, 1> and IRWTensor<float16, 1>.

Internals¶

Underlying Storage¶

Tensors are implemented on top of GPU buffer resources:

Tensor and read-only variants use StructuredBuffer<T>, or ImmutablePtr<T> in CUDA
RWTensor and read-write variants use RWStructuredBuffer<T> or Ptr<T> in CUDA
DiffTensor types wrap both primal and gradient buffers
AtomicTensor uses RWByteAddressBuffer or Ptr<T> in CUDA

Each tensor stores:

Buffer reference (_data)
Shape array (_shape)
Stride array (_strides)
Offset (_offset)

Memory Layout¶

Tensors use row-major layout by default (rightmost dimension has smallest stride).

For a 3D tensor of shape [D0, D1, D2], the strides are:

stride[0] = D1 * D2
stride[1] = D2
stride[2] = 1

The linear index for element [i, j, k] is computed as:

int linear_idx = i * stride[0] + j * stride[1] + k * stride[2] + offset;

Summary¶

SlangPy’s tensor types provide:

Multiple access modes: Read-only, write-only, read-write, atomic
Automatic differentiation: Differentiable tensor variants for AD
Flexible interfaces: Generic functions work with any tensor implementation
Multiple indexing styles: Subscripts, load/store, variadic indices
Multi-dimensional support: 1D, 2D, 3D, and higher-dimensional tensors
Struct element types: Not limited to primitive types

For details on using tensors from Python, see Tensors In Python. For information on automatic differentiation with tensors, see Differentiable Tensors.