Fused Softmax::P2::Triton optimization

As much as I wanted to let it go and jump into Cuda kernel implementation, I couldn’t. That nagging voice that says let’s see why triton kerenel drops dead.

Now let’s get into it. Zooming in, noticed that drops occurs on exact powers of 2. softmax-p2-blocksize

Looking at the code, we can see why it happens:

1
BLOCK_SIZE=triton.next_power_of_2(n_cols)

We’re good up to $2^{14}$ then we have a drop. With in $2^{15}$ band we see the performance is getting better until we hit $2^{16}$ . That’s because early in $2^{15}$ band we are just utilizing parts of the block, the rest are masked away; lowering the effective bandwidth.

I tried different ways to fix the issue, but wasn’t successful. May get back to it after writing the Cuda version.

Update

After implementing the Cuda version and learning a bit about some parallel work that we can do; I had another stab at it.

The following is better than our original kernel. The red line (triton_softmax2) is the new kernel.

softmax-p2-triton-1

Enabled autotuning on of the parameters and got the following: softmax-p2-triton-1

It’s much better. If we combine it with the initial triton kernel which is much better for lower row sizes; we do easily surpass or match torch for row sizes up to 29,440.

Here the improved softmax triton kernel so far:

1
@triton.autotune(
2
    configs=[
3
        triton.Config({'BLOCK_SIZE': 512,  'VEC': 2, 'num_warps': 4, 'num_stages': 1}),
4
        triton.Config({'BLOCK_SIZE': 1024, 'VEC': 2, 'num_warps': 4, 'num_stages': 2}),
5
        triton.Config({'BLOCK_SIZE': 1024, 'VEC': 4, 'num_warps': 4, 'num_stages': 1}),
6
        triton.Config({'BLOCK_SIZE': 2048, 'VEC': 4, 'num_warps': 8, 'num_stages': 2}),
7
        triton.Config({'BLOCK_SIZE': 512,  'VEC': 4, 'num_warps': 8, 'num_stages': 1}),
8
        # wide rows
9
        triton.Config({'BLOCK_SIZE': 4096,  'VEC': 4, 'num_warps': 8,  'num_stages': 2}),
10
        triton.Config({'BLOCK_SIZE': 8192,  'VEC': 4, 'num_warps': 8,  'num_stages': 2}),
11
        triton.Config({'BLOCK_SIZE': 8192,  'VEC': 4, 'num_warps': 16, 'num_stages': 2}),
12
        triton.Config({'BLOCK_SIZE': 16384, 'VEC': 4, 'num_warps': 8,  'num_stages': 3}),
13
        triton.Config({'BLOCK_SIZE': 4096,  'VEC': 4, 'num_warps': 16, 'num_stages': 3}),
14
    ],
15
    key=['n_cols']
16
)
17
@triton.jit
18
def softmax_rowwise(
19
    out_ptr, in_ptr,
20
    in_row_stride, out_row_stride,
21
    n_cols: tl.int32,
22
    BLOCK_SIZE: tl.constexpr,  # elements per program, across the whole row, processed in tiles
23
    VEC: tl.constexpr          # per-thread vector width, use 4 when alignment allows
24
):
25
    pid = tl.program_id(0)
26
    row_in  = in_ptr  + pid * in_row_stride
27
    row_out = out_ptr + pid * out_row_stride
28

29
    # Online pass 1: compute row max and normalizer without storing intermediates
30
    m = tl.full((), -float('inf'), tl.float32)   # running max
31
    s = tl.zeros((), tl.float32)                 # running sum of exp(x - m)
32

33
    # Tile over columns in steps of BLOCK_SIZE*VEC
34
    tile_span = BLOCK_SIZE * VEC
35
    n_tiles = (n_cols + tile_span - 1) // tile_span  # runtime integer division
36
    for tile_start in range(0, n_tiles):
37
        cols = tile_start * tile_span + tl.arange(0, BLOCK_SIZE * VEC)
38
        mask = cols < n_cols
39
        ptrs = row_in + cols
40

41
        # Help the compiler vectorize. If your rows are contiguous and 16-byte aligned this will trigger float4-style ld/st.
42
        tl.multiple_of(cols, VEC)          # per-thread lane is contiguous
43
        # tl.assume_aligned(ptrs, 16)        # encourage 128-bit memory ops
44

45
        x = tl.load(ptrs, mask=mask, other=-float('inf'))
46

47
        # local max for this tile
48
        tile_max = tl.max(x, axis=0)
49
        m_new = tl.maximum(m, tile_max)
50

51
        # rescale the running sum to the new max, then add this tile’s contribution
52
        s = s * tl.exp(m - m_new) + tl.sum(tl.exp(x - m_new), axis=0)
53
        m = m_new
54

55
    # Pass 2: write normalized output
56
    inv_s = 1.0 / s
57
    for tile_start in range(0, n_tiles):
58
        cols = tile_start * tile_span + tl.arange(0, BLOCK_SIZE * VEC)
59
        mask = cols < n_cols
60
        ptrs = row_in + cols
61

62
        tl.multiple_of(cols, VEC)
63
        # tl.assume_aligned(ptrs, 16)
64

65
        x = tl.load(ptrs, mask=mask, other=-float('inf'))
66
        y = tl.exp(x - m) * inv_s
67
        tl.store(row_out + cols, y, mask=mask)
68

69

70
def triton_softmax2(x):
71
    import torch
72
    if not x.is_cuda:
73
        x = x.cuda(non_blocking=True)
74
    n_rows, n_cols = x.shape
75
    out = torch.empty_like(x)
76
    grid = (n_rows,)
77
    softmax_rowwise[grid](
78
        out_ptr=out,
79
        in_ptr=x,
80
        in_row_stride=x.stride(0),
81
        out_row_stride=out.stride(0),
82
        n_cols=n_cols,
83
    )
84
    return out

As much as I would like to work more on the optimization, I’m going to throw the towel. Maybe I will get back to these later.