Maybe instead of just diving right into it, I can give some intro. Just to grasp why Fused softmax is important.

Model output is an array of unbounded real numbers we call logits. like the following:

x=[2.0,1.0,0.1]

The larger the logit, shows model’s preference towards that token. If we want our model to act greedy, we can just apply an argmax over the array and the most preferred token. But, in reality things are different. Both inference and training need probabilities as output. That’s where Softmax comes into picture.

\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{N} e^{x_j}}

So:

e^{x} = [e^2, e^1, e^{0.1}] \approx [7.389, 2.718, 1.105]

\text{Sum} = 7.389 + 2.718 + 1.105 = 11.212

Then:

\text{Softmax}(x) = \left[ \frac{7.389}{11.212}, \frac{2.718}{11.212}, \frac{1.105}{11.212} \right] \approx [0.659, 0.242, 0.099]

Now they look like probabilities — they’re positive, sum to 1, and the largest score (2.0) has the highest probability.

Stable softmax flavor

That’s good, but it might not be numerically stable. For example for the following:

x = [1000, 999, 998]

Computing $e^{1000}$ would overflow. Because it translates to an astronomically large number.

To avoid this, we use the numerically stable version:

\text{Softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum_{j} e^{x_j - \max(x)}}

Here, $\max(x) = 1000$ , so we subtract it:

x - \max(x) = [0, -1, -2]

This makes the final version to the following probabilities that is pretty stable:

\text{Softmax}(x) = [0.665, 0.245, 0.090]

Love it, huh?

Naive implementation

Now let’s do naive implementation and count array reads.

My suggestion is to try to open up a notebook and do it yourself. You can reference the above stable softmax formula. When you’re done, you can get back here or triton tutorials page to compare and fix.

We need you to build up this knowledge, we will use it to implement the Triton version and later the Cuda version.

Ok, this is my version, which is slightly different that the triton tutorial version:

1
import torch
2
def naive_softmax(x):
3
    M, N = x.shape
4
    # read MN elements, write M elements
5
    max_mem = torch.max(x, dim=-1).values
6
    # read MN + M elements, write MN elements
7
    z = x - max_mem.unsqueeze(-1).expand(M, N)
8
    # read MN elements, write MN elements
9
    z = torch.exp(z)
10
    # read MN elements, write M elements
11
    row_sum = torch.sum(z,-1).unsqueeze(-1).expand(M, N)
12
    # read MN + M elements, write MN elements
13
    result = z/row_sum
14
    # Total:: Read: 5MN + 2M; Write: 3MN+2M
15
    return result

I’ve also annotated it with number of reads and writes.

Triton Implementation

Now that we have this, do you think we can implement Triton without getting much help from the Triton tutorial?

TBH, I’m not sure, but worths trying. With a some help from Gemini AI, got to this:

1
@triton.jit
2
def softmax_kernel(output_ptr, input_ptr, input_row_stride, output_row_stride, n_rows, n_cols, BLOCK_SIZE: tl.constexpr):
3
    # tl.device_print("BLOCK_SIZE", BLOCK_SIZE)
4
    pid = tl.program_id(0)
5
    row_start_ptr = input_ptr + pid * input_row_stride
6
    col_offsets = tl.arange(0, BLOCK_SIZE)
7
    input_ptrs = row_start_ptr + col_offsets
8
    # Load the row data into a block. Use a mask for rows shorter than BLOCK_SIZE.
9
    mask = col_offsets < n_cols
10
    row = tl.load(input_ptrs, mask=mask, other=-float('inf'))
11
    row_max = tl.max(row, axis=0)
12
    numerator = tl.exp(row - row_max)
13
    denominator = tl.sum(numerator, axis=0)
14
    output = numerator / denominator
15

16
    # Store the result back to global memory.
17
    output_ptrs = output_ptr + pid * output_row_stride + col_offsets
18
    tl.store(output_ptrs, output, mask=mask)
19

20
def triton_softmax(x: torch.Tensor):
21
    if not x.is_cuda:
22
        x = x.cuda()
23

24
    n_rows, n_cols = x.shape
25
    output = torch.empty_like(x)
26
    grid = (n_rows,)
27

28
    softmax_kernel[grid](
29
        output_ptr=output,
30
        input_ptr=x,
31
        input_row_stride=x.stride(0),
32
        output_row_stride=output.stride(0),
33
        n_rows=n_rows,
34
        n_cols=n_cols,
35
        BLOCK_SIZE=triton.next_power_of_2(n_cols)
36
    )
37

38
    return output

You can see it’s pretty naive, we are launching 1 kernel per row.

Benchmark

I’ve added the Triton tutorial’s kernel (let’s call it BTK-Better Triton Kernel) to see how I’m comparing with a complex kernel.

As I’m using RTX5060, the BTK was getting to OOM (out of memory) pretty fast, so I lowered the num_stages = 1.

Now, let’s see the result: softmax-p1-all

The good news is that my Triton kernel was better than everyone from the beginning up to $N=14080$ .Bad news is that I don’t know if any model is having vocab_size equal or less than that amount these days. For context, M is batchsize, N is vocabsize in real world applications. Modern models like DeepSeek R1 have $vocab-size=128k$ . But let’s say I stood a chance sometime in the history of LLMs.

The BTK dies sometime after my kernel is gone. But damn, Torch goes strong.

For the next article, I don’t know, should I put some time to optimize the current kernel or just go to Cuda?

Fused Softmax::P1::Naive & Triton Implementation

Stable softmax flavor

Naive implementation

Triton Implementation

Benchmark