Vector Addition::P4::Optimizing

I’m sure from last post you remember our vector addition kernel in Cuda was behind Triton & Torch version. Well that shows the beauty of how Triton compiler automatically optimizes the kernels. Here is our kernel performance on RTX5060: GTX5060

First things first, let’s play with threads number. After all, that’s the only number we have.

1
#include <ATen/ATen.h>          // ATen: PyTorch Tensor library (like numpy)
2
#include <c10/cuda/CUDAGuard.h> // Ensures we're on the right GPU device
3
#include <c10/cuda/CUDAStream.h>// manages Cuda execution streams
4
#include <cuda_runtime.h>       // Core Cuda functionality
5
#include <torch/extension.h>    // Magic glue between C++ and Python
6

7
extern "C" void launch_add_kernel(const float *x, const float *y, float *out,
8
                                  int n, int blocks, int threads,
9
                                  cudaStream_t stream);
10

11
at::Tensor add_cuda(at::Tensor x, at::Tensor y) {
12
  TORCH_CHECK(x.sizes() == y.sizes(), "x and y must have the same shape")
13

14
  auto out = torch::empty_like(x);
15
  int64_t n_elements = x.numel();
16

17
  const int threads = 64; // Threads per block (multiple of 32: warp friendly!)
18
  const int blocks = (int)((n_elements + threads - 1) / threads);
19

20
  const float *x_ptr = x.data_ptr<float>();
21
  const float *y_ptr = y.data_ptr<float>();
22
  float *out_ptr = out.data_ptr<float>();
23

24
  cudaStream_t stream = c10::cuda::getCurrentCUDAStream();
25
  launch_add_kernel(x_ptr, y_ptr, out_ptr, (int)n_elements,
26
                    blocks, threads, stream);
27
  return out;
28
}
29

30
/* PYBIND11_MODULE: This makes the function callable in Python
31
 * the 11 in the name means C++11 at minimum
32
 * Funny enough, there is no other numbers.
33
 *
34
 * TORCH_EXTENSION_NAME: Automatically set by Pytorch build system
35
 */
36
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m){
37
  m.def("add_cuda", &add_cuda, "Vector add (CUDA)");
38
}

With this, we do observe some perf improvement. We’re closer to Triton and Torch version. threads-optim

Vectorized Memory Access and Warp-Level Efficiency

Looking at RTX 5060 specs, we can see Memory Interface Width is 128-bits. The current kernel loads 1 float at a time, that’s 4Bytes or 32 bits per transaction. We are effectively underutilizing the bandwidth, leaving 75% of it empty. Now, what if we load 4xfloats (4x32=128bit) at the same time to saturate the 128 bits?

Briliant idea, huh? Nvidia has some built it types for multiple of floats:

Type	Elements	Total Size	Alignment Requirement
`float1`	1	4 bytes	4 bytes
`float2`	2	8 bytes	8 bytes
`float3`	3	12 bytes	16 bytes (padded to 16)
`float4`	4	16 bytes	16 bytes

We’re going to use float4. Here is the new kernel:

1
/**
2
 * Vectorized CUDA kernel for element-wise vector addition using float4.
3
 *
4
 * Each thread processes 4 consecutive floats at once (16 bytes total).
5
 * This improves memory bandwidth utilization and reduces the number
6
 * of global memory transactions compared to the scalar version.
7
 */
8

9
__global__ void add_kernel_vec4(const float *__restrict__ x,
10
                                const float *__restrict__ y,
11
                                float *__restrict__ out,
12
                                int n) {
13

14
    // Compute the global index for this thread.
15
    // Each thread handles 4 elements, so multiply by 4.
16
    int idx = (blockIdx.x * blockDim.x + threadIdx.x) * 4;
17

18
    // Ensure we stay within array bounds.
19
    // "idx + 3 < n" because each thread touches elements [idx, idx+1, idx+2, idx+3].
20
    if (idx + 3 < n) {
21

22
        // Reinterpret the float* as a float4*.
23
        // This allows us to load 4 floats (16 bytes) in a single transaction.
24
        float4 a = reinterpret_cast<const float4*>(x)[idx / 4];
25
        float4 b = reinterpret_cast<const float4*>(y)[idx / 4];
26

27
        // Perform element-wise addition on the 4 packed floats.
28
        float4 c;
29
        c.x = a.x + b.x;
30
        c.y = a.y + b.y;
31
        c.z = a.z + b.z;
32
        c.w = a.w + b.w;
33

34
        // Store the result back to global memory as a single 16-byte store.
35
        reinterpret_cast<float4*>(out)[idx / 4] = c;
36
    }
37
}

Now, we should change the vector_add.cpp file to accomodate for changes:

1
    int threads = 64;
2
    int blocks = (n_elements/4 + threads - 1) / threads;

This will close the gap we saw in the middle of the chart. We didn’t see this gap in lower array sizes because the kernel launch was bottleneck in lower array sizes.

Adding grid-stride loop

We’re pretty close, just there is some gaps in the middle. As I was trying this on a different RTX 5060, the graph is different.

In the previous version, we went from launching 1 kernel per float, to 1 kernel per 4 floats and observed a good performance gain. With the new approach, we are launching just one kernel to do addition over loop. So all of the array will be calculated over just one launch.

As ChatGPT puts it, here is the difference: The first version:

1
Time ───────────────────────────────────────────────▶
2

3
Thread 0: [LOAD x0,y0]──(wait 400 cycles)──[ADD]──[STORE]
4
Thread 1: [LOAD x1,y1]──(wait 400 cycles)──[ADD]──[STORE]
5
Thread 2: [LOAD x2,y2]──(wait 400 cycles)──[ADD]──[STORE]
6
...
7
Warp Scheduler: switches between warps, but most are waiting on memory

The grid-stride loop version:

1
Time ─────────────────────────────────────────────────────────▶
2

3
Thread 0: [LOAD x0,y0][LOAD x4,y4][LOAD x8,y8] ... (pipeline full)
4
            ↑ while one load waits, next loads issued
5
Thread 1: [LOAD x1,y1][LOAD x5,y5][LOAD x9,y9] ...
6
Thread 2: [LOAD x2,y2][LOAD x6,y6][LOAD x10,y10] ...
7
...
8
Warp Scheduler: always finds ready warps (no idle gaps)

Here is the .cu code:

1
/**
2
 * Vectorized grid-stride CUDA kernel for element-wise vector addition.
3
 *
4
 * Each thread now processes multiple float4 blocks (16 bytes each)
5
 * in a grid-stride loop.  This improves memory-latency hiding
6
 * and SM utilization, especially for mid-sized arrays.
7
 */
8
__global__ void add_kernel_vec4_looped(const float *__restrict__ x,
9
                                       const float *__restrict__ y,
10
                                       float *__restrict__ out,
11
                                       int n) {
12
    // Thread’s global linear index
13
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
14
    // Total number of threads across the grid
15
    int stride = blockDim.x * gridDim.x;
16

17
    // Grid-stride loop: each iteration handles one float4 (4 floats)
18
    for (int i = tid * 4; i < n; i += stride * 4) {
19
        if (i + 3 < n) {
20
            // Vectorized 16-byte load
21
            float4 a = reinterpret_cast<const float4*>(x)[i / 4];
22
            float4 b = reinterpret_cast<const float4*>(y)[i / 4];
23

24
            // Compute
25
            float4 c;
26
            c.x = a.x + b.x;
27
            c.y = a.y + b.y;
28
            c.z = a.z + b.z;
29
            c.w = a.w + b.w;
30

31
            // Vectorized 16-byte store
32
            reinterpret_cast<float4*>(out)[i / 4] = c;
33
        }
34
    }
35
}

grid-stride-loop

Now we are ready to conclude the vector addition and go to the next one. This journey was longer than I imagined. But totally worths the pain. Try to replicate the work or even make it better yourself. I’ve shared the code under source-code folder in the site’s repo.

Good luck! I’m excited for the next kernel series. That’s fused softmax.