Vector Addition::P2::Cuda Kernel

Now that we have implemented the Triton kernel (better to say we copied it) and investigated it, it’s time for a Cuda version. Juicy stuff.

Do you remember we had add_kernel with @triton.jit decorator? Here that turns into a file by itself: vector_add_kernel.cu.

1
#include <cuda_runtime.h>
2

3
/*
4
 * __global__ = This is a CUDA keyword that tells the compiler:
5
 *    "This function runs on the GPU and can be called from CPU!"
6
 *
7
 * Think of this as a tiny worker that gets copied thousands of times across
8
 * the GPU. Each copy (thread) handles ONE element of the arrays.
9
 *
10
 * Params:
11
 * - x, y: Input arrays (read-only, hense the const keyword)
12
 * - out: Output array (Obvious!)
13
 * - n: Size of our arrays, number of elements, used for masking
14
 */
15
__global__ void add_kernel(const float *x, const float *y, float *out, int n) {
16
  /* Cuda's magic formula, converting thread coordinates to array index
17
   *
18
   * Imagine your GPU as a 2D grid:
19
   * - blockIdx.x = Which "block" am I in? (like which neighborhood)
20
   * - blockDim.x = How many threads per block? (like house per neighborhood)
21
   * - threadIdx.x = Which thread I am within my block?
22
   *
23
   * Example: If blockDim.x = 256 and I'm thread 5 in block 2:
24
   * idx = 2 * 256 + 5 = 517
25
   * Means I'm responsible for processing element 517 of the array
26
   */
27
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
28

29
  /* Bounds checking: the GPU's safety net
30
   * Here we make sure we never try to access a memory that doesn't belong to the arrays
31
   */
32
  if (idx < n){
33
    // Obvious?!
34
    out[idx] = x[idx] + y[idx];
35
  }
36
}
37

38

39
/*
40
 * Kernel launcher: The bridge between CPU and GPU!
41
 *
42
 * extern "C" = prevents C++ name mangling, making this function callable from C code
43
 *              or other languages like Python
44
 *
45
 * This function tells GPU to execute the code and returns (asynchronous execution)
46
 */
47
extern "C" void launch_add_kernel(const float *x, const float *y, float *out,
48
                                  int n, int blocks, int threads,
49
                                  cudaStream_t stream) {
50
  /*
51
   * The kernel launch: Cuda's special syntax
52
   * <<<blocks, threads, 0, stream>>>
53
   * ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This is called execution configuration
54
   *
55
   * - blocks: how many blocks to launch
56
   * - threads: how many threads per block
57
   * - 0: shared memory size
58
   * - stream: Cuda stream for async execution (like a "lane" on the GPU)
59
   *
60
   * So total threads launched would be blocks*threads
61
   * Choose threads to be multiple of 32 (warp size)
62
   * Common choices are: 128, 256, 512, 1024
63
   */
64
  add_kernel<<<blocks, threads, 0, stream>>>(x, y, out, n);
65

66
  // This function returns immediately, if you'd like to wait, use:
67
  // cudaStreamSynchronize(stream) or cudaDeviceSynchronize()
68
}

I really tried to be as good as possible in explaining concepts through comments; But still, 1 concept remaining: Warp.

Warp is a group of 32 consecutive threads that execute instructions together in lockstep. Think marching band of musicians that must play the same note (same code) at the same time.

But why Warps? Because GPUs use SIMD (Single instruction, Multiple Data) execution. Each warp is the minimum unit of execution.

Now let’s go to the next file, the caller:

1
#include <ATen/ATen.h>          // ATen: PyTorch Tensor library (like numpy)
2
#include <c10/cuda/CUDAGuard.h> // Ensures we're on the right GPU device
3
#include <c10/cuda/CUDAStream.h>// manages Cuda execution streams
4
#include <cuda_runtime.h>       // Core Cuda functionality
5
#include <torch/extension.h>    // Magic glue between C++ and Python
6

7
/*
8
 * This tells C++ compiler: "Hey, there's a function called launch_add_kernel
9
 * defined somewhere else (in our .cu file). Trust me, it exists!"
10
 */
11

12
extern "C" void launch_add_kernel(const float *x, const float *y, float *out,
13
                                  int n, int blocks, int threads,
14
                                  cudaStream_t stream);
15

16
at::Tensor add_cuda(at::Tensor x, at::Tensor y) {
17
  TORCH_CHECK(x.sizes() == y.sizes(), "x and y must have the same shape")
18

19
  // Same as triton, pre-allocate the memory
20
  // torch::zeros_like(x) is the alternative, but slower
21
  auto out = torch::empty_like(x);
22

23
  // did you notice same interface as python? Check the previous post.
24
  int64_t n_elements = x.numel();
25

26
  const int threads = 1024; // Threads per block (multiple of 32: warp friendly!)
27

28
  const int blocks = (int)((n_elements + threads - 1) / threads);
29

30
  // data_ptr<T>() extracts the actual memory address
31
  // Cuda kernels need raw memory pointer
32
  const float *x_ptr = x.data_ptr<float>();
33
  const float *y_ptr = y.data_ptr<float>();
34
  float *out_ptr = out.data_ptr<float>();
35

36
  cudaStream_t stream = c10::cuda::getCurrentCUDAStream();
37
  launch_add_kernel(x_ptr, y_ptr, out_ptr, (int)n_elements,
38
                    blocks, threads, stream);
39
  return out;
40
}
41

42
/* PYBIND11_MODULE: This makes the function callable in Python
43
 * the 11 in the name means C++11 at minimum
44
 * Funny enough, there is no other numbers.
45
 *
46
 * TORCH_EXTENSION_NAME: Automatically set by Pytorch build system
47
 */
48
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m){
49
  m.def("add_cuda", &add_cuda, "Vector add (CUDA)");
50
}

You have made it this far, just a bit remaining, let’s execute it.

1
import os
2
import torch
3
from torch.utils.cpp_extension import load
4

5
this_dir = os.path.dirname(__file__)
6

7
ext = load(
8
  name="vector_add_ext",
9
  sources=[
10
    os.path.join(this_dir, "vector_add.cpp"),
11
    os.path.join(this_dir, "vector_add_kernel.cu"),
12
  ],
13
  verbose=True,
14
)
15

16
def add(x: torch.Tensor, y: torch.Tensor):
17
  return ext.add_cuda(x, y)
18

19
device = torch.device("cuda:0")
20
torch.manual_seed(0)
21
size = 98_432
22
x = torch.rand(size, device=device, dtype=torch.float32)
23
y = torch.rand(size, device=device, dtype=torch.float32)
24

25
out_cuda = add(x, y)
26
torch.cuda.synchronize()

Aaah, I’m tired, going to call it a night!

Hope you enjoyed it. Next one, we will do hell lot of benchmarking! Let’s see how these kernels compare.