matmul | Fused Kernels

After implementing 2D image kernel, maybe now it’s time to implement matmul.

I suggest going ahead and implementing your Cuda version of Matmul. This one is going to be hard, so probably you’ll need some help from ChatGPT.

Here is my version:

1
__global__ void matmul_kernel_cache(float *out, const float *a, const float *b,
2
                                    int M, int N, int K)
3
{
4
    int row = blockIdx.y * blockDim.y + threadIdx.y;
5
    int col = blockIdx.x * blockDim.x + threadIdx.x;
6

7
    __shared__ float sA[TILE_WIDTH][TILE_WIDTH], sB[TILE_WIDTH][TILE_WIDTH];
8
    if (row < M && col < N)
9
    {
10
        int numTiles = (K + TILE_WIDTH - 1) / TILE_WIDTH;
11
        float pVal = 0;
12
        for (int t = 0; t < numTiles; t++)
13
        {
14
            int A_row = row;
15
            int A_col = t * TILE_WIDTH + threadIdx.x;
16
            if (A_row < M && A_col < K)
17
                sA[threadIdx.y][threadIdx.x] = a[A_row * K + A_col];
18
            else
19
                sA[threadIdx.y][threadIdx.x] = 0;
20

21
            int B_row = t * TILE_WIDTH + threadIdx.y;
22
            int B_col = col;
23
            if (A_row < M && A_col < K)
24
                sB[threadIdx.y][threadIdx.x] = b[B_row * N + B_col];
25
            else
26
                sB[threadIdx.y][threadIdx.x] = 0;
27

28
            __syncthreads();
29

30
            for (int j = 0; j < TILE_WIDTH; j++)
31
                pVal += sA[threadIdx.y][j] * sB[j][threadIdx.x];
32
            __syncthreads();
33
        }
34
        out[row * N + col] = pVal;
35
    }
36
}

This kernel is good for the start. But if you remember, we have a way to optimize it. Lowering memory bottleneck by transferring data in 4 bits.

1
__global__ void matmul_kernel_cache_coalescing(float *out, const float4 *a, const float4 *b_T,
2
                                               int M, int N, int K)
3
{
4
    int row = blockIdx.y * blockDim.y + threadIdx.y;
5
    int col = blockIdx.x * blockDim.x + threadIdx.x;
6

7
    if (row < M && col < N)
8
    {
9
        int K4 = (K + 3) / 4;
10

11
        int numTiles = (K4 + TILE_WIDTH - 1) / TILE_WIDTH;
12

13
        __shared__ float4 sA[TILE_WIDTH][TILE_WIDTH], sB[TILE_WIDTH][TILE_WIDTH];
14
        float pVal = 0;
15
        for (int t = 0; t < numTiles; t++)
16
        {
17
            int a_kchunk = t * TILE_WIDTH + threadIdx.x;
18
            int b_kchunk = t * TILE_WIDTH + threadIdx.y;
19

20
            if (a_kchunk < K4)
21
                sA[threadIdx.y][threadIdx.x] = a[row * K4 + a_kchunk];
22
            else
23
                sA[threadIdx.y][threadIdx.x] = make_float4(0.f, 0.f, 0.f, 0.f);
24

25
            if (b_kchunk < K4)
26
                sB[threadIdx.y][threadIdx.x] = b_T[col * K4 + b_kchunk];
27
            else
28
                sB[threadIdx.y][threadIdx.x] = make_float4(0.f, 0.f, 0.f, 0.f);
29

30
            __syncthreads();
31

32
            for (int j = 0; j < TILE_WIDTH; j++)
33
            {
34
                float4 va = sA[threadIdx.y][j];
35
                float4 vb = sB[j][threadIdx.x];
36

37
                pVal += va.x * vb.x + va.y * vb.y + va.z * vb.z + va.w * vb.w;
38
            }
39
            __syncthreads();
40
        }
41
        out[row * N + col] = pVal;
42
    }
43
}

Probably you’ve noticed that access pattern is different for B, that’s because we’re passing it the Transposed version. You’d ask why? Because we’re using float4, we’re moving over 4 consecutive floats. This won’t work for B that is access column-wise for matmul.

This is the code to apply Transpose to the matrix:

1
auto bT = b.transpose(0, 1).contiguous();
2
    TORCH_CHECK(reinterpret_cast<uintptr_t>(bT.data_ptr<float>()) % 16 == 0,
3
                "Tensor B is not 16-byte aligned");
4
    const float4 *b_T_ptr = reinterpret_cast<const float4 *>(bT.data_ptr<float>());

You might think, that’s bulls eye. The performance now is going to be the best. I should break to you: NO. Performance is still bad, just slightly better. I think through the years of using MATMUL, just writing a randomly optimize kernel won’t cut it. You can check the performance in the following image:

matmul

I may get back to this subject, but would love to keep going forward.