Writing an LLM compiler from scratch: PyTorch to CUDA in 5,000 lines of Python
Posted by NoVibeCoding@reddit | LocalLLaMA | View on Reddit | 7 comments
Hey r/LocalLLaMA,
I wanted to come up with a simple overview of the modern ML compiler stack, essentially what happens between model.generate()and the GPU executing a kernel. However, the stack is brutal to read. TVM is 500K+ lines of C++. PyTorch piles Dynamo, Inductor, and Triton on top of each other. Then there's XLA, MLIR, Halide, and Mojo.
Instead, I decided to take a different approach and build one from scratch. Just pure Python and raw CUDA. Take a small model (Qwen2.5-7B, TinyLlama) and compile it into a sequence of CUDA kernels. The goal isn't to beat Triton today, but to create a hackable compiler that doesn't require a PhD to modify, or at least make it easier to follow.
The final performance is about 50-90% of the production stack (as compared to PyTorch Eager and torch.compile).
I built it in a principled way, with a layered pipeline and concerns clearly separated:
- Torch IR — captured FX graph (rmsnorm, linear, softmax, ...)
- Tensor IR — every op decomposed into Elementwise / Reduction / IndexMap
- Loop IR — a kernel written as a loop nest fused with other kernels
- Tile IR — a kernel scheduled onto the GPU (threads, blocks, shared memory)
- Kernel IR — schedule materialized into hardware primitives
- CUDA — emitted source ready for nvcc
Tensor IR is introduced to handle different frontends in the future, like PyTorch, ONNX, and Jax. Loop fusion handles the fusion of long pointwise and reduction chains. Lowering stages introduce optimizations such as tiled matmul, smem staging, and double-buffering.
Each stage can be inspected and debugged independently (repository link). No GPU needed:
deplodock compile -c "nn.RMSNorm(2048)(torch.randn(1,32,2048))" --ir tensor|loop|tile|kernel|cuda
Benchmarking:
deplodock run --bench --profile -c "torch.nn.Softmax(dim=-1)(torch.randn(1,28,2048,2048))"
End-to-end compilation:
deplodock compile Qwen/Qwen2.5-7B
The generated CUDA kernel for RMSNorm looks like this:
extern "C" __global__
__launch_bounds__(256) void k_rms_norm_reduce(const float* x, const float* p_weight, float* rms_norm) {
float in0 = 2048.0f;
float in1 = 1e-06f;
{
int a1 = blockIdx.x;
int a0 = threadIdx.x;
float acc0 = 0.0f;
__syncthreads();
__shared__ float x_smem[2048];
for (int x_smem_flat = a0; x_smem_flat < 2048; x_smem_flat += 256) {
{
unsigned int _smem_addr = __cvta_generic_to_shared(&x_smem[x_smem_flat]);
asm volatile("cp.async.ca.shared.global [%0], [%1], 4;\n"
:: "r"(_smem_addr), "l"(&x[a1 * 2048 + x_smem_flat])
: "memory");
}
}
asm volatile("cp.async.commit_group;\n" ::: "memory");
asm volatile("cp.async.wait_group 0;\n" ::: "memory");
__syncthreads();
__shared__ float p_weight_smem[2048];
for (int p_weight_smem_flat = a0; p_weight_smem_flat < 2048; p_weight_smem_flat += 256) {
{
unsigned int _smem_addr = __cvta_generic_to_shared(&p_weight_smem[p_weight_smem_flat]);
asm volatile("cp.async.ca.shared.global [%0], [%1], 4;\n"
:: "r"(_smem_addr), "l"(&p_weight[p_weight_smem_flat])
: "memory");
}
}
asm volatile("cp.async.commit_group;\n" ::: "memory");
asm volatile("cp.async.wait_group 0;\n" ::: "memory");
__syncthreads();
for (int a2 = a0; a2 < 2048; a2 += 256) {
float in2 = x_smem[a2];
float v0 = in2 * in2;
acc0 += v0;
}
__shared__ float acc0_smem[256];
acc0_smem[a0] = acc0;
__syncthreads();
for (int s = 128; s > 0; s >>= 1) {
if (a0 < s) {
acc0_smem[a0] = acc0_smem[a0] + acc0_smem[a0 + s];
}
__syncthreads();
}
__syncthreads();
float acc0_b = acc0_smem[0];
float v1 = acc0_b / in0;
float v2 = v1 + in1;
float v3 = rsqrtf(v2);
for (int a3 = a0; a3 < 2048; a3 += 256) {
float in3 = x_smem[a3];
float in4 = p_weight_smem[a3];
float v4 = in3 * v3;
float v5 = v4 * in4;
rms_norm[a1 * 2048 + a3] = v5;
}
}
}
therealdivs1210@reddit
Thank you, that's a great way to learn about this stuff!
NoVibeCoding@reddit (OP)
Happy to hear it helped!
c-cul@reddit
very poor generated code
for example why there is __syncthreads at start? also why you use inline ptx instead of pipeline primitives like __pipeline_memcpy_async and friends?
NoVibeCoding@reddit (OP)
It's defensive. The codegen rule is "every staged load is preceded by a barrier" because in double-buffered kernels (matmul over tiled K) the previous iteration's consumers must finish before producers overwrite the buffer. The same rule is used to inject barriers in both the double-buffered and single-buffered cases; we sometimes end up with a noop.
Inline PTX is easier for codegen, no need to keep track of a stateful cuda::pipeline object and add additional imports.
In general, this article is about Torch -> Tile IR. I mention that the codegen overview will be in part 2. I might clean it up by then.
c-cul@reddit
this was obvious flaws, but there are more
you could load/store 128bit values instead of just 32bit
you could employ warp reduce for acc0_smem
I am not expert in llm-specific algos but looks like current implementation is strictly memory bound and has only 10-15% from SoL
NoVibeCoding@reddit (OP)
I know. Generating efficient kernels is hard. There are plenty of kernels to generate, and each of them prefers a different strategy. That's why all production stacks are using codegen sparingly. I explicitly call it out in the beginning and postpone the codegen overview till Part 2. Part 1 is about PyTorch tracing, operator decomposition, loop fusion, etc.
c-cul@reddit
yeah, that's why tvm/xla are so fat - they able to do some sophisticated optimizations
ok, will wait for part 2 - good luck