An overview of modern LLM compiler stack: writing an interactive and hackable compiler
Posted by NoVibeCoding@reddit | LocalLLaMA | View on Reddit | 9 comments
Hey r/LocalLLaMA,
Production ML compiler stack is brutal: TVM is 500K+ lines of C++. PyTorch piles Dynamo, Inductor, and Triton on top of each other. XLA, MLIR, Halide, Mojo.
It is, arguably, the most important piece of modern compute infrastructure, and there is little information available on core concepts. To fill the gap, I built a small ML compiler from scratch: pure Python and raw CUDA, no library use. It takes a small transformer (TinyLlama, Qwen2.5-7B) and lowers it to a sequence of CUDA kernels through six IRs. On RTX 5090, the autotuned stack lands at a geomean of 0.96× vs. the PyTorch production stack, with 32 of 84 kernel shapes beating PyTorch hand-optimized kernels (max 5.6× speedup).
After a month of work, the three-part series is finally finished:
Part 1 walks an RMSNorm layer end-to-end through the upper half of the pipeline: - Torch IR — captured FX graph (rmsnorm, linear, softmax, ...) - Tensor IR — every op decomposed into Elementwise / Reduction / IndexMap - Loop IR — a kernel written as a loop nest fused with other kernels - Tile IR — a kernel scheduled onto the GPU (threads, blocks, shared memory) - Kernel IR — schedule materialized into hardware primitives - CUDA — emitted source ready for nvcc
For example, a PyTorch expression walked through IR levels:
torch.relu(torch.matmul(x + bias, w)) # x: (16, 64), bias: (64,), w: (64, 16)
Torch IR:
bias_bc = bias[j] -> (16, 64) float32
add = add(x, bias_bc) -> (16, 64) float32
matmul = matmul(add, w, has_bias=False) -> (16, 16) float32
relu = relu(matmul) -> (16, 16) float32
Tensor IR:
bias_bc = bias[j] -> (16, 64) float32
w_bc = w[j, k] -> (16, 64, 16) float32
add = add(x, bias_bc) -> (16, 64) float32
add_bc = add[i, j] -> (16, 64, 16) float32
prod = multiply(add_bc, w_bc) -> (16, 64, 16) float32
red = sum(prod, axis=-2) -> (16, 1, 16) float32
matmul = red[i, na, j] -> (16, 16) float32
relu = relu(matmul) -> (16, 16) float32
Loop IR
=== merged_relu -> relu ===
for a0 in 0..16: # free (M)
for a1 in 0..16: # free (N)
for a2 in 0..64: # reduce (K)
in0 = load bias[a2]
in1 = load x[a0, a2]
in2 = load w[a2, a1]
v0 = add(in1, in0) # prologue (inside reduce)
v1 = multiply(v0, in2)
acc0 <- add(acc0, v1)
v2 = relu(acc0) # epilogue (outside reduce)
merged_relu[a0, a1] = v2
Part 2 explains the lower half: how a loop nest becomes a GPU schedule. Sixteen mechanical Tile-IR passes to split computations into blocks, map to threads, stage inputs into smem, etc. Each pass is one diff in the CLI. It mimics the sequence of optimizations a CUDA engineer would make.
For example, a pass that stages inputs into the smem:
deplodock compile \
-c "torch.nn.RMSNorm(2048)(torch.randn(1,32,2048))" \
--ir tile -vv \
| awk '/^>>> t:007/,/^<<< t:007/'
>>> t:007_stage_inputs
@@ matched at rms_norm (in-place) @@
@@ -2,6 +2,7 @@
v0 = reciprocal(2048)
Tile(axes=(a0:256=THREAD, a1:32=BLOCK)):
+ x_smem = Stage(x, origin=(0, a1, 0), slab=(a2:2048@2))
StridedLoop(a2 = a0; < 2048; += 256): # reduce
- in2 = load x[0, a1, a2]
+ in2 = load x_smem[a2]
v1 = multiply(in2, in2)
acc0 <- add(acc0, v1)
@@ -11,5 +12,5 @@
v4 = rsqrt(v3)
StridedLoop(a2 = a0; < 2048; += 256): # free
- in3 = load x[0, a1, a2]
+ in3 = load x_smem[a2]
in4 = load p_weight[a2]
v5 = multiply(in3, v4)
<<< t:007_stage_inputs
Part 3 finishes the series with autotuning. Every parameter in part 2 (block size, register tile, K-chunk, whether to stage, whether to double-buffer, etc.) was hand-picked using heuristics. Those heuristics worked on the shapes I fit them to, but fell over elsewhere. Part 3 replaces heuristics with a search loop: SP-MCTS over the cross-product of rule parameters.
The whole pipeline is one CLI:
# inspect any IR stage
deplodock compile -c "nn.RMSNorm(2048)(torch.randn(1,32,2048))" --ir tensor|loop|tile|kernel|cuda
# bench end-to-end
deplodock run --bench -c "torch.nn.Softmax(dim=-1)(torch.randn(1,28,2048,2048))"
# autotune on the live GPU
deplodock tune -c "nn.RMSNorm(2048)(torch.randn(1,32,2048))" -v
# full model
deplodock compile Qwen/Qwen2.5-7B
Each part is self-contained enough that you can skip ahead if you only care about one layer: - Part 1. IR Hierarchy — From PyTorch to Emitted CUDA - Part 2. Tile IR — Scheduling Loops onto a GPU - Part 3. Autotuning — A Search Loop Over Tile-IR Rewrites - Repo: https://github.com/cloudrift-ai/deplodock
FedericoBruzzone@reddit
This is really what is needed nowadays. Thank you for sharing!
FirmSignificance1725@reddit
You might find IREE interesting as well:
https://github.com/iree-org/iree
NoVibeCoding@reddit (OP)
IREE is a great project. Though it falls into the same bucket as MLIR, Inductor, XLA, etc. Too complex to hack and experiment with. Difficult to learn core concepts.
FirmSignificance1725@reddit
Yeah, valid point, especially as it has a heavy dependency on MLIR. Took me awhile to get comfortable with it. I think it’s really cool to provide a more readable compiler for learning purposes though
Definitely difficult for someone to jump into this space and learn from ground up if they don’t have a lot of time and, typically, monetary motivations that make the complexity worth it
BitGreen1270@reddit
Man it blows my mind that people are able to get down to low level basics and build stuff like this. I don't even understand the high-level stuff. Your work on this is amazing.
NoVibeCoding@reddit (OP)
Thanks. It is based on past experience, though. Figuring this out top to bottom is a significant investment indeed.
un_passant@reddit
Interesting !
I would love to know what you think of https://github.com/tinygrad/tinygrad .
NoVibeCoding@reddit (OP)
Tinygrad is great, and the project's framing is largely inspired by it. The focus is different. Tinygrad is a full training framework. Originally, it was aiming to demystify autograd. Nowadays, it has the compiler backend as well. However, the compiler components of TinyGrad are not well-documented. The UOp-based approach doesn't follow a classic IR stack with responsibilities clearly separated between layers, which makes it hard to follow. I can't speak to its performance, but it isn't a good pedagogical reference as an ML compiler.
PixelSage-001@reddit
This is exactly what the ecosystem needs right now. The current abstraction layers are way too rigid for any serious optimization. Being able to hack the compiler directly opens up so many possibilities for edge computing and local deployments.