An overview of modern LLM compiler stack: writing an interactive and hackable compiler

Posted by NoVibeCoding@reddit | LocalLLaMA | View on Reddit | 9 comments

Hey r/LocalLLaMA,

Production ML compiler stack is brutal: TVM is 500K+ lines of C++. PyTorch piles Dynamo, Inductor, and Triton on top of each other. XLA, MLIR, Halide, Mojo.

It is, arguably, the most important piece of modern compute infrastructure, and there is little information available on core concepts. To fill the gap, I built a small ML compiler from scratch: pure Python and raw CUDA, no library use. It takes a small transformer (TinyLlama, Qwen2.5-7B) and lowers it to a sequence of CUDA kernels through six IRs. On RTX 5090, the autotuned stack lands at a geomean of 0.96× vs. the PyTorch production stack, with 32 of 84 kernel shapes beating PyTorch hand-optimized kernels (max 5.6× speedup).

After a month of work, the three-part series is finally finished:

Part 1 walks an RMSNorm layer end-to-end through the upper half of the pipeline: - Torch IR — captured FX graph (rmsnorm, linear, softmax, ...) - Tensor IR — every op decomposed into Elementwise / Reduction / IndexMap - Loop IR — a kernel written as a loop nest fused with other kernels - Tile IR — a kernel scheduled onto the GPU (threads, blocks, shared memory) - Kernel IR — schedule materialized into hardware primitives - CUDA — emitted source ready for nvcc

For example, a PyTorch expression walked through IR levels:

torch.relu(torch.matmul(x + bias, w))   # x: (16, 64), bias: (64,), w: (64, 16)

Torch IR:

bias_bc =  bias[j]                         -> (16, 64) float32
add     =  add(x, bias_bc)                 -> (16, 64) float32
matmul  =  matmul(add, w, has_bias=False)  -> (16, 16) float32
relu    =  relu(matmul)                    -> (16, 16) float32

Tensor IR:

bias_bc  =  bias[j]                 -> (16, 64) float32
w_bc     =  w[j, k]                 -> (16, 64, 16) float32
add      =  add(x, bias_bc)         -> (16, 64) float32
add_bc   =  add[i, j]               -> (16, 64, 16) float32
prod     =  multiply(add_bc, w_bc)  -> (16, 64, 16) float32
red      =  sum(prod, axis=-2)      -> (16, 1, 16) float32
matmul   =  red[i, na, j]           -> (16, 16) float32
relu     =  relu(matmul)            -> (16, 16) float32

Loop IR

=== merged_relu -> relu ===
for a0 in 0..16:  # free (M)
    for a1 in 0..16:  # free (N)
        for a2 in 0..64:  # reduce (K)
            in0 = load bias[a2]
            in1 = load x[a0, a2]
            in2 = load w[a2, a1]
            v0 = add(in1, in0)      # prologue (inside reduce)
            v1 = multiply(v0, in2)
            acc0 <- add(acc0, v1)
        v2 = relu(acc0)             # epilogue (outside reduce)
        merged_relu[a0, a1] = v2

Part 2 explains the lower half: how a loop nest becomes a GPU schedule. Sixteen mechanical Tile-IR passes to split computations into blocks, map to threads, stage inputs into smem, etc. Each pass is one diff in the CLI. It mimics the sequence of optimizations a CUDA engineer would make.

For example, a pass that stages inputs into the smem:

deplodock compile \
  -c "torch.nn.RMSNorm(2048)(torch.randn(1,32,2048))" \
  --ir tile -vv \
  | awk '/^>>> t:007/,/^<<< t:007/'
>>> t:007_stage_inputs
@@ matched at rms_norm (in-place) @@
@@ -2,6 +2,7 @@
   v0 = reciprocal(2048)
   Tile(axes=(a0:256=THREAD, a1:32=BLOCK)):
+      x_smem = Stage(x, origin=(0, a1, 0), slab=(a2:2048@2))
       StridedLoop(a2 = a0; < 2048; += 256):  # reduce
-          in2 = load x[0, a1, a2]
+          in2 = load x_smem[a2]
           v1 = multiply(in2, in2)
           acc0 <- add(acc0, v1)
@@ -11,5 +12,5 @@
       v4 = rsqrt(v3)
       StridedLoop(a2 = a0; < 2048; += 256):  # free
-          in3 = load x[0, a1, a2]
+          in3 = load x_smem[a2]
           in4 = load p_weight[a2]
           v5 = multiply(in3, v4)
<<< t:007_stage_inputs

Part 3 finishes the series with autotuning. Every parameter in part 2 (block size, register tile, K-chunk, whether to stage, whether to double-buffer, etc.) was hand-picked using heuristics. Those heuristics worked on the shapes I fit them to, but fell over elsewhere. Part 3 replaces heuristics with a search loop: SP-MCTS over the cross-product of rule parameters.

The whole pipeline is one CLI:

# inspect any IR stage
deplodock compile -c "nn.RMSNorm(2048)(torch.randn(1,32,2048))" --ir tensor|loop|tile|kernel|cuda

# bench end-to-end
deplodock run --bench -c "torch.nn.Softmax(dim=-1)(torch.randn(1,28,2048,2048))"

# autotune on the live GPU
deplodock tune -c "nn.RMSNorm(2048)(torch.randn(1,32,2048))" -v

# full model
deplodock compile Qwen/Qwen2.5-7B

Each part is self-contained enough that you can skip ahead if you only care about one layer: - Part 1. IR Hierarchy — From PyTorch to Emitted CUDA - Part 2. Tile IR — Scheduling Loops onto a GPU - Part 3. Autotuning — A Search Loop Over Tile-IR Rewrites - Repo: https://github.com/cloudrift-ai/deplodock