LLMs are terrible at Sudoku

Posted by cov_id19@reddit | LocalLLaMA | View on Reddit | 4 comments

LLMs don’t fail Sudoku because it’s hard.

They fail because one mistake kills the entire solution —
and they have no way to recover.

So they just… hallucinate a valid-looking grid and hope.

Wrapped the same model in a recursive loop:

generate → execute → fix → repeat

Suddenly:

errors get caught
state gets updated
solution converges

Fine-tuning or bigger model is not the solution. Just not forcing it to be right in one shot makes the difference.

The Benchmark:

4.25M Sudoku puzzles rated by backtrack difficulty. No context - the puzzle is the task. Vanilla LLMs can't solve Sudoku by pattern-matching tokens - the constraint propagation is too deep. The REPL turns this into what it actually is: a search problem solvable by code. ↗ sapientinc/sudoku-extreme

"Solve this Sudoku puzzle. Each row, column, and 3x3 box must contain digits 1-9 exactly once."

2 _ _ | _ 9 _ | 8 _ _
_ _ _ | _ 8 _ | 1 _ _
9 _ 8 | 2 _ _ | 5 6 _
------+-------+------
5 _ _ | _ _ 2 | _ _ _
_ 9 _ | _ _ _ | _ 4 _
_ _ 3 | _ _ 9 | _ _ 7
------+-------+------
7 _ _ | _ _ 3 | _ 5 _
_ _ 5 | 6 7 _ | _ _ _
_ 6 _ | 4 _ _ | _ _ 1

Raw puzzle: 2...9.8......8.1..9.82..56.5....2....9.....4...3..9..77....3.5...567.....6.4....1

minRLM's code

# REASONING: Parse the Sudoku grid and solve with constraint propagation + backtracking.
import re

puzzle = "2...9.8......8.1..9.82..56.5....2....9.....4...3..9..77....3.5...567.....6.4....1"
grid = [list(puzzle[i*9:(i+1)*9]) for i in range(9)]

def candidates(r, c, g):
    used = set(g[r][j] for j in range(9) if g[r][j] != '.')
    used |= set(g[i][c] for i in range(9) if g[i][c] != '.')
    br, bc = (r//3)*3, (c//3)*3
    used |= set(g[i][j] for i in range(br, br+3) for j in range(bc, bc+3) if g[i][j] != '.')
    return set("123456789") - used

def solve(g):
    best, best_cands = None, None
    for r in range(9):
        for c in range(9):
            if g[r][c] == '.':
                cands = candidates(r, c, g)
                if not cands: return None
                if best is None or len(cands) < len(best_cands):
                    best, best_cands = (r, c), cands
    if best is None: return g
    r, c = best
    for v in sorted(best_cands):
        ng = [row[:] for row in g]
        ng[r][c] = v

# Propagate forced cells, then recurse
        res = solve(ng)
        if res is not None: return res
    return None

sol = solve(grid)
FINAL("".join("".join(row) for row in sol))

Result: 251396874436785192978214563547132986192867345683549217714923658825671439369458721 ✓ 5,615 tokens, 5.3s. Constraint propagation + backtracking, executes in milliseconds.

Runner	Accuracy	Avg Tokens	Notes
minRLM	80%	7,598	Writes a complete solver
Official RLM	70%	30,175	Also writes code, 4x more tokens
Vanilla	0%	269	Outputs 81 wrong digits

GPT-5.4-mini, 20 runs per runner.

[-]

StupidScaredSquirrel@reddit

I fail to see how this is a RLM implementation and not just using an llm to write a small program for a specific problem

cov_id19@reddit (OP)

Sudoku full results: https://avilum.github.io/minrlm/recursive-language-model.html#sudoku

I ran a benchmark across 13 tasks, on models in different sizes. The full results and my RLM implementation can be found here. https://avilum.github.io/minrlm/

Some reasoning tasks are not meant to be done in plaintext. For many tasks, reasoning through code recursively is all it takes.

Try it yourself with any local / remote model
https://github.com/avilum/minrlm?tab=readme-ov-file#try-it-in-10-seconds

Sudoku is unsolvable by token prediction alone - the constraint propagation is too deep for pattern matching. Vanilla outputs confident-looking 81-digit strings that violate basic Sudoku rules. The REPL turns it into what it actually is: a search problem. minRLM writes a backtracking solver and runs it.