LLMs are terrible at Sudoku
Posted by cov_id19@reddit | LocalLLaMA | View on Reddit | 4 comments
LLMs don’t fail Sudoku because it’s hard.
They fail because one mistake kills the entire solution —
and they have no way to recover.
So they just… hallucinate a valid-looking grid and hope.
Wrapped the same model in a recursive loop:
generate → execute → fix → repeat
Suddenly:
- errors get caught
- state gets updated
- solution converges
Fine-tuning or bigger model is not the solution. Just not forcing it to be right in one shot makes the difference.
The Benchmark:
4.25M Sudoku puzzles rated by backtrack difficulty. No context - the puzzle is the task. Vanilla LLMs can't solve Sudoku by pattern-matching tokens - the constraint propagation is too deep. The REPL turns this into what it actually is: a search problem solvable by code. ↗ sapientinc/sudoku-extreme
"Solve this Sudoku puzzle. Each row, column, and 3x3 box must contain digits 1-9 exactly once."
2 _ _ | _ 9 _ | 8 _ _
_ _ _ | _ 8 _ | 1 _ _
9 _ 8 | 2 _ _ | 5 6 _
------+-------+------
5 _ _ | _ _ 2 | _ _ _
_ 9 _ | _ _ _ | _ 4 _
_ _ 3 | _ _ 9 | _ _ 7
------+-------+------
7 _ _ | _ _ 3 | _ 5 _
_ _ 5 | 6 7 _ | _ _ _
_ 6 _ | 4 _ _ | _ _ 1
Raw puzzle: 2...9.8......8.1..9.82..56.5....2....9.....4...3..9..77....3.5...567.....6.4....1
minRLM's code
# REASONING: Parse the Sudoku grid and solve with constraint propagation + backtracking.
import re
puzzle = "2...9.8......8.1..9.82..56.5....2....9.....4...3..9..77....3.5...567.....6.4....1"
grid = [list(puzzle[i*9:(i+1)*9]) for i in range(9)]
def candidates(r, c, g):
used = set(g[r][j] for j in range(9) if g[r][j] != '.')
used |= set(g[i][c] for i in range(9) if g[i][c] != '.')
br, bc = (r//3)*3, (c//3)*3
used |= set(g[i][j] for i in range(br, br+3) for j in range(bc, bc+3) if g[i][j] != '.')
return set("123456789") - used
def solve(g):
best, best_cands = None, None
for r in range(9):
for c in range(9):
if g[r][c] == '.':
cands = candidates(r, c, g)
if not cands: return None
if best is None or len(cands) < len(best_cands):
best, best_cands = (r, c), cands
if best is None: return g
r, c = best
for v in sorted(best_cands):
ng = [row[:] for row in g]
ng[r][c] = v
# Propagate forced cells, then recurse
res = solve(ng)
if res is not None: return res
return None
sol = solve(grid)
FINAL("".join("".join(row) for row in sol))
Result: 251396874436785192978214563547132986192867345683549217714923658825671439369458721 ✓ 5,615 tokens, 5.3s. Constraint propagation + backtracking, executes in milliseconds.
| Runner | Accuracy | Avg Tokens | Notes |
|---|---|---|---|
| minRLM | 80% | 7,598 | Writes a complete solver |
| Official RLM | 70% | 30,175 | Also writes code, 4x more tokens |
| Vanilla | 0% | 269 | Outputs 81 wrong digits |
GPT-5.4-mini, 20 runs per runner.
StupidScaredSquirrel@reddit
I fail to see how this is a RLM implementation and not just using an llm to write a small program for a specific problem
cov_id19@reddit (OP)
Sudoku full results: https://avilum.github.io/minrlm/recursive-language-model.html#sudoku
cov_id19@reddit (OP)
I ran a benchmark across 13 tasks, on models in different sizes. The full results and my RLM implementation can be found here. https://avilum.github.io/minrlm/
cov_id19@reddit (OP)
Some reasoning tasks are not meant to be done in plaintext. For many tasks, reasoning through code recursively is all it takes.
Try it yourself with any local / remote model
https://github.com/avilum/minrlm?tab=readme-ov-file#try-it-in-10-seconds
Sudoku is unsolvable by token prediction alone - the constraint propagation is too deep for pattern matching. Vanilla outputs confident-looking 81-digit strings that violate basic Sudoku rules. The REPL turns it into what it actually is: a search problem. minRLM writes a backtracking solver and runs it.