I tested as many of the small local and OpenRouter models I could with my own agentic text-to-SQL benchmark. Surprises ensured... | TheaterFire

I tested as many of the small local and OpenRouter models I could with my own agentic text-to-SQL benchmark. Surprises ensured...

Posted by nickl@reddit | LocalLLaMA | View on Reddit | 62 comments

Last week I asked for some feedback about what extra models I should test. I've added them all and now the benchmark is available at https://sql-benchmark.nicklothian.com/

I didn't say a lot about what the agent at the time, but in simple terms it takes an English query like "Show order lines, revenue, units sold, revenue per unit (total revenue ÷ total units sold), average list price per product in the subcategory, gross profit, and margin percentage for each product subcategory" and turns it into SQL that it tests against a set of database tables.

It gets to see the query results and can modify it to fix issues, but with a limit to the number of debugging rounds it gets.

The benchmark is deliberately short (25 questions) and fast to run (much less than 5 minutes for most models) so you can try different configurations etc, but it is tough enough to separate the best models from the others.

I added the ability to run it yourself against your own server (thanks to the WASM version of Llama.cpp).

A few of the things I found interesting:

The best open models are kimi-k2.5, Qwen 3.5 397B-A17B and Qwen 3.5 27B (!)
NVIDIA Nemotron-Cascade-2-30B-A3B outscores Qwen 3.5-35B-A3B and matches Codex 5.3
Mimo v2 Flash is a gem of a model

I'd love to see some scores people get, as well as what I should change for v2!

[-]

nicoloboschi@reddit

This is a solid benchmark, especially focusing on text-to-SQL. If you want to augment your agent with memory capabilities, you might find some useful inspiration in the Hindsight architecture. https://github.com/vectorize-io/hindsight

[-]

rm-rf-rm@reddit

Can you please address these 2 critical questions:

Is the correctness check just row count, col count, column names and first row values? Not actually checking if all data is correct and equivalent to the canonical sql?
is the result for a single test based on just 1 pass? is there any checking for stability/instability i.e. each question asked 3-5 times to see if it passes every time?

[-]

nickl@reddit (OP)

Is the correctness check just row count, col count, column names and first row values?

Yes this is correct.

Not actually checking if all data is correct and equivalent to the canonical sql?

Correct.

The justification here is that it's actually very hard to get the first row correct and other ones wrong. Here is a typical trace:

When I was first building it I was using very small models. Passing the full result set blew out the context very quickly.

I have noticed some models seem to think that the labeling means only one row is being returned. I'm not entirely sure what the best option here is but open to ideas.

I'd be interested if you have examples of any cases where this scoring gives the wrong result

> is the result for a single test based on just 1 pass? is there any checking for stability/instability i.e. each question asked 3-5 times to see if it passes every time?

No, but it's also a valid point.

[-]

rm-rf-rm@reddit

I'd be interested if you have examples of any cases where this scoring gives the wrong result

I think the onus is on you to demonstrate that scoring with the full result vs first row is equivalent before settling for the latter as good enough.

[-]

nickl@reddit (OP)

Oh - yes, I did check it and I didn't find any cases where it was failing. But I have run more models through it so it's not impossible.

Thinking about it though, the scoring and the feedback to the LLM are actually two separate concerns so yes I should be able to do the scoring based on the whole resultset fairly easily.

[-]

rm-rf-rm@reddit

Thanks! Yes I was confused as to why scoring would have anything to do with context

[-]

nickl@reddit (OP)

well it sort of does because if there are cases where the rows beyond the first row are wrong that needs to be put in the context.

But I think I'm going to have to make improvements to the context management for v2 anyway so it's probably doable.

[-]

Tormeister@reddit

A few local tests, ran 2 or 3 times:

Qwen3.5 27B Q5_K_S: 20/25, keeps failing/crashing ~5 tests (hard & medium)

Qwen3.5 27B Q6_K: 22/25, Q2 error, Q9 and Q21 fail

Qwen3.5 40B (custom expanded) Q4_K_M: 23/25, Q9 and Q21 fail

All using q8_0 kv + Hadamard transform

[-]

nickl@reddit (OP)

Wow your 40B model does well!

[-]

abkibaarnsit@reddit

This is amazing. One small request, is it possible to add a SQL formatter to the Model SQL and Canonical SQL text areas

I want to compare the SQLs

[-]

nickl@reddit (OP)

> is it possible to add a SQL formatter to the Model SQL and Canonical SQL text areas

I did a quick look for something easy but didn't find something. Didn't look too hard though and I agree it is needed.

[-]

abkibaarnsit@reddit

Raising a PR

[-]

nicholas_the_furious@reddit

Yo! I love this and I am using your benchmark. Something I noticed, though, is you are controlling the temperature when you are passing in the calls. I know that in general low temp = better for tool calling, however I want to ask you to allow the model/provider to set their own sampling parameters!

Many models now need a temp higher than 0 or .1, especially reasoning models, in order to produce the best results. That may be why your reasoning versions of the smaller qwen models did worse than the non-thinking versions. I am running a q8_0 version of the Nemotron 2 Cascade model and it is surpassing what you had scored on the 120B Nemotron for this task. But I did notice the hardcoded .1 temp in there. I would like to be able to use the suggested temp provided by Nvidia (1.0).

[-]

nickl@reddit (OP)

These are all great points. Raised an issue for myself, thanks

[-]

nicholas_the_furious@reddit

The outstanding competitor was actually Apriel 1.6 15B. It got 20/25. Nailed everything below hard.

[-]

nickl@reddit (OP)

Some might find this chart useful too.

[-]

LowMental5202@reddit

How come gpt oss 20b is so high up? Tested it yesterday and got beyond a 100t/s on my 3090 and a relative big context window

[-]

IrisColt@reddit

I still recommend it to colleagues, by the way.

[-]

nickl@reddit (OP)

I don't have a good explanation. The free version scored much better than the non-free version too. It's very odd!

[-]

SpiritualWindow3855@reddit

You didn't pin the providers on OR?

[-]

nickl@reddit (OP)

I didn't know that you can do that!

[-]

Kahvana@reddit

Didn't expect mistral small 2603 to score that high, neat!

[-]

deleted_by_reddit@reddit

[removed]

[-]

Embarrassed_Elk_4733@reddit

ok, everything all working, i link the locallm-studio to my vps in newapi, and try the newapi endpoint just like: https://newapi.xxxxx.top/v1/chat/completions. the benchmark running!

here is the qwen3.5 35b a3b q4km in gtx4070tisuper 16gbvram. speed is 58token/s, very nice work for the qwen team and nickl!!!

[-]

jld1532@reddit

Do we have a trusted source to download cascade on LM Studio?

[-]

nicholas_the_furious@reddit

https://huggingface.co/bartowski/nvidia_Nemotron-Cascade-2-30B-A3B-GGUF

[-]

grumd@reddit

Omg this is so good, I want to run all my local models on this benchmark. I can't get it to run with my llama.cpp server though! I tried pointing it at http://localhost:8080/v1/messages, at http://localhost:8080/v1/completions, I can see the logs of a request coming to POST /v1/messages, but then I just see "Model failed to produce a tool call in 3 consecutive attempts" and there's no response from the model. Maybe something wrong with the setup because the same models work fine with OpenCode / Claude Code

[-]

nickl@reddit (OP)

Yep. Small models often fail to do reliable tool calls.

Grammar mode helps sometimes but that isn't available in the web version. The write up has more details.

[-]

chooseyouravatar@reddit

Love your site, thanks for your work. Talking about small models, it was fun to test Jan v1 (a 4b model, 2.5GB quantized) next to 30B and 9B models. It didn't perform badly (factually way better than Jan v2 8b)!

[-]

nickl@reddit (OP)

That's a great score. I've never heard of that model or JanHQ.

I see that is is designed for coding and I think in general those models work best for this benchmark.

[-]

chooseyouravatar@reddit

Yes, it's an instruct model (Q4_K_S quant here). It's one of my all-time favorites for agent/tooling stuff. Giving it 240 seconds to perform instead of 120 seconds, it succeeds in answering one hard question, scoring 15/25 in the same amount of time. (770 seconds). Smart little model.

[-]

Technical-Earth-3254@reddit

My local Qwen 3.5 27B Opus Distil in q4km scored 21/25, I'm very im impressed.

[-]

Killawatts13@reddit

Mind posting which model? My main issue is small context windows. Using q8 kv helps but still bottlenecks me at around 80k context. I have a 4090 w/64gb ram

[-]

Technical-Earth-3254@reddit

I was using Bartowskis quant, if that's what ur looking for. Without 24GB of VRAM we are limited to ~60k context when running on Windows. I have around 2GB of VRAM usage before using the model. Running on a setup without windows and a GUI would improve things quite a bit.

[-]

tmvr@reddit

I'm using the bartowski Q4_K_L with a 24GB RTX4090 on Windows and I can go up to 88K (90112) with the default KV and up to 156K (159744) with KV at q8_0. Windows and the apps are using 1.1GB VRAM.

[-]

Comfortable_Ebb7015@reddit

If you have a cpu with onboard graphics, you Can plug the screen to the motherboard and free these 2gb. Then use the gpu passthrough to use the gpu when you need it!

[-]

Technical-Earth-3254@reddit

I'm well aware of that, but I don't have an iGPU in any of my systems sadly.

[-]

nickl@reddit (OP)

That's a great score. What TPS did it get? (It shows in the mouse over)

[-]

Technical-Earth-3254@reddit

Around 20tps on a 3090 at 400W fully loaded in VRAM with \~60k context.

[-]

duridsukar@reddit

The gap between benchmark performance and production performance is the thing that keeps coming up in my work.

I run agents across a real estate operation — data retrieval, lead analysis, intake. What I've found is that SQL-style structured queries actually perform more reliably than natural language chains when the schema is well-defined. The model choice mattered less than the prompt architecture and schema documentation.

What kind of error patterns came up most when the query was complex? Hallucinated column names or wrong joins?

[-]

nickl@reddit (OP)

Tool calling is the biggest failure. It just is unreliable for small models as the context gets longer.

After thati haven't done deep analysis but things like hallucination of column names or not quoting names with spaces in them seems common.

[-]

Adorable_Weakness_39@reddit

Qwen 3.5-27B is the goat. You can run it on a RTX 3090 at 40 tok/s. Everyone should be using it on their own hardware

[-]

LienniTa@reddit

hey, what settings?

[-]

Adorable_Weakness_39@reddit

llama.cpp, 99 gpu layers, 240k context, q8 kv_cache.

The thing I've been working on auto-configures the llama.cpp server and downloads/detects the model model based on your hardware: https://github.com/L-Forster/open-jet . I am wanting feedback for future development.

It's faster than ollama and is easier to set up than llama.cpp.

[-]

nickl@reddit (OP)

Yes, it's amazing.

I'm very excited about Nemotron-Cascade-2-30B-A3B. A 3bit quant got within 2 points of Qwen 3.5-27B but some of the missed scores were timeouts on my GTX 1070 with 8GB!

[-]

RedParaglider@reddit

I think it's crazy that it beat qwen3 coder next and the 122b model.. the 122b was what really surprised me.

[-]

themaskbehindtheman@reddit

What's the quant and context you run with?

[-]

tmvr@reddit

Awesome site and benchmark setup!

Qwen3.5 27B Q4_K_L from bartowski gives the same 23/25 result you got with the same Q9 and Q21 failing.

[-]

SeaDisk6624@reddit

Can you test qwen 3.5 397b fp4 on open router please? it would be really interesting how it compares. thanks!

[-]

nickl@reddit (OP)

Is there a FP4 version on OpenRouter?

If you have an OpenRouter key you can actually run it yourself.

[-]

SeaDisk6624@reddit

yes there is, I can't currently.

[-]

Evening_Ad6637@reddit

Great work OP!

I’ve tested GLM-4.5-Air and GLM-4.7

GLM-4.7 23/25 (failed Q9 and Q21) $0.07 143s

GLM-4.5-Air 19/25 (Can’t remember the rest unfortunately)

[-]

rm-rf-rm@reddit

Is the full db schema injected into the prompt? I couldnt find information on this

[-]

nickl@reddit (OP)

No just the tables for the question.

If you click on a cell in a heatmap it shows you the exact trace for that question.

[-]

MLDataScientist@reddit

Do you have any SQL fine-tuned small models (<=9B) to test this benchmark with? I think even Qwen3.5 4B with SQL data fine-tuning might reach 90%+.

[-]

nickl@reddit (OP)

Yes that's my plan.

I'd like to do a fine tune of 0.8B so it can run in-browser and actually be useful.

But very happy to try other models if they exist already!

You might have missed it but if you have llama.cpp/LMStudio/whatever you can run the benchmark yourself against any models you have locally

[-]

rm-rf-rm@reddit

But very happy to try other models if they exist already!

Yes there are a bunch! they used to be posted on here every other week. Please add them, would be a very interesting result

[-]

Front_Eagle739@reddit

Those input token/s scores can't be right can they? Does Opus4.6 really only do prefill at 500 ish tokens/sec over api?

[-]

kiwibonga@reddit

Yep. A 27B model and a small stack of relevant text files can beat Opus at any task.

[-]

1337_mk3@reddit

27b is no surprise that model is wildddddd

[-]

MD_Reptile@reddit

I'd sure like to be able to run kimi k2.5 locally but reqs are too crazy for the good quants - I'd really love a good lighter quant to use on reasonable hardware that somehow is just as smart! What's the closest thing you've found that doesn't require a data center in your basement?

[-]

nickl@reddit (OP)

Qwen 3.5-27B is probably the best for most people to self host. I'm optimistic about Nemotron-Cascade-2-30B-A3B because runs (slowly) on my 8GB 1070, so I expect it will perform much better for most people who invest in some decent hardware.