I tested as many of the small local and OpenRouter models I could with my own agentic text-to-SQL benchmark. Surprises ensured...
Posted by nickl@reddit | LocalLLaMA | View on Reddit | 62 comments
Last week I asked for some feedback about what extra models I should test. I've added them all and now the benchmark is available at https://sql-benchmark.nicklothian.com/
I didn't say a lot about what the agent at the time, but in simple terms it takes an English query like "Show order lines, revenue, units sold, revenue per unit (total revenue ÷ total units sold), average list price per product in the subcategory, gross profit, and margin percentage for each product subcategory" and turns it into SQL that it tests against a set of database tables.
It gets to see the query results and can modify it to fix issues, but with a limit to the number of debugging rounds it gets.
The benchmark is deliberately short (25 questions) and fast to run (much less than 5 minutes for most models) so you can try different configurations etc, but it is tough enough to separate the best models from the others.
I added the ability to run it yourself against your own server (thanks to the WASM version of Llama.cpp).
A few of the things I found interesting:
- The best open models are kimi-k2.5, Qwen 3.5 397B-A17B and Qwen 3.5 27B (!)
- NVIDIA Nemotron-Cascade-2-30B-A3B outscores Qwen 3.5-35B-A3B and matches Codex 5.3
- Mimo v2 Flash is a gem of a model
I'd love to see some scores people get, as well as what I should change for v2!
nicoloboschi@reddit
This is a solid benchmark, especially focusing on text-to-SQL. If you want to augment your agent with memory capabilities, you might find some useful inspiration in the Hindsight architecture. https://github.com/vectorize-io/hindsight
rm-rf-rm@reddit
Can you please address these 2 critical questions:
nickl@reddit (OP)
Yes this is correct.
Correct.
The justification here is that it's actually very hard to get the first row correct and other ones wrong. Here is a typical trace:
When I was first building it I was using very small models. Passing the full result set blew out the context very quickly.
I have noticed some models seem to think that the labeling means only one row is being returned. I'm not entirely sure what the best option here is but open to ideas.
I'd be interested if you have examples of any cases where this scoring gives the wrong result
> is the result for a single test based on just 1 pass? is there any checking for stability/instability i.e. each question asked 3-5 times to see if it passes every time?
No, but it's also a valid point.
rm-rf-rm@reddit
I think the onus is on you to demonstrate that scoring with the full result vs first row is equivalent before settling for the latter as good enough.
nickl@reddit (OP)
Oh - yes, I did check it and I didn't find any cases where it was failing. But I have run more models through it so it's not impossible.
Thinking about it though, the scoring and the feedback to the LLM are actually two separate concerns so yes I should be able to do the scoring based on the whole resultset fairly easily.
rm-rf-rm@reddit
Thanks! Yes I was confused as to why scoring would have anything to do with context
nickl@reddit (OP)
well it sort of does because if there are cases where the rows beyond the first row are wrong that needs to be put in the context.
But I think I'm going to have to make improvements to the context management for v2 anyway so it's probably doable.
Tormeister@reddit
A few local tests, ran 2 or 3 times:
Qwen3.5 27B Q5_K_S:
20/25, keeps failing/crashing ~5 tests (hard & medium)Qwen3.5 27B Q6_K:
22/25, Q2 error, Q9 and Q21 failQwen3.5 40B (custom expanded) Q4_K_M:
23/25, Q9 and Q21 failAll using q8_0 kv + Hadamard transform
nickl@reddit (OP)
Wow your 40B model does well!
abkibaarnsit@reddit
This is amazing. One small request, is it possible to add a SQL formatter to the
Model SQLandCanonical SQLtext areasI want to compare the SQLs
nickl@reddit (OP)
> is it possible to add a SQL formatter to the
Model SQLandCanonical SQLtext areasI did a quick look for something easy but didn't find something. Didn't look too hard though and I agree it is needed.
abkibaarnsit@reddit
Raising a PR
nicholas_the_furious@reddit
Yo! I love this and I am using your benchmark. Something I noticed, though, is you are controlling the temperature when you are passing in the calls. I know that in general low temp = better for tool calling, however I want to ask you to allow the model/provider to set their own sampling parameters!
Many models now need a temp higher than 0 or .1, especially reasoning models, in order to produce the best results. That may be why your reasoning versions of the smaller qwen models did worse than the non-thinking versions. I am running a q8_0 version of the Nemotron 2 Cascade model and it is surpassing what you had scored on the 120B Nemotron for this task. But I did notice the hardcoded .1 temp in there. I would like to be able to use the suggested temp provided by Nvidia (1.0).
nickl@reddit (OP)
These are all great points. Raised an issue for myself, thanks
nicholas_the_furious@reddit
The outstanding competitor was actually Apriel 1.6 15B. It got 20/25. Nailed everything below hard.
nickl@reddit (OP)
Some might find this chart useful too.
LowMental5202@reddit
How come gpt oss 20b is so high up? Tested it yesterday and got beyond a 100t/s on my 3090 and a relative big context window
IrisColt@reddit
I still recommend it to colleagues, by the way.
nickl@reddit (OP)
I don't have a good explanation. The free version scored much better than the non-free version too. It's very odd!
SpiritualWindow3855@reddit
You didn't pin the providers on OR?
nickl@reddit (OP)
I didn't know that you can do that!
Kahvana@reddit
Didn't expect mistral small 2603 to score that high, neat!
deleted_by_reddit@reddit
[removed]
Embarrassed_Elk_4733@reddit
ok, everything all working, i link the locallm-studio to my vps in newapi, and try the newapi endpoint just like: https://newapi.xxxxx.top/v1/chat/completions. the benchmark running!
here is the qwen3.5 35b a3b q4km in gtx4070tisuper 16gbvram. speed is 58token/s, very nice work for the qwen team and nickl!!!
jld1532@reddit
Do we have a trusted source to download cascade on LM Studio?
nicholas_the_furious@reddit
https://huggingface.co/bartowski/nvidia_Nemotron-Cascade-2-30B-A3B-GGUF
grumd@reddit
Omg this is so good, I want to run all my local models on this benchmark. I can't get it to run with my llama.cpp server though! I tried pointing it at http://localhost:8080/v1/messages, at http://localhost:8080/v1/completions, I can see the logs of a request coming to POST /v1/messages, but then I just see "Model failed to produce a tool call in 3 consecutive attempts" and there's no response from the model. Maybe something wrong with the setup because the same models work fine with OpenCode / Claude Code
nickl@reddit (OP)
Yep. Small models often fail to do reliable tool calls.
Grammar mode helps sometimes but that isn't available in the web version. The write up has more details.
chooseyouravatar@reddit
Love your site, thanks for your work. Talking about small models, it was fun to test Jan v1 (a 4b model, 2.5GB quantized) next to 30B and 9B models. It didn't perform badly (factually way better than Jan v2 8b)!
nickl@reddit (OP)
That's a great score. I've never heard of that model or JanHQ.
I see that is is designed for coding and I think in general those models work best for this benchmark.
chooseyouravatar@reddit
Yes, it's an instruct model (Q4_K_S quant here). It's one of my all-time favorites for agent/tooling stuff. Giving it 240 seconds to perform instead of 120 seconds, it succeeds in answering one hard question, scoring 15/25 in the same amount of time. (770 seconds). Smart little model.
Technical-Earth-3254@reddit
My local Qwen 3.5 27B Opus Distil in q4km scored 21/25, I'm very im impressed.
Killawatts13@reddit
Mind posting which model? My main issue is small context windows. Using q8 kv helps but still bottlenecks me at around 80k context. I have a 4090 w/64gb ram
Technical-Earth-3254@reddit
I was using Bartowskis quant, if that's what ur looking for. Without 24GB of VRAM we are limited to ~60k context when running on Windows. I have around 2GB of VRAM usage before using the model. Running on a setup without windows and a GUI would improve things quite a bit.
tmvr@reddit
I'm using the bartowski Q4_K_L with a 24GB RTX4090 on Windows and I can go up to 88K (90112) with the default KV and up to 156K (159744) with KV at q8_0. Windows and the apps are using 1.1GB VRAM.
Comfortable_Ebb7015@reddit
If you have a cpu with onboard graphics, you Can plug the screen to the motherboard and free these 2gb. Then use the gpu passthrough to use the gpu when you need it!
Technical-Earth-3254@reddit
I'm well aware of that, but I don't have an iGPU in any of my systems sadly.
nickl@reddit (OP)
That's a great score. What TPS did it get? (It shows in the mouse over)
Technical-Earth-3254@reddit
Around 20tps on a 3090 at 400W fully loaded in VRAM with \~60k context.
duridsukar@reddit
The gap between benchmark performance and production performance is the thing that keeps coming up in my work.
I run agents across a real estate operation — data retrieval, lead analysis, intake. What I've found is that SQL-style structured queries actually perform more reliably than natural language chains when the schema is well-defined. The model choice mattered less than the prompt architecture and schema documentation.
What kind of error patterns came up most when the query was complex? Hallucinated column names or wrong joins?
nickl@reddit (OP)
Tool calling is the biggest failure. It just is unreliable for small models as the context gets longer.
After thati haven't done deep analysis but things like hallucination of column names or not quoting names with spaces in them seems common.
Adorable_Weakness_39@reddit
Qwen 3.5-27B is the goat. You can run it on a RTX 3090 at 40 tok/s. Everyone should be using it on their own hardware
LienniTa@reddit
hey, what settings?
Adorable_Weakness_39@reddit
llama.cpp, 99 gpu layers, 240k context, q8 kv_cache.
The thing I've been working on auto-configures the llama.cpp server and downloads/detects the model model based on your hardware: https://github.com/L-Forster/open-jet . I am wanting feedback for future development.
It's faster than ollama and is easier to set up than llama.cpp.
nickl@reddit (OP)
Yes, it's amazing.
I'm very excited about Nemotron-Cascade-2-30B-A3B. A 3bit quant got within 2 points of Qwen 3.5-27B but some of the missed scores were timeouts on my GTX 1070 with 8GB!
RedParaglider@reddit
I think it's crazy that it beat qwen3 coder next and the 122b model.. the 122b was what really surprised me.
themaskbehindtheman@reddit
What's the quant and context you run with?
tmvr@reddit
Awesome site and benchmark setup!
Qwen3.5 27B Q4_K_L from bartowski gives the same 23/25 result you got with the same Q9 and Q21 failing.
SeaDisk6624@reddit
Can you test qwen 3.5 397b fp4 on open router please? it would be really interesting how it compares. thanks!
nickl@reddit (OP)
Is there a FP4 version on OpenRouter?
If you have an OpenRouter key you can actually run it yourself.
SeaDisk6624@reddit
yes there is, I can't currently.
Evening_Ad6637@reddit
Great work OP!
I’ve tested GLM-4.5-Air and GLM-4.7
GLM-4.7 23/25 (failed Q9 and Q21) $0.07 143s
GLM-4.5-Air 19/25 (Can’t remember the rest unfortunately)
rm-rf-rm@reddit
Is the full db schema injected into the prompt? I couldnt find information on this
nickl@reddit (OP)
No just the tables for the question.
If you click on a cell in a heatmap it shows you the exact trace for that question.
MLDataScientist@reddit
Do you have any SQL fine-tuned small models (<=9B) to test this benchmark with? I think even Qwen3.5 4B with SQL data fine-tuning might reach 90%+.
nickl@reddit (OP)
Yes that's my plan.
I'd like to do a fine tune of 0.8B so it can run in-browser and actually be useful.
But very happy to try other models if they exist already!
You might have missed it but if you have llama.cpp/LMStudio/whatever you can run the benchmark yourself against any models you have locally
rm-rf-rm@reddit
Yes there are a bunch! they used to be posted on here every other week. Please add them, would be a very interesting result
Front_Eagle739@reddit
Those input token/s scores can't be right can they? Does Opus4.6 really only do prefill at 500 ish tokens/sec over api?
kiwibonga@reddit
Yep. A 27B model and a small stack of relevant text files can beat Opus at any task.
1337_mk3@reddit
27b is no surprise that model is wildddddd
MD_Reptile@reddit
I'd sure like to be able to run kimi k2.5 locally but reqs are too crazy for the good quants - I'd really love a good lighter quant to use on reasonable hardware that somehow is just as smart! What's the closest thing you've found that doesn't require a data center in your basement?
nickl@reddit (OP)
Qwen 3.5-27B is probably the best for most people to self host. I'm optimistic about Nemotron-Cascade-2-30B-A3B because runs (slowly) on my 8GB 1070, so I expect it will perform much better for most people who invest in some decent hardware.