Fine-tuned Qwen3 SLMs (0.6-8B) beat frontier LLMs on narrow tasks
Posted by Jolly-Gazelle-6060@reddit | LocalLLaMA | View on Reddit | 82 comments
We spent a while putting together a systematic comparison of small distilled Qwen3 models (0.6B to 8B) against frontier APIs — GPT-5 nano/mini/5.2, Gemini 2.5 Flash Lite/Flash, Claude Haiku 4.5/Sonnet 4.6/Opus 4.6, Grok 4.1 Fast/Grok 4 — across 9 datasets spanning classification, function calling, QA, and open-book QA.
All distilled models were trained using open-weight teachers only (no frontier API outputs in the training loop), with as few as 50 examples. Inference is vLLM on a single H100.
**The results that surprised us most:**
* **Smart Home function calling**: Qwen3-0.6B — yes, the 0.6B — hits 98.7% vs Gemini Flash at 92.0%. Some of that gap is the strict eval penalizing reasonable alternative interpretations, but still.
* **Text2SQL**: Qwen3-4B distilled gets 98.0% vs Claude Haiku at 98.7% and GPT-5 nano at 96.0%. Cost per million requests: \~$3 vs $378 and $24 respectively.
* **Classification** (Banking77, E-commerce, TREC): basically solved. Distilled models land within 0–1.5pp of the best frontier option.
* **Where frontier still wins**: HotpotQA (open-ended reasoning + world knowledge) — 92.0% vs Haiku's 98.0%. This is the task type where distillation has the clearest trade-off.
Overall, distilled models match or beat the best mid-tier frontier model (sub-$1/MTok input) on 6/9 tasks, and effectively tie on a 7th.
**Throughput/latency** (Text2SQL, Qwen3-4B on H100):
* 222 RPS sustained
* p50: 390ms | p95: 640ms | p99: 870ms
* 7.6 GiB VRAM (BF16, no quantization)
* FP8 gave +15% throughput, −44% VRAM, no measurable accuracy loss in brief experiments
**Methodology notes** (since I know this sub cares):
* Same test sets, same prompts, same eval criteria for all models
* Frontier models run 3× per dataset (reporting mean ± std), distilled at temp=0
* Eval: exact-match for classification, tool\_call\_equivalence (JSON comparison w/ default param normalization) for function calling, Claude Sonnet 4.6 as LLM-judge for generation tasks
* Cost calc: frontier = measured token usage × published pricing (Feb 2026); distilled = H100 at $2.40/hr ÷ sustained RPS
**Practical takeaway on when to distill vs. call an API:**
* Distill when you have structured tasks, well-defined schemas, high volume, or data sovereignty needs
* Frontier API when you need broad world knowledge, freeform generation, or volume is low enough that the cost doesn't matter
* Best of both worlds: route between the two
Everything is open source — code, models, data, eval scripts: **GitHub**: [https://github.com/distil-labs/inference-efficiency-benchmarks/](https://github.com/distil-labs/inference-efficiency-benchmarks/) **Blog with full charts**: [https://www.distillabs.ai/blog/the-10x-inference-tax-you-dont-have-to-pay](https://www.distillabs.ai/blog/the-10x-inference-tax-you-dont-have-to-pay)
Happy to dig into methodology, specific dataset results, or the distillation setup if anyone has questions.
82 Comments
Glittering-Call8746@reddit
gabucz@reddit
Glittering-Call8746@reddit
ThiccStorms@reddit
Western_Objective209@reddit
Jolly-Gazelle-6060@reddit (OP)
Western_Objective209@reddit
Jolly-Gazelle-6060@reddit (OP)
Western_Objective209@reddit
Additional_Wish_3619@reddit
Western_Objective209@reddit
Additional_Wish_3619@reddit
Western_Objective209@reddit
Jolly-Gazelle-6060@reddit (OP)
Western_Objective209@reddit
Additional_Wish_3619@reddit
iamMess@reddit
vladlearns@reddit
party-horse@reddit
last_llm_standing@reddit
party-horse@reddit
last_llm_standing@reddit
chodemunch6969@reddit
mantafloppy@reddit
ninefourtwo@reddit
LeoTechAI_UK@reddit
Effective-Clerk-5309@reddit
Jolly-Gazelle-6060@reddit (OP)
Effective-Clerk-5309@reddit
Jolly-Gazelle-6060@reddit (OP)
AurumDaemonHD@reddit
Innomen@reddit
ElkTop6108@reddit
pgrijpink@reddit
Odd-Ordinary-5922@reddit
DeProgrammer99@reddit
party-horse@reddit
DeProgrammer99@reddit
DeProgrammer99@reddit
pgrijpink@reddit
maciejgryka@reddit
NotaDevAI@reddit
jslominski@reddit
Jolly-Gazelle-6060@reddit (OP)
Additional_Ad_7718@reddit
m98789@reddit
Jolly-Gazelle-6060@reddit (OP)
Budulai343@reddit
WithoutReason1729@reddit
charmander_cha@reddit
maciejgryka@reddit
snmnky9490@reddit
charmander_cha@reddit
ortegaalfredo@reddit
Yu2sama@reddit
grumd@reddit
maciejgryka@reddit
ortegaalfredo@reddit
Jolly-Gazelle-6060@reddit (OP)
zipzag@reddit
grumd@reddit
party-horse@reddit
zipzag@reddit
openSourcerer9000@reddit
TopTippityTop@reddit
Conscious_Ad_9070@reddit
maciejgryka@reddit
Conscious_Ad_9070@reddit
maciejgryka@reddit
Plenty_Extent_9047@reddit
fourthwaiv@reddit
maciejgryka@reddit
Senior_Hamster_58@reddit
Jolly-Gazelle-6060@reddit (OP)
mckirkus@reddit
Jolly-Gazelle-6060@reddit (OP)
letsgoiowa@reddit
maciejgryka@reddit
Effective-Drawer9152@reddit
maciejgryka@reddit
Xi-tzu@reddit
maciejgryka@reddit