Fine-tuned Qwen3 SLMs (0.6-8B) beat frontier LLMs on narrow tasks

Posted by Jolly-Gazelle-6060@reddit | LocalLLaMA | View on Reddit | 82 comments

We spent a while putting together a systematic comparison of small distilled Qwen3 models (0.6B to 8B) against frontier APIs — GPT-5 nano/mini/5.2, Gemini 2.5 Flash Lite/Flash, Claude Haiku 4.5/Sonnet 4.6/Opus 4.6, Grok 4.1 Fast/Grok 4 — across 9 datasets spanning classification, function calling, QA, and open-book QA. All distilled models were trained using open-weight teachers only (no frontier API outputs in the training loop), with as few as 50 examples. Inference is vLLM on a single H100. **The results that surprised us most:** * **Smart Home function calling**: Qwen3-0.6B — yes, the 0.6B — hits 98.7% vs Gemini Flash at 92.0%. Some of that gap is the strict eval penalizing reasonable alternative interpretations, but still. * **Text2SQL**: Qwen3-4B distilled gets 98.0% vs Claude Haiku at 98.7% and GPT-5 nano at 96.0%. Cost per million requests: \~$3 vs $378 and $24 respectively. * **Classification** (Banking77, E-commerce, TREC): basically solved. Distilled models land within 0–1.5pp of the best frontier option. * **Where frontier still wins**: HotpotQA (open-ended reasoning + world knowledge) — 92.0% vs Haiku's 98.0%. This is the task type where distillation has the clearest trade-off. Overall, distilled models match or beat the best mid-tier frontier model (sub-$1/MTok input) on 6/9 tasks, and effectively tie on a 7th. **Throughput/latency** (Text2SQL, Qwen3-4B on H100): * 222 RPS sustained * p50: 390ms | p95: 640ms | p99: 870ms * 7.6 GiB VRAM (BF16, no quantization) * FP8 gave +15% throughput, −44% VRAM, no measurable accuracy loss in brief experiments **Methodology notes** (since I know this sub cares): * Same test sets, same prompts, same eval criteria for all models * Frontier models run 3× per dataset (reporting mean ± std), distilled at temp=0 * Eval: exact-match for classification, tool\_call\_equivalence (JSON comparison w/ default param normalization) for function calling, Claude Sonnet 4.6 as LLM-judge for generation tasks * Cost calc: frontier = measured token usage × published pricing (Feb 2026); distilled = H100 at $2.40/hr ÷ sustained RPS **Practical takeaway on when to distill vs. call an API:** * Distill when you have structured tasks, well-defined schemas, high volume, or data sovereignty needs * Frontier API when you need broad world knowledge, freeform generation, or volume is low enough that the cost doesn't matter * Best of both worlds: route between the two Everything is open source — code, models, data, eval scripts: **GitHub**: [https://github.com/distil-labs/inference-efficiency-benchmarks/](https://github.com/distil-labs/inference-efficiency-benchmarks/) **Blog with full charts**: [https://www.distillabs.ai/blog/the-10x-inference-tax-you-dont-have-to-pay](https://www.distillabs.ai/blog/the-10x-inference-tax-you-dont-have-to-pay) Happy to dig into methodology, specific dataset results, or the distillation setup if anyone has questions.

Reply to Post

82 Comments

[-]

Glittering-Call8746@reddit

I'm not sure if this has been answered, how do u do the routing to Qwen3 models (0.6B to 8B) ?

[-]

gabucz@reddit

What do you mean by routing?

[-]

Glittering-Call8746@reddit

My bad i thought there's a router to the different SLM..

[-]

ThiccStorms@reddit

SLMs ftw

[-]

Western_Objective209@reddit

I work on healthcare AI systems, these kinds of posts make management yell at us to use SLMs but when we actually try using them in real systems they are fucking useless

[-]

Jolly-Gazelle-6060@reddit (OP)

they can be if you don't specialize them. we are very much aware that fine-tuning can be a lot of effort & overhead, that was one of the core reasons for building distil labs

[-]

Western_Objective209@reddit

have you tried deploying real systems beyond benchmarking?

[-]

Jolly-Gazelle-6060@reddit (OP)

Yes

[-]

Western_Objective209@reddit

okay, I have not seen performance remain high on these SLMs when moving from benchmarks to real data. you can say "well you need to do a better job fine tuning them" and IDK, maybe, but I don't see any real world products from billion dollar labs using SLMs and I don't see them working when I use them, and you seem to have a financial interest in them working so there's some inconsistencies here

[-]

Additional_Wish_3619@reddit

There's a lot of work being done to maintain this performance, but I think this is a deeper problem. I even see with LLMs that benchmarks don't translate well to the real world. I think that testing SLM's in a real world scenario is almost better than going with benchmarks. It needs to showcase capabilities through real world tasks versus showing capabilities on a predefined set of tasks. Hard problem to solve though.

[-]

Western_Objective209@reddit

The main problem I see that in the healthcare example listed, PII redaction, anything less than 100% is considered bad. I have systems processing hundreds of millions of claims a year and when we do PII redaction, if we had 94% accuracy that would mean millions of violations and we'd be fined into oblivion and probably have to shut down

[-]

Additional_Wish_3619@reddit

Oh yeah I can agree with you there. Healthcare is one of those horrifically hard examples, where I would even say to stay the hell away from SLM's. Way too unpredictable and adds LOTS of unnecessary risk. What systems currently do this PII redaction? I assume AI (LLM's or SLMs) do not traditionally do this.

[-]

Western_Objective209@reddit

Looking at the examples in the data set, they are likely notes and nothing would be perfect here you would just train the note taker not to write PII in the text box. Although I did make my own LLM system based on the repo, https://www.reddit.com/r/LocalLLaMA/comments/1rozrmn/comment/o9okb3s/ and Haiku 4.5 with 40 examples was basically perfect just some formatting errors. OP used LLM as a judge to come up with these scores, so instead of going for exact match like I did in my example if I use LLM as a judge Haiku 4.5 performs like this: ``` Official Benchmark Results (LLM-as-judge, generic prompt) ┌─────────────────────┬───────┬───────┬───────┬───────┐ │ Model │ Run 1 │ Run 2 │ Run 3 │ Avg │ ├─────────────────────┼───────┼───────┼───────┼───────┤ │ Haiku 4.5 │ 84.2% │ 82.7% │ 85.7% │ 84.2% │ ├─────────────────────┼───────┼───────┼───────┼───────┤ │ Sonnet 4.6 │ 90.2% │ 89.5% │ 88.0% │ 89.2% │ ├─────────────────────┼───────┼───────┼───────┼───────┤ │ Opus 4.6 │ 91.0% │ 88.7% │ 87.2% │ 89.0% │ ├─────────────────────┼───────┼───────┼───────┼───────┤ │ Fine-tuned Qwen3-4B │ — │ — │ — │ 94.0% │ └─────────────────────┴───────┴───────┴───────┴───────┘ Our Experiment (Haiku 4.5 + 40 few-shot examples) ┌─────────────────────────────────────┬─────────────────┐ │ Scoring Method │ Score │ ├─────────────────────────────────────┼─────────────────┤ │ Exact string match (raw) │ 85.0% (113/133) │ ├─────────────────────────────────────┼─────────────────┤ │ Exact match + regex post-processing │ 88.0% (117/133) │ ├─────────────────────────────────────┼─────────────────┤ │ Opus-as-judge (raw output) │ 94.0% (125/133) │ └─────────────────────────────────────┴─────────────────┘ Key takeaway: the official benchmark runs Haiku with a generic prompt and gets ~84%. By crafting a task-specific prompt with all 40 training examples as few-shot demonstrations, we boosted Haiku's LLM-judge score to 94.0% — matching the fine-tuned Qwen3-4B exactly. The gap between the frontier models in the official benchmark (84-89%) and our 40-example Haiku (94%) is entirely attributable to prompt engineering — same model, same test set, same judge methodology, just a better prompt. ✻ Sautéed for 7m 36s ❯ okay and would Haiku 4.5 with a post processing step fix the errors? ⏺ Let me check exactly what each of the 8 failures was to see which ones the regex post-processing would fix. ⏺ Read 3 files (ctrl+o to expand) ⏺ Here are the 8 failures broken down by fixability: Fixable with existing fix_last4_tokens regex (2) ┌─────┬──────────────────────────────────┬──────────────────────────────────┐ │ ID │ Error │ Fix │ ├─────┼──────────────────────────────────┼──────────────────────────────────┤ │ #15 │ IBAN_LAST4:0456 → should be 3456 │ Regex recomputes from raw digits │ ├─────┼──────────────────────────────────┼──────────────────────────────────┤ │ #45 │ IBAN_LAST4:1643 → should be 4300 │ Regex recomputes from raw digits │ └─────┴──────────────────────────────────┴──────────────────────────────────┘ Fixable with additional simple regex (3) ┌──────┬──────────────────────────────────────────────┬──────────────────────────┐ │ ID │ Error │ Fix │ ├──────┼──────────────────────────────────────────────┼──────────────────────────┤ │ #105 │ Extra "IBAN " prefix before [IBAN_LAST4:...] │ Strip IBAN before token │ ├──────┼──────────────────────────────────────────────┼──────────────────────────┤ │ #121 │ Same — extra "IBAN " prefix │ Same strip │ ├──────┼──────────────────────────────────────────────┼──────────────────────────┤ │ #92 │ Unicode curly quote ' vs ASCII ' │ Normalize unicode quotes │ └──────┴──────────────────────────────────────────────┴──────────────────────────┘ Not fixable with post-processing (2) ┌─────┬──────────────────────────────────────────┬───────────────────────────────────────────────┐ │ ID │ Error │ Type │ ├─────┼──────────────────────────────────────────┼───────────────────────────────────────────────┤ │ #81 │ Absorbed "NIN" label into redacted value │ Semantic — model misunderstood keep-vs-redact │ ├─────┼──────────────────────────────────────────┼───────────────────────────────────────────────┤ │ #90 │ Absorbed "SSN" label into redacted value │ Semantic — same issue │ └─────┴──────────────────────────────────────────┴───────────────────────────────────────────────┘ Dataset bug (1) ┌─────┬─────────────────────────────────────┬─────────────────────────────────────────────────────────────────────────────────────┐ │ ID │ Error │ Reality │ ├─────┼─────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────┤ │ #58 │ Judge says 0154 wrong, expects 9014 │ Haiku is correct — IBAN PT50...9015 4, last 4 digits = 0154. Ground truth is wrong. │ └─────┴─────────────────────────────────────┴─────────────────────────────────────────────────────────────────────────────────────┘ Summary With a full post-processing pipeline (existing last-4 fix + IBAN prefix strip + unicode normalization): - 130/133 GOOD (97.7%) — only 2 genuine semantic errors + 1 dataset bug - 131/133 (98.5%) if you correct the dataset bug where Haiku actually got it right The only real failures are #81 and #90, both the same pattern: Haiku absorbs the label word ("SSN"/"NIN") into the redacted value instead of keeping it in the text. That's a prompt-fixable issue — adding an explicit example of that pattern to the few-shot set would likely resolve it. ``` So realistically, Haiku 4.5 can actually get to 100% with just some prompt engineering, which is pretty good. That's similar to the results my lab gets on our real prototypes

[-]

Jolly-Gazelle-6060@reddit (OP)

having spoken to hundreds of companies building agentic systems, those who are deploying advanced agents at scale use SLMs in some shape or form (once the design is stable and they are at a point where optimizing the different nodes makes sense). Many of them achieved great results via fine-tuning both using distil labs and not (a lot of data curation required tho).

[-]

Western_Objective209@reddit

I mean your conversations as a SLM vendor are going to be heavily biased. So I figured I'd give it a shot the way I would do this, noticed this: https://github.com/distil-labs/inference-efficiency-benchmarks/blob/b478b90e030bb2bccf7c85bafa4782757f626dd7/question-answering/pii-redaction-healthcare/data/train.jsonl#L5 your training data has an error in it, The IBAN digits are 9300762011623852957 → last 4 = 2957, but the label says 5297 Similar errors in the test data, but just spending like 15-20 min on running the benchmarks myself with claude code: ``` ⏺ Big improvement. Here's the full comparison: ┌────────────┬──────────┬─────────────────┬───────────────────┐ │ Model │ Examples │ Raw │ + Post-Processing │ ├────────────┼──────────┼─────────────────┼───────────────────┤ │ Haiku 4.5 │ 5 │ 103/133 (77.4%) │ 106/133 (79.7%) │ ├────────────┼──────────┼─────────────────┼───────────────────┤ │ Haiku 4.5 │ 40 │ 115/133 (86.5%) │ 117/133 (88.0%) │ ├────────────┼──────────┼─────────────────┼───────────────────┤ │ Sonnet 4.6 │ 5 │ 116/133 (87.2%) │ 116/133 (87.2%) │ ├────────────┼──────────┼─────────────────┼───────────────────┤ │ Opus 4.6 │ 5 │ 115/133 (86.5%) │ 115/133 (86.5%) │ └────────────┴──────────┴─────────────────┴───────────────────┘ Haiku with 40 examples + post-processing now beats both Sonnet and Opus with 5 examples. The SSN label absorption issue is almost entirely fixed (down from 8 failures to 1). The remaining 16 failures are: - ~8 unicode quote mismatches (curly vs straight) — encoding artifact, not real errors - 2 IBAN "word kept" — IBAN [IBAN_LAST4:...] instead of [IBAN_LAST4:...] - 1 SSN label — still one holdout - 1 NIN label absorbed - 1 dataset bug (#58) - 2-3 other minor formatting Discounting the unicode mismatches, Haiku with 40 examples is effectively at ~125/133 (94%) — which matches the fine-tuned SLM's benchmark score. And that's the cheapest model with prompt caching. ``` basically getting the same results with Haiku 4.5, more examples, and prompt caching as the SLM fine tune. At a large enough scale I guess the SLM is cheaper, but Haiku 4.5 with batch processing is going to be pretty damn cheap

[-]

Additional_Wish_3619@reddit

100% agree, they can be useful! Although as you mentioned, lots of overhead in fine-tuning to get what you want. You also need a good system in order to get full potential out of a bunch of specialized agents no? Like Distil Labs just covers the model portion, or does it also give insights into how to integrate these into agents systems appropriately?

[-]

iamMess@reddit

Where is the Healthcare QA dataset from?

[-]

vladlearns@reddit

same question🙋‍♂️

[-]

party-horse@reddit

This is a healthcare PII redaction dataset that we created as part of a demo model for PII redaction. You can find the original model in [https://github.com/distil-labs/Distil-PII](https://github.com/distil-labs/Distil-PII)

[-]

last_llm_standing@reddit

does anyone know how the finetuning was done? they only specific a distill CLI

[-]

party-horse@reddit

Fine tuning was done with distil labs platform.

[-]

last_llm_standing@reddit

no i meant, do you have any idea the approach they might have taken

[-]

chodemunch6969@reddit

I think this is a strategy that deserves more community mindshare. The throughput and lightness of the model make it really compelling, both for inference and training. The way I see it, something like this makes sense as part of a journey moving from using large frontier models with simple prompts -> extracting common workflows into specialized prompts driving specific tools so that it can be done agentically -> baking some of those tools into a smaller fine trained model. That means you can still have a bigger model driving the agentic behavior, but it knows how to fan out to smaller, more performant, fine tuned models when it knows it should. The hard part of all of this if you were to do it with a large model has always been the fine tuning - training is just prohibitive for large models of course, but even so for some of the popular "medium" models that are very popular in the local space (qwen3.5 35ba3b, 27b, glm4.7 30b flash, etc). But seeing Qwen 0.8b + LFM perform comparatively so well compared to previous models in the same parameter weight class makes me think that the strategy might have a lot more legs today than it did say just 3 weeks ago. One concrete use case for this in my opinion is agentic coding. For example, I notice that some of the nuts and bolts tool calls (file searching, file edit, etc) are done pretty decently when through said medium sized models, but they're pretty slow, wasteful, and often failure prone. I think it'd be pretty fascinating to try to and do fine tunes for some of these specific tools, run it in an agentic harness (opencode for me), and see how much it lifts both speed and accuracy on real world tasks.

[-]

mantafloppy@reddit

Strange way of writing "What happen when you train small model on the benchmark."

[-]

ninefourtwo@reddit

no one else is talking about overfittng

[-]

LeoTechAI_UK@reddit

claude code hooks are the way to go here. the PreToolUse hook fires before any tool execution so you can validate/block commands. check the docs under settings > hooks.

[-]

Effective-Clerk-5309@reddit

We have been trying to get an SLM that helps with automation, basically NL to actions that are then executed by framework specific objects

[-]

Jolly-Gazelle-6060@reddit (OP)

Great use case for a specialized function calling model (perfect SLM application btw)

[-]

Effective-Clerk-5309@reddit

Do any SLMs exist today that can fulfill this kind of application usecase?

[-]

Jolly-Gazelle-6060@reddit (OP)

The beauty of it is that you can simply create one with minimal data

[-]

AurumDaemonHD@reddit

If latentMAS can be applied to an agent graph of specialized loras/finetunes on the same basemodel that would be something. There was a post some guy did avp protocol and its close. There is also radix attention. Problem is inference engines need to support this much better.

[-]

Innomen@reddit

This is how the brain does it. Right tool for the job with an orchestration layer.

[-]

ElkTop6108@reddit

This is a really important result that I think gets undersold. The key insight isn't just "small models can be good" - it's that **task-specific evaluation is fundamentally different from general benchmarking**. A few observations: **The 50-example fine-tuning threshold is remarkable.** This basically means for any well-defined production task, you can take a 0.6B-4B model, invest a few hours collecting 50-100 labeled examples, and potentially match or beat a frontier model that costs 100-1000x more per inference. The ROI math on that is insane for production workloads where you know your exact task shape. **The evaluation methodology matters more than the models.** The fact that you noted Gemini's accuracy drops partly due to "strict eval penalizing reasonable alternative interpretations" is a huge underappreciated point. A lot of LLM evaluation is broken because people use exact-match or simplistic metrics on tasks where multiple valid outputs exist. Semantic evaluation (does the answer mean the right thing?) vs syntactic evaluation (does the answer match the reference string?) gives you wildly different rankings. **This doesn't generalize to open-ended tasks, and that's fine.** The value here is specifically for narrow, well-defined production tasks like function calling, classification, and structured extraction. For open-ended generation (creative writing, complex reasoning, ambiguous questions), frontier models still dominate. But most production AI systems are actually a pipeline of narrow tasks, not one big open-ended prompt. **Deployment cost implications are massive.** Running a 0.6B model on a single GPU vs calling GPT-5 per request - at scale that's the difference between $0.001/request and $0.05/request. For high-volume production systems (email classification, form extraction, intent routing), this makes self-hosted viable even for smaller teams. Did you measure hallucination rates at all? Curious whether the fine-tuned small models are more or less likely to confidently produce wrong answers compared to the frontier models, especially on edge cases outside the training distribution.

[-]

pgrijpink@reddit

What do you define as a narrow task? For example is coding in python narrow enough (I presume not)? But what about datascience with pandas?

[-]

Odd-Ordinary-5922@reddit

you can see what he defines as narrow tasks on the leaderboard bro

[-]

DeProgrammer99@reddit

"Text2SQL question answering" is super broad, though. It's probably questions that require maybe a few joins and no subqueries or window functions...

[-]

party-horse@reddit

In the current dataset we have up to few csv tables and SQLite-compatible SQL. You cna find the details in [https://github.com/distil-labs/distil-text2sql](https://github.com/distil-labs/distil-text2sql)

[-]

DeProgrammer99@reddit

Ooh, nice, you trained a few models for LLM-as-a-judge, even. I might try those to evaluate my LLM evaluation tool, which has a judge-judging pipeline.

[-]

DeProgrammer99@reddit

DST time change fatigue. That was not a model *trained to be* a good judge, but a metric of the text-to-SQL model's performance as estimated by another LLM. 😅

[-]

pgrijpink@reddit

Not the question bro. I’m aware there’s a leaderboard posted, but the framework they’re describing in the blog is one where you distill small models on narrow tasks using a larger teacher model. My question is: when is a task narrow enough to get such good results - for application in different tasks.

[-]

maciejgryka@reddit

There isn't one strict definition, it takes some experimentation per-problem to check if a given small model is likely to support a given task well. "Coding in Python" is almost certainly too broad. Something like "turn plain-language tasks into Pandas calls" would probably work well (we did something very similar with git here https://www.distillabs.ai/blog/gitara-how-we-trained-a-3b-function-calling-git-agent-for-local-use).

[-]

NotaDevAI@reddit

This is amazing result! Thanks for sharing!!

[-]

jslominski@reddit

Do you reckon Qwen3.5 will improve this even more or it won't matter at this stage/benchmark saturation and model size?

[-]

Jolly-Gazelle-6060@reddit (OP)

Good question! We're actually working on a separate benchmark of Qwen3.5's "adaptability", stay tuned.

[-]

Additional_Ad_7718@reddit

This gives me deja vu from the llama era. I think specialized models are promising and there's still a lot of low hanging fruit.

[-]

m98789@reddit

Can you repro with open source training like unsloth or trl, no one wants to use a proprietary “distilllabs” product. Which makes this also feel like an ad.

[-]

Jolly-Gazelle-6060@reddit (OP)

We are just sharing the results of our benchmarks. Whether you curate the data yourself and fine tune an SLM using Unsloth or distill a model using distil labs - it's completely up to you. We're just making the case for small models, I hope you can appreciate that.

[-]

Budulai343@reddit

This is genuinely awesome

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! [Come check it out!](https://discord.gg/PgFhZ8cnWW) You've also been given a special flair for your contribution. We appreciate your post! *I am a bot and this action was performed automatically.*

[-]

charmander_cha@reddit

Alguém saberia dizer um uso criativo disso? Seria possivel extrair utilidade disso ao usá-los no opencode?

[-]

maciejgryka@reddit

Of the ones showed, the docstring one is probably the only candidate, but honestly, "coding" is too broad a task for small models at the moment.

[-]

snmnky9490@reddit

These seem like they would be more suited to like "embedded" systems or where some kind of agent-like system needs to run on tiny hardware doing constrained tasks. The kind of broad coding where one needs to keep a large codebase in mind while planning out development would need much higher active parameter counts, but the 27B dense seems outrageously good for its total size

[-]

charmander_cha@reddit

Obrigado a ambos

[-]

ortegaalfredo@reddit

I find that finetuning a model rarely increase their capabilities meaningfully and most likely it decrease them. Finetune is useful for affecting output format or adding some information, but anything you do with finetuning you can also do it via prompting.

[-]

Yu2sama@reddit

Roleplay finetunes is the only place I don't think this is true. (Also some Heretic finetunes showcase more intelligence but is not the norm).

[-]

grumd@reddit

I don't see how that can be true in theory. Model weights include a lot of general world knowledge. Removing unneeded world knowledge and replacing it with specialized knowledge will make the model better at your specific task. Imo such fine-tuning will always be viable.

[-]

maciejgryka@reddit

I can't say it's not possible to get the kinds of results we show here with prompting alone, but we were not able to beat fine-tuned models with just prompts. Maybe you can! The data is there in the git repo if you want to take a shot - I would be pretty impressed if you manage :) Also in some scenarios it's important to keep the input tokens to minimum to lower latency and compute as much as possible, which is often in conflict with more elaborate prompting. Again, not impossible to overcome, there are just trade-offs worth paying attention too.

[-]

ortegaalfredo@reddit

Prompting has another disadvantages like injections/prompt revealing, etc.

[-]

Jolly-Gazelle-6060@reddit (OP)

We actually compared performance of prompt engineering vs fine-tuning using distillation here (Seed Student vs Trained Student): [https://www.distillabs.ai/blog/distil-labs-benchmarking-the-platform](https://www.distillabs.ai/blog/distil-labs-benchmarking-the-platform) Additionally, such a model is usually more robust (performs better on data further away than the examples provided in the in-context learning prompt). Might be a good idea for another benchmark actually...

[-]

zipzag@reddit

As a daily user of 122B 8 bit I find that its not even close to Gemini Flash in real world use. Turns out that talented people distilling SOTA and benchmaxing does produce useful small models, but the AGI isn't included in the final result.

[-]

grumd@reddit

You mean Qwen 3.5 122B? It's a moe model with 3B active parameters, it's not going to be stronger than Gemini. Qwen3.5-397B-A17B however...

[-]

party-horse@reddit

\> distilling SOTA and benchmaxing That's not what we did exactly: the models were trained on synthetic data created by large models but the tests sets were always kept separate. We also did not use the test set to pick hyperparameters etc. \> AGI isn't included in the final result. Thats true - the models are task-specific by design so no artificial general inelligence. Our point is that you dont need AGI for most common tasks and you dont need to pay the tax for a massive model to translate text to SQL or redact PII information.

[-]

zipzag@reddit

I wasn't referencing your particular project. It does seem that your type of use case has a good track record of success.

[-]

openSourcerer9000@reddit

Great to see results like this. It's pretty clear these simple tasks are better suited for smaller models, which can be ran locally in a for loop or in batch. The doctsring example is pretty impressive for an 8B model, as that takes more reasoning, 10% error rate vs 6% of gpt 5.2. still, I think there's definitely a limit to what 8B and under models can achieve. I would be more interested in seeing what small scale qlora training for 30b to 120b models does compared with the big boys, on more high-level tasks like some specialized coding domain. Also, I'm not seeing the distillation workflow or training setup in that repo. How many synthetic samples were used for each task? Was it a full fine tune or lora? I would encourage y'all to publish a paper to put some more weight behind these numbers. I'm investing in learning this stuff as a professional with the bet that smaller local models can outperform in specialized domains, but I'm honestly not sure if that's true. In my own experience, learning about other fields helps improve results in my own, since knowledge seems to be an interconnected graph. It would be great to see some research down that trajectory. If I'm right, I'll have a good career but the trillion dollar data centers will tank the economy. If the coin flips the other way, the data center bets will pan out but we'll all be out of work. More objective research on model distillation could weight that coin toss more one way or another.

[-]

TopTippityTop@reddit

What do you use to fine-tune?

[-]

Conscious_Ad_9070@reddit

Pretty interesting! Will you guys release different variants (0.8b, 2b, 3b...) for each task?

[-]

maciejgryka@reddit

Is there something specific you'd like to see? Unlikely we're going to run the "full matrix" because it takes a bunch of compute (and honestly more importantly, time & attention). We tried to cover the basics to show the trends, but if you have a specific question I might be able to whip something up. One general observations: the Smart Home and Banking77 are pretty well-defined problems, so using 0.6B model as a base gives good results. On the other hand Docstring is much more open, so we wen with an 8B model instead, smaller ones struggle.

[-]

Conscious_Ad_9070@reddit

What about quantization? 4b bf16 is still a bit heavy for laptops. For now I'm more interested on the git assistant, I'm having a look at your Gitara project, looks quite useful!

[-]

maciejgryka@reddit

Ah we need to run more experiments there. I did convert the text2sql model to fp8 using \`llm-compressor\` and it cut memory by \~half and inference speed by 14%. We're working on more in-depth benchmarks of quantization, though, so stay tuned!

[-]

Plenty_Extent_9047@reddit

Could you share setup, was it Lora or fft , hyperparams etc , thank you!

[-]

fourthwaiv@reddit

How are you developing the labeled datasets for fine-tuning,

[-]

maciejgryka@reddit

Short answer is we put a lot of effort into our synthetic data generation pipeline. Some more details in the post here [https://www.distillabs.ai/blog/small-expert-agents-from-10-examples](https://www.distillabs.ai/blog/small-expert-agents-from-10-examples)

[-]

Senior_Hamster_58@reddit

Cool, but show leakage checks and real baselines.

[-]

Jolly-Gazelle-6060@reddit (OP)

the data is in the github repos & caveats are mentioned in the article, so you can reproduce the results yourself sir

[-]

mckirkus@reddit

I've been wondering about this. You could build a mixture of experts using a few fine tuned OSS models. If they're this small they may be able to run on the CPU. MOA? Mixture of agents?

[-]

Jolly-Gazelle-6060@reddit (OP)

nice one, we actually call it MAS (multi-agent systems) and we have seen it implemented by privacy-focused companies where hardware is limited

[-]

letsgoiowa@reddit

Excellent. I can envision the future as a series of highly specialized SLMs called by an orchestrator with gigantic, $5/query models used only for truly enormous strategic and world knowledge tasks. These SLMs can totally run on smartphones so we can easily have a reality where people simply don't need cloud services for a lot of the device management tasks.

[-]

maciejgryka@reddit

We (obviously) agree and Nvidia does too [https://research.nvidia.com/labs/lpr/slm-agents/](https://research.nvidia.com/labs/lpr/slm-agents/) :)

[-]

Effective-Drawer9152@reddit

I have one use case, where the model need to generate json but with some spatial knowldge, like creating diagram using json(you can think paint), like having coordinates and all. Sonnet is too costly and i am thinking to finetune some models of qwen. I want to know your opinion on this.

[-]

maciejgryka@reddit

We haven't tried a use case like this yet, it's worth a shot. If you sign up at [https://www.distillabs.ai/](https://www.distillabs.ai/) you'll get a couple free training credits and I'd be happy to help out on the community Slack if you run into any trouble.

[-]

Xi-tzu@reddit

Where do i find this smart home model?

[-]

maciejgryka@reddit

Correct, for others the Huggingface collection with all the models is here [https://huggingface.co/collections/distil-labs/distil-efficiency-benchmarks](https://huggingface.co/collections/distil-labs/distil-efficiency-benchmarks)