I taught my 1B to follow instructions. It got worse at following instructions...
Posted by GPUburnout@reddit | LocalLLaMA | View on Reddit | 11 comments
Same SFT recipe (SlimOrca 50K, LoRA r=16, 1 epoch). Three models trained from scratch at 1B, 2B, and 3B parameters. IFEval before and after:
| Model | Base | After SFT | Delta |
|---|---|---|---|
| 1B | 20.50 | 14.75 | -5.75 |
| 2B | 21.94 | 17.03 | -4.91 |
| 3B | 23.14 | 25.18 | +2.04 |
OK so SFT is supposed to teach instruction-following. thing is though the 1B actually unlearned it. 2B was slightly less bad. The 3B finally read the room.
Setups were slightly different: 3B used lr=5e-5, the others used 2e-4. So maybe it's capacity, maybe it's the gentler LR. I'll re-run the 2B at 5e-5 to find out.
Before I burn the compute:
- Anyone else seen IFEval regress after SFT on small models?
- Is this a known thing I missed?
- Best guess on mechanism?
Receipts available if anyone wants to dig in.
Otherwise_Economy576@reddit
capacity is the obvious-but-correct answer. smaller models don't have the headroom to absorb SFT without overwriting prior abilities. and ifeval specifically measures diverse format-constraint compliance, while slimorca is mostly gpt-4-style chat. so you might be replacing whatever weak instruction-following emerged from pretraining with narrower chat-format following. the 3B has capacity to learn both, the 1B has to pick.
the LR difference probably matters too. 2e-4 on a 1B with lora r=16 is fairly aggressive. you might be overshooting the soft-adaptation lora is best at and getting closer to actual weight modification, which is where forgetting kicks in. re-run at 5e-5 is the right next move, and longer warmup if your scheduler doesn't already have it.
GPUburnout@reddit (OP)
Interesting. Thanks.
on the IFEval vs Orca: I think Orca biases toward "let me explain why..." which fights the constraint compliance directly. Is that what you are getting at? That makes the trade-off concrete: the 1B substitutes one for the other because it can't hold both representations, but the 3B would.
longer warmup is a good call. I have linear warmup over the first 100 steps. I can bump it to 300-500 for the 2B re-run. Did you arrive at proper warmup length experimentally, or is there a general rule of thumb you follow?
llama-impersonator@reddit
there are better instruct tuning datasets to try, like sonnet-orca, tulu3 sft, magpie from llama3-405b, even hermes. for instruct tuning, honestly sheer quantity of data brings a lot of virtue, so you should train with a much larger dataset. i'd also avoid lora for instruct tuning, tbh
alphatrad@reddit
Training is hard and sometimes you get it right and sometimes you don't. I had this experience when trying to fine tune Qwen3-Coder https://huggingface.co/1337Hero/qwen3-coder-30b-a3b-codemonkey-GGUF
Ended up making it dumber because my dataset just wasn't good enough.
cmndr_spanky@reddit
So confused, why aren’t you telling us what model you’re using as a base model ?
Try freezing layers and only training the last few transformer blocks. Try increasing LR.
natt08@reddit
Tell me you’re not an MLE without telling me you’re not an MLE. You didn’t "fine-tune" or “train” these models, you gave them a digital lobotomy. A 2 x 10^{-4} learning rate on a 1B model isn't a setup it’s a flamethrower. Small models have zero spare capacity, so when you slap a massive, uncurated dataset like SlimOrca on them with that much gradient pressure, you aren't teaching them instructions you're forcing them to overwrite their foundational logic. The 3B model didn't read the room it was just the only one you didn't accidentally set on fire because you used a sane 5 x 10^{-5} learning rate. Stop treating SFT like a microwave recipe. If you want a positive delta on tiny models, you need to stop burning compute and start doing actual data engineering. Slash your dataset size by 90%, curate for quality, and use a learning rate that doesn't nuke the weights into the sun.
Blaze6181@reddit
Careful with SFT. It can break the model to conform, rather than tweak the model.
triynizzles1@reddit
I am a little bit out of the loop, but perhaps you experienced some overfitting on the smaller models. I stopped using 1 epoch as a measurement of how much to train the model. Now i mostly follow the divergence of training and validation lines on my graph.
Qwen3_6_27b_UD_Q4XL@reddit
You would need at least 7b to properly fine-tune. Mine on 4B failed miserably.
nuclearbananana@reddit
Literally the whole point of tiny models is to fine tune them. I've successfully fine tuned google 350m model before
Hefty_Wolverine_553@reddit
That dataset is super old, and the model you're training is probably very new. Their training methods are probably leagues better than you randomly slapping together a sub par dataset and using a basic SFT method. If you're doing SFT on a base model, your training setup is probably broken in some way.