Fine-tuning with small batch sizes and gradient accumulation poorly perform if you use Transformers (TRL)!

Posted by TheKaitchup@reddit | LocalLLaMA | View on Reddit | 36 comments

If you use Hugging Face libraries to fine-tune your LLM (TRL and transformers), fine-tuning with a small batch size and gradient accumulation significantly underperforms!

Here are some experiments with Llama 3.2 and SmolM-135M.

batch_size=1 and gradient_accumulation_steps=32 is much worse than batch_size= 32 and gradient_accumulation_steps=1 while they are mathematically equivalent.

I could also confirm it with Qwen2.5. The precision of the model's parameters doesn't matter. It happens with bf16 and fp32 weights. I opened an issue in the TRL repo several days ago but nothing much happened since.

[-]

dahara111@reddit

I ported your Colab to unsloth and tested it, but I noticed one thing.
https://colab.research.google.com/drive/1flNtn9RXzezGCScYv2cP-9MB9YKZvtpa?usp=sharing

Per_device_eval_batch_size might be measured more accurately if you use the same value.

I'm using the free version of Colab, so the GPU is different, but the result is as shown in the image below.

In this case, there is a difference in the training loss, but there is not much difference in the validation loss, so perhaps there is not a big difference in the model itself?

[-]

danielhanchen@reddit

I managed to fix it! https://www.reddit.com/r/LocalLLaMA/comments/1g4ego7/llm_training_bug_fixes_gradient_accumulation_was/

[-]

dahara111@reddit

great work. Thank you.

[-]

danielhanchen@reddit

[-]

TheKaitchup@reddit (OP)

Interesting! I didn't try with Unsloth, but since Unsloth is built on top of Transformers, it makes sense to see a similar behavior.

I suppose you have done it with QLoRA/LoRA? I would expect a smaller difference in the learning curves indeed, compared to my configuration which was a full fine-tuning.

As for per_device_eval_batch_size, its value shouldn't have any impact. It's like an inference batch size. No gradients are computed during validation.

[-]

xadiant@reddit

IIRC gradient accumulation and batch size technically have some difference. However, this data is super interesting. Daniel from Unsloth is a frequent lurker here, I would love to hear his opinion about why this could be happening.

[-]

danielhanchen@reddit

Wrote up a fix at https://www.reddit.com/r/LocalLLaMA/comments/1g4ego7/llm_training_bug_fixes_gradient_accumulation_was/

[-]

xadiant@reddit

Incredible! So it really was a bug... And it' been around for at least 3 years. Great job as always, literally a free quality improvement.

[-]

danielhanchen@reddit

Hi - actually I saw this somewhere - interestingly I can actually repro this - ga * bsz should "technically" be equivalent mathematically. The interesting part is mixed precision accumulates gradients in float32, so it's not a precision issue.

I'll need to do more investigation. Good find u/TheKaitchup !

[-]

dahara111@reddit

Amazing Daniel!

[-]

danielhanchen@reddit

[-]

xadiant@reddit

Thank you for the qualified input! So, if this is indeed a bug, many models (especially fine-tuned with limited resources) could be underperforming, albeit slightly? 🤔

I'm not an engineer or coder but when I find errors in my profession, my ocd sense tingles and I tend to find even more related issues. I believe there have been cases where something fundamental has been wrong and gone unnoticed for years, just because it's so ironically obvious.

[-]

danielhanchen@reddit

:) Will do a large investigation and report back! I'm mostly going to bank on the fact their might be some subtle bug, but unsure yet

[-]

FullOf_Bad_Ideas@reddit

That's very interesting. I always treated them as equal when it comes to end results, that's how it was "marketed" as - the determinator of your end result quality being just your global batch size, not the batch size or gradient accumulation steps. I hope this can be fixed, otherwise people with little vram will be always at the disadvantage compared to those who have more vram to spare.

As other commenter pointed out, it would be good for /u/danielhanchen to be aware of this.

[-]

danielhanchen@reddit

Fix here: https://www.reddit.com/r/LocalLLaMA/comments/1g4ego7/llm_training_bug_fixes_gradient_accumulation_was/

[-]

danielhanchen@reddit

Hi hi Interestingly mathematically speaking ga * bsz should be equivalent - I can also repro this - since it's mixed precision, the gradient accumulator is in float32, so it shouldn't lose precision - I'm going to assume there's a subtle bug somewhere - or maybe I'm downplaying precision issues.

I'll have to do more research on my end!

[-]

caphohotain@reddit

I thought it was well-known that bsz != ac?

A long time ago when I first used Oobabooga, I saw in its training tab, it stated something like that.

[-]

TheKaitchup@reddit (OP)

Well, it shouldn't.

Gradient accumulation shouldn't have any impact. This is the theory. If you take vanilla implementations, like https://github.com/karpathy/nanoGPT, you will obtain the same results with and without ac.

[-]

caphohotain@reddit

I just checked the Training Pro extension in Oobabooga, it says:

True Batch Size - Specifies how many text blocks per step will be trained. The higher value, the better the concept of training will be, but it requires more GPU memory and it reduces speed.

Gradient Accumulation Steps - Virtually multiplies the Batch Size by averaging the learning over more than one step. VRAM friendly. Evens out loss fluctuations but can also degrade training fidelity.

Apparently the author of this extension knows something.

[-]

TheLocalDrummer@reddit

Evens out loss fluctuations

I mean... don't large batch sizes do the same thing?

averaging the learning over more than one step

This is also an effect of increasing batch size, afaik.

[-]

blepcoin@reddit

No one is arguing that batch size do the same thing, they're arguing that gas are NOT, even though they're advertised as equivalent. If we could all just ramp up batch size to 999 we could just throw gas out the window. Unfortunately that's not reality.

[-]

FullOf_Bad_Ideas@reddit

Exactly, bigger batch size means gradient descent towards a more generalized solution, which in other words could be presented as "degraded training fidelity".

/u/caphohotain You can read this sentence in various ways but most likely what author meant is that technically you don't optimize that precisely for a given sample with bigger batch sizes. Which is true even if 1 bs * 16 ga gives you equal quality to 16 bs * 1 ga, so I don't think author was aware of this bug.

[-]

anommm@reddit

If you use batch_size=1 you won't have any pad tokens in the input. But if you use a higher batch size, your input will be padded. Have you tried to pad all your inputs to the maximum input length? It doesn't make sense in a real experiment, but it will allow you to use exactly the same data for every configuration.

[-]

nero10579@reddit

Very interesting findings

[-]

Billy462@reddit

I really wish there was an up to date end to end guide on how to actually fine tune modern models. I have a lot of experience with end to end trained classifier ais (like image recognition ones) and I could probably make a stab at doing a Qwen finetune or something BUT I really don’t know what current best practices are.

So this post says I want a bs >32. Another post recently said to combine the base model, instruct and your own fine tune weight wise after training. There will be tons of other doodads like that.

[-]

whata_wonderful_day@reddit

I'm dealing with the same thing. Although it doesn't have the latest models, I've found the huggingface alignment handbook very useful. They've got the code + recipes for the zephyr models: https://github.com/huggingface/alignment-handbook

The simPO paper is built on it and reproduces a lot of alignment techniques, so that's a good place to look for hyperparameters

[-]

TheKaitchup@reddit (OP)

Yes! HF's alignment handbook is an excellent resource! I definitely recommend it.

[-]

TheKaitchup@reddit (OP)

I think defining general best practices is quite difficult. It depends on too many things: model, hardware, budget, and dataset, mainly.

For instance, "combine the base model, instruct and your own fine tune" is indeed very likely to yield good results but I wouldn't recommend it unless you are 100% sure that your own fine-tune is good, which brings us back to your question.

[-]

_w0n@reddit

Maybe this paper can help: https://arxiv.org/pdf/2408.13296 I'm currently reading it myself :)

[-]

blepcoin@reddit

What do bs=1 gas=X look like, for X=[1,2,4,8]? I wonder if this is a gas problem or a bs&gas problem.

[-]

TheKaitchup@reddit (OP)

If we do that, we modify the total training batch size. The learning curves would be widely different, with different sizes for the training steps. X=8 will be the best while X=1 would learn almost nothing or diverge.

[-]

blepcoin@reddit

Can’t you just scale them and do the same amount of samples?

[-]

dahara111@reddit

Thank you.

I'm a bit curious. If it's easy to test, I'd like to know if the same trend occurs when 'flash_attention_2' is turned off.

[-]

TheKaitchup@reddit (OP)

The trend is the same when FlashAttention is turned off. Here is with float32 (which requires to turn off FA2):

[-]

indrasmirror@reddit

Thanks I was wondering why my finetuning wasn't preforming as well as I feel like ot should've. Gonna adjust those parameters now

[-]

ListenProfessional47@reddit

Thanks for sharing this! Very helpful.