Fine-tuning with small batch sizes and gradient accumulation poorly perform if you use Transformers (TRL)!

Posted by TheKaitchup@reddit | LocalLLaMA | View on Reddit | 36 comments

If you use Hugging Face libraries to fine-tune your LLM (TRL and transformers), fine-tuning with a small batch size and gradient accumulation significantly underperforms!

Here are some experiments with Llama 3.2 and SmolM-135M. 

batch_size=1 and gradient_accumulation_steps=32 is much worse than batch_size= 32 and gradient_accumulation_steps=1 while they are mathematically equivalent.

I could also confirm it with Qwen2.5. The precision of the model's parameters doesn't matter. It happens with bf16 and fp32 weights. I opened an issue in the TRL repo several days ago but nothing much happened since.