How is everyone doing DPO on Gemma 3 using Unsloth/TRL?

Posted by CartographerFun4221@reddit | LocalLLaMA | View on Reddit | 5 comments

I'm running around in circles trying to battle TRL picking up on the multimodality of Gemma 3 and expecting images in the DPO dataset, even though i'm doing text only. I set vision to off yet it always expects the image tags to be present. Having them present but empty still doesn't work.

Is there an easy way to DPO on just text with Gemma 3? I'd hate to lose 2 stages of SFT progress on this, i chose it specifically for its strong Urdu abilities (the tokenizer is twice as efficient for Nastaliq than Llama 3.1)

[-]

FullOf_Bad_Ideas@reddit

can you try LlamaFactory? It always works well for me for preference finetuning.

[-]

CartographerFun4221@reddit (OP)

I haven't tried this yet, can I easily run the jobs on AWS SageMaker instead of local GPU? If so I will jump on this ASAP. I have credits in AWS and I'm GPU poor (RTX 2070 Super 8GB laptop card)

[-]

FullOf_Bad_Ideas@reddit

I never used Sagemaker but it appears supported. https://github.com/aws-samples/Easy_Fintune_LLM_using_SageMaker_with_LLama_Factory

If it's a small lora finetune, and DPO runs are often short, it's literally like $2 in rented H100 time.

[-]

SlowFail2433@reddit

Not sure about TRL.

Generally it’s often better to write your own training loops as the open source training loops are often not that great.

Also does it have to be DPO? Ever seen Deepseek GRPO-style has become more popular

[-]

CartographerFun4221@reddit (OP)

I should've been clearer, I'm training on AWS SageMaker (yay free creds) and I'm already somewhat used to Unsloth so going from running on local GPU to AWS in a few lines is a godsend. But I'm really interested in what you mean about custom training loops. As a relative noob to this field, and as someone who has only done a few SFT fine tunes, would you recommend me still going down the custom route (LLM augmented of course)? Are the benefits really that worth it? If so then I will definitely look into it. Do you know of any resources I can use to go from 0-100 for this?

Also it doesn't really have to be DPO. I thought training on Anthro pic's RLHF dataset would've been nice, I was going to do GRPO last anyway. Do you recommend skipping DPO? I'm fine-tuning Gemma 3 4B to be the best English-Urdu LLM. I've translated and transliterated (to Arabic Urdu and Roman Urdu) many datasets, SFT on them and produced decent results but of course the model is really rough around the edges and needs that final pass to reliably respond in ways I want it to (e.g asking it "hi how are you" will make it say who it is, but then will finish with a psuedo training example from the dataset, sort of). If I can get good results going straight to GRPO i would prefer that. Trying to hack around TRL right now is causing me to run out of disk space (injecting pixels in image tags to get around the trainer's errors)

Thanks for your time