You can now fine-tune Gemma 4 locally 8GB VRAM + Bug Fixes

Posted by danielhanchen@reddit | LocalLLaMA | View on Reddit | 107 comments

Hey guys, you can now fine-tune Gemma 4 E2B and E4B in our free Unsloth notebooks! You need 8GB VRAM to train Gemma-4-E2B locally. Unsloth trains Gemma 4 \~1.5x faster with \~50% less VRAM than FA2 setups: https://github.com/unslothai/unsloth

We also found and did bug fixes for Gemma 4 training:

Grad accumulation no longer causes losses to explode - before you might see losses of 300 to 400 - it should be 10 to 15 - Unsloth has this fixed.
Index Error for 26B and 31B for inference - this will fail inference for 26B and 31B when using transformers - we fixed it.
use_cache=False had gibberish for E2B, E4B - see https://github.com/huggingface/transformers/issues/45242
float16 audio -1e9 overflows on float16

You can also train 26B-A4B and 31B or train via a UI with Unsloth Studio. Studio and the notebooks work for Vision, Text, Audio and inference.

For Bug Fix details and tips and tricks, read our blog/guide: https://unsloth.ai/docs/models/gemma-4/train

Free Colab Notebooks:

E4B + E2B (Studio web UI)	E4B (Vision + Text)	E4B (Audio)	E2B (Run + Text)

Thanks guys!

[-]

TechySpecky@reddit

I am an MLE but a bit out of the loop with what we define as fine-tuning with LLMs. Are fine-tunes solely aimed at slightly different output styles or can you add information / continue the pretaining process somehow without complete model collapse?

If I have a different specialized domain is it possible to fine-tune models for that domain?

[-]

danielhanchen@reddit (OP)

Yes you can do all what you mentioned! We talk about all use cases here: https://unsloth.ai/docs/get-started/fine-tuning-llms-guide

[-]

Flamenverfer@reddit

Hi noob here, Does Lora and qLora allow you expand the knowledge base at all or is it just a simpler shift in model behavior like talking styles and structured outputs. Or would you need full finetuning for increasing a models expertise.

[-]

No_Tart3750@reddit

Ejemplo, si eres un coder especializado en backend, puedes generar un dataset y re entrenar este modelo, lo excelente es que corre en teléfonos celulares, es muy bueno, justamente estoy usando un H200 para entrenarlo con 400 mil datos de temas de desarrollo full-stack.

[-]

boatbomber@reddit

LoRA absolutely allows you to store new information: https://arxiv.org/abs/2603.01097

[-]

Sufficient_Prune3897@reddit

In theory its just influencing old knowledge. That said, these models are trained on so much, that the difference isnt really there. In general I would decide if I do a lora or a finetune based on the size of my dataset. Anything even close to 10k samples and I full finetune.

[-]

tavirabon@reddit

If LLMs train like DiTs, you can definitely add new domain knowledge with loras, but overfitting becomes a substantial issue so you need to scale your dataset, training time and rank accordingly. From dozens of samples to thousands at least. I would personally avoid loraplus etc learning acceleration methods as well.

Though I imagine the gap is less apparent for text where entire data sources and complex data filtering pipelines don't leave deliberate holes (metaphorically and literally)

[-]

IrisColt@reddit

RemindMe! 2 days

[-]

RemindMeBot@reddit

I will be messaging you in 2 days on 2026-04-09 19:02:22 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

[-]

IrisColt@reddit

That's a good question...

[-]

TechySpecky@reddit

Thank you for being so helpful, I see a lot of advertising around low V-RAM, but let's say I can rent a couple B200s will I still benefit and just be able to tune larger models?

[-]

NatureOutrageous8664@reddit

Larger and faster since you can up the batch size quite a bit with larger vram.

[-]

danielhanchen@reddit (OP)

Yes definitely! Unsloth works on all hardware - consumer and data-center!

[-]

EuphoricAnimator@reddit

That’s really cool about making Gemma 4 more accessible for fine-tuning. I’ve been playing with local models for a while now, mostly on a Mac Studio M4 Max with 128GB of RAM, and it’s a totally different experience than just hitting an API. I regularly run Qwen 3.5, Gemma 4, and a bunch of stuff from Ollama.

For quick tasks,brainstorming, simple writing assistance,Ollama models are fantastic. I can get surprisingly good results from a 7B model running at around 30 tokens/sec on my machine. But if I’m doing anything more complex, like longer-form content creation or trying to get a really specific style, the 70B models are where it’s at. Even then, though, they aren't quite as consistently "smart" as the best cloud options like GPT-4. My M4 Max can handle a quantized 70B model, but inference is slower, maybe 8-10 tokens/sec, and the quality can dip noticeably.

The big win with local is control and privacy, obviously. Plus, no ongoing costs. I’ve spent more time experimenting and understanding how these models work because I’m wrestling with quantization, prompt formatting, and VRAM limitations. Cloud is still better if I need absolute reliability and top-tier performance, but local is where the fun is right now.

Fine-tuning is the next frontier, and lowering the VRAM requirement to 8GB for Gemma 4 is a big step. I've been hesitant to dive too deep because the initial requirements felt prohibitive, so I'm definitely going to check this out. It’s exciting to see the community pushing the boundaries of what's possible on consumer hardware.

[-]

FBIFreezeNow@reddit

Thanks so much for this! One question I have is that you mentioned the thinking enabled to use gemma4’s thinking template, but the examples do not show how to actually make use of them. The user asks “what is 2+2” and I do see the “think” special token but it would be great if you could actually have a Colab link or inline example on your doc that includes the QLoRA with thinking on, with some instructions on how we can use the dataset that has reasoning? Thanks!

[-]

hamir_s@reddit

Hey, thanks for the amazing models! I have been using the mlx gemma 4 31b 8bit model and it works great with mlx_lm. But how can I use for image parsing? Trying to using it with mlx_vlm and it seems like it cannot find vision layers.

[-]

yoracale@reddit

Thanks for using it! Atm it doesn't support mlx as it's done through mlx-lm and they more or less strip out the vision tower even if its there in the weights

[-]

hamir_s@reddit

Ah I see! That effectively decreases the model size too?

[-]

yoracale@reddit

Yes thats correct

[-]

Dry-Hovercraft9191@reddit

any chance for training with 6G vram ? or try to use colab free tier ?

[-]

yoracale@reddit

Unfortunately the optimizations are nearly maxxed out for the model. Yes the colab notebooks are in the articles you linked

[-]

MaruluVR@reddit

Is unsloth studio just for fine tuning or can you also do continued pretraining?

[-]

danielhanchen@reddit (OP)

You can do continued pretraining - we just haven't exposed it very easily within the UI haha - it should be up today

[-]

MaruluVR@reddit

Thank you, I have 400GB worth of books, would those get loaded and unloaded dynamically as they are needed for training or would I need enough ram to hold them?

[-]

de4dee@reddit

those databases are probably already in the model

[-]

TechySpecky@reddit

yes but if you want the model to focus more on a specific domain, eg let's say you had 500,000 pages of text related to the manufacturing of light bulbs. Would feeding all that into the model as pretaining not improve the models performance on light bulb related queries?

[-]

de4dee@reddit

sure that would improve

[-]

MaruluVR@reddit

They arent public though all commercial

[-]

danielhanchen@reddit (OP)

Oh my 400GB haha is a LOT! You'll be spending more time on formatting and tokenizing that than training :(

[-]

de4dee@reddit

yes you can do CPT many times with unsloth

[-]

ustas007@reddit

You just gave me idea. I have problem to recognize fabric in one of my project, all models are mixed up any pattern or colors, and I'm already collected 27000 examples, and was frustrated, few months, and result is sux. But you gave me one more idea to try. Thx.

[-]

Enthu-Cutlet-1337@reddit

8GB trains E2B, but batch size/seq length bite first; 26B/31B are LoRA-only territory on consumer cards.

[-]

Pwc9Z@reddit

Finetuning 26/31B on a single 3090 is probably an absolute no-go, I suppose?

[-]

danielhanchen@reddit (OP)

31B is possible most likely - last I checked it used 22GB so 24GB could work but yes it might OOM on longer sequence lengths

[-]

Jack_5515@reddit

Would it be possible for you to provide an example configuration for the 22GB VRAM usage?

I tried finetuning Gemma 4 31B on my RTX 5090 using Unsloth Studio, which does start, however each step slows down, and then it OOMs (after around 8 steps). VRAM was as free as could be on Windows (so about 31.4 GB).

I tried finetuning on vision data - does this drastically increase the VRAM usage (or does Windows need more)? (Sidenote: Considering issue #4859, I already locally merged in one of the PR’s fixing the issue)

I already tried (what I assume are) conservative settings, with a sequence length of 5120, a batch size of 1 (and then gradient accumulation steps of 8), lora rank and alpha 8, not training the vision layers. All other settings remained the default set in the UI.

In Unsloth Studio I selected 4bit QLoRA and unsloth/gemma-4-31B-unsloth-bnb-4bit as model.

[-]

Quiet-Owl9220@reddit

Noob question, but is excess RAM not useful for training?

I would also like to ask anyone who knows, what is the state of finetuning models with consumer AMD GPU (24gb VRAM)?

[-]

danielhanchen@reddit (OP)

It's super useful! Unsloth gradient checkpointing uses your RAM as offloaded memory for gradients!

[-]

Pro-Row-335@reddit

31B QLoRA works with 22GB and 26B-A4B LoRA needs >40GB
https://unsloth.ai/docs/models/gemma-4/train

[-]

guiopen@reddit

Can't we use Qlora for moes?

[-]

danielhanchen@reddit (OP)

Not yet sadly - our kernels aren't yet optimal for QLoRA fusing for MoEs :(

[-]

danielhanchen@reddit (OP)

Yep! :)

[-]

JohnMason6504@reddit

The grad accumulation blowup to 300-400 is the classic mixed precision loss scaling drift. At small per-step batch the loss scale clamp fires before the accumulated grads land, so the step collapses or explodes depending on which side of the scale window the first micro-batch opens at. The use_cache=False bit on E2B and E4B is the other half of the same lesson because frozen base plus adapter training changes which tensors are live across the prefill boundary and the KV cache assumptions from inference time quietly no longer hold. Nice to see both fixed under one release.

[-]

sunychoudhary@reddit

That’s actually a pretty meaningful threshold.

Once local fine-tuning drops into “normal hardware” territory, the barrier shifts from compute to: data quality, eval discipline and knowing what you’re actually trying to improve.

A lot more people can experiment now, but that doesn’t automatically mean better models.

[-]

vr_fanboy@reddit

I was training gemma4 E4B in the past hours, checkpoints were performing horribly, i had use_cache=false, exited to try again now, is there an ETA for a new pip version?

[-]

yoracale@reddit

Yes we released a new pip release earlier today which includes all the fixes

[-]

rickyrickyatx@reddit

Can you use a Mac to train by any chance?

[-]

yoracale@reddit

MLX training is coming this month but you can run models on your MacOS with Unsloth studio already! See: https://unsloth.ai/docs/new/studio#quickstart

[-]

Zestyclose_Yak_3174@reddit

Wondering how much VRAM is needed to finetune the dense 31B on unsloth

[-]

yoracale@reddit

22B for QLoRA, 80GB probs for LoRA, we wrote it in the guide: https://unsloth.ai/docs/models/gemma-4/train

[-]

Zestyclose_Yak_3174@reddit

Thanks

[-]

ryebrye@reddit

I'm running CPT right now on 31B dense with an 8k context window for the docs on an H200 (140GB vram) - it oom'ed with 2x8 but it's running fine with 1x16 - but I'm training LoRA with bf16

[-]

Final_Ad_8913@reddit

Regarding “2. ⁠Index Error for 26B and 31B for inference - this will fail inference for 26B and 31B when using transformers - we fixed it.”

Do you mind sharing the fix? There haven’t been releases since the original gemma4 announcement (v0.1.35-beta). When will the fix make it to an official release? Thanks!

[-]

iniziolab@reddit

people forgot about Gemma for a while, gemma4 brought the heat back for sure as any new model coming out with new info.

[-]

DrBearJ3w@reddit

Is it possible on AMD RDNA3?

[-]

danielhanchen@reddit (OP)

We're working on it - hopefully this week!

[-]

CheatCodesOfLife@reddit

AMD always worked for Unsloth

I don't suppose you know if it works on an AMD Mi50 🤞

[-]

yoracale@reddit

It does, we have a guide for it: https://unsloth.ai/docs/get-started/install/amd

[-]

JournalistMore7545@reddit

i feel a bit overwhelmed with the race of new stuff that come out, but the token usage of paid llms is more painfull, i will just ask & i really want honnest feedback

for the people testing it now, what’s still the most annoying part?

[-]

yoracale@reddit

Many people say tool-calling is broken in a lot of apps, however if you use it unsloth studio, toolcalling automatically heals and is very good :)

https://i.redd.it/lvw7h85hmvtg1.gif

[-]

Jeidoz@reddit

I am relativly new for some AI topics. Can someone tell me when and why I may be interested doing fine-tuning for any model? Examples/Usecases are welcomed!

[-]

yoracale@reddit

We have a complete guide for fine-tuning including examples etc here: https://unsloth.ai/docs/get-started/fine-tuning-llms-guide

[-]

PerfectLaw5776@reddit

In your... case, one would be to finetune Gemma4 or another model with annoying safety filters and/or little adult knowledge to output more NSFW text for use in e.g: RP or translation.

Other methods to break censorship exist, but finetuning it can help it learn how to talk in more adult spaces at the same time.

[-]

Lolologist@reddit

Suppose for a moment I am an MLE, because I am, that more often that not needs to make classification (binary, multilabel, multi-class) models. How can I do with models like these that, say, is easy enough to train stop a ModernBert of equivalent?

[-]

yoracale@reddit

Absolutely but it might not be as easy as it sounds, you can shove your data in Unsloth Studio for synthetic data gen but experimentation still is key. We also support modernbert yes

[-]

Im_Still_Here12@reddit

If I'm only using Gemma for security image analysis (e.g. prompts asking the LLM "What is the person doing?" and "What is their intent?", is there any reason to train??

[-]

yoracale@reddit

If you want much more accurate answers, vision is actually one of most popular ways to fine-tune it. E.g. RL with robots is very much a focus right now

[-]

placesforfudge@reddit

I tried using unsloth studio but it's refusing to use my dedicated GPU. I tried multiple browsers and ensured my laptop is on high performance mode, hardware acceleration is on, efficiency mode is off. When I use lm studio it uses my dgpu and works smooth as butter.

[-]

yoracale@reddit

Would you happen to be using the Docker image? could you make an issue and provide specs if possible so we can help debug? Apologies for the issue and thanks so much for trying it out! We're releaisng a desktop app soon so hopefully that'll solve some issues

[-]

celsowm@reddit

Its support full fine tuning too?

[-]

yoracale@reddit

Yes ofc ourse, we wrote it in the guide: https://unsloth.ai/docs/models/gemma-4/train

[-]

celsowm@reddit

Thanks and a honest question: why unsloth to full fine training and not trl?

[-]

Thistlemanizzle@reddit

Anyone have ideas on what to fine tune for? I am struggling to come up with a use case that would justify learning and experimenting.

[-]

Imaginary-Unit-3267@reddit

Why are you looking for an excuse to do something you don't actually need to do? If finetuning would benefit you personally, you'd already know how.

[-]

Legitimate-Pumpkin@reddit

What comes to my mind mostly is a speaking style. Maybe so it can speak like a character you like or if you prefer more serious, trained in your mail so it can suggest you mails or even send them directly, if you are “brave” enough. Also do some marketing stuff?

Maybe it’s trainable to code in a specific way or to know specific knowledge? I’m also a bit lost here, although sounds very good 🙃

[-]

Thistlemanizzle@reddit

Yeah, I guess I could feed it my responses to various emails and texts.

Eh.

[-]

UnknownLesson@reddit

I think the colab Links are broken

Thank you've for what you're doing :)

[-]

yoracale@reddit

Hello I think it's because you're on old reddit which doesn't render the links properly. You can find all the notebooks in the guide: https://unsloth.ai/docs/models/gemma-4/train

[-]

Qwen30bEnjoyer@reddit

What would the VRAM requirements be for the 31b dense model? 130gb VRAM?

[-]

yoracale@reddit

22B for QLoRA, 80GB probs for LoRA, we wrote it in the guide: https://unsloth.ai/docs/models/gemma-4/train

[-]

After_Dark@reddit

The studio tab seems disabled on my install on a mac (M4 Pro, 48GB of memory, should be more than enough for some E2B training I'd think). Can't tell if this is a bug, my fault, or training just isn't supported on mac and it isn't documented anywhere

[-]

riceinmybelly@reddit

Not finished yet, look at their website

[-]

After_Dark@reddit

I searched the whole thing and couldn't find anything stating that you can't train on mac, and the "How to Fine-Tune Gemma 4" guide explicitly mentions installing on mac

[-]

yoracale@reddit

Hey apologies, we wrote you can run models via the UI on Mac but didn't mention in the article MLX doesn't work as of yet, but it's coming this month. we did write it here but that's in other articles: https://unsloth.ai/docs/new/studio#quickstart

[-]

qnixsynapse@reddit

Does it work on my horrible Intel Arc 8GB VRAM GPU?

[-]

danielhanchen@reddit (OP)

Inference yes but training :(

[-]

CheatCodesOfLife@reddit

Haha I'd be careful saying "yes" to that without testing it. The Arc really is a "horrible" experience with random numerical precision issues depending on the model, etc

[-]

limericknation@reddit

I'm rockin' Gemma 4 E2B on my MacBook Neo presently. This is great stuff.

[-]

FrostyDwarf24@reddit

Does this mean gemma E4B will fit in my 5070ti?

[-]

Sufficient_Prune3897@reddit

In theory yes, in practice their numbers are with stupid low context sizes, which take up more space than the model.

[-]

danielhanchen@reddit (OP)

Yes! The free Colab notebook for E4B uses way under 16GB VRAM!

[-]

m98789@reddit

How about GRPO?

[-]

yoracale@reddit

Works but you need to set fast inference = false. We'll announce support for that probs next week

[-]

guiopen@reddit

Why e2b uses 8gb vram? I remember we could finetune qwen3 4b with much less

[-]

danielhanchen@reddit (OP)

E2B is actually a 7-8B model! :)

[-]

Round_Document6821@reddit

gemma series has the coolest architecture wise imo. Cool bug fixes as always from Unsloth!

[-]

danielhanchen@reddit (OP)

Thanks! Ye Gemma-4 is pretty cool!

[-]

Thatisverytrue54321@reddit

Is it too noobish of a question to ask how structured your training data had to be?

[-]

danielhanchen@reddit (OP)

Unsloth Studio also has a data designer so it can make data from unstructured data!

But we also have an auto AI Assist feature which will auto format your dataset!

[-]

Middle_Bullfrog_6173@reddit

At least the studio notebook is still throwing an error when trying to train these models. Something about the gemma4 model type.

[-]

danielhanchen@reddit (OP)

Hmm I re-checked Unsloth Studio in Colab:

I'll check all others as well

[-]

Middle_Bullfrog_6173@reddit

Thanks, that's exactly what I'm running. Maybe my transformers install was unnecessary and it was a random error. It did say something about gemma4 model type and transformers in the log when it errored.

[-]