You can now fine-tune Gemma 4 locally 8GB VRAM + Bug Fixes
Posted by danielhanchen@reddit | LocalLLaMA | View on Reddit | 107 comments
Hey guys, you can now fine-tune Gemma 4 E2B and E4B in our free Unsloth notebooks! You need 8GB VRAM to train Gemma-4-E2B locally. Unsloth trains Gemma 4 \~1.5x faster with \~50% less VRAM than FA2 setups: https://github.com/unslothai/unsloth
We also found and did bug fixes for Gemma 4 training:
- Grad accumulation no longer causes losses to explode - before you might see losses of 300 to 400 - it should be 10 to 15 - Unsloth has this fixed.
- Index Error for 26B and 31B for inference - this will fail inference for 26B and 31B when using transformers - we fixed it.
use_cache=Falsehad gibberish for E2B, E4B - see https://github.com/huggingface/transformers/issues/45242- float16 audio -1e9 overflows on float16
You can also train 26B-A4B and 31B or train via a UI with Unsloth Studio. Studio and the notebooks work for Vision, Text, Audio and inference.
For Bug Fix details and tips and tricks, read our blog/guide: https://unsloth.ai/docs/models/gemma-4/train
Free Colab Notebooks:
| E4B + E2B (Studio web UI) | E4B (Vision + Text) | E4B (Audio) | E2B (Run + Text) |
|---|---|---|---|
Thanks guys!
TechySpecky@reddit
I am an MLE but a bit out of the loop with what we define as fine-tuning with LLMs. Are fine-tunes solely aimed at slightly different output styles or can you add information / continue the pretaining process somehow without complete model collapse?
If I have a different specialized domain is it possible to fine-tune models for that domain?
danielhanchen@reddit (OP)
Yes you can do all what you mentioned! We talk about all use cases here: https://unsloth.ai/docs/get-started/fine-tuning-llms-guide
Flamenverfer@reddit
Hi noob here, Does Lora and qLora allow you expand the knowledge base at all or is it just a simpler shift in model behavior like talking styles and structured outputs. Or would you need full finetuning for increasing a models expertise.
No_Tart3750@reddit
Ejemplo, si eres un coder especializado en backend, puedes generar un dataset y re entrenar este modelo, lo excelente es que corre en teléfonos celulares, es muy bueno, justamente estoy usando un H200 para entrenarlo con 400 mil datos de temas de desarrollo full-stack.
boatbomber@reddit
LoRA absolutely allows you to store new information: https://arxiv.org/abs/2603.01097
Sufficient_Prune3897@reddit
In theory its just influencing old knowledge. That said, these models are trained on so much, that the difference isnt really there. In general I would decide if I do a lora or a finetune based on the size of my dataset. Anything even close to 10k samples and I full finetune.
tavirabon@reddit
If LLMs train like DiTs, you can definitely add new domain knowledge with loras, but overfitting becomes a substantial issue so you need to scale your dataset, training time and rank accordingly. From dozens of samples to thousands at least. I would personally avoid loraplus etc learning acceleration methods as well.
Though I imagine the gap is less apparent for text where entire data sources and complex data filtering pipelines don't leave deliberate holes (metaphorically and literally)
IrisColt@reddit
RemindMe! 2 days
RemindMeBot@reddit
I will be messaging you in 2 days on 2026-04-09 19:02:22 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)
IrisColt@reddit
That's a good question...
TechySpecky@reddit
Thank you for being so helpful, I see a lot of advertising around low V-RAM, but let's say I can rent a couple B200s will I still benefit and just be able to tune larger models?
NatureOutrageous8664@reddit
Larger and faster since you can up the batch size quite a bit with larger vram.
danielhanchen@reddit (OP)
Yes definitely! Unsloth works on all hardware - consumer and data-center!
EuphoricAnimator@reddit
That’s really cool about making Gemma 4 more accessible for fine-tuning. I’ve been playing with local models for a while now, mostly on a Mac Studio M4 Max with 128GB of RAM, and it’s a totally different experience than just hitting an API. I regularly run Qwen 3.5, Gemma 4, and a bunch of stuff from Ollama.
For quick tasks,brainstorming, simple writing assistance,Ollama models are fantastic. I can get surprisingly good results from a 7B model running at around 30 tokens/sec on my machine. But if I’m doing anything more complex, like longer-form content creation or trying to get a really specific style, the 70B models are where it’s at. Even then, though, they aren't quite as consistently "smart" as the best cloud options like GPT-4. My M4 Max can handle a quantized 70B model, but inference is slower, maybe 8-10 tokens/sec, and the quality can dip noticeably.
The big win with local is control and privacy, obviously. Plus, no ongoing costs. I’ve spent more time experimenting and understanding how these models work because I’m wrestling with quantization, prompt formatting, and VRAM limitations. Cloud is still better if I need absolute reliability and top-tier performance, but local is where the fun is right now.
Fine-tuning is the next frontier, and lowering the VRAM requirement to 8GB for Gemma 4 is a big step. I've been hesitant to dive too deep because the initial requirements felt prohibitive, so I'm definitely going to check this out. It’s exciting to see the community pushing the boundaries of what's possible on consumer hardware.
FBIFreezeNow@reddit
Thanks so much for this! One question I have is that you mentioned the thinking enabled to use gemma4’s thinking template, but the examples do not show how to actually make use of them. The user asks “what is 2+2” and I do see the “think” special token but it would be great if you could actually have a Colab link or inline example on your doc that includes the QLoRA with thinking on, with some instructions on how we can use the dataset that has reasoning? Thanks!
hamir_s@reddit
Hey, thanks for the amazing models! I have been using the mlx gemma 4 31b 8bit model and it works great with mlx_lm. But how can I use for image parsing? Trying to using it with mlx_vlm and it seems like it cannot find vision layers.
yoracale@reddit
Thanks for using it! Atm it doesn't support mlx as it's done through mlx-lm and they more or less strip out the vision tower even if its there in the weights
hamir_s@reddit
Ah I see! That effectively decreases the model size too?
yoracale@reddit
Yes thats correct
Dry-Hovercraft9191@reddit
any chance for training with 6G vram ? or try to use colab free tier ?
yoracale@reddit
Unfortunately the optimizations are nearly maxxed out for the model. Yes the colab notebooks are in the articles you linked
MaruluVR@reddit
Is unsloth studio just for fine tuning or can you also do continued pretraining?
danielhanchen@reddit (OP)
You can do continued pretraining - we just haven't exposed it very easily within the UI haha - it should be up today
MaruluVR@reddit
Thank you, I have 400GB worth of books, would those get loaded and unloaded dynamically as they are needed for training or would I need enough ram to hold them?
de4dee@reddit
those databases are probably already in the model
TechySpecky@reddit
yes but if you want the model to focus more on a specific domain, eg let's say you had 500,000 pages of text related to the manufacturing of light bulbs. Would feeding all that into the model as pretaining not improve the models performance on light bulb related queries?
de4dee@reddit
sure that would improve
MaruluVR@reddit
They arent public though all commercial
danielhanchen@reddit (OP)
Oh my 400GB haha is a LOT! You'll be spending more time on formatting and tokenizing that than training :(
de4dee@reddit
yes you can do CPT many times with unsloth
ustas007@reddit
You just gave me idea. I have problem to recognize fabric in one of my project, all models are mixed up any pattern or colors, and I'm already collected 27000 examples, and was frustrated, few months, and result is sux. But you gave me one more idea to try. Thx.
Enthu-Cutlet-1337@reddit
8GB trains E2B, but batch size/seq length bite first; 26B/31B are LoRA-only territory on consumer cards.
Pwc9Z@reddit
Finetuning 26/31B on a single 3090 is probably an absolute no-go, I suppose?
danielhanchen@reddit (OP)
31B is possible most likely - last I checked it used 22GB so 24GB could work but yes it might OOM on longer sequence lengths
Jack_5515@reddit
Would it be possible for you to provide an example configuration for the 22GB VRAM usage?
I tried finetuning Gemma 4 31B on my RTX 5090 using Unsloth Studio, which does start, however each step slows down, and then it OOMs (after around 8 steps). VRAM was as free as could be on Windows (so about 31.4 GB).
I tried finetuning on vision data - does this drastically increase the VRAM usage (or does Windows need more)? (Sidenote: Considering issue #4859, I already locally merged in one of the PR’s fixing the issue)
I already tried (what I assume are) conservative settings, with a sequence length of 5120, a batch size of 1 (and then gradient accumulation steps of 8), lora rank and alpha 8, not training the vision layers. All other settings remained the default set in the UI.
In Unsloth Studio I selected 4bit QLoRA and unsloth/gemma-4-31B-unsloth-bnb-4bit as model.
Quiet-Owl9220@reddit
Noob question, but is excess RAM not useful for training?
I would also like to ask anyone who knows, what is the state of finetuning models with consumer AMD GPU (24gb VRAM)?
danielhanchen@reddit (OP)
It's super useful! Unsloth gradient checkpointing uses your RAM as offloaded memory for gradients!
Pro-Row-335@reddit
31B QLoRA works with 22GB and 26B-A4B LoRA needs >40GB
https://unsloth.ai/docs/models/gemma-4/train
guiopen@reddit
Can't we use Qlora for moes?
danielhanchen@reddit (OP)
Not yet sadly - our kernels aren't yet optimal for QLoRA fusing for MoEs :(
danielhanchen@reddit (OP)
Yep! :)
JohnMason6504@reddit
The grad accumulation blowup to 300-400 is the classic mixed precision loss scaling drift. At small per-step batch the loss scale clamp fires before the accumulated grads land, so the step collapses or explodes depending on which side of the scale window the first micro-batch opens at. The use_cache=False bit on E2B and E4B is the other half of the same lesson because frozen base plus adapter training changes which tensors are live across the prefill boundary and the KV cache assumptions from inference time quietly no longer hold. Nice to see both fixed under one release.
sunychoudhary@reddit
That’s actually a pretty meaningful threshold.
Once local fine-tuning drops into “normal hardware” territory, the barrier shifts from compute to: data quality, eval discipline and knowing what you’re actually trying to improve.
A lot more people can experiment now, but that doesn’t automatically mean better models.
vr_fanboy@reddit
I was training gemma4 E4B in the past hours, checkpoints were performing horribly, i had use_cache=false, exited to try again now, is there an ETA for a new pip version?
yoracale@reddit
Yes we released a new pip release earlier today which includes all the fixes
rickyrickyatx@reddit
Can you use a Mac to train by any chance?
yoracale@reddit
MLX training is coming this month but you can run models on your MacOS with Unsloth studio already! See: https://unsloth.ai/docs/new/studio#quickstart
Zestyclose_Yak_3174@reddit
Wondering how much VRAM is needed to finetune the dense 31B on unsloth
yoracale@reddit
22B for QLoRA, 80GB probs for LoRA, we wrote it in the guide: https://unsloth.ai/docs/models/gemma-4/train
Zestyclose_Yak_3174@reddit
Thanks
ryebrye@reddit
I'm running CPT right now on 31B dense with an 8k context window for the docs on an H200 (140GB vram) - it oom'ed with 2x8 but it's running fine with 1x16 - but I'm training LoRA with bf16
Final_Ad_8913@reddit
Regarding “2. Index Error for 26B and 31B for inference - this will fail inference for 26B and 31B when using transformers - we fixed it.”
Do you mind sharing the fix? There haven’t been releases since the original gemma4 announcement (v0.1.35-beta). When will the fix make it to an official release? Thanks!
iniziolab@reddit
people forgot about Gemma for a while, gemma4 brought the heat back for sure as any new model coming out with new info.
DrBearJ3w@reddit
Is it possible on AMD RDNA3?
danielhanchen@reddit (OP)
We're working on it - hopefully this week!
CheatCodesOfLife@reddit
I don't suppose you know if it works on an AMD Mi50 🤞
yoracale@reddit
It does, we have a guide for it: https://unsloth.ai/docs/get-started/install/amd
JournalistMore7545@reddit
i feel a bit overwhelmed with the race of new stuff that come out, but the token usage of paid llms is more painfull, i will just ask & i really want honnest feedback
for the people testing it now, what’s still the most annoying part?
yoracale@reddit
Many people say tool-calling is broken in a lot of apps, however if you use it unsloth studio, toolcalling automatically heals and is very good :)
https://i.redd.it/lvw7h85hmvtg1.gif
Jeidoz@reddit
I am relativly new for some AI topics. Can someone tell me when and why I may be interested doing fine-tuning for any model? Examples/Usecases are welcomed!
yoracale@reddit
We have a complete guide for fine-tuning including examples etc here: https://unsloth.ai/docs/get-started/fine-tuning-llms-guide
PerfectLaw5776@reddit
In your... case, one would be to finetune Gemma4 or another model with annoying safety filters and/or little adult knowledge to output more NSFW text for use in e.g: RP or translation.
Other methods to break censorship exist, but finetuning it can help it learn how to talk in more adult spaces at the same time.
Lolologist@reddit
Suppose for a moment I am an MLE, because I am, that more often that not needs to make classification (binary, multilabel, multi-class) models. How can I do with models like these that, say, is easy enough to train stop a ModernBert of equivalent?
yoracale@reddit
Absolutely but it might not be as easy as it sounds, you can shove your data in Unsloth Studio for synthetic data gen but experimentation still is key. We also support modernbert yes
Im_Still_Here12@reddit
If I'm only using Gemma for security image analysis (e.g. prompts asking the LLM "What is the person doing?" and "What is their intent?", is there any reason to train??
yoracale@reddit
If you want much more accurate answers, vision is actually one of most popular ways to fine-tune it. E.g. RL with robots is very much a focus right now
placesforfudge@reddit
I tried using unsloth studio but it's refusing to use my dedicated GPU. I tried multiple browsers and ensured my laptop is on high performance mode, hardware acceleration is on, efficiency mode is off. When I use lm studio it uses my dgpu and works smooth as butter.
yoracale@reddit
Would you happen to be using the Docker image? could you make an issue and provide specs if possible so we can help debug? Apologies for the issue and thanks so much for trying it out! We're releaisng a desktop app soon so hopefully that'll solve some issues
celsowm@reddit
Its support full fine tuning too?
yoracale@reddit
Yes ofc ourse, we wrote it in the guide: https://unsloth.ai/docs/models/gemma-4/train
celsowm@reddit
Thanks and a honest question: why unsloth to full fine training and not trl?
Thistlemanizzle@reddit
Anyone have ideas on what to fine tune for? I am struggling to come up with a use case that would justify learning and experimenting.
Imaginary-Unit-3267@reddit
Why are you looking for an excuse to do something you don't actually need to do? If finetuning would benefit you personally, you'd already know how.
Legitimate-Pumpkin@reddit
What comes to my mind mostly is a speaking style. Maybe so it can speak like a character you like or if you prefer more serious, trained in your mail so it can suggest you mails or even send them directly, if you are “brave” enough. Also do some marketing stuff?
Maybe it’s trainable to code in a specific way or to know specific knowledge? I’m also a bit lost here, although sounds very good 🙃
Thistlemanizzle@reddit
Yeah, I guess I could feed it my responses to various emails and texts.
Eh.
UnknownLesson@reddit
I think the colab Links are broken
Thank you've for what you're doing :)
yoracale@reddit
Hello I think it's because you're on old reddit which doesn't render the links properly. You can find all the notebooks in the guide: https://unsloth.ai/docs/models/gemma-4/train
Qwen30bEnjoyer@reddit
What would the VRAM requirements be for the 31b dense model? 130gb VRAM?
yoracale@reddit
22B for QLoRA, 80GB probs for LoRA, we wrote it in the guide: https://unsloth.ai/docs/models/gemma-4/train
After_Dark@reddit
The studio tab seems disabled on my install on a mac (M4 Pro, 48GB of memory, should be more than enough for some E2B training I'd think). Can't tell if this is a bug, my fault, or training just isn't supported on mac and it isn't documented anywhere
riceinmybelly@reddit
Not finished yet, look at their website
After_Dark@reddit
I searched the whole thing and couldn't find anything stating that you can't train on mac, and the "How to Fine-Tune Gemma 4" guide explicitly mentions installing on mac
yoracale@reddit
Hey apologies, we wrote you can run models via the UI on Mac but didn't mention in the article MLX doesn't work as of yet, but it's coming this month. we did write it here but that's in other articles: https://unsloth.ai/docs/new/studio#quickstart
qnixsynapse@reddit
Does it work on my horrible Intel Arc 8GB VRAM GPU?
danielhanchen@reddit (OP)
Inference yes but training :(
CheatCodesOfLife@reddit
Haha I'd be careful saying "yes" to that without testing it. The Arc really is a "horrible" experience with random numerical precision issues depending on the model, etc
limericknation@reddit
I'm rockin' Gemma 4 E2B on my MacBook Neo presently. This is great stuff.
FrostyDwarf24@reddit
Does this mean gemma E4B will fit in my 5070ti?
Sufficient_Prune3897@reddit
In theory yes, in practice their numbers are with stupid low context sizes, which take up more space than the model.
danielhanchen@reddit (OP)
Yes! The free Colab notebook for E4B uses way under 16GB VRAM!
m98789@reddit
How about GRPO?
yoracale@reddit
Works but you need to set fast inference = false. We'll announce support for that probs next week
guiopen@reddit
Why e2b uses 8gb vram? I remember we could finetune qwen3 4b with much less
danielhanchen@reddit (OP)
E2B is actually a 7-8B model! :)
Round_Document6821@reddit
gemma series has the coolest architecture wise imo. Cool bug fixes as always from Unsloth!
danielhanchen@reddit (OP)
Thanks! Ye Gemma-4 is pretty cool!
Thatisverytrue54321@reddit
Is it too noobish of a question to ask how structured your training data had to be?
danielhanchen@reddit (OP)
Unsloth Studio also has a data designer so it can make data from unstructured data!
But we also have an auto AI Assist feature which will auto format your dataset!
Middle_Bullfrog_6173@reddit
At least the studio notebook is still throwing an error when trying to train these models. Something about the gemma4 model type.
danielhanchen@reddit (OP)
Hmm I re-checked Unsloth Studio in Colab:
I'll check all others as well
Middle_Bullfrog_6173@reddit
Thanks, that's exactly what I'm running. Maybe my transformers install was unnecessary and it was a random error. It did say something about gemma4 model type and transformers in the log when it errored.
danielhanchen@reddit (OP)
Checking now sorry
Middle_Bullfrog_6173@reddit
Not sure if it was the actual issue, but trying again with an explicit transformers 5 install step allowed training to start.
Pristine_Pick823@reddit
Is multi-GPU support any closer nowadays?
danielhanchen@reddit (OP)
DDP and model sharding works out of the box!
RandumbRedditor1000@reddit
I cant wait for the finetunes of this. It has a lot of potential to be a great base model
danielhanchen@reddit (OP)
Yes excited for Gemma finetunes!