Gemma 26B A4B failing to write even simple .py files - escape characters causing parse errors?
Posted by No_Reference_7678@reddit | LocalLLaMA | View on Reddit | 13 comments
Just tried running Gemma 26B A4B and I'm running into some weird issues. It's failing to write even simple Python files, and the escape character handling seems broken. Getting tons of parse errors.
Anyone else experienced this with Gemma models? Or is this specific to my setup?
**Specs:**
- GPU: RTX 4060 8GB
- Model: Gemma 26B A4B
**run**
./build/bin/llama-server -m ./models/gemma-4-26B-A4B-it-UD-Q4_K_M.gguf --fit-ctx 64000 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0
Compared to Qwen3.5-35B-A3B which I've been running smoothly, Gemma's code generation just feels off. Wondering if I should switch back or if there's a config tweak I'm missing.
(Still kicking myself for not pulling the trigger on the 4060 Ti 16GB. I thought I wouldn't need the extra VRAM - then AI happened )
sleepingsysadmin@reddit
Root problem here is gpu specs really. You only have 8gb, so you quantize so much that the accuracy of the model drops quite a bit.
We all made this mistake with hardware. I went to 32gb of vram thinking that's good enough. Never is. Now I want a 5090 or a pro 6000. You always want more.
To me, I'd look at Qwen3.5 9b. It'll fit better and still is GPT120b smart.
Also start saving $100/paycheque because in about 1-2 years the DDR6 era hits and that's when you want to upgrade.
TheMasterOogway@reddit
there is nothing wrong with a Q4 quant lol
DinoAmino@reddit
No, not at all as long as you don't care about accuracy lol. And if that's the case then nothing wrong with quantizing cache either lol
sleepingsysadmin@reddit
Q4_K_M is probably around 90%. I only roll Unsloth UD at Q4 because it's far less punishing.
Then you had each cache quantization and you're below 80% accuracy.
If he fills up that much too short 65,000 context. He's going to be around 70% accuracy.
Then ask it to do a high precise thing like Coding and it's going to be close to 60%.
egomarker@reddit
Let's start with checking you llama.cpp version.
Do you chat with the model or are using some agentic software?
No_Reference_7678@reddit (OP)
latest build, I am building my own nimble agentic harness for local models...
downloading the latest .ggup might fix.
egomarker@reddit
The usual suspect when LLM has trouble with escaping on custom harnesses is that "read" tools output doesn't correspond to what "write" tools expect.
Usual offenders are for example double-triple json serializations in "read" tool, that produce monsters like \\\\\\"\\\\\\n. And for example "write" tool does only one or two deserializations, leaving some escape symbols in. Some LLMs can work around it, some can't.
So check your logs, figure out what exactly LLM is sending and what is written to file. And check inputs from your tools all the way from "what you send" to "what exactly LLM is getting", make sure you are not forcing some weird escaping style on LLM that is incorrectly interpreted by your "write" tools.
No_Reference_7678@reddit (OP)
This is exactly where gemma is failing... qwen usually comes with solutions but gemma even after 4 loop coudnt figure it out.
milkipedia@reddit
And do you have thinking enabled?
Paradigmind@reddit
And did you try turning it off and back on again?
gnnr25@reddit
Redownload the gguf, they just updated again.
ambient_temp_xeno@reddit
A few problems I can see: unsloth quant. kv cache quantization.
--top-p 0.95 --temp 1.0 --top-k 64 --min-p 0.0 are the correct sampler settings. llama.cpp defaults to min-p 0.05 which for this model is wrong.
TheMasterOogway@reddit
Don't know about the parsing issues but with 8GB VRAM try offloading the experts to ram like this:
--n-gpu-layers 99 --n-cpu-moe 30
It should run much faster