I finally found the best 5070 TI + 32GB ram GGUF model

Posted by FrozenFishEnjoyer@reddit | LocalLLaMA | View on Reddit | 5 comments

it's the Gemma 4 26B A3B IQ4 NL.

My llama.cpp command is:

llama-server.exe -m "gemma-4-26B-A4B-it-UD-IQ4_NL.gguf" -ngl 999 -fa on -c 65536 -ctk q8_0 -ctv q8_0 --batch-size 1024 --ubatch-size 512 --temp 1.0 --top-k 64 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --no-warmup --port 8080 --host 0.0.0.0 --chat-template-kwargs "{\"enable_thinking\":true}" --perf

In essence, this is just the recommended setting's from Google, but this has served me damn well as a co-assistant to Claude Code in VS Code.

I gave it tests, and it's around 6.5/10. It reads my guide.md, it follows it, reads files, and many more. Its main issue is that it can't get past the intricacies of packages. What I mean by that is that it can't connect files to each other with full accuracy.

But that's it for its issues. Everything else has been great since it has a large context size and fast <100 tokens per second. This is one of the few models that have passed the carwash test from my testing.

[-]

FrozenFishEnjoyer@reddit (OP)

Yes, but I prefer the q8_0 for better accuracy

SaltResident9310@reddit

And here I am waiting for 1-bit quants so that I can run good dense models on my lowly laptop.