I finally found the best 5070 TI + 32GB ram GGUF model
Posted by FrozenFishEnjoyer@reddit | LocalLLaMA | View on Reddit | 5 comments
it's the Gemma 4 26B A3B IQ4 NL.
My llama.cpp command is:
llama-server.exe -m "gemma-4-26B-A4B-it-UD-IQ4_NL.gguf" -ngl 999 -fa on -c 65536 -ctk q8_0 -ctv q8_0 --batch-size 1024 --ubatch-size 512 --temp 1.0 --top-k 64 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --no-warmup --port 8080 --host 0.0.0.0 --chat-template-kwargs "{\"enable_thinking\":true}" --perf
In essence, this is just the recommended setting's from Google, but this has served me damn well as a co-assistant to Claude Code in VS Code.
I gave it tests, and it's around 6.5/10. It reads my guide.md, it follows it, reads files, and many more. Its main issue is that it can't get past the intricacies of packages. What I mean by that is that it can't connect files to each other with full accuracy.
But that's it for its issues. Everything else has been great since it has a large context size and fast <100 tokens per second. This is one of the few models that have passed the carwash test from my testing.
a-babaka@reddit
what tasks are you using llm for? does qwen3.5 35b work worse on them? at least you can expect more context there
iamapizza@reddit
The IQ4s seem to be smaller than the Q4s, why is that?
jacek2023@reddit
you can experiment with more quantized kv cache (to use less memory), check this: https://www.reddit.com/r/LocalLLaMA/comments/1sf61n2/kvcache_support_attention_rotation_for/
FrozenFishEnjoyer@reddit (OP)
Yes, but I prefer the q8_0 for better accuracy
SaltResident9310@reddit
And here I am waiting for 1-bit quants so that I can run good dense models on my lowly laptop.