Any local llm for mid GPU
Posted by kellyjames436@reddit | LocalLLaMA | View on Reddit | 18 comments
Hey, recently tried Gemma4:9b and Qwen3.5:9b running on my RTX 4060 on a laptop with 16GB ram, but it’s so slow and annoying.
Is there any local llm for coding tasks that can work smoothly on my machine?
jacek2023@reddit
it's not mid, it's a potato
kellyjames436@reddit (OP)
Unfortunately it’s
pmttyji@reddit
Gemma-4-26B-A4B & Qwen3.5-35B-A3B. Both are MOE so faster than dense. Q4 (IQ4_XS) is better as you have only 8GB VRAM.
kellyjames436@reddit (OP)
Thank you, what do you recommend for agent with those, i’ve been struggling with openclaw recently, also tried claude code and it seems need some configuration to use tools.
dabxdabx@reddit
hey, what are your use cases for agent?
yes-im-hiring-2025@reddit
Have you tried doing a few optimization fixes first? 9B is elite for local use, generally performant as well.
Surprised to see you say you had subpar experience.
Check these optimizations out:
There's also more experimental stuff around turbo quant and spec prefill, but I haven't had time to do it myself so idk how much of a perf boost they provide. After a point everything is diminishing returns, though
kellyjames436@reddit (OP)
I’ll try llama.cpp with 9b models and see what happens, my use cases is specifically for coding and tools calling.
Afraid-Pilot-9052@reddit
for a 4060 with 16gb ram you're gonna want to stay in the 3-4b parameter range for smooth performance, or use heavily quantized versions of the bigger models. try qwen2.5-coder:7b-q4 or deepseek-coder-v2-lite, both run way better at those quant levels. also make sure you're offloading fully to gpu and not splitting across cpu/gpu, that's usually what kills speed. if you want something that handles the whole setup without messing with configs, i've been using OpenClaw Desktop which has a setup wizard that auto-detects your hardware and picks the right model settings.
kellyjames436@reddit (OP)
I’ve installed openclaw with ollama, when i sent a hello message to the ai i got an error that says i don’t have enough system ram. I’m confused if those small models can help with some heavy coding tasks or not.
Eelroots@reddit
I've got the same struggle with 12gb vram - most of the models I see around are fit for 16gb. It would be damn nice if huggingface will also publish the approx memory size.
kellyjames436@reddit (OP)
Since you struggle with 12gb of vram that means 8gb isn’t enough to run ai agent locally
Afraid-Pilot-9052@reddit
maybe minimum GPU is needed?
NotArticuno@reddit
I agree with the suggestion of qwen2.5-coder:74-q4!
I haven't tried any deepseek model but I'm curious to.
kellyjames436@reddit (OP)
Is that 4q means there’s only 4b parameter are active ?
NotArticuno@reddit
Not it has to do with the precision of the numbers used during calculation. It's like 4-bit vs 8-bit, etc. Here read this chat, I literally forget the difference in these things every time I re-learn it. I guess because I never actually apply it irl.
https://chatgpt.com/s/t_69d54eb876e481918783aea889d462f9
kellyjames436@reddit (OP)
There so much number and letters there, i should learn that from scratch to understand what each number and letter represents.
hejwoqpdlxn@reddit
The 9B models you tried don’t fit in 8GB VRAM, so they spill into system RAM which is why it feels so slow. Your 16GB is system RAM, not VRAM, those are separate pools and inference speed is mostly determined by the GPU number. For coding on a 4060 laptop I’d go with Qwen2.5-Coder 7B Q4 it fits cleanly in 8GB and is genuinely solid for real coding tasks.
If you want snappier responses, the 3B version is roughly 2x faster and still handles most day-to-day stuff fine. 7B is enough for writing functions, debugging, boilerplate. where it starts to struggle is when you’re throwing huge codebases at it or doing complex multi file reasoning. For normal coding work it’s fine. Also maybe ditch OpenClaw, just use Ollama directly.
kellyjames436@reddit (OP)
Openclaw agent put heavy weight on the system specs, i tried it and it didn’t work for me, i’ll try those recommendations, thank you