Any local llm for mid GPU

Posted by kellyjames436@reddit | LocalLLaMA | View on Reddit | 18 comments

Hey, recently tried Gemma4:9b and Qwen3.5:9b running on my RTX 4060 on a laptop with 16GB ram, but it’s so slow and annoying.

Is there any local llm for coding tasks that can work smoothly on my machine?

[-]

pmttyji@reddit

Gemma-4-26B-A4B & Qwen3.5-35B-A3B. Both are MOE so faster than dense. Q4 (IQ4_XS) is better as you have only 8GB VRAM.

[-]

kellyjames436@reddit (OP)

Thank you, what do you recommend for agent with those, i’ve been struggling with openclaw recently, also tried claude code and it seems need some configuration to use tools.

[-]

dabxdabx@reddit

hey, what are your use cases for agent?

[-]

yes-im-hiring-2025@reddit

Have you tried doing a few optimization fixes first? 9B is elite for local use, generally performant as well.

Surprised to see you say you had subpar experience.

Check these optimizations out:

quant : go down to q4 if you're not already here
serve with either llama.cpp or vllm. They're very well optimized for inference. llama.cpp is better for single person/local use IMO
control your context length : don't set to max, it's a memory hog. For <=15B I feel like the best size is between 16-32k to match acceptable flash/mini stuff on local use
check out batch processing size. The default is pretty low, but based on your GPU and RAM you can pretty much customize it. llama.cpp comes OOTB with just a --batch-size param I think
speculative decoding : check if you can set up a draft model in the 1-2B range for your models. If possible, it's a nice 1.5x++ speedup. It keeps both models in memory though so you'll have to be careful selecting one
enable flash attention (should come ootb for most llama.cpp and vllm both, but just in case you haven't)

There's also more experimental stuff around turbo quant and spec prefill, but I haven't had time to do it myself so idk how much of a perf boost they provide. After a point everything is diminishing returns, though

[-]

kellyjames436@reddit (OP)

I’ll try llama.cpp with 9b models and see what happens, my use cases is specifically for coding and tools calling.

[-]

Afraid-Pilot-9052@reddit

for a 4060 with 16gb ram you're gonna want to stay in the 3-4b parameter range for smooth performance, or use heavily quantized versions of the bigger models. try qwen2.5-coder:7b-q4 or deepseek-coder-v2-lite, both run way better at those quant levels. also make sure you're offloading fully to gpu and not splitting across cpu/gpu, that's usually what kills speed. if you want something that handles the whole setup without messing with configs, i've been using OpenClaw Desktop which has a setup wizard that auto-detects your hardware and picks the right model settings.

[-]

kellyjames436@reddit (OP)

I’ve installed openclaw with ollama, when i sent a hello message to the ai i got an error that says i don’t have enough system ram. I’m confused if those small models can help with some heavy coding tasks or not.

[-]

Eelroots@reddit

I've got the same struggle with 12gb vram - most of the models I see around are fit for 16gb. It would be damn nice if huggingface will also publish the approx memory size.

[-]

kellyjames436@reddit (OP)

Since you struggle with 12gb of vram that means 8gb isn’t enough to run ai agent locally

[-]

Afraid-Pilot-9052@reddit

maybe minimum GPU is needed?

[-]

NotArticuno@reddit

I agree with the suggestion of qwen2.5-coder:74-q4!

I haven't tried any deepseek model but I'm curious to.

[-]

kellyjames436@reddit (OP)

Is that 4q means there’s only 4b parameter are active ?

[-]

NotArticuno@reddit

Not it has to do with the precision of the numbers used during calculation. It's like 4-bit vs 8-bit, etc. Here read this chat, I literally forget the difference in these things every time I re-learn it. I guess because I never actually apply it irl.
https://chatgpt.com/s/t_69d54eb876e481918783aea889d462f9

[-]

kellyjames436@reddit (OP)

There so much number and letters there, i should learn that from scratch to understand what each number and letter represents.

[-]

hejwoqpdlxn@reddit

The 9B models you tried don’t fit in 8GB VRAM, so they spill into system RAM which is why it feels so slow. Your 16GB is system RAM, not VRAM, those are separate pools and inference speed is mostly determined by the GPU number. For coding on a 4060 laptop I’d go with Qwen2.5-Coder 7B Q4 it fits cleanly in 8GB and is genuinely solid for real coding tasks.

If you want snappier responses, the 3B version is roughly 2x faster and still handles most day-to-day stuff fine. 7B is enough for writing functions, debugging, boilerplate. where it starts to struggle is when you’re throwing huge codebases at it or doing complex multi file reasoning. For normal coding work it’s fine. Also maybe ditch OpenClaw, just use Ollama directly.

[-]

kellyjames436@reddit (OP)

Openclaw agent put heavy weight on the system specs, i tried it and it didn’t work for me, i’ll try those recommendations, thank you