Best Plan/Act models for 30 gb vram 64gb ram
Posted by PreparationTrue9138@reddit | LocalLLaMA | View on Reddit | 9 comments
Hi, I have a Dell g15 with 64gb Ram, Rtx 3060 6gb + egpu with RTX 3090 24 gb.
What model will be the best for Planning?
I think Gemma4 26 b and qwen3.5 35b are good for build/act mode because they are very fast 100 t/s, but I need more intelligence for plan mode.
what will be better?
I want to try some qwen models like qwen3 coder next or qwen3.5 122b
main use case is Compose multiplatform development
what do you think?
Odd-Ordinary-5922@reddit
run gemma 4 31b
sk1kn1ght@reddit
Gemma 4 IT on the 3090 + Gemma 2b on the 3060 for spec decoding. Fast inference
PreparationTrue9138@reddit (OP)
Hi, can you please explain what is spec decoding and what kind of speed boost we are talking about?
sk1kn1ght@reddit
You use your big model as you normally would. You also load up a small model (Gemma 2b q4 even or lower play around for your case). Then the small model predicts tokens that the big model accepts or discards. On average a 20tps for dense models. Moe doesn't get much advantage of it. Important is that they need to be from the same family otherwise translations etc happen which diminish your returns.
PreparationTrue9138@reddit (OP)
Thanks, I tried that, I think because of egpu bottleneck it doesn't work as expected or I did something wrong. It slowed down both pp and tg speed.
tmvr@reddit
You have enough memory to try all of them so just do it. Opinions of other people or official benchmarks are meaningless if you already have a concrete use case.
CtrlAltDesolate@reddit
Qwen 3 coder next definitely seems better than Gemma 4 26b currently.
Gemma will typically get things done right faster when it's able to do so, but it seems to run into way more things it can figure out and overthinks / breaks stuff it shouldn't be touching randomly regardless of your system prompt or settings.
I do like it for the initial framework and UI design, but Qwens definitely the go-to for the complex functionality stage.
Plastic-Stress-6468@reddit
Rule of thumb is if it's not fully gpu offloaded it's going to run like arse. Since you are using an eGPU dock, it's going to have even higher latency and lower bandwidth than native PCIE, and having model weights and kv cache crawling over native PCIE is already painfully slow.
Look for models+context that fit just under 24gb, the 3090's VRAM pool.
Both qwen3 coder next and qwen3.5 122b will not fit comfortably on 24gb at reasonable quants, which most people will say the lowest you should go is q4. q3 and below and it's going to get sketchy. Your 3090 can fit IQ1 qwen3 coder next with about 2gb spare for contexts, and can't fit qwen3.5 122b at all at even the lowest quant.
I'd look to cloud models if you need more intelligence for planning.
BigYoSpeck@reddit
I've gotten the best quality code of any local model from Gemma 4 31b
But others will say they have from Qwen3.5 27b so the best thing to do is trial them