Best coding model on RTX 3060
Posted by solimaotheelephant3@reddit | LocalLLaMA | View on Reddit | 4 comments
Wondering what’s the best coding model that can fit on a RTX 3060 (12GB). Has anyone been able to do something useful with it?
Also wondering about best setup (vllm? Llama.cpp?) and quantization.
Thanks a lot, this community is great
ea_man@reddit
QWEN3.6 35B A3B with MTP, https://huggingface.co/byteshape/Qwen3.6-35B-A3B-MTP-GGUF
the smaller quant es IQ3 the fastest speed and less "quality"
The more context you want the more KV cache quant you need, es at 20k ctx you may do with q_4, at 120K you want q8_0 q5_0.
For coding you want MTP enabled with n=1-3 according to how much ctx length you wanna keep, it multiplies * ctx length. For creative chat just do n=1 or none.
Single user / task -> llama.cp
Multiuser -> vllm, you don't have VRAM for that
WiseVanilla2743@reddit
Brother how much ram do you have Assuming you have 32gb Qwen3.6 35b a3b Gemma4 26b a4b Are the best intelligence moe models you can run For best mix of speed and intelligence you should try qwen3.5 9b
solimaotheelephant3@reddit (OP)
Rtx 3060 is 12GB
SimShelby@reddit
Qwen3.5 b35 A4b ud q4km from unsluth + turbo quant contexte 200k
am getting 40/45 tps And 300pp I have 32gb ram And 16gb vram
you can lower the contexte to match your vram
or try with Q3KM