Best coding model on RTX 3060

Posted by solimaotheelephant3@reddit | LocalLLaMA | View on Reddit | 4 comments

Wondering what’s the best coding model that can fit on a RTX 3060 (12GB). Has anyone been able to do something useful with it?

Also wondering about best setup (vllm? Llama.cpp?) and quantization.

Thanks a lot, this community is great

[-]

ea_man@reddit

QWEN3.6 35B A3B with MTP, https://huggingface.co/byteshape/Qwen3.6-35B-A3B-MTP-GGUF

the smaller quant es IQ3 the fastest speed and less "quality"

The more context you want the more KV cache quant you need, es at 20k ctx you may do with q_4, at 120K you want q8_0 q5_0.

For coding you want MTP enabled with n=1-3 according to how much ctx length you wanna keep, it multiplies * ctx length. For creative chat just do n=1 or none.

Single user / task -> llama.cp

Multiuser -> vllm, you don't have VRAM for that

[-]

Brother how much ram do you have Assuming you have 32gb Qwen3.6 35b a3b Gemma4 26b a4b Are the best intelligence moe models you can run For best mix of speed and intelligence you should try qwen3.5 9b

[-]

solimaotheelephant3@reddit (OP)

Rtx 3060 is 12GB

[-]

SimShelby@reddit

Qwen3.5 b35 A4b ud q4km from unsluth + turbo quant contexte 200k

am getting 40/45 tps And 300pp I have 32gb ram And 16gb vram

you can lower the contexte to match your vram

or try with Q3KM