5060ti and 64gb ram - what is my best option for local coding?
Posted by bonesoftheancients@reddit | LocalLLaMA | View on Reddit | 13 comments
compiled llama.cpp forks for turboquant and rotorquant and now trying models - what is the best models for local coding that will run on my setup (in a usable speed)? and what realistically should i expect (after using gemini and claude online for coding)?
NeverForget2023@reddit
I'm trying to figure this out right now myself. Similar setup: 7800x3D, 64 GB DDR5 6000, 4070 Ti Super.
Giving these a try (all unsloth):
gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf
Qwen3-Coder-Next-UD-Q4_K_XL.gguf
Qwen3-Coder-30B-A3B-Instruct-UD-Q8_K_XL.gguf
Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf
Running lmeval (mbpp and humaneval_instruct tasks) against each.
Also trying https://github.com/k-koehler/gguf-tensor-overrider to fit as much of the important tensors in the GPU as possible. That doesn't seem to support Gemma4, the params it spits out try to put just about everything in the GPU and it coredumps. So for Gemma4 I'm just letting llama.cpp do the layer fit automatically.
Qwen3 Coder Next finished last night in 3,169.5 second, mbpp score 0.784, human eval score 0.939.
I'll keep this Google sheet updated as I get results: https://docs.google.com/spreadsheets/d/1Icn01bywinr3UG1iF25c54wG6ohlwZ1xgc3b5BgkJEs/edit?usp=sharing
bonesoftheancients@reddit (OP)
thanks - this is great. have you considered the qwen 3.6 TQ3_4s model? its pretty fast but no idea how good for coding
NeverForget2023@reddit
If you are trying to get it all to git in to VRAM, then Q3 might not be enough. Still need some headroom for KV cache. But if you are ok w/ overflowing to system RAM, might as well go w/ Q4+ and just pick how much total RAM use you are ok with. I'm in mid-run and htop is showing about 44 GB system RAM used (26 GB for llama.cpp) and nvidia-smi just under 15 GB VRAM used with UD-Q8_K_XL.
bonesoftheancients@reddit (OP)
not sure i understand - was asking about the TQ3 turboquant weights (can be used the llama.cpp TQ fork)
NeverForget2023@reddit
Which one are you looking at? This one? https://huggingface.co/YTan2000/Qwen3.6-35B-A3B-TQ3_4S
bonesoftheancients@reddit (OP)
yes this one - i think you need to compile a fork of llama.cpp that include the ability to use weights in TQ3
NeverForget2023@reddit
Sorry! I didn't understand. Just learning about turboquant now.
pand5461@reddit
Qwen3.5-120b-a10b might actually run at quants like iq3_s or iq4_xxs if you have 16gb gpu version. I ran iq3_s using ik_llama with 4060 8 GB but then it needs heavy cpu offloading and runs at only 6-7 tok/s. But 16 gb vram might be enough to run with only MoE offloaded to cpu. Qwen3-coder-next is also great and should be pretty fast (runs at 24 tok/s on my PC).
Frizzy-MacDrizzle@reddit
Some of the instruct models do well at one shot Python. Also you can pull hugging face models too!
Most-Trainer-8876@reddit
Try Qwen 3.6 35B A3B model. Perfect for local coding! Your setup can 100% context, i.e. 256K
bonesoftheancients@reddit (OP)
thanks
tmvr@reddit
Qwen3.6 35B A3B
Qwen3.5 35B A3B
Qwen3 Coder 30B A3B
Try these at Q4_K_M or better with loading the experts to system RAM (use the
-fitparameter in llamacpp).bonesoftheancients@reddit (OP)
thanks