Wanna try the best coding model with my rtx 3090, not sure where to start, I believe Qwen3.5-27B-UD-Q4_K_XL would be the best? if so should I use ollama with it?
Posted by dreamer_2142@reddit | LocalLLaMA | View on Reddit | 18 comments
I've already searched, but information is getting updated each week, so it's really hard to get an answer, I really hope some of you guys can give me some tips. And can I use an agent with it to enhance the code? Love to hear your setup.
Thanks!
b1231227@reddit
https://huggingface.co/DavidAU/Qwen3.6-27B-Heretic-Uncensored-FINETUNE-NEO-CODE-Di-IMatrix-MAX-GGUF
I recommend this model. I'm currently using IQ3_M to modify the llama.cpp code, and the automatic operation works quite well.
dreamer_2142@reddit (OP)
If I want to use it with Ollama, I need a "Modelfile" I assume? where can get a template to make that for this model?
vick2djax@reddit
You’re gonna get half the speeds and half the performance in Ollama. Just save yourself the confusion ahead. I started with Ollama and wasted so much time. It’s garbage.
Anbeeld@reddit
Qwen 3.6, not 3.5
LirGames@reddit
Forget ollama, use llama.cpp or llama-swap (which uses llama.cpp anyway). Unsloth Q4_K_XL is perfectly fine. You can run it with 80K context of you have the vision active in GPU or you can offload it to RAM/disable and you can easily go up to 96K context at Q8 KV Cache.
If you don't understand anything about this message. Just drop it into Gemini/Claude and ask help setting everything up (Docker highly recommended), they'll figure it out.
dreamer_2142@reddit (OP)
Thanks a lot, will do.
kosnarf@reddit
Check out llama-swap. It's been performing much better than ollama.
sine120@reddit
Skip Ollama, just learn to build llama.cpp. 27B Q4 is a good pick. Use llama-server and hook it up to opencode or Pi coding agents. Opencode is you just want something that works, Pi if you want to speed up prompt processing.
T0nd3@reddit
Good starting point. A few things to sharpen:
On the model: Qwen3-32B at Q4_K_M fits in your 24GB (around 19-20GB loaded) and is arguably the best coding model you can run locally right now. The "UD" unsloth quants are generally high quality — if you see Qwen3-32B-UD-Q4_K_XL that's a solid pick. If you want headroom for longer contexts, Qwen3-30B-A3B (the MoE variant) uses less VRAM at similar quality.
On Ollama: Yes, start with Ollama. It handles model management cleanly and exposes an OpenAI-compatible API which is important for the agent step. One thing to set:
The default context window is 2048 which is too small for coding tasks.
On agents: Two setups worth trying:
aider --model ollama/qwen3:32bFor Qwen3 specifically, enable thinking mode in Continue's system prompt or via
/thinkin Ollama — it noticeably improves code quality on harder tasks.Then-Topic8766@reddit
Damn bots! You will need Qwen 3.6 (27b or 35.b-A3b).
dreamer_2142@reddit (OP)
I've tried to use the downloaded model "Qwen3.5-27B-UD-Q4_K_XL" with OLLAMA, but it gives an unrelated answer to my question. I assume I need to download a specific model from their library and not any model I could find from huggingface?
Then-Topic8766@reddit
Try to use llama.cpp directly. Much better experience than ollama.
dreamer_2142@reddit (OP)
ok, thanks. any quantized you recommend for 24gb? and recommendation for an ai agent.
Then-Topic8766@reddit
Install llama.cpp. Depending of context size you want choose quant. I think Q4_K_XL should work with 3090. Try both 27b and 35bA3b. First is smarter but second is faster. And with second you can offload to RAM ang get bigger quant.
sagiroth@reddit
Bot account, dont trust this one. Recommending old setups
dreamer_2142@reddit (OP)
Almost fallen to it. thanks, but where can I find the club 3090 on github?
sagiroth@reddit
https://github.com/noonghunna/club-3090
dreamer_2142@reddit (OP)
Thanks m8!