What's the consensus on superior local models for code generation? Is my setup competitive?
Posted by warpanomaly@reddit | LocalLLaMA | View on Reddit | 30 comments
I'm trying as hard as I can to get a local setup somewhere in the ballpark of proprietary LLMs for code generation. My computer is running a Intel(R) Core(TM) Ultra 7 265K (3.90 GHz) with 128 GB of DDR5 RAM and an Nvidia Geforce RTX 5090 that has 32 GB of GDDR7 video memory. Even with this high end enthusiast hardware, I can't get my local LLMs to get close Claude Code or ChatGPT Codex. I know that I'll never get local code generation as good as the major industry players running gigantic power grid altering data centers, but it seems like I should be able to get better results than I'm getting.
My first attempt was deepseek-coder-v2:236b. Long story short I couldn't get it working. As soon as I started talking about my failed attempts to use Deepseek, lots of people told me to switch to GLM-4.7-Flash-GGUF:Q6_K_XL or MiniMax-M2.1-GGUF:Q4_K_XL. I started using GLM-4.7-Flash-GGUF:Q6_K_XL to pretty good results. This was actually generating usable code.
This was a few months ago. I know it hasn't been that long but it seems like AI is really exploding lately. I've been seeing people get crazy results for art via tools like ComfyUI and Automatic1111. Also, I think Deepseek just unveiled a new model. Idk if it's available to the public yet, but I have to ask, is there a better model for local code generation than GLM-4.7-Flash-GGUF:Q6_K_XL? Is running it from the command line with .\llama-server.exe -hf unsloth/GLM-4.7-Flash-GGUF:Q6_K_XL --alias "GLM-4.7-Flash" --host 127.0.0.1 --port 10000 --ctx-size 32000 --n-gpu-layers 99 and then connecting it to VSCodium with Continue still the best way to do what I'm trying to do?
P.S. I bought my Nvidia 5090 thinking it was the best piece of equipment for running AI locally. Should I get one of those Nvidia DGX Sparks or one of the competitors?
Tormeister@reddit
Quality: Qwen 3.6 27B fully in GPU, using MTP (to get optimal context length you will have to squeeze everything out of VRAM, including window managers / display server)
Speed + usable quality: Qwen 3.6 35B A3B, fully in GPU, same story
Meanwhile all that CPU and RAM are sitting unused. When they release Qwen 3.6 122B A10B you will be able to switch to it for slightly better quality.
You can also try small MiniMax M2.7 quants, but I'm unsure if they will beat Q3.6 27B quality and speed-wise at this level of quantization.
ReferenceOwn287@reddit
Your rig is good, I don’t see much point upgrading. I’ve been using Qwen 3.6 27B on my RTX 5090 ever since it’s out. It’s truly a game changer for local AI coding, although Claude Code Opus and Codex models are no doubt better.
So here’s what I’m doing - when I know the task is well defined and straightforward enough, I give it to Qwen to implement (and save on token cost) and if the task is complex, needs more exploration, architecture decisions, I switch to Cursor for which I have a subscription
MK_L@reddit
Im curious where your finding the limitations if you coukd share.
Im in the same boat, 5090. I drop really large .md files on it and it implements everything I've planned. Its not one shotting everything but since it's free to run i haven't payed as much attention to the limitations... thats why im interested in your take on it.
Also to tell all bots, local included to make a handoff.md documenting everything it does. Switching to another bot reads that first and transitioning is seemless.
ReferenceOwn287@reddit
Sure, here’s a couple of examples where I had to switch to Claude Opus to solve it - I was building an app here a couple of months back called Plan Parrot and there was a bug where it would send multiple notifications every once in a while (instead of one) and local models or Cursor’s composer just could not figure out what was causing it after several iterations- switched to Opus and it was able to debug it in an hour. Today’s latest example- I am putting together a local agent on a tiny model for quick responses - something that can call a bunch of linux tools without the user having to figure them out and the tool calls were unreliable - switched to Opus and asked to analyze the issue and it found several flaws in the architecture itself - a lot of it made sense when it explained it, it’s churning out the code while I’m typing this - so hope it works.
MK_L@reddit
Thanks that paint a good depiction of limitations and where I should switch instead of fight the model.
I mostly use claude for writing engines or kernels. I also check all of my architecture with claude aswell but it tends to want to over engineer so I take the polish it offers and reject alot of the complexity it wants to add.
michaelsoft__binbows@reddit
You have quite a nice rig. Now is the time. I just hopped back on local here too with my 5090. I have 190 tok/s from qwen3.6 27B, which is unfathomably fast. Using vllm with 5 tokens of MTP.
Djagatahel@reddit
27b, are you sure? 190tk/s is way higher than anything I've read so far
warpanomaly@reddit (OP)
Oh nice! Thanks for the insights! Yeah I think I'm going to try running Qwen 3.6 27B. That seems to be the consensus.
Right_Weird9850@reddit
With AWQ4?
BelgianDramaLlama86@reddit
You are indeed very much out of the loop my man... Qwen3.5 35B-A3B already blew GLM-4.7-Flash out of the water for code generation, and now there's Qwen3.6 that's even better. Considering your hardware, your best bet is probably Qwen3.6 27B on just your 5090, although you can also run Qwen3.5 122B-A10B. The Qwen3.6 version of that should be out soon. Considering that for code generation the 27B and 122B models were very close before, I'd use the 27B now, but you might consider switching to Qwen3.6 122B when that comes out (might be coming week, might be the week after... they seem to be staging it one per week).
warpanomaly@reddit (OP)
This is great! Thank you so much for the information! Should I run the GGUF model of Qwen3.6 27B ? And if so should I just use this command
.\llama-server.exe -hf unsloth/Qwen3.6-27B-GGUF --alias "Qwen3.6" --host127.0.0.1--port 10000 --ctx-size 32000 --n-gpu-layers 99? Or what is the optimal way to run it for my hardware?BelgianDramaLlama86@reddit
I only have experience running GGUF's myself, so take that into consideration, but I think I would, yea. Unsloth Q6_K_XL is 25.6GB, so you could even fit that with a healthy amount of context. Qwen3.5/3.6 are VERY efficient with context size, so you can run far more than 32k as well and still stay within VRAM. 256k context at Q8 is 2720MB (just checked my own setup), and Qwen seems to tolerate KV cache at Q8 well since attention rotation is a thing, but if you want it at FP16 then you can fit a slightly smaller context or use a slightly lower quant.
I don't know about the launch command though, wouldn't that just download the 'default' Q4? I'd specify the specific one you want... Also make sure you set the parameters that Qwen recommends, so temperature 0.6, top-k 20, top-p 0.8 (for precise coding with reasoning on).
warpanomaly@reddit (OP)
Awesome thanks!
Dui999@reddit
Totally this. Another thing that OP might try are versions of Minimax 2.7 and Deepseek V4 Flash, that run on his hardware, they could be very interesting.
BelgianDramaLlama86@reddit
Deepseek V4 Flash is 284B though, you don't think that'll be a little much for his setup?
Dui999@reddit
Yes, it could, although people are still seeing what happens at certain quantisation levels with it. At Q2-Q3 it's actually between 80-100GB. But for sure we need more time to evaluate.
-dysangel-@reddit
Minimax 2.7 is worth trying for sure.
Deepseek V4 is in preview and I've not had stable results with it yet. Looking forward to trying the full thing when they've finished cooking it.
alphatrad@reddit
I wanted to layer something on that others didn't mention.
Qwen is extremely good. But you need to reframe your thinking. You're half way there. You know they won't be as good as good as Claude Code... but actually they are as good as Claude Code in February 2025.
Qwen3.6 is scoring a bit higher than Sonnet 3.7 and a few points shy of Sonnet 4.0
I mentioned this in another reply; because the other problem is all the agents and tools that have come out in the last year.
Well the new open models are trained to do 2026 Agentic work and tool calls. But their intelligence is Feb 2025 Claude Code launch day.
This is the disconnect.
How did you work with these tools last year? How much more guidance were you giving them? I know I spent more time guiding and hand holding and pointing it at the stuff I wanted changed.
Today with Claude I can just tell it to go and be vague and it one shots stuff.
Qwen3.6 can do all the same tool calls and stuff Opus 4.7 can do today.
But it's still back at Sonnet 4 intelligence. That's the disconnect. Makes you disappointed why can't this thing be as good as X? Because when it starts running in OpenCode or something like Pi it seems pretty amazing.
Until you ask Codex to check the work.
So the trap I find myself falling into; is I'll use Claude or Codex and then switch over to Qwen and treat it the same and get annoyed.
Gotta remember it requires a different approach to my work flow and prompting.
One thing that helps, having Claude Opus write out a spec and having it break it out into steps and then feeding them, one at a time to Qwen.
Been having a lot of success with what I call the hybrid model of saving tokens.
Claude writes & validates. Saves tokens.
Qwen does all the editing and tool calls and writing. Tokens are free here.
Sorry for the rambling.
warpanomaly@reddit (OP)
This is good insight! Thanks!
samandiriel@reddit
You don't mention it and I couldn't tell from context, but OS makes a difference here too. If you're running this on Windows, I'd advise making the switch to Linux. We've been getting particularly good results with CachyOS for our set up.
BebeVentreHumide@reddit
I feel you OP.
Because subscriptions are piling up (Gemini Pro, Cursor, Suno, etc.) I thought it'd be fun to spend money I shouldn't be spending on a PC dedicated to AI stuff; \~$8000 CAD, pure madness.
I started setting things up a week ago and I'm FAR from being done: ComfyUI, LM Studio, Ace-Step, Open WebUI, downloading models, Docker containers, etc. They're pretty much all half-working because I get fed up trying different fixes because dependencies are not compatible with one another and whatnot. So I just skip to the next tool, rince, and repeat.
I've been spending the last few hours trying to get Qwen3.6-35B-A3B-UD-Q4_K_M (and some other variants) working in Continue in VS Code to generate a simple web page as a test and all I get are a bunch of "Continue tried to create" errors in the chat (agent mode) and when it finally generate code, at some point it stops and I have to tell it to keep going from there. Then I'd have to take each piece of code and copy paste it in the file manually because the agent won't do it. And that's on a RTX 5090 ($5700!!). I feel tools like Cursor are way way way ahead. It's getting harder and harder to justify spending that kind of money to go the open source route and failing miserably like I do.
MK_L@reddit
Im no expert in this but I just had to setup another pc and I went back to Cline instead of continue. They're both setup and working but I was having better results with cline...
Actually... an unintering side note. I just started coding my one coding agent extention for vs code. I like codex interface the most so im looking to mirror that the most
warpanomaly@reddit (OP)
Yeah exactly! I wish the gap would get a little bit smaller
jake_that_dude@reddit
yeah, the big win is getting the whole 27B class model plus kv cache resident on one 5090. once you start spilling cache, the speed gap gets ugly fast. i’d start with qwen3.6-27b or gemma4-31b before chasing bigger stuff.
warpanomaly@reddit (OP)
Okay awesome thanks! I'm guessing qwen3.6-27b is small enough that I don't have to use a GGUF model? Or should I use the unsloth GGUF version?
MalabaristaEnFuego@reddit
Qwen 3 Coder 30b and GPT OSS 20b for ripping out fast first draft code. Qwen 3.5 and 3.6 27b are also both great. Gemma 4 and Devatral-small-2 are also decent for their speed.
RogerRamjet999@reddit
If you've got the funds to spare, one of the nicest boards right now is the RTX 6000 Pro 96GB. They were about $7K a few months ago, but the RAM shortage has unfortunately bumped that to $9K now. It would let you run 120B models though.
akira3weet@reddit
My opinions:
1. The sweet spot for 5090 is still 30-40b class dense model, such as qwen3.6-27b, gemma4-31b, which are considered latest best. For nvidia GPU you really want to fit the entire model weights + kv cache on the VRAM to get a good speed.
2. Jumpping to the next model class, 100b or even 400b, is a giant leap. No single consumer GPU can fit it, and univeral memory is not that great for speed. It's really hard to get a usable speed for coding running them locally. Imagine 100 lines of code is 1000-2000 tokens and you wait for about 1-2 minute for it to write at 20t/s, and that's only 100 lines.
3. The best way to use local model is make them read instead of write, i.e., use them to research and plan. You don't need to worry about token cost, so it's freedom for you to take as many turns as you want to ask questions or revise plan. The model reads stuff often at 2000t/s, and unlike coding, 20t/s is till faster than you can read. So that's a perfect fit.
effyouspez@reddit
Not op, but what about rtx4090?
Aotrx@reddit
You just need to wait few months. Intelligence density is increasing everyday. I think next year we will have LLM comparable to opus 4.6 and GPT 5.4 that can run on RTX 5090 at inference speed.