Qwen3 Coder Next as first "usable" coding model < 60 GB for me
Posted by Chromix_@reddit | LocalLLaMA | View on Reddit | 222 comments
I've tried lots of "small" models < 60 GB in the past. GLM 4.5 Air, GLM 4.7 Flash, GPT OSS 20B and 120B, Magistral, Devstral, Apriel Thinker, previous Qwen coders, Seed OSS, QwQ, DeepCoder, DeepSeekCoder, etc. So what's different with Qwen3 Coder Next in OpenCode or in Roo Code with VSCodium?
- Speed: The reasoning models would often yet not always produce rather good results. However, now and then they'd enter reasoning loops despite correct sampling settings, leading to no results at all in a large over-night run. Aside from that the sometimes extensive reasoning takes quite some time for the multiple steps that OpenCode or Roo would induce, slowing down interactive work a lot. Q3CN on the other hand is an instruct MoE model, doesn't have internal thinking loops and is relatively quick at generating tokens.
- Quality: Other models occasionally botched the tool calls of the harness. This one seems to work reliably. Also I finally have the impression that this can handle a moderately complex codebase with a custom client & server, different programming languages, protobuf, and some quirks. It provided good answers to extreme multi-hop questions and made reliable full-stack changes. Well, almost. On Roo Code it was sometimes a bit lazy and needed a reminder to really go deep to achieve correct results. Other models often got lost.
- Context size: Coding on larger projects needs context. Most models with standard attention eat all your VRAM for breakfast. With Q3CN having 100k+ context is easy. A few other models also supported that already, yet there were drawbacks in the first two mentioned points.
I run the model this way:
set GGML_CUDA_GRAPH_OPT=1
llama-server -m Qwen3-Coder-Next-UD-Q4_K_XL.gguf -ngl 99 -fa on -c 120000 --n-cpu-moe 29 --temp 0 --cache-ram 0
This works well with 24 GB VRAM and 64 GB system RAM when there's (almost) nothing else on the GPU. Yields about 180 TPS prompt processing and 30 TPS generation speed for me.
temp 0? Yes, works well for instruct for me, no higher-temp "creativity" needed. Prevents the very occasional issue that it outputs an unlikely (and incorrect) token when coding.cache-ram 0? The cache was supposed to be fast (30 ms), but I saw 3 second query/update times after each request. So I didn't investigate further and disabled it, as it's only one long conversation history in a single slot anyway.GGML_CUDA_GRAPH_OPT? Experimental option to get more TPS. Usually works, yet breaks processing with some models.
OpenCode vs. Roo Code:
Both solved things with the model, yet with OpenCode I've seen slightly more correct answers and solutions. But: Roo asks by default about every single thing, even harmless things like running a syntax check via command line. This can be configured with an easy permission list to not stop the automated flow that often. OpenCode on the other hand just permits everything by default in code mode. One time it encountered an issue, uninstalled and reinstalled packages in an attempt of solving it, removed files and drove itself into a corner by breaking the dev environment. Too autonomous in trying to "get things done", which doesn't work well on bleeding edge stuff that's not in the training set. Permissions can of course also be configured, but the default is "YOLO".
Aside from that: Despite running with only a locally hosted model, and having disabled update checks and news downloads, OpenCode (Desktop version) tries to contact a whole lot of IPs on start-up.
Most-Trainer-8876@reddit
How is it possible for you to load this model with `-ngl 99`
I get this error
Chromix_@reddit (OP)
--n-cpu-moelet's the experts stay in system RAM. But it's easier to simply not specify-nglthese days. Use--fit-target 256or so and you should get good enough results.Most-Trainer-8876@reddit
Thanks, I was able to solve this problem by simply using `--fit on` flag, no need for guessing cpu moe or gpu layers. I am getting about 120tps for prompt processing and 20tps for generation. Prompt Processing is honestly too slow... :-(
Is there anyway to improve that?
Chromix_@reddit (OP)
PP indeed looks way too slow, while TG seems OK. Check if your VRAM maybe spilled into shared memory. Test increasing batch & ubatch size as discussed here and elsewhere to speed up PP. Best run with llama-bench for systematic testing.
TargetIndependent435@reddit
Which of its quant(s) can I use considering a RTX3090+32GB DDR5, planning on using Claude Code with my codebase.
andrewmobbs@reddit
I've also found Qwen3-Coder-Next to be incredible, replacing gpt-oss-120b as my standard local coding model (on a 16GB VRAM, 64GB DDR5 system).
I found it worth the VRAM to increase `--ubatch-size` and `--batch-size` to 4096, which tripled prompt processing speed. Without that, the prompt processing was dominating query time for any agentic coding where the agents were dragging in large amounts of context. Having to offload another layer or two to system RAM didn't seem to hurt the eval performance nearly as much as that helped the processing.
STUDBOO@reddit
pls make a video or step by step instructions, I am using LM studio
Chromix_@reddit (OP)
Setting it that high gives me 2.5x more prompt processing speed, that's quite a lot. Yet the usage was mostly dominated by inference time for me, and this drops it to 75% due to less offloaded layers. With batch 2048 it's still 83% and 2x more PP speed. Context compaction speed is notably impacted by inference time (generating 20k tokens), so I prefer having as much of the model as possible on the GPU, as my usage is rarely impacted by having to re-process lots of data.
BrightRestaurant5401@reddit
wait, are you offloading layer linearly? Qwen3-Coder-Next is a moe so I think its better to offload up and down and maybe even gate?
Chromix_@reddit (OP)
I used to toy around with regex to find the optimal offload, but these days --fit usually works nicely, even for MoE models.
inphaser@reddit
How does it perform compared to minimax 2.5 REAP?
spadak@reddit
how did you set it up ? I was under impression that you need much more of VRAM, I have rtx 5070 ti and 96GB of DDR5 and would love to be able to use it locally, I'm on windows
JustSayin_thatuknow@reddit
Try linux my friend! ππ»
genpfault@reddit
In the tok/s sense, or quality-of-output sense?
-dysangel-@reddit
thanks for that - I remember playing around with these values a long time ago and seeing they didn't improve inference speed - but didn't realise they could make such a dramatic difference to prompt processing. That is a very big deal
inphaser@reddit
Idk what i'm doing wrong, i just managed to try it in llama.cpp after a SyCL bug has been fixed and made it into the docker image.
But the results are just unusable. I mean, what are we supposed to do with these results like below?
Chromix_@reddit (OP)
That looks broken, but in a special way. It looks like your prompt isn't being sent to the model. These "free form" results is what you get when you run inference without specifying a prompt.
Try it via CLI to see if you get better results:
llama-cli -m Qwen3-Coder-Next-IQ4_XS.gguf -fa on -c 4096 --temp 0 -p "hi"inphaser@reddit
Thanks i just tried, it looks the same:
$ docker run -it --rm --name llama.cpp --network=host --device /dev/dri -v $MODEL_DIR:/modelsghcr.io/ggml-org/llama.cpp:light-intel-m /models/Qwen3-Coder-Next-IQ4_XS.gguf -ngl 99 -np 1 -c 32768load_backend: loaded SYCL backend from /app/libggml-sycl.soload_backend: loaded CPU backend from /app/libggml-cpu-alderlake.soLoading model... |get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memoryget_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory|get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory/get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memoryget_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memoryget_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory-get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memoryget_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory/get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory-get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memoryββ ββββ ββββ ββ ββββ ββββββββ ββββ βββββ βββββ βββββββ ββ βββββ ββ ββ ββ βββββ ββ ββ ββ ββ ββββ ββ βββββ ββ ββ ββ βββββ ββ βββββ βββββ βββββββ ββββ ββbuild : b8172-a8b192b6emodel : Qwen3-Coder-Next-IQ4_XS.ggufmodalities : textavailable commands:/exit or Ctrl+C stop or exit/regen regenerate the last response/clear clear the chat history/read add a text file> hiType-SupportOpenXMLDimensionedOpen[ Prompt: 1.1 t/s | Generation: 1.9 t/s ]> can you write hello world in c?) {} {} {} {[ Prompt: 3.4 t/s | Generation: 2.0 t/s ]Chromix_@reddit (OP)
Hmmm, you could download or create a CPU-only build of llama.cpp, without any SYCL functionality integrated. If it works with that version then it's a SYCL bug that you could create an issue for on GitHub. If it still doesn't work then download another quant from some other repo to see if your existing file might be corrupted or simply a bad quant. If it's still broken then... uh.. good luck.
inphaser@reddit
Thanks! indeed cpu works (and runs also at 5t/s vs 2 of SyCL)
Chromix_@reddit (OP)
Please make sure to report that as SYCL issue with all your details then, so that it can get fixed (and you'll get faster speeds)
BrianJThomas@reddit
Couldnβt do any tool calls successfully for me in opencode and I gave up.
benevbright@reddit
have you tried unsloth version?
Practical-Bed3933@reddit
I get "I don't have access to a listdir tool in my available tools." with Ollama on Mac. Did you resolve it u/BrianJThomas ?
Chromix_@reddit (OP)
Latest updated model quant, at least Q4, latest llama.cpp, latest opencode? There were issues 5 days ago that have been solved since then. I have not seen a single failed too call since then.
wisepal_app@reddit
i get "invalid [tool=write, error=Invalid input for tool write: JSON parsing failed:" error with opencode. i am using latest llama.cpp with cuda and unsloth ud Q4_K_XL GGUF quant. Any idea what could be the problem?
Chromix_@reddit (OP)
Not really without further details. You get invalid JSON - like for example the model messing up parenthesis, which is something that it occasionally did before the fixes. If you get the actual JSON output from OpenCode that's not accepted, then this would tell what's broken.
wisepal_app@reddit
Sorry this is the full message: "invalid [tool=write, error=Invalid input for tool write: JSON parsing failed: Text: {"content":"# Tokenizer Package\n__version__ = "1.0.0"","filePath":"C:\Users\Hp Studio\Desktop\Tokenizer\init.py","filePath"C:\Users\Hp Studio\Desktop\Tokenizer\models\init.py"}. Error message: JSON Parse error: Expected ':' before value in object property definition]"
Chromix_@reddit (OP)
Hmm, that's strange. The model didn't escape any JSON data at all, and it failed at the file path, writing "filePath"C:\Users instead of "filePath":"C:\Users
That looks broken, it shouldn't happen on temperature 0.
Try adding something like "Ensure correctly escaped JSON strings for tool calls" to your prompt and see if it does anything. Yet that second error looks like broken inference or model.
SatoshiNotMe@reddit
Itβs also usable in Claude Code via llama-server, set up instructions here:
https://github.com/pchalasani/claude-code-tools/blob/main/docs/local-llm-setup.md
On my M1 Max MacBook 64 GB I get a decent 20 tok/s generation speed and around 180 tok/s prompt processing
benevbright@reddit
super slow for me. M2 Max 64gb
SatoshiNotMe@reddit
much worse than 20 tok/s with CC ?
benevbright@reddit
made video with Claude Code. super slow: https://youtu.be/ok3tNaWfq2Y?si=tQgWgHxaZnxu02PW
AcePilot01@reddit
is claude local?
wanderer_4004@reddit
Same hardware here - M1 Max MacBook 64 GB. With MLX I get 41 tok/s TG and 360 tok/s PP. However, MLX server is less good than llama.cpp in kv-caching and especially branching. Also occasionally seems to leak memory. Am using Qwen Code and am quite happy with it.
Consumerbot37427@reddit
Also on Apple Silicon w/ Max. I have had lots of issues with MLX, I might stop bothering with them and just stick with GGUFs. Waiting for prefill is so frustrating, and seeing log messages about "failed to trim x tokens, clearing cache instead" drove me nuts.
I had been doing successful coding with Mistral Vibe/Devstral Small, but the context management issue plus the release of Qwen3 Coder Next inspired me to try out Claude Code with LM Studio serving the Anthropic API, and it seems amazing! It seems to be much better at caching prefill and managing context, so not only do I get more tokens per second from a MoE model, the biggest bonus is how much less time is spent waiting for the context/prefill. Loving it!
wanderer_4004@reddit
Actually I have now a workflow that works really well for me with Qwen Code and mlx-server. The important thing is to compress the context after each bug fix or feature. Then I say just 'hi' to load the new compressed context again and the system is ready for immediate answers. The important part was to limit context size in the settings.json.
I tested yesterday evening the latest llama.cpp and unsloth Qwen3-Coder-Next-MXFP4_MOE.gguf with Qwen Code and it has trouble with tool calling. Maybe the MXFP4 gguf is no good...
Consumerbot37427@reddit
Man, I've had back luck with the unsloth quants, too. I've got 96GB so I can run Q6, but dropped to Q4 so I could get 200k tokens of context. Maybe try an official quant?
Haven't tried Qwen Code yet. Vibe was the first CLI coding tool I tried, then Claude Code. And there's OpenCode... Too many options, and it all moves so fast, I hate committing and investing too much in learning a tool.
That sounds fantastic!
wanderer_4004@reddit
Yes, the same here. What I like on a fully local tool chain is that I get a much better feeling about token usage. Because finally Anthropic is in the token selling business and obviously they want to hook up each and every developer onto using as many tokens as possible.
Their goal is to convince employers that for $100 spend on a developer, the same can be achieved with $90 spend on tokens, thus a gain of 10%. That's the endgame. Or it would be if LocalLLaMA would not exist.
What I like about Gemini CLI and Qwen CLI (which is based on Gemini) is, that those companies are not primarily in the token selling business. Especially China is -due to export restrictions- heavily interested in efficient token use - at least for now.
txgsync@reddit
Yep this behavior led me to write my own implementation of a MLX server with βslotsβ like llama.cpp has so more than one thing can happen at a time. FLOPS/byte goes up!
Inferencer and LM Studio both now support this too. If you use Gas Town for parallel agentic coding this dramatically speeds things up for your Polecats. Qwen3-Coder-Next is promising on Mac with parallel agentic harnesses. But I have to test it a bit harder.
wanderer_4004@reddit
I have started to look into the kv cache. Especially saving to disk and loading from disk and also making it more resilient against branching and interruptions. But no real code yet. Unless you want to commercialise it, just put it out somewhere on Github...
txgsync@reddit
My dumb little bash scripts with llama.cpp donβt yet deserve publication :). But I think we could reduce benchmark time on M4 Max for lalmbench from over an hour to just a few minutes: https://github.com/txgsync/lalmbench
crantob@reddit
Your writeup is helpful to my stage of ignorance. Thank you.
Chromix_@reddit (OP)
Claude Code uses a whole lot of tokens for the system prompt though, before any code is processed at all. OpenCode and Roo used less last time I checked. Still, maybe the results are better? I haven't tested Claude CLI with local models so far.
Purple-Programmer-7@reddit
Opencode > Claude code. Itβs okay that people donβt listen though π
arcanemachined@reddit
No kidding. The quality of the interfaces is like night and day.
Claude Code feels like I'm in a vibe-coded fever dream. I'm sure that OpenCode is written with LLM-assisted code, but the interface feels so much more coherent to me, it's not even funny.
cleverusernametry@reddit
Roo has a very large system prompt as well no? I'm guesing opencode is the same deal
Chromix_@reddit (OP)
Roo is about 9K tokens and OpenCode 11K.
msrdatha@reddit
Initially I was testing both CC and opencode, but then Claude started the drama like limiting other agents and tools on api usage etc. This made me think, may be CC will not be good for local ai, the moment they feel its gaining traction and we would be suddenly banned with some artificially introduced limitations. So left cc for good and continued with opencode and kilo
SatoshiNotMe@reddit
There seems to be a lot of confusion about this: Anthropic has nothing against using the Claude Code harness with other LLMs. They even have a guide for this:
https://code.claude.com/docs/en/llm-gateway
However what Anthropic specifically is allergic to is when other apps or coding agents try to leverage the all-you-can-eat buffet subscriptions (pro/max) to avoid API costs.
SatoshiNotMe@reddit
Yes CC has a sys prompt of at least 20K tokens. On my M1 Max MacBook the only interesting LLMs with good-enough generation speed are the Qwen variants such as 30B-A3B and the new coder-next. GLM-4.7-flash has been bad at around 10 tok/s.
XiRw@reddit
Why donβt you use their website at this point if you are going non local with Claude instead of tunneling through an API?
SatoshiNotMe@reddit
I mainly wanted to use the 30B local models for sensitive document work, so canβt use an API, and needed it to run on my Mac. I really wouldnβt use 30B models for serious coding; for that I just use my Max sub.
XiRw@reddit
Ah okay that makes sense then
benevbright@reddit
Agreed. I made live demo video with qwen3-coder-next on my 64GB Mac Studio. https://youtu.be/ok3tNaWfq2Y?si=tQgWgHxaZnxu02PW it would be great to get some feedback.
AcePilot01@reddit
Hey. im back lol, I was having issues with my ollama (idk why bit it like lost my models, even tho I had them) something wierd with dockerm prob. so I deleted them, but I am having trouble how to figure getting this working?
For one, where you get the gguf? 2, I did see it was 160gb lmfao. So how did you install this. I was goiing to use openwebui prob. unless it would be impossible that way
Chromix_@reddit (OP)
I've used
Qwen3-Coder-Next-UD-Q4_K_XL.ggufin this test. You can get that or any other quant that fits your VRAM on HF.AcePilot01@reddit
yeah I am using that one now too, since we both only had 24gb vram
what speeds are you getting? I am curious, how can you "benchmark" these? Obv different asks have it give a different reply that sometimes is more or less tokens per second.
Chromix_@reddit (OP)
Prompt between 150 and 600 TPS, inference around 30 TPS - all depending on the used options. Check llmama-bench documentation how to run tests yourself, as well as other comments in this thread for more numbers.
AcePilot01@reddit
ok were those numbers from bench or real world asks?
what context you using?
AcePilot01@reddit
You know, I just noticed, how do you do ngl 99? 99 layers should be larger than the gpu right?
FairAlternative8300@reddit
Pro tip for Windows users with 16GB VRAM like the 5070 Ti: the `--n-cpu-moe` flag is the magic sauce here. It offloads the MoE expert layers to CPU while keeping attention on GPU, so you get decent 20+ tok/s generation without needing a 5090.
With 96GB DDR5 you should be golden. Try something like:
`llama-server.exe -m Qwen3-Coder-Next-UD-Q4_K_XL.gguf -ngl 99 -fa on -c 80000 --n-cpu-moe 28`
Start with fewer CPU-MOE layers and increase until it fits in VRAM. Flash attention (`-fa on`) helps a lot with the 16GB constraint too.
simracerman@reddit
Good reference. I will give opencode and this model a try.
At what context size did you notice it lost the edge?
Chromix_@reddit (OP)
So far I didn't observe any issue where it did something where it "should have known better" due to having relevant information in its context, but I "only" used it up to 120k. Of course its long context handling is far from perfect, yet it seems good enough in practice for now. Kimi Linear should be better in that aspect (not coding though), but I haven't tested it yet.
zoyer2@reddit
you mean 48B? ive tested it and sadly not so good, Qwen3 coder next is a lot better
Terminator857@reddit
failed for me on a simple test. Asked to list recent files in directory tree. Worked. Then asked to show dates and human readable file sizes. Went into a loop. Opencode q8. Latest build of llama-server. strix-halo.
Chromix_@reddit (OP)
That was surprisingly interesting. When testing with Roo it listed the files right away, same as in your test with OpenCode. Then after asking about dates & sizes it started asking me back, not just once like it sometimes does, but forever in a loop. Powershell or cmd, how to format the output, exclude .git, only files or also directories, what date format, what size format, sort order, hidden files, and then it kept going into a loop asking about individual directories again and again. That indeed seems to be broken for some reason.
CSEliot@reddit
Perhaps when it comes to meta data, thats wheren the issue is? Is this on linux or windows?
Chromix_@reddit (OP)
The point is: It didn't even try to read the data, it just kept asking about what to include and how to output it, without issuing a single command (except for the generic directory list, directly supported by the harness, independent of the OS).
CSEliot@reddit
Is a unix command like "ls" typically how opencode agents get file info?
AccomplishedLeg527@reddit
i am running it on 8 gb vram 3070ti laptop with 32Gb ram :). First attempt was using accelerate offloading to disk (as 80gb safetensors won`t fit in ram), and i got 1 token per 255 second. Than i wrote custom offloading and reach 1 token per second (!!! 255x speedup !!!). On desktop 3070 with full PCE buss it should be 2x faster + 2x fsater because desktop GPU + can be used raid 0 (2 nvme ssd) - 3..5x on loading experts weights + if more vram (12-16Gb) can be more weights cashed on cuda (now i got 55% cache hit rate using only 3gb vram). In total 8-16Gb middle range desktop cards can run it with speed 3 to 10 tokens per second with only 32Gb ram. If someone interested i can share how i did this. Or should i pattent it?
Chromix_@reddit (OP)
So, you're running the 80GB Q8 quant on a system with 40 GB (V)RAM in total. Your SSD isn't reading the remaining 40+ GB once per second, but it also doesn't need to, since it's a MoE.
There was a posting here a while ago where someone compiled some stats on the predictability of expert selection per token and got things 80%+ correct IIRC. With that approach one can (pre)load the required experts to maximize generation speed without having to wait that much for the comparatively slow SSD. Maybe you did something similar? Or is it just pinning the shared expert(s) and other parts that are needed for each token into (V)RAM?
AccomplishedLeg527@reddit
most frequent experts indexes cached in vram for each layer, with only 3 Gb free vram i got 43-55% cache hit rate, for ram i have 2 options, one used by mmap to speedup loading, or do not use mmap and ram not needed at all (maybe up to few Mb for transfers), model without experts use only 4.6Gb vram (+we neeed some memory for context)
DOAMOD@reddit
https://i.redd.it/j3rbhpq2zfig1.gif
For me, it's been a bit disappointing in some tests, and also in a coding problem where the solution wasn't very helpful. It doesn't seem very intelligent. I suppose it will be good for other types of coding tasks like databases, etc. I had high expectations.
chickN00dle@reddit
spinning fishies cant be anything other than a success π
joking
AcePilot01@reddit
I forgot to ask, is this the 160gb version?
Chromix_@reddit (OP)
Qwen3 Coder Next is 160 GB in the distributed base version, yet the quantized GGUFs in the 50 to 60 GB range work quite well.
AcePilot01@reddit
I kind of figured, since the 16ogb wouldn't fit lol, wasn't sure if (since it was a bunch of tensor files) that maybe it worked different lol.
I did try downloading a GGUF version and setting it up with llama.cpp but never could get it to work unfort.
Gimme_Doi@reddit
thanks
UmpireBorn3719@reddit
I use RTX 5090, AMD 9900X, RAM 64GB, MXFP4
Result: Prefill around 1500 tps, generation around 50 tps
slot update_slots: id 2 | task 3 | prompt processing progress, n_tokens = 241664, batch.n_tokens = 4096, progress = 0.991353
slot update_slots: id 2 | task 3 | n_tokens = 241664, memory_seq_rm [241664, end)
slot update_slots: id 2 | task 3 | prompt processing progress, n_tokens = 243260, batch.n_tokens = 1596, progress = 0.997900
slot update_slots: id 2 | task 3 | n_tokens = 243260, memory_seq_rm [243260, end)
slot update_slots: id 2 | task 3 | prompt processing progress, n_tokens = 243772, batch.n_tokens = 512, progress = 1.000000
slot update_slots: id 2 | task 3 | prompt done, n_tokens = 243772, batch.n_tokens = 512
slot init_sampler: id 2 | task 3 | init sampler, took 18.45 ms, tokens: text = 243772, total = 243772
slot update_slots: id 2 | task 3 | created context checkpoint 1 of 32 (pos_min = 243259, pos_max = 243259, size = 75.376 MiB)
slot print_timing: id 2 | task 3 |
prompt eval time = 170074.62 ms / 243772 tokens ( 0.70 ms per token, 1433.32 tokens per second)
eval time = 4125.05 ms / 182 tokens ( 22.67 ms per token, 44.12 tokens per second)
total time = 174199.66 ms / 243954 tokens
slot release: id 2 | task 3 | stop processing: n_tokens = 243953, truncated = 0
srv update_slots: all slots are idle
Savantskie1@reddit
Iβm currently using vs code insiders, I canβt use cli coding tools. So can you check to see if this model will work with that? I use lm studio, I donβt care if llama.cpp is faster I wonβt use it so donβt suggest it please.
Chromix_@reddit (OP)
Roo Code is a VSCode plugin that you can use with any OpenAI-compatible API, like for example LMStudio provides. Out of interest: Is there a specific reason to stick to LMStudio if it's only used as API endpoint for a IDE (or IDE plugin)? The difference can be very large as another commenter found out.
Savantskie1@reddit
I donβt care about speed, I care about ease of use and being able to load and unload a model without needing to spawn a separate instance of the model runner. Thatβs just waste of resources
Chromix_@reddit (OP)
llama-server support for loading / switching models via API has been added a few months ago. In terms of ease-of-use you'd indeed need to use something like OpenWebUI with llama-server if the standard functionality isn't sufficient for you. Ease-of-use is probably also why lots of people use ollama.
Savantskie1@reddit
I may revisit it if itβs truly easier to use, but I donβt chase numbers. So I think Iβll be fine using lm studio for now
Hot_Turnip_3309@reddit
Ok so I have everything up to date and downloaded multiple GGUFs... tool calling does NOT work
Chromix_@reddit (OP)
I know, it's always annoying to have that "it seems to work for everyone else, but not for me" case. Maybe go through the support / ticket process with your inference engine and agent harness, collect the necessarily information, maybe logits could be interesting for the tool call as well. Maybe there is some inference error left which happens to strike mostly in your specific use-case.
ai_tinkerer_29@reddit
This resonates with my experience too. I've been bouncing between different models for coding work and the MoE architecture really does make a difference for speed without sacrificing too much quality.
Quick question: How does the tool-calling reliability compare to something like DeepSeek-V3 or QwQ in your experience? I've had issues with some models hallucinating tool calls or breaking the JSON format mid-stream.
Alsoβcurious about your OpenCode vs Roo Code comparison. The "YOLO permissions" thing in OpenCode is exactly why I've been hesitant. Did you end up configuring stricter permissions, or just stick with Roo for production work?
Appreciate the detailed write-up on the llama-server flags too. The GGML_CUDA_GRAPH_OPT tip is goldβdidn't know about that one.
Chromix_@reddit (OP)
I didn't compare to DeepSeek-V3 as it's not in the same weight class. QwQ is old, it's a good model but tool calling wasn't trained as excessively back then.
The permissions was more a "allow/deny by default" issue, combined with OpenCode really trying hard to get things working, even if it made no sense. I went for stricter permissions combined with safe utility scripts to execute.
AcePilot01@reddit
Are your comparisons of Opencode and roo code compared to Qwen3 coder next, or am I missing something? or are those agents what you USE this model with?
Chromix_@reddit (OP)
You cannot compare "OpenCode" to "Qwen3", because OpenCode is a harness for using LLMs, and Qwen3 is a LLM. My post is about using both OpenCode as well as Roo Code with Qwen3 Coder Next (Q3CN).
You can also use OpenWebUI with Q3CN, but it doesn't give you any agentic coding functionality like OpenCode or Roo. You could paste in code though.
No, Roo Code is a plugin for VSCode (an IDE), so if you install it you have agentic coding in an IDE. Of course you could also rewire the Copilot that's forced into VSCode for local LLMs. OpenCode is less of an IDE, but more a vibe-coding tool.
AcePilot01@reddit
OH ok, when I went to Open code's site they seemed to indicate it was a subscription/online thing. Not local.
Chromix_@reddit (OP)
Quite a few offer some easy online services - no local setup required, yes. Although there are quite often fully local options available.
AcePilot01@reddit
Imight copy your settings there, cus I also have a 4090 and 64gb of ram lol
Chromix_@reddit (OP)
You'll need to ensure to have sufficient free VRAM to achieve similar numbers - or tweak the
-n-cpu-moeparameter a bit.AcePilot01@reddit
didn't you claim to have the same vram? lmfao
Chromix_@reddit (OP)
Oh, Linux is fine. It's mostly that users on Windows with a single GPU sometimes have so much additional processes occupying their VRAM that they don't have the full capacity left for LLMs, which is why the exact offload numbers would lead to exceeding the available capacity and thus to slowdowns.
fadedsmile87@reddit
I have an RTX 5090 + 96GB of RAM. I'm using the Q8_0 quant of Qwen3-Coder-Next with \~100k context window with Cline. It's magnificent. It's a very capable coding agent. The downside of using that big quant is the tokens per second. I'm getting 8-9 tokens / s for the first 10k tokens, then it drops to around 6 t/s at 50k full context.
Chromix_@reddit (OP)
That's surprisingly slow, especially given that you have a RTX 5090. You should be getting at least half the speed that I'm getting with a Q4. Did you try with my way of running it (of course with manually adjusted ncmoe to almost fill the VRAM)?
fadedsmile87@reddit
I have 2x 48GB DDR5 mem sticks. 6000 MT/s (down from 6400 for stability)
i9-14900K
I'm using the default settings in LM Studio.
context: 96k
offloading 15/48 layers onto GPU (LM Studio estimates 28.23GB on GPU, 90.23GB on RAM)
Chromix_@reddit (OP)
Ah, just two modules then, so that should be fine. You could try the latest llama.cpp as a comparison, and play around with manual CPU masks. The E Cores and general threaded had a tendency to slow things down a lot in the past. I mean, you can try with the same Q4 that I used, and if your TPS are the same as mine or lower, then there's something you can likely improve.
fadedsmile87@reddit
I downloaded the Q4_K_M variant (48GB size). I tested it and got 14 t/s for a 3k token output.
You're right. Something must be off in my settings if you're getting twice as that with a less powerful GPU and less VRAM. I'm not very familiar with llama.cpp. I'm a simple user lol.
Chromix_@reddit (OP)
With LMStudio I guess? Well, try llama.cpp then. Download the latest release, start a cmd, start with the exact two lines that I executed (don't forget about the graph opt), check the speed, use this to see if something improves.
llama-server -m Qwen3-Coder-Next-UD-Q4_K_XL.gguf -fa on --fit-ctx 120000 --fit on --temp 0 --cache-ram 0 --fit-target 128SillypieSarah@reddit
my llama.cpp can't find the model directory.. i wanna use the q6 i already have that's split into two files, but idk how @\~@
fadedsmile87@reddit
What is this sorcery?!
I got 40 t/s on Q4 variant and 27 t/s on the Q8 variant. How is it possible that LM Studio is doing such a bad job at utilizing my GPU?
This is amazing! And I thought I'd have to upgrade to RTX 6000 Pro to get fast speeds lol
Thank you!
By the way, are there any tradeoffs with your settings? Does it hurt quality?
Anarchaotic@reddit
I don't understand how you're getting 40 t/s - I have the exact same specs as you but I'm only seeing 10t/s prompt generation and 45 t/s on prompt processing.
In LM Studio (not Llama server) - what were your settings + the token/s you got from that same exact model.
fadedsmile87@reddit
See nasone32's response. He helped me achieve the same performance in LM Studio as I did using llama server.
In the new LM Studio versions, there's an option called "number of layers for which to force MoE weights onto CPU". Instead of partially offloading layers to GPU, offload all of them. Use the difference in the number of layers offloaded here -> "number of layers for which to force MoE weights onto CPU".
This should speed things up a lot.
Anarchaotic@reddit
Hm I tried the same exact prompt for llama CPP and in LM studio -but couldn't get past 25 t/s response time. Maybe something else is bottlenecking me.
nasone32@reddit
you can achieve the same in LM Studio. you don't have to offload the layers in LM studio, you need to set the GPU offload layers to 48 (the slider all to the right, all layers on GPU) then set "Number of layers for which to force MoE on CPU" to the smallest number you can, and it does the same as the above.
And it will do the same llamacpp does with those settings
fadedsmile87@reddit
Wow, you are absolutely correct! I just tested it.
Instead of 15/48 layers in the GPU offload setting, I set it to 48/48 and put 33 layers for "number of layers for which to force MoE weights onto CPU".
This is awesome! I like LM Studio UX better than llama.cpp anyway haha
tmvr@reddit
The
--fitand--fit-ctxparameters do the heavy lifting. They put everyting important into the VRAM (dense layers, KV cache, context) and then deal with the sparse expert layers. If some fit they get into the VRAM the rest goes into the system RAM. And of course-fa onmakes sure that your memory usage for the context does not go through the roof.Chromix_@reddit (OP)
Congrats, you just got the performance equivalent of an additional $2000 hardware for free π. No trade-offs, no drawbacks, just unused PC capacity that you're now using.
We'll you might now want to get OpenWebUI or so to connect to you llama-server if you want a richer UI than llama-server provides.
fragment_me@reddit
That's expected. I get 17-18 tok/s with a 5090 and ddr4 using UD q6 k xl with .\llama-server.exe -m Qwen3-Coder-Next-UD-Q6_K_XL.gguf `
-ot "\.(19|[2-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" `
--no-mmap --jinja --threads -12 `
--cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on --ctx-size 128000 -kvu `
--temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 `
--host 127.0.0.1 --parallel 4 --batch-size 512 `
fadedsmile87@reddit
I was using LM Studio.
Thanks to Chromix, I've installed llama.cpp and used:
llama-server -m Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf -fa on --fit-ctx 120000 --fit on --temp 0 --cache-ram 0 --fit-target 128
Now I'm getting 27 t/s on the Q8_0 quant :-)
fragment_me@reddit
You should try Q8 KV cache, data shows it's pretty much the same.
blackhawk00001@reddit
Same setup here, 96GB/5090/7900x/windows/VS code IDE with kilo code extension.
Try using llama.cpp, below are the commands that I'm using to get 30 t/s with Q4_K_M and 20 t/s with Q8. The Q8 is slower but solved a problem in one pass that the Q4 could not figure out. Supposedly it's much faster on vulkan at this time but I haven't tried yet.
.\llama-server.exe -m "D:\llm_models\Qwen3-Coder-Next-GGUF\Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf" --jinja --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 -fa on --fit on -c 131072 --no-mmap --host
fadedsmile87@reddit
I was using LM Studio.
Thanks to Chromix, I've installed llama.cpp and used:
llama-server -m Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf -fa on --fit-ctx 120000 --fit on --temp 0 --cache-ram 0 --fit-target 128
Now I'm getting 27 t/s on the Q8_0 quant :-)
blackhawk00001@reddit
Are are you deploying your models on Linux or windows? I tried your settings but had to cancel it because the prompt processing became much slower. Output was still 20/s for me.
I noticed that your startup commands resulted in all of the model stored in RAM where mine was split between RAM and VRAM.
Iβll try mixing settings once I can research what they all are.
fadedsmile87@reddit
not sure what you mean by "startup commands resulted in all of the model stored in RAM". My GPU shows 31.1/31.5 GB usage and my RAM is 92.2/95.7 GB in Windows Task Manager -> Performance.
I'm using Windows.
I made another test now.
prompt eval is 122 t/s (a 2.5k token prompt)
output was 26.17 t/s (an additional 3k token output)
blackhawk00001@reddit
Which server distribution are you using? I'm now getting the same as you using vulkan server. My settings caused an out of memory error while loading on vulkan.
Hopefully the llamma.cpp distribution for cuda is optimized soon.
fadedsmile87@reddit
I downloaded and installed the latest release (b7972).
And chose these:
blackhawk00001@reddit
I did some more testing and fed logs back into cuda Q4 to summarize results. I found that the prompt processing speed is more important to me than the small gain in response processing speed, mostly so when loading documents and files from the workspace. I found that cuda is still faster than vulkan but not by much at the moment for Q8. Q4 is many times faster than vulkan Q4 but I might have configured something wrong, it has different startup attributes. Most interesting was that cuda had many more prompt tokens than vulkan, reducing the effect of the faster processing. I wonder if that affects accuracy. If you are getting those results of yours with Q8 on cuda I'd be curious to know how many tokens it is handling in the prompt and response. I tested each by setting it in architect mode and asking what it would take to change the background of my home page, and letting it plan and then make the change.
blackhawk00001@reddit
Something about storing experts on the gpu and everything else on the cpu. I'm still learning so might not explain it well. For comparison, I'm sitting at 67/95.1GB RAM and 30.9/31.5GB GPU used, 1GB shared GPU memory. I'd need to reload with your settings but I had similar RAM usage but my Shared GPU memory was higher so there might have been extra swapping going on.
I've seen a range of 160-300 t/s prompt and average 20t/s depending on the task. I need to test with the cuda 12.4 and vulkan servers to see if there's any difference.
blackhawk00001@reddit
Nice, Iβll try those settings
TBG______@reddit
I tested: llama.cpp + Qwen3-Coder-Next-MXFP4_MOE.gguf on RTX 5090 β Three Setups Compared
Setup 1 β Full GPU Layers (VRAM-heavy)
VRAM Usage: \~29 GB dedicated
Command: A:\llama.cpp\build\bin\Release\llama-server.exe --model "A:\Qwen3-Coder-Next-GGUF\Qwen3-Coder-Next-MXFP4_MOE.gguf" --host 0.0.0.0 --port 8080 --alias "Qwen3-Coder-Next" --seed 3407 --temp 0.8 --top-p 0.95 --min-p 0.01 --top-k 40 --jinja --n-gpu-layers 28 --ctx-size 131072 --batch-size 1024 --threads 32 --threads-batch 32 --parallel 1
Speed (65k token prompt):
Prompt eval: 381 tokens/sec
Generation: 8.1 tokens/sec
Note: Generation becomes CPU-bound due to partial offload; high VRAM but slower output.
Setup 2 β CPU Expert Offload (VRAM-light)
VRAM Usage: \~8 GB dedicated
Command: A:\llama.cpp\build\bin\Release\llama-server.exe --model "A:\Qwen3-Coder-Next-GGUF\Qwen3-Coder-Next-MXFP4_MOE.gguf" -ot ".ffn_.*_exps.=CPU" --host 0.0.0.0 --port 8080 --alias "Qwen3-Coder-Next" --seed 3407 --temp 0.8 --top-p 0.95 --min-p 0.01 --top-k 40 --jinja --n-gpu-layers 999 --ctx-size 131072 --batch-size 1024 --threads 32 --threads-batch 32 --parallel 1
Speed (70k token prompt):
Prompt eval: 60-140 tokens/sec (varies by cache hit)
Generation: 20-21 tokens/sec
Note: Keeps attention on GPU, moves heavy MoE experts to CPU; fits on smaller VRAM but generation still partially CPU-limited.
Setup 3 β Balanced MoE Offload (Sweet Spot)
VRAM Usage: \~27.6 GB dedicated (leaves \~5 GB headroom)
Command: A:\llama.cpp\build\bin\Release\llama-server.exe --model "A:\Qwen3-Coder-Next-GGUF\Qwen3-Coder-Next-MXFP4_MOE.gguf" --host 0.0.0.0 --port 8080 --alias "Qwen3-Coder-Next" --seed 3407 --temp 0.8 --top-p 0.95 --min-p 0.01 --top-k 40 --jinja --n-gpu-layers 999 --n-cpu-moe 24 --ctx-size 131072 --batch-size 1024 --threads 32 --threads-batch 32 --parallel 1
Speed (95k token prompt):
Prompt eval: 105-108 tokens/sec
Generation: 23-24 tokens/sec
Note: First 24 layers' experts on CPU, rest on GPU. Best balance of VRAM usage and speed; \~3x faster generation than Setup 1 while using similar total VRAM.
Recommendation: Use Setup 3 for Claude Code with large contexts. It maximizes GPU utilization without spilling, maintains fast prompt caching, and delivers the highest sustained generation tokens per second.
Any ideas to speed it up ?
RevolutionaryTrust12@reddit
Check my reply!! i got 70 tk/s
Chromix_@reddit (OP)
With so much VRAM left on setup 3 you can bump the batch and ubatch size to 4096 as another commenter suggested. That should bring your prompt processing speed to roughly that of setup 1.
TBG______@reddit
Thanks: i needed a bit more ctx sizs so i did: $env:GGML_CUDA_GRAPH_OPT=1
A:\llama.cpp\build\bin\Release\llama-server.exe --model "A:\Qwen3-Coder-Next-GGUF\Qwen3-Coder-Next-MXFP4_MOE.gguf" --host 0.0.0.0 --port 8080 --alias "Qwen3-Coder-Next" --seed 3407 --temp 0.8 --top-p 0.95 --min-p 0.01 --top-k 40 --jinja --n-gpu-layers 999 --n-cpu-moe 24 --ctx-size 180224 --batch-size 4096 --ubatch-size 2048 --threads 32 --threads-batch 32 --parallel 1
Speed (145k token prompt):
Prompt eval: 927 tokens/sec
Generation: 23 tokens/sec
Interactive speed (cached, 200β300 new tokens):
Prompt eval: 125β185 tokens/sec
Generation: 23β24 tokens/sec
Chromix_@reddit (OP)
Looks good, best of both worlds. Your interactive speed is so low due to just adding a few new tokens, way below the batch size. The good thing is: It doesn't matter, since "just a few tokens" get processed quickly anyway.
Easy_Kitchen7819@reddit
Try DeepSwe
Money-Frame7664@reddit
Do you mean agentica-org/DeepSWE-Preview ?
Easy_Kitchen7819@reddit
Yes
Danmoreng@reddit
Did you try the fit and fit-ctx parameters instead of ngl and n-cpu-moe ? Just read the other benchmark thread and tested on my hardware, it gives better speed.
tmflynnt@reddit
FYI that I added an update to that thread with additional gains based on people's comments.
Chromix_@reddit (OP)
Yes, tried that (and even commented how to squeeze more performance out of it) but it's not faster for me, usually a bit slower.
AcePilot01@reddit
How do you "run " one like that?
I use Openwebui and ollama, so when I download them (forget how they even get placed in there, lmfao I just have ai do it all haha)
Chromix_@reddit (OP)
Ditch ollama for llama.cpp. He could do it, you can do it too. (To be fair you can also connect OpenCode to ollama, but why not switch to something nicer while being at it?)
AcePilot01@reddit
Maybe, trying to get it to work in Openwebui is being a freaking pain. having to merge them all etc, it should be as easy as downloading the model and sticking it in a damn folder lol. having to vibe code it to work is getting old lmfao
qubridInc@reddit
This aligns with what weβve observed as well. Qwen3 Coder Next works better in practice mainly because itβs an instruction-tuned MoE, not a reasoning-style model. That avoids internal reasoning loops and keeps latency predictable, which really matters for agent-style tools and long runs.
Tool calling and structured outputs are noticeably more reliable, and the long context (100k+) is actually usable on 24 GB VRAM thanks to its attention/memory characteristics. Combined with deterministic sampling (temp 0), it behaves stably for real-world coding instead of drifting or stalling.
Tudeus@reddit
Has anyone used it as the main drive for openclaw?
DHasselhoff77@reddit
Qwen3 Coder Next also supports as a fill-in-the-middle (FIM) tasks. This means you can use it for auto completion via for example llama-vscode while also using it for agentic tasks. No need for two different models occuping VRAM simultaneously.
Chromix_@reddit (OP)
It'd be a rather good yet slow FIM model, yes. On the other hand there is Falcon 90M with FIM support which you could easily squeeze into the remaining VRAM or even run on CPU for auto-complete.
DHasselhoff77@reddit
The Falcom 90M GGUF didn't support llama.cpp's
/infillendpoint so it wasn't usable for me with llama-vscode. Using an OpenAI compatible endpoint works but in the case of that specific VSCode extension, it requires extra configuration work.I also tried running Qwen Coder 2.5, 3B or 1.5B, but on the CPU and with a smaller context. It's pretty much the same speed as Qwen3 Coder Next on the GPU though.
crablu@reddit
I have problems running qwen3-coder-next with opencode (RTX 5090, 64GB RAM). I tried with Qwen3-Coder-Next-UD-Q4_K_XL.gguf and Qwen3-Coder-Next-MXFP4_MOE.gguf. It works perfectly fine in chat.
start command:
models.ini:
Opencode is not able to use the write tool. The UI says invalid. I built latest llama.cpp. Does anyone know how to fix this?
Chromix_@reddit (OP)
Try temperature 0, verify that you have the latest update of the Q4 model. It works reliably for me with that.
crablu@reddit
With temp 0 it seems to work now. Thank you.
Chromix_@reddit (OP)
Strange, maybe you can get the details of the failed tool calls and then figure out whether that's something on the OpenCode, llama.cpp or model side to solve.
sb6_6_6_6@reddit
1 x 5090 + 2 x 3090 Unsloth UD-Q6_K_XL CPU: Ultra 9 285K docker on CachyOs - 76 t/s context 139000
version: '3.8' services: llama: image: ghcr.io/ggml-org/llama.cpp:server-cuda container_name: llama-latest cpus: 8.0 cpuset: "0-7" mem_swappiness: 1 oom_kill_disable: true deploy: resources: reservations: devices: - driver: nvidia device_ids: ["1","2","0"] capabilities: [gpu] environment: - NCCL_P2P_DISABLE=1 - CUDA_VISIBLE_DEVICES=1,2,0 - CUDA_DEVICE_ORDER=PCI_BUS_ID - HUGGING_FACE_HUB_TOKEN=TOKEN - LLAMA_ARG_MAIN_GPU=0 - LLAMA_ARG_ALIAS=Qwen3-Coder-80B - LLAMA_ARG_MLOCK=true - LLAMA_SET_ROWS=1
snipertoby@reddit
Yes! Qwen3-Coder-Next-REAP-48B-A3B-4bit-mlx could reach 60T/s on Macmini 64G
Revolutionary_Loan13@reddit
Anyone using a docker image with lama-server on it or does it not perform as well?
Chromix_@reddit (OP)
What for would you use docker? One of the main points about llama.cpp is that you can just use it as-is, without having to install any dependencies. You don't even need to install llama.cpp, just copy and run the binary distribution. Docker is usually used to run things that need dependencies, a running database server, whatsoever.
It'd be like taking your M&Ms out of the pack and wrapping them individually before eating them, just because you're used to unwrap your candy one by one when snacking.
pol_phil@reddit
This model is great. My only problem is that its prefix caching doesn't work on vLLM. I think SGLang has solved this, but haven't tried it yet.
Are u aware of other frameworks which do not have this issue?
Chromix_@reddit (OP)
Two fixes in that area were just added for llama.cpp. vLLM is of course faster if you have the VRAM for it.
Brilliant-Length8196@reddit
Try Kilo Code instead of Roo Code.
Terminator857@reddit
Last time I tried, I didn't have an easy time figuring out how to wire kilocode with llama-server.
alexeiz@reddit
Use "openai compatible" settings.
HumanDrone8721@reddit
Just in case someone wonders here are the fresh benchmark on a semi-potato PC i7-14KF (4090+3090+128GB DDR5) for the 8bit fat quanta, coding performance later:
Chromix_@reddit (OP)
That TG speed looks slower than expected. In another comment here someone got 27 t/s with a single RTX 5090 and your CPU. Yes, the 5090 is faster, but not twice as fast. Have you tried only using the 4090, and the options/settings from my post?
HumanDrone8721@reddit
That's for the 4090 and 3090 separate benchmarks, the fact that only 14 layers fit in card and the difference is negligible between cards tells me the the performance is RAM and CPU bound and not on the capabilities of the GPU.
The poster with the 5090 probably managed to fit 39 or even 40 layers in the GPU and this gave a boost of speed, unfortunately as almost no one is bothered to post the actual precise command line and parameters, is just some anecdote.
Chromix_@reddit (OP)
The 4090 and the 3090 have almost the same VRAM bandwidth, and single inference of this quant is memory-bound. That could be an alternative explanation for why they give you the same TG speed in the benchmark. I haven't tested the Q8. If you download the exact Q4 model that I used and run with my command line then you should get the same PP/TG speeds that I posted. If you don't then there might be something to optimize on your side.
HumanDrone8721@reddit
It could very well be as you say, but I'm kind of done with small, fast, but mostly useless (for me) low bit quants, I have now reached to the point where the wow factor is proper clean implementation, not the token speed, this is nice to have but it doesn't make me money for the next 3090 :). So I will not bother with downloading the Q4, not because it's not an interesting bench to do, but because the internet speed here is horrendous (tja, the largest economy in EU).
Chromix_@reddit (OP)
Oh, that would have been just for speed comparison, not for your daily usage, as any difference with Q4 would also translate to your Q8. Aside from that I'd be quite interested whether those \~2% that a Q4 scores worse in benchmarks translates to noticeable degradation in your usage.
Visit your next agricultural engineer outside the city to download the model, they have fiber to the barn π.
HumanDrone8721@reddit
Well, using llama's little chat interface I have put the guy trough his paces and it actually give me a consistent 33tps !!! and I've concluded with this gem (ha, benchmax this ;) :
It went pretty well and ended with: *"MΓΆchten Sie diesen Essay als LaTeX-Druckversion, mit Quellcode-Beispielen als C-Makro-Checks, oder als PrΓ€sentation fΓΌr Safety-Workshops? Ich kann gern die Formatierung anpassen oder vertiefende Analysen fΓΌr einzelne Artikel liefern β etwa wie MISRA-C mit den auslegungsrelevanten Entscheidungen des Bundesverfassungsgerichts harmonisiert."*, endless evening fun :)
HumanDrone8721@reddit
And for good measure here is the CPU-only run for reference:
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
Zestyclose_Yak_3174@reddit
Seems like the verdict is still out. Many seem to say it's good, yet also many seem to say it is a very weak model in the real world.
shrug_hellifino@reddit
Any quant I use, I am getting this error: forcing full prompt re-processing due to lack of cache data as shown below, reloading 50,60,70,150k context over and over is quite miserable.. all latest build and fresh quant dl just in case as of today 2/8/26. Any guidance or insight would be appreciated.
Chromix_@reddit (OP)
A potential fix for this was just merged, get the latest version and test again :-)
You could also increase
--cache-ramif you have some free RAM to spare.shrug_hellifino@reddit
Wow, I just rebuilt this morning, so this is that new? Thank you for the pointer!
rm-rf-rm@reddit
Looking for help in getting it working with MLX π https://old.reddit.com/r/LocalLLaMA/comments/1qwa7jy/qwen3codernext_mlx_config_for_llamaswap/
Savantskie1@reddit
I donβt care about speed, I care about ease of use and being able to load and unload a model without needing to spawn a separate instance of the model runner. Thatβs just waste of resources
IrisColt@reddit
Thanks!!!
jedsk@reddit
did you get any err outs with opencode?
it kept failing for me when just building/editing an html page
Chromix_@reddit (OP)
No obvious errors aside from the initial uninstalling of packages because it wasn't prompted to leave the dev environment alone. Well, and then there's this - the first and sometimes second LLM call in a sequence always fails for some reason, despite the server being available:
live4evrr@reddit
I was almost ready to give up on it, but after downloading the latest GGUF (4 bit XL) from Unsloth and updating llama.cpp, it is a good local option. Of course can't be compared to frontier cloud models (no it is not nearly as good as Sonnet 4.5) but it is pretty close. Amazing how well it can run so well on a 32GB VRAM card with sufficient ram (64+).
EliasOenal@reddit
I have had good results with Qwen3 Coder Next (Unsloth's Qwen3-Coder-Next-UD-Q4_K_XL.gguf) locally on Mac, it is accurate even with reasonably complex tool use and works with interactive tools through the term-cli skill in OpenCode. Here's a video clip of it interactively debugging with lldb. (Left side is me attaching a session to Qwen's interactive terminal to have a peek.)
Chromix_@reddit (OP)
Soo you're saying when I install the term-cli plugin then my local OpenCode with Qwen can operate my Claude CLI for me? π
EliasOenal@reddit
Haha, indeed! I yesterday wanted to debug a sporadic crash I encountered twice in llama.cpp, when called from OpenCode. (One of the risks of being on git HEAD.) So I spawned two term-cli sessions, one with llama.cpp and one with OpenCode, asking another session of OpenCode to take over to debug this. It actually ended up typing into OpenCode, running prompts, but it wasn't able to find the crash 50k tokens in. So I halted that for now.
LoSboccacc@reddit
it's not much but two year ago we we'd having 15 tps on capybara 14b being barely coherent and now we have a somewhat usable haiku 3.5 at home
HollowInfinity@reddit
I used OpenCode, roo, my own agent and others but found the best agent is (unsurprisingly) Qwen-Code. The system prompts and tool setup is probably exactly what the agent is trained for. Although as I type this you could probably just steal their tool definitions and prompts for whatever agent you're using.
StardockEngineer@reddit
Install oh my OpenCode into OpenCode to get the Q&A part of planning as youβve described in Roo Code. Also provides Claude Code compatibility for skills, agents and hooks.
Chromix_@reddit (OP)
A vibe-coded vibe-coding tool plug-in? I'll give it a look.
txgsync@reddit
I like to vibe-code UIs for my vibe-coded plugins used in my vibe-coding platform.
msrdatha@reddit
Indeed the speed, quality and context size points mentioned are spot on with my test environment with mac M3 and kilo code as well.
This is my preferred model for coding now. I am switching this and Devstral-2-small from time to time.
Any thoughts on which is a good model for "Architect/Design" solution part? Does a thinking model make any difference in design only mode?
-dysangel-@reddit
How much RAM do you have? For architect/design work I think GLM 4.6/4.7 would be good. Unsloth's glm reap 4.6 at IQ2_XXS works well for me, taking up 89GB of RAM. I mostly use GLM Coding Plan anyway, so I just use local for chatting and experiments.
Having said that, I'm testing Qwen 3 Coder Next out just now, and it's created a better 3D driving simulation for me than GLM 4.7 did via the official coding plan. It also created a heuristic AI to play tetris with no problems. IΒ need to try pushing it even harder
https://i.redd.it/xgtfa0jo0aig1.gif
msrdatha@reddit
89GB of RAM at what context size?
-dysangel-@reddit
Took a while to find out how to find full RAM usage on the new LM Studio UI! The 89GB is the loaded base model only, and it's a total 130GB with 132000 context
msrdatha@reddit
for me ...above 90GB is "up above the world so high......"
any way, thanks for the confirmation.
-dysangel-@reddit
No worries. Give it a few years and this will be pretty normal stuff. When I was a kid I remember us adding a 512kb expansion card to our Amiga to double the RAM lol
msrdatha@reddit
Thanks.. but not on a Mac.
instead, I follow this logic... "The 90 in hand is better than 1024+ in cloud" :)
mycall@reddit
I cannot get LMStudio to give Q3CN more than 2048 context size. I wonder if anyone else has this issue.
-dysangel-@reddit
Qwen 3 Coder Next time trial game, single web page with three.js. Very often models will get the wheel orientation incorrect etc. It struggled a bit to get the road spline correct, but fixed it after a few iterations of feedback :)
https://i.redd.it/l9j55bxr0aig1.gif
Chromix_@reddit (OP)
Reasoning models excel in design mode for me as well. I guess a suitable high-quality flow would be:
Experimental IDE support for that could be interesting, especially now that llama.cpp allows model swapping via API. Still, the whole flow would take a while to be executed, which could still be feasible if you want a high quality design over lunch break (well, high quality given the local model & size constraint).
msrdatha@reddit
appreciate sharing these thoughts. makes sense very much.
I have been thinking if a simple RAG system or Memory can help in such cases. Just thought only - not yet tried. Did not want to spend too much time on learning deep RAG or Memory implementation. I see kilo code does have some of these in settings. not yet tired on an actual code scenario.
any thoughts or experience on such actions related to coding?
Chromix_@reddit (OP)
With larger (2M tokens+), more complex code bases a RAG system (that you need to keep up-to-date) can make sense. Claude and others just
greptheir way through things, but it becomes way less efficient or even breaks with certain use-cases, code-bases and complexity of the task. The question is then whether or not Q3CN could handle that on top. Still, if you get good results most of the time without any added complexity on top: Why add any? :-)msrdatha@reddit
yes, this is exactly why I have been staying away from RAG till now. why complicate unnecessarily. I would rather focus on how to make it more useful at a task.
But from time to time, I feel a small-simple RAG solution with a folder of data that we can ask the agent to learn from may help. Again, wound need to walk through it with the agent to ensure that it picks up the right concepts only from the data.
anoni_nato@reddit
I'm getting quite good results coding with Mistral Vibe and GLM 4.5 air free (openrouter, can't self host yet).
Has its issues (search and replace fails often so it switches to file overwrite, and sometimes it loses track of context size) but it's producing code that works without me opening an IDE.
klop2031@reddit
Yeah i feel the same. For the first time this thing can do agentic tasks and can code well. I actually found myself not using a frontier model and just using this because of privacy. Im like wow so much better
jacek2023@reddit
How do you use OpenCode on 24 GB VRAM? How long do you wait for prefill? Do you have this fix? https://github.com/ggml-org/llama.cpp/pull/19408
Odd-Ordinary-5922@reddit
if you have --cache-ram set to something high prefill isnt really a problem
jacek2023@reddit
I use --cache-ram 60000, what's your setting?
Odd-Ordinary-5922@reddit
just set it to your context size which usually fixes it for me.
Chromix_@reddit (OP)
Thanks for pointing that out. No I haven't tested with this very recent fix yet. ggerganov states though that reprocessing would be unavoidable if something early in the prompt is changed - which is exactly what happens when Roo Code for example switches from "Architect" to "Code" mode.
jacek2023@reddit
Yes I am thinking about trying roo (I tested that it works), but I am not sure how "agentic" it is. Can you make it compile and run your app like in opencode? I use Claude Code (+Claude) and Codex (+GPT 5.3) simultaneously and opencode works similarly, can I achieve that workflow in roocode?
Chromix_@reddit (OP)
Roo will absolutely try to run syntax checks, protobuf compilation, unit tests and such. Actually running the application needs to be instructed in my experience. Still, I prefer that rather conservative approach over full YOLO that OpenCode seems to do by default. Sort of the same as Claude "try to get it working without bothering the user, no matter what". So in the end I guess it comes down to preference, although it seems to be that the model is a bit more capable with OpenCode than with Roo.
jacek2023@reddit
In all three cases (Claude Code, Codex, OpenCode), my workflow is to build a large number of .md files containing knowledge/experiences, this document set grows alongside the source code.
Chromix_@reddit (OP)
Well, in that case you'll have to see what it does. Claude loves to write documentation files and in-code comments to much that I explicitly instructed it multiple times to stop doing so, unless I request it. I barely tried documentation creation with Q3CN and Roo. The bit that I tried was OKish, yet what Claude creates is certainly better.
jacek2023@reddit
in CLAUDE.md / AGENTS.md I just have all the instructions what agent should or shouldn't do, for example I needed to teach it how to run my app, take screenshot and then compare with specification
rorowhat@reddit
The Q45 quanto was taking 62GB of Ram on LMstudio as well, it didn't make sense.
dubesor86@reddit
played around with it a bit, very flakey json, forgetful to include mandatory keys and very verbose, akin to a thinker without explicit reasoning field.
Chromix_@reddit (OP)
Verbose in code or in user-facing output? The latter seemed rather compact for me during the individual steps, with the regular 4 paragraph conclusion at the end of a task. Maybe temperature 0 has something to do with that.
Blues520@reddit
Do you find it able to solve difficult tasks because I used the same quant and it was coherent but the quality was so so.
-dysangel-@reddit
My feeling is that the small/medium models are not going to be that great at advanced problem solving, but they're getting to the stage where they will be able to follow instructions well to generate working code. I think you'd still want a larger model like GLM/Deepseek for more in depth planning and problem solving, and then Qwen 3 Coder has a chance of being able to implement individual steps. And you'd still want to fall back to a larger model or yourself if it gets stuck.
Chromix_@reddit (OP)
Yes, for the occasional really "advanced problem solving" I fill the context of the latest GPT model with manually curated pages of code and text, set it to high reasoning, max tokens and get a coffee. Despite that, and yielding pretty good results and insights for some things, it still frequently needs corrections due to missing optimal (or well, better) solutions. Q3CN has no chance to compete with that. Yet it doesn't need to for regular day-to-day dev work, that's my point - seems mostly good enough.
-dysangel-@reddit
Yeah exactly. They're able to do a good amount on their own, especially of more basic work. For more complex tasks, I don't try to get them to do everything on their own. I'll treat them more as a pair programmer or just chat through the problem with them and implement myself. Especially when it comes to stuff like graphics work, you need a human in the loop for feedback anyway.
Blues520@reddit
That makes sense and it does do well in tool calling which some models like Devstral trip themselves over with.
Chromix_@reddit (OP)
The model & inference in llama.cpp had issues when they were released initially. This has been fixed by now. So if you don't use the latest version of llama.cpp or haven't (re-)downloaded the updated quants then that could explain the mixed quality you were seeing. I also tried the Q8 REAP vs a UD Q4, but the Q8 was making more mistakes, probably because the REAP quants haven't been updated yet, or maybe it's due to REAP itself.
For "difficult tasks": I did not test the model on LeetCode challenges, implementing novel algorithms and things, but on normal dev work: Adding new features, debugging & fixing broken things in a poorly documented real-life project - no patent-pending compression algorithms and highly exotic stuff.
The latest Claude 4.6 or GPT-5.2 Codex performs of course way better. More direct approach towards the solution, sometimes better approaches that the Q3CN didn't find at all. Yet still, for "just getting some dev work done" it's no longer needed to have the latest and greatest. Q3CN is the first local model that's usable for me in this area. Of course you might argue that using the latest SOTA is always best, as you always want the fastest, best solution, no matter what, and I would agree.
Blues520@reddit
I pulled the latest model and llamacpp yesterday so the fixes were in. I'm not saying that it's a bad model, I guess I was expecting more given the hype.
I didn't do any leetcode but normal dev stuff as well. I suspect that a higher quant will be better. I wouldn't bother with the REAP quant though.
Chromix_@reddit (OP)
Q4 seems good enough, yet I also thought that there could be more. So I also tested with a Q6 which should be relatively close to the full model in terms of quality, yet this then either comes with a decreased context size (which leads to bad results on its own, "compacting" before having read all relevant pieces of code), or harsh speed penalties due not not having enough VRAM for it.
And yes, the hype is always bigger. In this case it's not so much about the hype for me, but "I now have something I didn't have with other models, the right feature/property combo to make it work fine for me".
Blues520@reddit
That's great. I'm glad it works well for you and it's good for your setup with decent context.
mysho@reddit
I tried to let it convert a simple systems service to one activated by a socket. Used Kilo code with qwen3-coder-next. Took it 30 requests for such a trivial task, but it managed in the end. I expected better, but it's kinda usable for trivial stuff.
Status_Contest39@reddit
the same feeling, even not as good as GLM4.7 flash