Qwen3 Coder Next as first "usable" coding model < 60 GB for me

[-]

Chromix_@reddit (OP)

--n-cpu-moe let's the experts stay in system RAM. But it's easier to simply not specify -ngl these days. Use --fit-target 256 or so and you should get good enough results.

[-]

Thanks, I was able to solve this problem by simply using `--fit on` flag, no need for guessing cpu moe or gpu layers. I am getting about 120tps for prompt processing and 20tps for generation. Prompt Processing is honestly too slow... :-(

Is there anyway to improve that?

[-]

Chromix_@reddit (OP)

PP indeed looks way too slow, while TG seems OK. Check if your VRAM maybe spilled into shared memory. Test increasing batch & ubatch size as discussed here and elsewhere to speed up PP. Best run with llama-bench for systematic testing.

[-]

TargetIndependent435@reddit

Which of its quant(s) can I use considering a RTX3090+32GB DDR5, planning on using Claude Code with my codebase.

[-]

andrewmobbs@reddit

I've also found Qwen3-Coder-Next to be incredible, replacing gpt-oss-120b as my standard local coding model (on a 16GB VRAM, 64GB DDR5 system).

I found it worth the VRAM to increase `--ubatch-size` and `--batch-size` to 4096, which tripled prompt processing speed. Without that, the prompt processing was dominating query time for any agentic coding where the agents were dragging in large amounts of context. Having to offload another layer or two to system RAM didn't seem to hurt the eval performance nearly as much as that helped the processing.

[-]

STUDBOO@reddit

pls make a video or step by step instructions, I am using LM studio

[-]

Chromix_@reddit (OP)

Setting it that high gives me 2.5x more prompt processing speed, that's quite a lot. Yet the usage was mostly dominated by inference time for me, and this drops it to 75% due to less offloaded layers. With batch 2048 it's still 83% and 2x more PP speed. Context compaction speed is notably impacted by inference time (generating 20k tokens), so I prefer having as much of the model as possible on the GPU, as my usage is rarely impacted by having to re-process lots of data.

[-]

BrightRestaurant5401@reddit

wait, are you offloading layer linearly? Qwen3-Coder-Next is a moe so I think its better to offload up and down and maybe even gate?

[-]

Chromix_@reddit (OP)

I used to toy around with regex to find the optimal offload, but these days --fit usually works nicely, even for MoE models.

[-]

inphaser@reddit

How does it perform compared to minimax 2.5 REAP?

[-]

spadak@reddit

how did you set it up ? I was under impression that you need much more of VRAM, I have rtx 5070 ti and 96GB of DDR5 and would love to be able to use it locally, I'm on windows

[-]

JustSayin_thatuknow@reddit

Try linux my friend! 🙏🏻

[-]

genpfault@reddit

IQ4_NL seemed slightly better

In the tok/s sense, or quality-of-output sense?

[-]

-dysangel-@reddit

thanks for that - I remember playing around with these values a long time ago and seeing they didn't improve inference speed - but didn't realise they could make such a dramatic difference to prompt processing. That is a very big deal

[-]

inphaser@reddit

Idk what i'm doing wrong, i just managed to try it in llama.cpp after a SyCL bug has been fixed and made it into the docker image.
But the results are just unusable. I mean, what are we supposed to do with these results like below?

[-]

Chromix_@reddit (OP)

That looks broken, but in a special way. It looks like your prompt isn't being sent to the model. These "free form" results is what you get when you run inference without specifying a prompt.

Try it via CLI to see if you get better results: llama-cli -m Qwen3-Coder-Next-IQ4_XS.gguf -fa on -c 4096 --temp 0 -p "hi"

[-]

inphaser@reddit

Thanks i just tried, it looks the same:

$ docker run -it --rm --name llama.cpp --network=host --device /dev/dri -v $MODEL_DIR:/models ghcr.io/ggml-org/llama.cpp:light-intel -m /models/Qwen3-Coder-Next-IQ4_XS.gguf -ngl 99 -np 1 -c 32768

load_backend: loaded SYCL backend from /app/libggml-sycl.so

load_backend: loaded CPU backend from /app/libggml-cpu-alderlake.so

Loading model... |get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory

get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory

|get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory

/get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory

get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory

-get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory

get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory

/get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory

-get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory

▄▄ ▄▄

██ ██

██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄

██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██

██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀

██ ██

▀▀ ▀▀

build : b8172-a8b192b6e

model : Qwen3-Coder-Next-IQ4_XS.gguf

modalities : text

available commands:

/exit or Ctrl+C stop or exit

/regen regenerate the last response

/clear clear the chat history

/read add a text file

> hi

Type-Support

OpenXML

Dimensioned

Open

[ Prompt: 1.1 t/s | Generation: 1.9 t/s ]

> can you write hello world in c?

) {

} {

[ Prompt: 3.4 t/s | Generation: 2.0 t/s ]

[-]

Chromix_@reddit (OP)

Hmmm, you could download or create a CPU-only build of llama.cpp, without any SYCL functionality integrated. If it works with that version then it's a SYCL bug that you could create an issue for on GitHub. If it still doesn't work then download another quant from some other repo to see if your existing file might be corrupted or simply a bad quant. If it's still broken then... uh.. good luck.

[-]

inphaser@reddit

Thanks! indeed cpu works (and runs also at 5t/s vs 2 of SyCL)

[-]

Chromix_@reddit (OP)

Please make sure to report that as SYCL issue with all your details then, so that it can get fixed (and you'll get faster speeds)

[-]

BrianJThomas@reddit

Couldn’t do any tool calls successfully for me in opencode and I gave up.

[-]

benevbright@reddit

have you tried unsloth version?

[-]

Practical-Bed3933@reddit

I get "I don't have access to a listdir tool in my available tools." with Ollama on Mac. Did you resolve it u/BrianJThomas ?

[-]

Chromix_@reddit (OP)

Latest updated model quant, at least Q4, latest llama.cpp, latest opencode? There were issues 5 days ago that have been solved since then. I have not seen a single failed too call since then.

[-]

wisepal_app@reddit

i get "invalid [tool=write, error=Invalid input for tool write: JSON parsing failed:" error with opencode. i am using latest llama.cpp with cuda and unsloth ud Q4_K_XL GGUF quant. Any idea what could be the problem?

[-]

Chromix_@reddit (OP)

Not really without further details. You get invalid JSON - like for example the model messing up parenthesis, which is something that it occasionally did before the fixes. If you get the actual JSON output from OpenCode that's not accepted, then this would tell what's broken.

[-]

wisepal_app@reddit

Sorry this is the full message: "invalid [tool=write, error=Invalid input for tool write: JSON parsing failed: Text: {"content":"# Tokenizer Package\n__version__ = "1.0.0"","filePath":"C:\Users\Hp Studio\Desktop\Tokenizer\init.py","filePath"C:\Users\Hp Studio\Desktop\Tokenizer\models\init.py"}. Error message: JSON Parse error: Expected ':' before value in object property definition]"

[-]

Chromix_@reddit (OP)

Hmm, that's strange. The model didn't escape any JSON data at all, and it failed at the file path, writing "filePath"C:\Users instead of "filePath":"C:\Users
That looks broken, it shouldn't happen on temperature 0.
Try adding something like "Ensure correctly escaped JSON strings for tool calls" to your prompt and see if it does anything. Yet that second error looks like broken inference or model.

[-]

SatoshiNotMe@reddit

It’s also usable in Claude Code via llama-server, set up instructions here:

https://github.com/pchalasani/claude-code-tools/blob/main/docs/local-llm-setup.md

On my M1 Max MacBook 64 GB I get a decent 20 tok/s generation speed and around 180 tok/s prompt processing

[-]

benevbright@reddit

super slow for me. M2 Max 64gb

[-]

SatoshiNotMe@reddit

much worse than 20 tok/s with CC ?

[-]

benevbright@reddit

made video with Claude Code. super slow: https://youtu.be/ok3tNaWfq2Y?si=tQgWgHxaZnxu02PW

[-]

AcePilot01@reddit

is claude local?

[-]

wanderer_4004@reddit

Same hardware here - M1 Max MacBook 64 GB. With MLX I get 41 tok/s TG and 360 tok/s PP. However, MLX server is less good than llama.cpp in kv-caching and especially branching. Also occasionally seems to leak memory. Am using Qwen Code and am quite happy with it.

[-]

Consumerbot37427@reddit

Also on Apple Silicon w/ Max. I have had lots of issues with MLX, I might stop bothering with them and just stick with GGUFs. Waiting for prefill is so frustrating, and seeing log messages about "failed to trim x tokens, clearing cache instead" drove me nuts.

I had been doing successful coding with Mistral Vibe/Devstral Small, but the context management issue plus the release of Qwen3 Coder Next inspired me to try out Claude Code with LM Studio serving the Anthropic API, and it seems amazing! It seems to be much better at caching prefill and managing context, so not only do I get more tokens per second from a MoE model, the biggest bonus is how much less time is spent waiting for the context/prefill. Loving it!

[-]

wanderer_4004@reddit

Actually I have now a workflow that works really well for me with Qwen Code and mlx-server. The important thing is to compress the context after each bug fix or feature. Then I say just 'hi' to load the new compressed context again and the system is ready for immediate answers. The important part was to limit context size in the settings.json.

I tested yesterday evening the latest llama.cpp and unsloth Qwen3-Coder-Next-MXFP4_MOE.gguf with Qwen Code and it has trouble with tool calling. Maybe the MXFP4 gguf is no good...

[-]

Consumerbot37427@reddit

Man, I've had back luck with the unsloth quants, too. I've got 96GB so I can run Q6, but dropped to Q4 so I could get 200k tokens of context. Maybe try an official quant?

Haven't tried Qwen Code yet. Vibe was the first CLI coding tool I tried, then Claude Code. And there's OpenCode... Too many options, and it all moves so fast, I hate committing and investing too much in learning a tool.

The important thing is to compress the context after each bug fix or feature. Then I say just 'hi' to load the new compressed context again and the system is ready for immediate answers.

That sounds fantastic!

[-]

wanderer_4004@reddit

Too many options, and it all moves so fast, I hate committing and investing too much in learning a tool.

Yes, the same here. What I like on a fully local tool chain is that I get a much better feeling about token usage. Because finally Anthropic is in the token selling business and obviously they want to hook up each and every developer onto using as many tokens as possible.

Their goal is to convince employers that for $100 spend on a developer, the same can be achieved with $90 spend on tokens, thus a gain of 10%. That's the endgame. Or it would be if LocalLLaMA would not exist.

What I like about Gemini CLI and Qwen CLI (which is based on Gemini) is, that those companies are not primarily in the token selling business. Especially China is -due to export restrictions- heavily interested in efficient token use - at least for now.

[-]

txgsync@reddit

Yep this behavior led me to write my own implementation of a MLX server with “slots” like llama.cpp has so more than one thing can happen at a time. FLOPS/byte goes up!

Inferencer and LM Studio both now support this too. If you use Gas Town for parallel agentic coding this dramatically speeds things up for your Polecats. Qwen3-Coder-Next is promising on Mac with parallel agentic harnesses. But I have to test it a bit harder.

[-]

wanderer_4004@reddit

I have started to look into the kv cache. Especially saving to disk and loading from disk and also making it more resilient against branching and interruptions. But no real code yet. Unless you want to commercialise it, just put it out somewhere on Github...

[-]

txgsync@reddit

My dumb little bash scripts with llama.cpp don’t yet deserve publication :). But I think we could reduce benchmark time on M4 Max for lalmbench from over an hour to just a few minutes: https://github.com/txgsync/lalmbench

[-]

crantob@reddit

Your writeup is helpful to my stage of ignorance. Thank you.

[-]

Chromix_@reddit (OP)

Claude Code uses a whole lot of tokens for the system prompt though, before any code is processed at all. OpenCode and Roo used less last time I checked. Still, maybe the results are better? I haven't tested Claude CLI with local models so far.

[-]

Purple-Programmer-7@reddit

Opencode > Claude code. It’s okay that people don’t listen though 😂

[-]

arcanemachined@reddit

No kidding. The quality of the interfaces is like night and day.

Claude Code feels like I'm in a vibe-coded fever dream. I'm sure that OpenCode is written with LLM-assisted code, but the interface feels so much more coherent to me, it's not even funny.

[-]

cleverusernametry@reddit

Roo has a very large system prompt as well no? I'm guesing opencode is the same deal

[-]

Chromix_@reddit (OP)

Roo is about 9K tokens and OpenCode 11K.

[-]

msrdatha@reddit

Initially I was testing both CC and opencode, but then Claude started the drama like limiting other agents and tools on api usage etc. This made me think, may be CC will not be good for local ai, the moment they feel its gaining traction and we would be suddenly banned with some artificially introduced limitations. So left cc for good and continued with opencode and kilo

[-]

SatoshiNotMe@reddit

There seems to be a lot of confusion about this: Anthropic has nothing against using the Claude Code harness with other LLMs. They even have a guide for this:

https://code.claude.com/docs/en/llm-gateway

However what Anthropic specifically is allergic to is when other apps or coding agents try to leverage the all-you-can-eat buffet subscriptions (pro/max) to avoid API costs.

[-]

SatoshiNotMe@reddit

Yes CC has a sys prompt of at least 20K tokens. On my M1 Max MacBook the only interesting LLMs with good-enough generation speed are the Qwen variants such as 30B-A3B and the new coder-next. GLM-4.7-flash has been bad at around 10 tok/s.

[-]

XiRw@reddit

Why don’t you use their website at this point if you are going non local with Claude instead of tunneling through an API?

[-]

SatoshiNotMe@reddit

I mainly wanted to use the 30B local models for sensitive document work, so can’t use an API, and needed it to run on my Mac. I really wouldn’t use 30B models for serious coding; for that I just use my Max sub.

[-]

XiRw@reddit

Ah okay that makes sense then

[-]

benevbright@reddit

Agreed. I made live demo video with qwen3-coder-next on my 64GB Mac Studio. https://youtu.be/ok3tNaWfq2Y?si=tQgWgHxaZnxu02PW it would be great to get some feedback.

[-]

AcePilot01@reddit

Hey. im back lol, I was having issues with my ollama (idk why bit it like lost my models, even tho I had them) something wierd with dockerm prob. so I deleted them, but I am having trouble how to figure getting this working?

For one, where you get the gguf? 2, I did see it was 160gb lmfao. So how did you install this. I was goiing to use openwebui prob. unless it would be impossible that way

[-]

Chromix_@reddit (OP)

I've used Qwen3-Coder-Next-UD-Q4_K_XL.gguf in this test. You can get that or any other quant that fits your VRAM on HF.

[-]

AcePilot01@reddit

yeah I am using that one now too, since we both only had 24gb vram

what speeds are you getting? I am curious, how can you "benchmark" these? Obv different asks have it give a different reply that sometimes is more or less tokens per second.

[-]

Chromix_@reddit (OP)

Prompt between 150 and 600 TPS, inference around 30 TPS - all depending on the used options. Check llmama-bench documentation how to run tests yourself, as well as other comments in this thread for more numbers.

[-]

AcePilot01@reddit

ok were those numbers from bench or real world asks?

what context you using?

[-]

AcePilot01@reddit

You know, I just noticed, how do you do ngl 99? 99 layers should be larger than the gpu right?

[-]

FairAlternative8300@reddit

Pro tip for Windows users with 16GB VRAM like the 5070 Ti: the `--n-cpu-moe` flag is the magic sauce here. It offloads the MoE expert layers to CPU while keeping attention on GPU, so you get decent 20+ tok/s generation without needing a 5090.

With 96GB DDR5 you should be golden. Try something like:

`llama-server.exe -m Qwen3-Coder-Next-UD-Q4_K_XL.gguf -ngl 99 -fa on -c 80000 --n-cpu-moe 28`

Start with fewer CPU-MOE layers and increase until it fits in VRAM. Flash attention (`-fa on`) helps a lot with the 16GB constraint too.

[-]

simracerman@reddit

Good reference. I will give opencode and this model a try.

At what context size did you notice it lost the edge?

[-]

Chromix_@reddit (OP)

So far I didn't observe any issue where it did something where it "should have known better" due to having relevant information in its context, but I "only" used it up to 120k. Of course its long context handling is far from perfect, yet it seems good enough in practice for now. Kimi Linear should be better in that aspect (not coding though), but I haven't tested it yet.

[-]

zoyer2@reddit

you mean 48B? ive tested it and sadly not so good, Qwen3 coder next is a lot better

[-]

Terminator857@reddit

failed for me on a simple test. Asked to list recent files in directory tree. Worked. Then asked to show dates and human readable file sizes. Went into a loop. Opencode q8. Latest build of llama-server. strix-halo.

[-]

Chromix_@reddit (OP)

That was surprisingly interesting. When testing with Roo it listed the files right away, same as in your test with OpenCode. Then after asking about dates & sizes it started asking me back, not just once like it sometimes does, but forever in a loop. Powershell or cmd, how to format the output, exclude .git, only files or also directories, what date format, what size format, sort order, hidden files, and then it kept going into a loop asking about individual directories again and again. That indeed seems to be broken for some reason.

[-]

CSEliot@reddit

Perhaps when it comes to meta data, thats wheren the issue is? Is this on linux or windows?

[-]

Chromix_@reddit (OP)

The point is: It didn't even try to read the data, it just kept asking about what to include and how to output it, without issuing a single command (except for the generic directory list, directly supported by the harness, independent of the OS).

[-]

CSEliot@reddit

Is a unix command like "ls" typically how opencode agents get file info?

[-]

AccomplishedLeg527@reddit

i am running it on 8 gb vram 3070ti laptop with 32Gb ram :). First attempt was using accelerate offloading to disk (as 80gb safetensors won`t fit in ram), and i got 1 token per 255 second. Than i wrote custom offloading and reach 1 token per second (!!! 255x speedup !!!). On desktop 3070 with full PCE buss it should be 2x faster + 2x fsater because desktop GPU + can be used raid 0 (2 nvme ssd) - 3..5x on loading experts weights + if more vram (12-16Gb) can be more weights cashed on cuda (now i got 55% cache hit rate using only 3gb vram). In total 8-16Gb middle range desktop cards can run it with speed 3 to 10 tokens per second with only 32Gb ram. If someone interested i can share how i did this. Or should i pattent it?

[-]

Chromix_@reddit (OP)

So, you're running the 80GB Q8 quant on a system with 40 GB (V)RAM in total. Your SSD isn't reading the remaining 40+ GB once per second, but it also doesn't need to, since it's a MoE.

There was a posting here a while ago where someone compiled some stats on the predictability of expert selection per token and got things 80%+ correct IIRC. With that approach one can (pre)load the required experts to maximize generation speed without having to wait that much for the comparatively slow SSD. Maybe you did something similar? Or is it just pinning the shared expert(s) and other parts that are needed for each token into (V)RAM?

[-]

AccomplishedLeg527@reddit

most frequent experts indexes cached in vram for each layer, with only 3 Gb free vram i got 43-55% cache hit rate, for ram i have 2 options, one used by mmap to speedup loading, or do not use mmap and ram not needed at all (maybe up to few Mb for transfers), model without experts use only 4.6Gb vram (+we neeed some memory for context)

[-]

DOAMOD@reddit

https://i.redd.it/j3rbhpq2zfig1.gif

For me, it's been a bit disappointing in some tests, and also in a coding problem where the solution wasn't very helpful. It doesn't seem very intelligent. I suppose it will be good for other types of coding tasks like databases, etc. I had high expectations.

[-]

chickN00dle@reddit

spinning fishies cant be anything other than a success 🙌

joking

[-]

AcePilot01@reddit

I forgot to ask, is this the 160gb version?

[-]

Chromix_@reddit (OP)

Qwen3 Coder Next is 160 GB in the distributed base version, yet the quantized GGUFs in the 50 to 60 GB range work quite well.

[-]

AcePilot01@reddit

I kind of figured, since the 16ogb wouldn't fit lol, wasn't sure if (since it was a bunch of tensor files) that maybe it worked different lol.

I did try downloading a GGUF version and setting it up with llama.cpp but never could get it to work unfort.

[-]

Gimme_Doi@reddit

thanks

[-]

UmpireBorn3719@reddit

I use RTX 5090, AMD 9900X, RAM 64GB, MXFP4
Result: Prefill around 1500 tps, generation around 50 tps

slot update_slots: id 2 | task 3 | prompt processing progress, n_tokens = 241664, batch.n_tokens = 4096, progress = 0.991353

slot update_slots: id 2 | task 3 | n_tokens = 241664, memory_seq_rm [241664, end)

slot update_slots: id 2 | task 3 | prompt processing progress, n_tokens = 243260, batch.n_tokens = 1596, progress = 0.997900

slot update_slots: id 2 | task 3 | n_tokens = 243260, memory_seq_rm [243260, end)

slot update_slots: id 2 | task 3 | prompt processing progress, n_tokens = 243772, batch.n_tokens = 512, progress = 1.000000

slot update_slots: id 2 | task 3 | prompt done, n_tokens = 243772, batch.n_tokens = 512

slot init_sampler: id 2 | task 3 | init sampler, took 18.45 ms, tokens: text = 243772, total = 243772

slot update_slots: id 2 | task 3 | created context checkpoint 1 of 32 (pos_min = 243259, pos_max = 243259, size = 75.376 MiB)

slot print_timing: id 2 | task 3 |

prompt eval time = 170074.62 ms / 243772 tokens ( 0.70 ms per token, 1433.32 tokens per second)

eval time = 4125.05 ms / 182 tokens ( 22.67 ms per token, 44.12 tokens per second)

total time = 174199.66 ms / 243954 tokens

slot release: id 2 | task 3 | stop processing: n_tokens = 243953, truncated = 0

srv update_slots: all slots are idle

[-]

Savantskie1@reddit

I’m currently using vs code insiders, I can’t use cli coding tools. So can you check to see if this model will work with that? I use lm studio, I don’t care if llama.cpp is faster I won’t use it so don’t suggest it please.

[-]

Chromix_@reddit (OP)

Roo Code is a VSCode plugin that you can use with any OpenAI-compatible API, like for example LMStudio provides. Out of interest: Is there a specific reason to stick to LMStudio if it's only used as API endpoint for a IDE (or IDE plugin)? The difference can be very large as another commenter found out.

[-]

Savantskie1@reddit

I don’t care about speed, I care about ease of use and being able to load and unload a model without needing to spawn a separate instance of the model runner. That’s just waste of resources

[-]

Chromix_@reddit (OP)

llama-server support for loading / switching models via API has been added a few months ago. In terms of ease-of-use you'd indeed need to use something like OpenWebUI with llama-server if the standard functionality isn't sufficient for you. Ease-of-use is probably also why lots of people use ollama.

[-]

Savantskie1@reddit

I may revisit it if it’s truly easier to use, but I don’t chase numbers. So I think I’ll be fine using lm studio for now

[-]

Hot_Turnip_3309@reddit

Ok so I have everything up to date and downloaded multiple GGUFs... tool calling does NOT work

[-]

Chromix_@reddit (OP)

I know, it's always annoying to have that "it seems to work for everyone else, but not for me" case. Maybe go through the support / ticket process with your inference engine and agent harness, collect the necessarily information, maybe logits could be interesting for the tool call as well. Maybe there is some inference error left which happens to strike mostly in your specific use-case.

[-]

ai_tinkerer_29@reddit

This resonates with my experience too. I've been bouncing between different models for coding work and the MoE architecture really does make a difference for speed without sacrificing too much quality.

Quick question: How does the tool-calling reliability compare to something like DeepSeek-V3 or QwQ in your experience? I've had issues with some models hallucinating tool calls or breaking the JSON format mid-stream.

Also—curious about your OpenCode vs Roo Code comparison. The "YOLO permissions" thing in OpenCode is exactly why I've been hesitant. Did you end up configuring stricter permissions, or just stick with Roo for production work?

Appreciate the detailed write-up on the llama-server flags too. The GGML_CUDA_GRAPH_OPT tip is gold—didn't know about that one.

[-]

Chromix_@reddit (OP)

I didn't compare to DeepSeek-V3 as it's not in the same weight class. QwQ is old, it's a good model but tool calling wasn't trained as excessively back then.

The permissions was more a "allow/deny by default" issue, combined with OpenCode really trying hard to get things working, even if it made no sense. I went for stricter permissions combined with safe utility scripts to execute.

[-]

AcePilot01@reddit

Are your comparisons of Opencode and roo code compared to Qwen3 coder next, or am I missing something? or are those agents what you USE this model with?

[-]

Chromix_@reddit (OP)

You cannot compare "OpenCode" to "Qwen3", because OpenCode is a harness for using LLMs, and Qwen3 is a LLM. My post is about using both OpenCode as well as Roo Code with Qwen3 Coder Next (Q3CN).

You can also use OpenWebUI with Q3CN, but it doesn't give you any agentic coding functionality like OpenCode or Roo. You could paste in code though.

I assume Roo/Open code are basically just IDE's?

No, Roo Code is a plugin for VSCode (an IDE), so if you install it you have agentic coding in an IDE. Of course you could also rewire the Copilot that's forced into VSCode for local LLMs. OpenCode is less of an IDE, but more a vibe-coding tool.

[-]

AcePilot01@reddit

OH ok, when I went to Open code's site they seemed to indicate it was a subscription/online thing. Not local.

[-]

Chromix_@reddit (OP)

Quite a few offer some easy online services - no local setup required, yes. Although there are quite often fully local options available.

[-]

AcePilot01@reddit

Imight copy your settings there, cus I also have a 4090 and 64gb of ram lol

[-]

Chromix_@reddit (OP)

You'll need to ensure to have sufficient free VRAM to achieve similar numbers - or tweak the -n-cpu-moe parameter a bit.

[-]

AcePilot01@reddit

didn't you claim to have the same vram? lmfao

[-]

Chromix_@reddit (OP)

Oh, Linux is fine. It's mostly that users on Windows with a single GPU sometimes have so much additional processes occupying their VRAM that they don't have the full capacity left for LLMs, which is why the exact offload numbers would lead to exceeding the available capacity and thus to slowdowns.

[-]

fadedsmile87@reddit

I have an RTX 5090 + 96GB of RAM. I'm using the Q8_0 quant of Qwen3-Coder-Next with \~100k context window with Cline. It's magnificent. It's a very capable coding agent. The downside of using that big quant is the tokens per second. I'm getting 8-9 tokens / s for the first 10k tokens, then it drops to around 6 t/s at 50k full context.

[-]

Chromix_@reddit (OP)

That's surprisingly slow, especially given that you have a RTX 5090. You should be getting at least half the speed that I'm getting with a Q4. Did you try with my way of running it (of course with manually adjusted ncmoe to almost fill the VRAM)?

[-]

fadedsmile87@reddit

I have 2x 48GB DDR5 mem sticks. 6000 MT/s (down from 6400 for stability)
i9-14900K

I'm using the default settings in LM Studio.
context: 96k
offloading 15/48 layers onto GPU (LM Studio estimates 28.23GB on GPU, 90.23GB on RAM)

[-]

Chromix_@reddit (OP)

Ah, just two modules then, so that should be fine. You could try the latest llama.cpp as a comparison, and play around with manual CPU masks. The E Cores and general threaded had a tendency to slow things down a lot in the past. I mean, you can try with the same Q4 that I used, and if your TPS are the same as mine or lower, then there's something you can likely improve.

[-]

fadedsmile87@reddit

I downloaded the Q4_K_M variant (48GB size). I tested it and got 14 t/s for a 3k token output.

You're right. Something must be off in my settings if you're getting twice as that with a less powerful GPU and less VRAM. I'm not very familiar with llama.cpp. I'm a simple user lol.

[-]

Chromix_@reddit (OP)

With LMStudio I guess? Well, try llama.cpp then. Download the latest release, start a cmd, start with the exact two lines that I executed (don't forget about the graph opt), check the speed, use this to see if something improves. llama-server -m Qwen3-Coder-Next-UD-Q4_K_XL.gguf -fa on --fit-ctx 120000 --fit on --temp 0 --cache-ram 0 --fit-target 128

[-]

SillypieSarah@reddit

my llama.cpp can't find the model directory.. i wanna use the q6 i already have that's split into two files, but idk how @\~@

[-]

fadedsmile87@reddit

What is this sorcery?!
I got 40 t/s on Q4 variant and 27 t/s on the Q8 variant. How is it possible that LM Studio is doing such a bad job at utilizing my GPU?

This is amazing! And I thought I'd have to upgrade to RTX 6000 Pro to get fast speeds lol

Thank you!

By the way, are there any tradeoffs with your settings? Does it hurt quality?

[-]

Anarchaotic@reddit

I don't understand how you're getting 40 t/s - I have the exact same specs as you but I'm only seeing 10t/s prompt generation and 45 t/s on prompt processing.

In LM Studio (not Llama server) - what were your settings + the token/s you got from that same exact model.

[-]

fadedsmile87@reddit

See nasone32's response. He helped me achieve the same performance in LM Studio as I did using llama server.
In the new LM Studio versions, there's an option called "number of layers for which to force MoE weights onto CPU". Instead of partially offloading layers to GPU, offload all of them. Use the difference in the number of layers offloaded here -> "number of layers for which to force MoE weights onto CPU".
This should speed things up a lot.

[-]

Anarchaotic@reddit

Hm I tried the same exact prompt for llama CPP and in LM studio -but couldn't get past 25 t/s response time. Maybe something else is bottlenecking me.

[-]

nasone32@reddit

you can achieve the same in LM Studio. you don't have to offload the layers in LM studio, you need to set the GPU offload layers to 48 (the slider all to the right, all layers on GPU) then set "Number of layers for which to force MoE on CPU" to the smallest number you can, and it does the same as the above.

And it will do the same llamacpp does with those settings

[-]

fadedsmile87@reddit

Wow, you are absolutely correct! I just tested it.
Instead of 15/48 layers in the GPU offload setting, I set it to 48/48 and put 33 layers for "number of layers for which to force MoE weights onto CPU".

This is awesome! I like LM Studio UX better than llama.cpp anyway haha

[-]

tmvr@reddit

The --fit and --fit-ctx parameters do the heavy lifting. They put everyting important into the VRAM (dense layers, KV cache, context) and then deal with the sparse expert layers. If some fit they get into the VRAM the rest goes into the system RAM. And of course -fa on makes sure that your memory usage for the context does not go through the roof.

[-]

Chromix_@reddit (OP)

Congrats, you just got the performance equivalent of an additional $2000 hardware for free 😉. No trade-offs, no drawbacks, just unused PC capacity that you're now using.

We'll you might now want to get OpenWebUI or so to connect to you llama-server if you want a richer UI than llama-server provides.

[-]

fragment_me@reddit

That's expected. I get 17-18 tok/s with a 5090 and ddr4 using UD q6 k xl with .\llama-server.exe -m Qwen3-Coder-Next-UD-Q6_K_XL.gguf `

-ot "\.(19|[2-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" `

--no-mmap --jinja --threads -12 `

--cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on --ctx-size 128000 -kvu `

--temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 `

--host 127.0.0.1 --parallel 4 --batch-size 512 `

[-]

fadedsmile87@reddit

I was using LM Studio.

Thanks to Chromix, I've installed llama.cpp and used:
llama-server -m Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf -fa on --fit-ctx 120000 --fit on --temp 0 --cache-ram 0 --fit-target 128

Now I'm getting 27 t/s on the Q8_0 quant :-)

[-]

fragment_me@reddit

You should try Q8 KV cache, data shows it's pretty much the same.

[-]

blackhawk00001@reddit

Same setup here, 96GB/5090/7900x/windows/VS code IDE with kilo code extension.

Try using llama.cpp, below are the commands that I'm using to get 30 t/s with Q4_K_M and 20 t/s with Q8. The Q8 is slower but solved a problem in one pass that the Q4 could not figure out. Supposedly it's much faster on vulkan at this time but I haven't tried yet.

.\llama-server.exe -m "D:\llm_models\Qwen3-Coder-Next-GGUF\Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf" --jinja --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 -fa on --fit on -c 131072 --no-mmap --host

[-]

fadedsmile87@reddit

I was using LM Studio.

Thanks to Chromix, I've installed llama.cpp and used:
llama-server -m Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf -fa on --fit-ctx 120000 --fit on --temp 0 --cache-ram 0 --fit-target 128

Now I'm getting 27 t/s on the Q8_0 quant :-)

[-]

blackhawk00001@reddit

Are are you deploying your models on Linux or windows? I tried your settings but had to cancel it because the prompt processing became much slower. Output was still 20/s for me.

I noticed that your startup commands resulted in all of the model stored in RAM where mine was split between RAM and VRAM.

I’ll try mixing settings once I can research what they all are.

[-]

fadedsmile87@reddit

not sure what you mean by "startup commands resulted in all of the model stored in RAM". My GPU shows 31.1/31.5 GB usage and my RAM is 92.2/95.7 GB in Windows Task Manager -> Performance.

I'm using Windows.
I made another test now.
prompt eval is 122 t/s (a 2.5k token prompt)
output was 26.17 t/s (an additional 3k token output)

[-]

blackhawk00001@reddit

Which server distribution are you using? I'm now getting the same as you using vulkan server. My settings caused an out of memory error while loading on vulkan.

Hopefully the llamma.cpp distribution for cuda is optimized soon.

[-]

fadedsmile87@reddit

I downloaded and installed the latest release (b7972).

And chose these:

Windows x64 (CUDA 13) - CUDA 13.1 DLLs

[-]

blackhawk00001@reddit

I did some more testing and fed logs back into cuda Q4 to summarize results. I found that the prompt processing speed is more important to me than the small gain in response processing speed, mostly so when loading documents and files from the workspace. I found that cuda is still faster than vulkan but not by much at the moment for Q8. Q4 is many times faster than vulkan Q4 but I might have configured something wrong, it has different startup attributes. Most interesting was that cuda had many more prompt tokens than vulkan, reducing the effect of the faster processing. I wonder if that affects accuracy. If you are getting those results of yours with Q8 on cuda I'd be curious to know how many tokens it is handling in the prompt and response. I tested each by setting it in architect mode and asking what it would take to change the background of my home page, and letting it plan and then make the change.

## Summary Comparison


| Category | Prompt Tokens | Prompt Speed | Response Tokens | Response Speed | Avg Total Time |
|----------|---------------|--------------|-----------------|----------------|----------------|
| 
**Vulkan Q8_0**
 | 3521.67 | 102.07 tok/s | 81.87 | 26.82 tok/s | 32.17 sec |
| 
**Vulkan Q4_K_M**
 | 7175.75 | 380.79 tok/s | 153.50 | 14.82 tok/s | 26.86 sec |
| 
**CUDA13 Q8_0**
 | 5599.13 | 156.25 tok/s | 156.13 | 20.88 tok/s | 30.22 sec |
| 
**CUDA13 Q4_K_M**
 | 7915.40 | 607.00 tok/s | 130.60 | 28.70 tok/s | 16.77 sec |

[-]

blackhawk00001@reddit

Something about storing experts on the gpu and everything else on the cpu. I'm still learning so might not explain it well. For comparison, I'm sitting at 67/95.1GB RAM and 30.9/31.5GB GPU used, 1GB shared GPU memory. I'd need to reload with your settings but I had similar RAM usage but my Shared GPU memory was higher so there might have been extra swapping going on.

I've seen a range of 160-300 t/s prompt and average 20t/s depending on the task. I need to test with the cuda 12.4 and vulkan servers to see if there's any difference.

[-]

blackhawk00001@reddit

Nice, I’ll try those settings

[-]

TBG______@reddit

I tested: llama.cpp + Qwen3-Coder-Next-MXFP4_MOE.gguf on RTX 5090 – Three Setups Compared

Setup 1 – Full GPU Layers (VRAM-heavy)
VRAM Usage: \~29 GB dedicated
Command: A:\llama.cpp\build\bin\Release\llama-server.exe --model "A:\Qwen3-Coder-Next-GGUF\Qwen3-Coder-Next-MXFP4_MOE.gguf" --host 0.0.0.0 --port 8080 --alias "Qwen3-Coder-Next" --seed 3407 --temp 0.8 --top-p 0.95 --min-p 0.01 --top-k 40 --jinja --n-gpu-layers 28 --ctx-size 131072 --batch-size 1024 --threads 32 --threads-batch 32 --parallel 1
Speed (65k token prompt):
Prompt eval: 381 tokens/sec
Generation: 8.1 tokens/sec
Note: Generation becomes CPU-bound due to partial offload; high VRAM but slower output.

Setup 2 – CPU Expert Offload (VRAM-light)
VRAM Usage: \~8 GB dedicated
Command: A:\llama.cpp\build\bin\Release\llama-server.exe --model "A:\Qwen3-Coder-Next-GGUF\Qwen3-Coder-Next-MXFP4_MOE.gguf" -ot ".ffn_.*_exps.=CPU" --host 0.0.0.0 --port 8080 --alias "Qwen3-Coder-Next" --seed 3407 --temp 0.8 --top-p 0.95 --min-p 0.01 --top-k 40 --jinja --n-gpu-layers 999 --ctx-size 131072 --batch-size 1024 --threads 32 --threads-batch 32 --parallel 1
Speed (70k token prompt):
Prompt eval: 60-140 tokens/sec (varies by cache hit)
Generation: 20-21 tokens/sec
Note: Keeps attention on GPU, moves heavy MoE experts to CPU; fits on smaller VRAM but generation still partially CPU-limited.

Setup 3 – Balanced MoE Offload (Sweet Spot)
VRAM Usage: \~27.6 GB dedicated (leaves \~5 GB headroom)
Command: A:\llama.cpp\build\bin\Release\llama-server.exe --model "A:\Qwen3-Coder-Next-GGUF\Qwen3-Coder-Next-MXFP4_MOE.gguf" --host 0.0.0.0 --port 8080 --alias "Qwen3-Coder-Next" --seed 3407 --temp 0.8 --top-p 0.95 --min-p 0.01 --top-k 40 --jinja --n-gpu-layers 999 --n-cpu-moe 24 --ctx-size 131072 --batch-size 1024 --threads 32 --threads-batch 32 --parallel 1
Speed (95k token prompt):
Prompt eval: 105-108 tokens/sec
Generation: 23-24 tokens/sec
Note: First 24 layers' experts on CPU, rest on GPU. Best balance of VRAM usage and speed; \~3x faster generation than Setup 1 while using similar total VRAM.

Recommendation: Use Setup 3 for Claude Code with large contexts. It maximizes GPU utilization without spilling, maintains fast prompt caching, and delivers the highest sustained generation tokens per second.

Any ideas to speed it up ?

[-]

RevolutionaryTrust12@reddit

Check my reply!! i got 70 tk/s

[-]

Chromix_@reddit (OP)

With so much VRAM left on setup 3 you can bump the batch and ubatch size to 4096 as another commenter suggested. That should bring your prompt processing speed to roughly that of setup 1.

[-]

TBG______@reddit

Thanks: i needed a bit more ctx sizs so i did: $env:GGML_CUDA_GRAPH_OPT=1

A:\llama.cpp\build\bin\Release\llama-server.exe --model "A:\Qwen3-Coder-Next-GGUF\Qwen3-Coder-Next-MXFP4_MOE.gguf" --host 0.0.0.0 --port 8080 --alias "Qwen3-Coder-Next" --seed 3407 --temp 0.8 --top-p 0.95 --min-p 0.01 --top-k 40 --jinja --n-gpu-layers 999 --n-cpu-moe 24 --ctx-size 180224 --batch-size 4096 --ubatch-size 2048 --threads 32 --threads-batch 32 --parallel 1

Speed (145k token prompt):
Prompt eval: 927 tokens/sec
Generation: 23 tokens/sec

Interactive speed (cached, 200–300 new tokens):
Prompt eval: 125–185 tokens/sec
Generation: 23–24 tokens/sec

[-]

Chromix_@reddit (OP)

Looks good, best of both worlds. Your interactive speed is so low due to just adding a few new tokens, way below the batch size. The good thing is: It doesn't matter, since "just a few tokens" get processed quickly anyway.

[-]

Easy_Kitchen7819@reddit

Try DeepSwe

[-]

Money-Frame7664@reddit

Do you mean agentica-org/DeepSWE-Preview ?

[-]

Easy_Kitchen7819@reddit

Yes

[-]

Danmoreng@reddit

Did you try the fit and fit-ctx parameters instead of ngl and n-cpu-moe ? Just read the other benchmark thread and tested on my hardware, it gives better speed.

[-]

tmflynnt@reddit

FYI that I added an update to that thread with additional gains based on people's comments.

[-]

Chromix_@reddit (OP)

Yes, tried that (and even commented how to squeeze more performance out of it) but it's not faster for me, usually a bit slower.

[-]

AcePilot01@reddit

How do you "run " one like that?

I use Openwebui and ollama, so when I download them (forget how they even get placed in there, lmfao I just have ai do it all haha)

[-]

Chromix_@reddit (OP)

Ditch ollama for llama.cpp. He could do it, you can do it too. (To be fair you can also connect OpenCode to ollama, but why not switch to something nicer while being at it?)

[-]

AcePilot01@reddit

Maybe, trying to get it to work in Openwebui is being a freaking pain. having to merge them all etc, it should be as easy as downloading the model and sticking it in a damn folder lol. having to vibe code it to work is getting old lmfao

[-]

qubridInc@reddit

This aligns with what we’ve observed as well. Qwen3 Coder Next works better in practice mainly because it’s an instruction-tuned MoE, not a reasoning-style model. That avoids internal reasoning loops and keeps latency predictable, which really matters for agent-style tools and long runs.

Tool calling and structured outputs are noticeably more reliable, and the long context (100k+) is actually usable on 24 GB VRAM thanks to its attention/memory characteristics. Combined with deterministic sampling (temp 0), it behaves stably for real-world coding instead of drifting or stalling.

[-]

Tudeus@reddit

Has anyone used it as the main drive for openclaw?

[-]

DHasselhoff77@reddit

Qwen3 Coder Next also supports as a fill-in-the-middle (FIM) tasks. This means you can use it for auto completion via for example llama-vscode while also using it for agentic tasks. No need for two different models occuping VRAM simultaneously.

[-]

Chromix_@reddit (OP)

It'd be a rather good yet slow FIM model, yes. On the other hand there is Falcon 90M with FIM support which you could easily squeeze into the remaining VRAM or even run on CPU for auto-complete.

[-]

DHasselhoff77@reddit

The Falcom 90M GGUF didn't support llama.cpp's /infill endpoint so it wasn't usable for me with llama-vscode. Using an OpenAI compatible endpoint works but in the case of that specific VSCode extension, it requires extra configuration work.

I also tried running Qwen Coder 2.5, 3B or 1.5B, but on the CPU and with a smaller context. It's pretty much the same speed as Qwen3 Coder Next on the GPU though.

[-]

crablu@reddit

I have problems running qwen3-coder-next with opencode (RTX 5090, 64GB RAM). I tried with Qwen3-Coder-Next-UD-Q4_K_XL.gguf and Qwen3-Coder-Next-MXFP4_MOE.gguf. It works perfectly fine in chat.

start command:

llama-server.exe ^
 --models-preset "E:\LLM\llama-server\models.ini" ^
 --models-max 1 ^
 --parallel 1 ^
 --cont-batching ^
 --flash-attn on ^
 --jinja ^
 --port 8080

models.ini:

[qwen3-coder-next-mxfp4]
model   = E:\LLM\models\unsloth\Qwen3-Coder-Next-GGUF\Qwen3-Coder-Next-MXFP4_MOE.gguf
c = 65536
b = 8192
ub = 8192
temp = 1
top-p = 0.95
top-k = 40
min-p = 0.01
n-cpu-moe = 24
no-mmap = true

Opencode is not able to use the write tool. The UI says invalid. I built latest llama.cpp. Does anyone know how to fix this?

[-]

Chromix_@reddit (OP)

Try temperature 0, verify that you have the latest update of the Q4 model. It works reliably for me with that.

[-]

crablu@reddit

With temp 0 it seems to work now. Thank you.

[-]

Chromix_@reddit (OP)

Strange, maybe you can get the details of the failed tool calls and then figure out whether that's something on the OpenCode, llama.cpp or model side to solve.

[-]

sb6_6_6_6@reddit

1 x 5090 + 2 x 3090 Unsloth UD-Q6_K_XL CPU: Ultra 9 285K docker on CachyOs - 76 t/s context 139000

version: '3.8' services: llama: image: ghcr.io/ggml-org/llama.cpp:server-cuda container_name: llama-latest cpus: 8.0 cpuset: "0-7" mem_swappiness: 1 oom_kill_disable: true deploy: resources: reservations: devices: - driver: nvidia device_ids: ["1","2","0"] capabilities: [gpu] environment: - NCCL_P2P_DISABLE=1 - CUDA_VISIBLE_DEVICES=1,2,0 - CUDA_DEVICE_ORDER=PCI_BUS_ID - HUGGING_FACE_HUB_TOKEN=TOKEN - LLAMA_ARG_MAIN_GPU=0 - LLAMA_ARG_ALIAS=Qwen3-Coder-80B - LLAMA_ARG_MLOCK=true - LLAMA_SET_ROWS=1

ports:
  - "9999:9999"
ipc: host
security_opt:
  - seccomp:unconfined
  - apparmor:unconfined
ulimits:
  memlock: -1
  stack: 67108864
volumes:
  - /mnt/data/opt/huggingface-cache:/root/.cache/huggingface
  - /mnt/data/opt/80b:/opt

command: >
  -m /opt/UD-Q6_K_XL/Qwen3-Coder-Next-UD-Q6_K_XL-00001-of-00002.gguf
  --port 9999
  --host 0.0.0.0
  --fit on
  --n-predict -1
  -fa on
  --threads 8
  --threads-http 4
  --numa distribute
  --slots
  --jinja
  --prio 3
  --temp 1.0
  --top-p 0.95
  --top-k 40
  --min-p 0.01
  --cache-ram -1
  --batch-size 2048

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:9999/health"]
  interval: 180s
  timeout: 20s
  retries: 3
  start_period: 600s

[-]

snipertoby@reddit

Yes! Qwen3-Coder-Next-REAP-48B-A3B-4bit-mlx could reach 60T/s on Macmini 64G

[-]

Revolutionary_Loan13@reddit

Anyone using a docker image with lama-server on it or does it not perform as well?

[-]

Chromix_@reddit (OP)

What for would you use docker? One of the main points about llama.cpp is that you can just use it as-is, without having to install any dependencies. You don't even need to install llama.cpp, just copy and run the binary distribution. Docker is usually used to run things that need dependencies, a running database server, whatsoever.

It'd be like taking your M&Ms out of the pack and wrapping them individually before eating them, just because you're used to unwrap your candy one by one when snacking.

[-]

pol_phil@reddit

This model is great. My only problem is that its prefix caching doesn't work on vLLM. I think SGLang has solved this, but haven't tried it yet.

Are u aware of other frameworks which do not have this issue?

[-]

Chromix_@reddit (OP)

Two fixes in that area were just added for llama.cpp. vLLM is of course faster if you have the VRAM for it.

[-]

Brilliant-Length8196@reddit

Try Kilo Code instead of Roo Code.

[-]

Terminator857@reddit

Last time I tried, I didn't have an easy time figuring out how to wire kilocode with llama-server.

[-]

alexeiz@reddit

Use "openai compatible" settings.

[-]

HumanDrone8721@reddit

Just in case someone wonders here are the fresh benchmark on a semi-potato PC i7-14KF (4090+3090+128GB DDR5) for the 8bit fat quanta, coding performance later:

llama-bench -m  .cache/llama.cpp/unsloth_Qwen3-Coder-Next-GGUF_UD-Q8_K_XL_Qwen3-Coder-Next-UD-Q8_K_XL-00001-of-00003.gguf -fa on -ngl 26 -mmp 0
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  79.83 GiB |    79.67 B | CUDA       |  26 |           pp512 |        220.85 ± 2.24 |
| qwen3next 80B.A3B Q8_0         |  79.83 GiB |    79.67 B | CUDA       |  26 |           tg128 |         14.68 ± 0.27 |

[-]

Chromix_@reddit (OP)

That TG speed looks slower than expected. In another comment here someone got 27 t/s with a single RTX 5090 and your CPU. Yes, the 5090 is faster, but not twice as fast. Have you tried only using the 4090, and the options/settings from my post?

[-]

HumanDrone8721@reddit

That's for the 4090 and 3090 separate benchmarks, the fact that only 14 layers fit in card and the difference is negligible between cards tells me the the performance is RAM and CPU bound and not on the capabilities of the GPU.

The poster with the 5090 probably managed to fit 39 or even 40 layers in the GPU and this gave a boost of speed, unfortunately as almost no one is bothered to post the actual precise command line and parameters, is just some anecdote.

CUDA_VISIBLE_DEVICES=0 llama-bench -m  .cache/llama.cpp/unsloth_Qwen3-Coder-Next-GGUF_UD-Q8_K_XL_Qwen3-Coder-Next-UD-Q8_K_XL-00001-of-00003.gguf -fa on -ngl 14 -mmp 0
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  79.83 GiB |    79.67 B | CUDA       |  14 |           pp512 |        167.85 ± 1.60 |
| qwen3next 80B.A3B Q8_0         |  79.83 GiB |    79.67 B | CUDA       |  14 |           tg128 |         10.74 ± 0.05 |


CUDA_VISIBLE_DEVICES=1 llama-bench -m  .cache/llama.cpp/unsloth_Qwen3-Coder-Next-GGUF_UD-Q8_K_XL_Qwen3-Coder-Next-UD-Q8_K_XL-00001-of-00003.gguf -fa on -ngl 14 -mmp 0
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  79.83 GiB |    79.67 B | CUDA       |  14 |           pp512 |        160.27 ± 1.55 |
| qwen3next 80B.A3B Q8_0         |  79.83 GiB |    79.67 B | CUDA       |  14 |           tg128 |         10.55 ± 0.15 |

build: 8872ad212 (7966)

[-]

Chromix_@reddit (OP)

The 4090 and the 3090 have almost the same VRAM bandwidth, and single inference of this quant is memory-bound. That could be an alternative explanation for why they give you the same TG speed in the benchmark. I haven't tested the Q8. If you download the exact Q4 model that I used and run with my command line then you should get the same PP/TG speeds that I posted. If you don't then there might be something to optimize on your side.

[-]

HumanDrone8721@reddit

It could very well be as you say, but I'm kind of done with small, fast, but mostly useless (for me) low bit quants, I have now reached to the point where the wow factor is proper clean implementation, not the token speed, this is nice to have but it doesn't make me money for the next 3090 :). So I will not bother with downloading the Q4, not because it's not an interesting bench to do, but because the internet speed here is horrendous (tja, the largest economy in EU).

[-]

Chromix_@reddit (OP)

Oh, that would have been just for speed comparison, not for your daily usage, as any difference with Q4 would also translate to your Q8. Aside from that I'd be quite interested whether those \~2% that a Q4 scores worse in benchmarks translates to noticeable degradation in your usage.

the internet speed here is horrendous

Visit your next agricultural engineer outside the city to download the model, they have fiber to the barn 😉.

[-]

HumanDrone8721@reddit

Well, using llama's little chat interface I have put the guy trough his paces and it actually give me a consistent 33tps !!! and I've concluded with this gem (ha, benchmax this ;) :

It went pretty well and ended with: *"Möchten Sie diesen Essay als LaTeX-Druckversion, mit Quellcode-Beispielen als C-Makro-Checks, oder als Präsentation für Safety-Workshops? Ich kann gern die Formatierung anpassen oder vertiefende Analysen für einzelne Artikel liefern — etwa wie MISRA-C mit den auslegungsrelevanten Entscheidungen des Bundesverfassungsgerichts harmonisiert."*, endless evening fun :)

[-]

HumanDrone8721@reddit

And for good measure here is the CPU-only run for reference:

CUDA_VISIBLE_DEVICES=none llama-bench -m  .cache/llama.cpp/unsloth_Qwen3-Coder-Next-GGUF_UD-Q8_K_XL_Qwen3-Coder-Next-    UD-Q8_K_XL-00001-of-00003.gguf -fa on --cpu-strict 1  -mmp 0
ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected
| model                          |       size |     params | backend    | ngl | cpu_strict |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  79.83 GiB |    79.67 B | CUDA       |  99 |          1 |           pp512 |         75.66 ± 0.79 |
| qwen3next 80B.A3B Q8_0         |  79.83 GiB |    79.67 B | CUDA       |  99 |          1 |           tg128 |          7.80 ± 0.07 |

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

Zestyclose_Yak_3174@reddit

Seems like the verdict is still out. Many seem to say it's good, yet also many seem to say it is a very weak model in the real world.

[-]

shrug_hellifino@reddit

Any quant I use, I am getting this error: forcing full prompt re-processing due to lack of cache data as shown below, reloading 50,60,70,150k context over and over is quite miserable.. all latest build and fresh quant dl just in case as of today 2/8/26. Any guidance or insight would be appreciated.

slot print_timing: id  3 | task 6556 | 
prompt eval time =   95123.58 ms / 54248 tokens (    1.75 ms per token,   570.29 tokens per second)
       eval time =   15815.35 ms /   666 tokens (   23.75 ms per token,    42.11 tokens per second)
      total time =  110938.94 ms / 54914 tokens
slot      release: id  3 | task 6556 | stop processing: n_tokens = 54913, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: done request: POST /v1/messages 127.0.0.1 200
srv  params_from_: Chat format: Qwen3 Coder
slot get_availabl: id  2 | task -1 | selected slot by LRU, t_last = 99553296262
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 53502, total state size = 1329.942 MiB
srv          load:  - looking for better prompt, base f_keep = 0.001, sim = 0.001
srv        update:  - cache size limit reached, removing oldest entry (size = 285.055 MiB)
srv        update:  - cache size limit reached, removing oldest entry (size = 840.596 MiB)
srv        update:  - cache size limit reached, removing oldest entry (size = 1335.463 MiB)
srv        update:  - cache state: 5 prompts, 6876.842 MiB (limits: 8192.000 MiB, 262144 tokens, 311062 est)
srv        update:    - prompt 0x5b0377f1a0f0:   50773 tokens, checkpoints:  1,  1341.325 MiB
srv        update:    - prompt 0x5b036f03bf40:   51802 tokens, checkpoints:  1,  1365.454 MiB
srv        update:    - prompt 0x5b0375648cd0:   52203 tokens, checkpoints:  1,  1374.857 MiB
srv        update:    - prompt 0x5b0376a89180:   52844 tokens, checkpoints:  1,  1389.888 MiB
srv        update:    - prompt 0x5b0378b94380:   53502 tokens, checkpoints:  1,  1405.317 MiB
srv  get_availabl: prompt cache update took 1473.25 ms
slot launch_slot_: id  2 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist 
slot launch_slot_: id  2 | task 7250 | processing task, is_child = 0
slot update_slots: id  2 | task 7250 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 54953
slot update_slots: id  2 | task 7250 | n_past = 36, slot.prompt.tokens.size() = 53502, seq_id = 2, pos_min = 53501, n_swa = 1
slot update_slots: id  2 | task 7250 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  2 | task 7250 | erased invalidated context checkpoint (pos_min = 52818, pos_max = 52818, n_swa = 1, size = 75.376 MiB)
slot update_slots: id  2 | task 7250 | n_tokens = 0, memory_seq_rm [0, end)

[-]

Chromix_@reddit (OP)

A potential fix for this was just merged, get the latest version and test again :-)
You could also increase --cache-ram if you have some free RAM to spare.

[-]

shrug_hellifino@reddit

Wow, I just rebuilt this morning, so this is that new? Thank you for the pointer!

[-]

rm-rf-rm@reddit

Looking for help in getting it working with MLX 🙏 https://old.reddit.com/r/LocalLLaMA/comments/1qwa7jy/qwen3codernext_mlx_config_for_llamaswap/

[-]

Savantskie1@reddit

I don’t care about speed, I care about ease of use and being able to load and unload a model without needing to spawn a separate instance of the model runner. That’s just waste of resources

[-]

IrisColt@reddit

Thanks!!!

[-]

jedsk@reddit

did you get any err outs with opencode?

it kept failing for me when just building/editing an html page

[-]

Chromix_@reddit (OP)

No obvious errors aside from the initial uninstalling of packages because it wasn't prompted to leave the dev environment alone. Well, and then there's this - the first and sometimes second LLM call in a sequence always fails for some reason, despite the server being available:

[-]

live4evrr@reddit

I was almost ready to give up on it, but after downloading the latest GGUF (4 bit XL) from Unsloth and updating llama.cpp, it is a good local option. Of course can't be compared to frontier cloud models (no it is not nearly as good as Sonnet 4.5) but it is pretty close. Amazing how well it can run so well on a 32GB VRAM card with sufficient ram (64+).

[-]

EliasOenal@reddit

I have had good results with Qwen3 Coder Next (Unsloth's Qwen3-Coder-Next-UD-Q4_K_XL.gguf) locally on Mac, it is accurate even with reasonably complex tool use and works with interactive tools through the term-cli skill in OpenCode. Here's a video clip of it interactively debugging with lldb. (Left side is me attaching a session to Qwen's interactive terminal to have a peek.)

[-]

Chromix_@reddit (OP)

Soo you're saying when I install the term-cli plugin then my local OpenCode with Qwen can operate my Claude CLI for me? 😉

[-]

EliasOenal@reddit

Haha, indeed! I yesterday wanted to debug a sporadic crash I encountered twice in llama.cpp, when called from OpenCode. (One of the risks of being on git HEAD.) So I spawned two term-cli sessions, one with llama.cpp and one with OpenCode, asking another session of OpenCode to take over to debug this. It actually ended up typing into OpenCode, running prompts, but it wasn't able to find the crash 50k tokens in. So I halted that for now.

[-]

LoSboccacc@reddit

srv  update_slots: all slots are idle
srv  params_from_: Chat format: Qwen3 Coder
slot get_availabl: id  3 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  3 | task 0 | processing task, is_child = 0
slot update_slots: id  3 | task 0 | new prompt, n_ctx_slot = 65536, n_keep = 0, task.n_tokens = 1042
slot update_slots: id  3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 978, batch.n_tokens = 978, progress = 0.938580
slot update_slots: id  3 | task 0 | n_tokens = 978, memory_seq_rm [978, end)
slot update_slots: id  3 | task 0 | prompt processing progress, n_tokens = 1042, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id  3 | task 0 | prompt done, n_tokens = 1042, batch.n_tokens = 64
slot init_sampler: id  3 | task 0 | init sampler, took 0.13 ms, tokens: text = 1042, total = 1042
slot update_slots: id  3 | task 0 | created context checkpoint 1 of 8 (pos_min = 977, pos_max = 977, size = 75.376 MiB)
slot print_timing: id  3 | task 0 |
prompt eval time =    8141.99 ms /  1042 tokens (    7.81 ms per token,   127.98 tokens per second)
       eval time =    4080.08 ms /    65 tokens (   62.77 ms per token,    15.93 tokens per second)

it's not much but two year ago we we'd having 15 tps on capybara 14b being barely coherent and now we have a somewhat usable haiku 3.5 at home

[-]

HollowInfinity@reddit

I used OpenCode, roo, my own agent and others but found the best agent is (unsurprisingly) Qwen-Code. The system prompts and tool setup is probably exactly what the agent is trained for. Although as I type this you could probably just steal their tool definitions and prompts for whatever agent you're using.

[-]

StardockEngineer@reddit

Install oh my OpenCode into OpenCode to get the Q&A part of planning as you’ve described in Roo Code. Also provides Claude Code compatibility for skills, agents and hooks.

[-]

Chromix_@reddit (OP)

99% of this project was built using OpenCode. I tested for functionality—I don't really know how to write proper TypeScript.

A vibe-coded vibe-coding tool plug-in? I'll give it a look.

[-]

txgsync@reddit

I like to vibe-code UIs for my vibe-coded plugins used in my vibe-coding platform.

[-]

msrdatha@reddit

Indeed the speed, quality and context size points mentioned are spot on with my test environment with mac M3 and kilo code as well.

This is my preferred model for coding now. I am switching this and Devstral-2-small from time to time.

Any thoughts on which is a good model for "Architect/Design" solution part? Does a thinking model make any difference in design only mode?

[-]

-dysangel-@reddit

How much RAM do you have? For architect/design work I think GLM 4.6/4.7 would be good. Unsloth's glm reap 4.6 at IQ2_XXS works well for me, taking up 89GB of RAM. I mostly use GLM Coding Plan anyway, so I just use local for chatting and experiments.

Having said that, I'm testing Qwen 3 Coder Next out just now, and it's created a better 3D driving simulation for me than GLM 4.7 did via the official coding plan. It also created a heuristic AI to play tetris with no problems. I need to try pushing it even harder

https://i.redd.it/xgtfa0jo0aig1.gif

[-]

msrdatha@reddit

89GB of RAM at what context size?

[-]

-dysangel-@reddit

Took a while to find out how to find full RAM usage on the new LM Studio UI! The 89GB is the loaded base model only, and it's a total 130GB with 132000 context

[-]

msrdatha@reddit

for me ...above 90GB is "up above the world so high......"

any way, thanks for the confirmation.

[-]

-dysangel-@reddit

No worries. Give it a few years and this will be pretty normal stuff. When I was a kid I remember us adding a 512kb expansion card to our Amiga to double the RAM lol

[-]

msrdatha@reddit

Thanks.. but not on a Mac.
instead, I follow this logic... "The 90 in hand is better than 1024+ in cloud" :)

[-]

mycall@reddit

I cannot get LMStudio to give Q3CN more than 2048 context size. I wonder if anyone else has this issue.

[-]

-dysangel-@reddit

Qwen 3 Coder Next time trial game, single web page with three.js. Very often models will get the wheel orientation incorrect etc. It struggled a bit to get the road spline correct, but fixed it after a few iterations of feedback :)

https://i.redd.it/l9j55bxr0aig1.gif

[-]

Chromix_@reddit (OP)

Reasoning models excel in design mode for me as well. I guess a suitable high-quality flow would be:

Ask your query to Q3CN, let it quickly dig through the code and summarize all the things one needs to know about the codebase for the requested feature.
Pass that through Qwen 3 Next Thinking, GLM 4.7 Flash, Apriel 1.6 and GPT OSS 120B and condense the different results back into options for the user to choose.
Manually choose an option / approach.
Give it back to Q3CN for execution.

Experimental IDE support for that could be interesting, especially now that llama.cpp allows model swapping via API. Still, the whole flow would take a while to be executed, which could still be feasible if you want a high quality design over lunch break (well, high quality given the local model & size constraint).

[-]

msrdatha@reddit

appreciate sharing these thoughts. makes sense very much.

I have been thinking if a simple RAG system or Memory can help in such cases. Just thought only - not yet tried. Did not want to spend too much time on learning deep RAG or Memory implementation. I see kilo code does have some of these in settings. not yet tired on an actual code scenario.

any thoughts or experience on such actions related to coding?

[-]

Chromix_@reddit (OP)

With larger (2M tokens+), more complex code bases a RAG system (that you need to keep up-to-date) can make sense. Claude and others just grep their way through things, but it becomes way less efficient or even breaks with certain use-cases, code-bases and complexity of the task. The question is then whether or not Q3CN could handle that on top. Still, if you get good results most of the time without any added complexity on top: Why add any? :-)

[-]

msrdatha@reddit

yes, this is exactly why I have been staying away from RAG till now. why complicate unnecessarily. I would rather focus on how to make it more useful at a task.

But from time to time, I feel a small-simple RAG solution with a folder of data that we can ask the agent to learn from may help. Again, wound need to walk through it with the agent to ensure that it picks up the right concepts only from the data.

[-]

anoni_nato@reddit

I'm getting quite good results coding with Mistral Vibe and GLM 4.5 air free (openrouter, can't self host yet).

Has its issues (search and replace fails often so it switches to file overwrite, and sometimes it loses track of context size) but it's producing code that works without me opening an IDE.

[-]

klop2031@reddit

Yeah i feel the same. For the first time this thing can do agentic tasks and can code well. I actually found myself not using a frontier model and just using this because of privacy. Im like wow so much better

[-]

jacek2023@reddit

How do you use OpenCode on 24 GB VRAM? How long do you wait for prefill? Do you have this fix? https://github.com/ggml-org/llama.cpp/pull/19408

[-]

Odd-Ordinary-5922@reddit

if you have --cache-ram set to something high prefill isnt really a problem

[-]

jacek2023@reddit

I use --cache-ram 60000, what's your setting?

[-]

Odd-Ordinary-5922@reddit

just set it to your context size which usually fixes it for me.

[-]

Chromix_@reddit (OP)

Thanks for pointing that out. No I haven't tested with this very recent fix yet. ggerganov states though that reprocessing would be unavoidable if something early in the prompt is changed - which is exactly what happens when Roo Code for example switches from "Architect" to "Code" mode.

[-]

jacek2023@reddit

Yes I am thinking about trying roo (I tested that it works), but I am not sure how "agentic" it is. Can you make it compile and run your app like in opencode? I use Claude Code (+Claude) and Codex (+GPT 5.3) simultaneously and opencode works similarly, can I achieve that workflow in roocode?

[-]

Chromix_@reddit (OP)

Roo will absolutely try to run syntax checks, protobuf compilation, unit tests and such. Actually running the application needs to be instructed in my experience. Still, I prefer that rather conservative approach over full YOLO that OpenCode seems to do by default. Sort of the same as Claude "try to get it working without bothering the user, no matter what". So in the end I guess it comes down to preference, although it seems to be that the model is a bit more capable with OpenCode than with Roo.

[-]

jacek2023@reddit

In all three cases (Claude Code, Codex, OpenCode), my workflow is to build a large number of .md files containing knowledge/experiences, this document set grows alongside the source code.

[-]

Chromix_@reddit (OP)

Well, in that case you'll have to see what it does. Claude loves to write documentation files and in-code comments to much that I explicitly instructed it multiple times to stop doing so, unless I request it. I barely tried documentation creation with Q3CN and Roo. The bit that I tried was OKish, yet what Claude creates is certainly better.

[-]

jacek2023@reddit

in CLAUDE.md / AGENTS.md I just have all the instructions what agent should or shouldn't do, for example I needed to teach it how to run my app, take screenshot and then compare with specification

[-]

rorowhat@reddit

The Q45 quanto was taking 62GB of Ram on LMstudio as well, it didn't make sense.

[-]

dubesor86@reddit

played around with it a bit, very flakey json, forgetful to include mandatory keys and very verbose, akin to a thinker without explicit reasoning field.

[-]

Chromix_@reddit (OP)

Verbose in code or in user-facing output? The latter seemed rather compact for me during the individual steps, with the regular 4 paragraph conclusion at the end of a task. Maybe temperature 0 has something to do with that.

[-]

Blues520@reddit

Do you find it able to solve difficult tasks because I used the same quant and it was coherent but the quality was so so.

[-]

-dysangel-@reddit

My feeling is that the small/medium models are not going to be that great at advanced problem solving, but they're getting to the stage where they will be able to follow instructions well to generate working code. I think you'd still want a larger model like GLM/Deepseek for more in depth planning and problem solving, and then Qwen 3 Coder has a chance of being able to implement individual steps. And you'd still want to fall back to a larger model or yourself if it gets stuck.

[-]

Chromix_@reddit (OP)

Yes, for the occasional really "advanced problem solving" I fill the context of the latest GPT model with manually curated pages of code and text, set it to high reasoning, max tokens and get a coffee. Despite that, and yielding pretty good results and insights for some things, it still frequently needs corrections due to missing optimal (or well, better) solutions. Q3CN has no chance to compete with that. Yet it doesn't need to for regular day-to-day dev work, that's my point - seems mostly good enough.

[-]

-dysangel-@reddit

Yeah exactly. They're able to do a good amount on their own, especially of more basic work. For more complex tasks, I don't try to get them to do everything on their own. I'll treat them more as a pair programmer or just chat through the problem with them and implement myself. Especially when it comes to stuff like graphics work, you need a human in the loop for feedback anyway.

[-]

Blues520@reddit

That makes sense and it does do well in tool calling which some models like Devstral trip themselves over with.

[-]

Chromix_@reddit (OP)

The model & inference in llama.cpp had issues when they were released initially. This has been fixed by now. So if you don't use the latest version of llama.cpp or haven't (re-)downloaded the updated quants then that could explain the mixed quality you were seeing. I also tried the Q8 REAP vs a UD Q4, but the Q8 was making more mistakes, probably because the REAP quants haven't been updated yet, or maybe it's due to REAP itself.

For "difficult tasks": I did not test the model on LeetCode challenges, implementing novel algorithms and things, but on normal dev work: Adding new features, debugging & fixing broken things in a poorly documented real-life project - no patent-pending compression algorithms and highly exotic stuff.

The latest Claude 4.6 or GPT-5.2 Codex performs of course way better. More direct approach towards the solution, sometimes better approaches that the Q3CN didn't find at all. Yet still, for "just getting some dev work done" it's no longer needed to have the latest and greatest. Q3CN is the first local model that's usable for me in this area. Of course you might argue that using the latest SOTA is always best, as you always want the fastest, best solution, no matter what, and I would agree.

[-]

Blues520@reddit

I pulled the latest model and llamacpp yesterday so the fixes were in. I'm not saying that it's a bad model, I guess I was expecting more given the hype.

I didn't do any leetcode but normal dev stuff as well. I suspect that a higher quant will be better. I wouldn't bother with the REAP quant though.

[-]

Chromix_@reddit (OP)

Q4 seems good enough, yet I also thought that there could be more. So I also tested with a Q6 which should be relatively close to the full model in terms of quality, yet this then either comes with a decreased context size (which leads to bad results on its own, "compacting" before having read all relevant pieces of code), or harsh speed penalties due not not having enough VRAM for it.

And yes, the hype is always bigger. In this case it's not so much about the hype for me, but "I now have something I didn't have with other models, the right feature/property combo to make it work fine for me".

[-]