I tried it *just* for you. Using the same settings and prompt to compare Q6\_K of this to the last quant I tried, UD-Q6\_K\_XL, whose output had [4 compile errors](https://www.reddit.com/r/LocalLLaMA/comments/1r3weq3/comment/o581ehs/) (ignoring the ones I said I can't blame it for) on my "make an entire TypeScript minigame" vibe check... First of all, the model file is 28% smaller, making it a bit faster because more of it fits on my GPUs. Also keep in mind this GGUF is a static quant--no importance matrix.
Like the original model was fond of doing, it kinda-cheated by leaving out all the imports and only defined a few instances of the main gameplay data (plant seeds, in this case) with "..." after the first three. And ignoring the one "function called on the wrong object" error because my spec doesn't say that `fullSave` belongs to `GameState` and not `City`, it crapped out 22 compile errors:
\- 2 uninitialized properties
\- 1 use of a nonexistent type (`Difficulty`, accounting for 4 errors)
\- 5 undeclared properties (accounting for 8 errors)
\- 2 undefined functions
\- 4 objects passed into functions that can't accept that type (calls concentrated in one function)
\- 1 complete duplicate assignment of a property (specifically, an event handler) in one object initializer
\- 1 function called on the wrong object (the spec *does* say it's a `City` method for this one, but the generated code tried to call it on `uiManager`)
Keep in mind this pruned version was intended to retain math and coding skills *specifically* (second paragraph of the [model card](https://huggingface.co/bknyaz/Qwen3-Coder-Next-REAM)), so the performance degradation should be more noticeable on other kinds of prompts. I conclude that this REAM version is *pretty* *bad* compared to the original model, at least for TypeScript.
I'm indeed using the model since about a week (together with the [b7972](https://github.com/ggml-org/llama.cpp/releases/tag/b7972) llama.cpp release). I definitely prefer the `mradermacher/Qwen3-Coder-Next-REAM-GGUF:Q4_K_M` variant for coding with python over last years `Qwen3-Coder-30B-A3B-Instruct` \- it is aware of a number of relatively new language features that last years models never got right and gave satisfying answers in a light debugging session.
On a dual `4090` system I still have about 3GB of VRAM headroom left on each card with `--ctx-size 120000` at 95 token/s. I have used `Qwen3-Coder-Next` a few times over API and definitely noticed a significant difference when trying to use it im my native language (German) - here the API model is already quite bad, but for the `REAM` model it generated multiple grammatical errors.
Hard to say, as I did not use the normal model that much. I find that `minimax-2.5`, `gemini-3-flash-preview`, `GLM-5` and `Kimi-K2.5` all sit in a more attractive price/performance spot when used via API so I don't have that much of a comparison.
I have noticed (but can't tell so far if there are quantization/REAM specific differences), that `Qwen3-Coder-Next` does have more of an hallucination problem than the above models. It also shows some of the self-correction behavior you'd find in the thinking process of thinking models making the outputs a bit verbose.
To be honest, there are so many freaking models and variations of everything, it's a bit overwhelming to stay on top of them all let alone benchmark them. Then, performance aside, the actual usability and coherence can vary from task to task from user to user.
Yes. For your sanity sake, the rule out thumb is avoid shortcuts to the maximum possible extent. Use the biggest model, least quantized, non abliterated/pruned etc that can run at the slowest acceptable pace on your hardware. Defer to quality tokens rather than more tokens faster
this is why I just what ever the latest and greatest is, ignore the rest... then dont even look for a while til im bored and then I check for udpates again, it's just too much tbh, you can only have so many hobbies, less when they are fast paced.
yeah, my thought as well! This is also how I view whenever I hear praises of the usability of low-quant of huge models, which got me hooked, and decided to download and try. Most of the time, the results seem decent on some simple test prompt. But as long as I put them to use on pretty technical, detail-driven task, they always fail miserably
I tried it at q2\_K... No bueno:
[https://www.reddit.com/r/LocalLLaMA/comments/1r4k79m/comment/o5c984o/](https://www.reddit.com/r/LocalLLaMA/comments/1r4k79m/comment/o5c984o/)
The answers are worthless. But didn't get loops or nonsensical words, that's already something.
I can't run any better quant to test.
Just as info, the Q3\_K\_M was already able to do the game flappy bird in one shot. Not out of the ordinary anymore, but a huge jump from the Q2 that was close to worthless
Q2_k is critical on longer answers, works best for shorter ones. I always try to use Imatrix Q4_K_M, it's always the best compromise at any model. Of course if you have enough VRAM and RAM (24+64 here).
Those are not imatrix quant yet. Aside from that the token and output weights are quantized a lot which can impact coding models. I'll wait for a quant that was created using imatrix and also has token and output weights at least on Q8. That's usually the case for Bartowski and Unsloth L / XL quants.
21 Comments
DeProgrammer99@reddit
ClimateBoss@reddit
Significant_Fig_7581@reddit (OP)
Mkengine@reddit
rainbyte@reddit
nima3333@reddit
Round_Mixture_7541@reddit
Sufficient-Rent6078@reddit
Achso998@reddit
Sufficient-Rent6078@reddit
BetaOp9@reddit
StardockEngineer@reddit
cleverusernametry@reddit
AcePilot01@reddit
hieuphamduy@reddit
mouseofcatofschrodi@reddit
Significant_Fig_7581@reddit (OP)
mouseofcatofschrodi@reddit
Blizado@reddit
Significant_Fig_7581@reddit (OP)
Chromix_@reddit