The Qwen of Pain. | TheaterFire

[-]

ElectronSpiderwort@reddit

While you're waiting, load up GLM-4.5-Air-UD-Q3_K_XL, which is the best model I've tried *to date* on 64GB RAM. I know, I know; it's more than a month old and stale already, but it's better than twiddling thumbs

[-]

Free-Combination-773@reddit

The biggest issue with it is waiting approximately forever for prompts to be processed and for responses to finish, isn't it?

[-]

-dysangel-@reddit

Yeah. Though you have to trade it off vs Air being able to do more things correctly first try.

I feel like people with high RAM Macs are close to being able to ditch cloud models just now. The current best setup is probably along these lines:

- Planning: GLM 4.5 or Deepseek level model makes detailed plan (if you cache the system prompt for this then you will be able to proceed instantly rather than wait for 8-10k to process)

- Code editing: Qwen 3 Next as first pass, GLM 4.5 Air as backup if Qwen fails (oooor, ideally a big brother linear attention Qwen 3.5 model whenever it comes out!)
- Debugging: GLM 4.5 Air, fallback to using Copilot with Claude or Gemini, or just fix things myself

[-]

thrownawaymane@reddit

What do you think the minimum line for that is GB of ram wise? 64gb or 128?

[-]

ElectronSpiderwort@reddit

Yes, the performance is predictable :/ But, muuuuch faster than 32b+ dense

[-]

-Ellary-@reddit (OP)

Oh, I'm using it right now, but regular Q4KS at 16k context.
It is a fun and balanced model, feels like somewhat like a 40-50b dense.

[-]

Time_Reaper@reddit

Now get 256gb of ddr5 and run big glm at q4km.

[-]

VoidAlchemy@reddit

Wendell just did a video on which mobos are more likely to support this now: https://www.youtube.com/watch?v=Rn18jQSi8vg

u/Blizado do you have an AM5 mobo or older DDR4 (hence comment on 128GB max?) Typically 4x DIMMs is the "verboten" configuration as the guaranteed data rate is much lower than 2x DIMMs, however it is now becoming more likely - albiet not guaranteed. silicon lottery is getting better odds these days lol

[-]

Blizado@reddit

Nope, a intel Z790 board which can only handle 128GB at max. And I have actually "fun" to find out how get it closer back to 6000Mhz, actually I'm only on 5200Mhz. XD

[-]

VoidAlchemy@reddit

Ahh, got it. Yeah its hard to track actual support across time with BIOS updates etc. as you say, plenty of "fun" haha <3

[-]

Blizado@reddit

Dumb when your Motherboard can only handle 128GB at max. XD

[-]

Borkato@reddit

What’s the tok/s like?

[-]

ElectronSpiderwort@reddit

Almost unusably slow on 2 channels of DDR4. 2.5 tok/sec inference. But, good answers. Would have been unbelievable just 5 years ago

[-]

MerePotato@reddit

Don't MoE models seriously suffer at high quantisation levels?

[-]

VoidAlchemy@reddit

Its not quite that simple as my own quantizations of ubergarm/GLM-4.5-GGUF have high quality attn/shexp/first dense layer while the routed exps get more heavily quantized. This allows us to go below ~3bpw and maintain pretty decent perplexity (quality). Check my model cards for graphs on relative quality for size on a number of recent large MoEs.

[-]

ElectronSpiderwort@reddit

In my experience the one I listed above performs well, but with the occasional single token error on long output or missed prompt detail. If I had OP's rig with a nice GPU as well I'd go with a larger quant, offloading just the experts to RAM. Objectively, with the triangle-ball pygame prompt, It scores better than any other local model I've tried so far like gpt oss 20b, Qwen 30a3b and 32B and GLM 4 32b. The latest dense Qwen thinking 32B Q8 might be better but my system is so slow I'm not going to test them :shrug:

[-]

Pristine-Woodpecker@reddit

Depends. Qwen degrades quickly below Q4, DeepSeek is still reasonable up to Q1.

[-]

Dr_Me_123@reddit

Is it really better than GLM-Air, or is it just because it’s a new architecture?

[-]

Muted-Celebration-47@reddit

You can try it with Openrouter. From my tests, GLM-4.5 air is better.

[-]

random-tomato@reddit

Have to agree with this; I tested it with Hyperbolic API and I do feel like GLM 4.5 Air is noticeably better in many aspects (writing/coding/tool calling/etc.)

[-]

stoppableDissolution@reddit

Theres not a slicer of chance it is better than air. But people are falling into new&shiny trap so easily

[-]

DistanceSolar1449@reddit

Just use vllm

[-]

BananaPeaches3@reddit

It doesn't support older GPUs.

[-]

Outrageous_Cap_1367@reddit

with ram?

[-]

DistanceSolar1449@reddit

https://docs.vllm.ai/en/v0.7.1/getting_started/examples/cpu_offload.html

[-]

milksteak11@reddit

Looks like offline only, I was excited for a sec

[-]

DistanceSolar1449@reddit

https://docs.vllm.ai/en/v0.8.1/getting_started/examples/basic.html#cpu-offload

[-]

prusswan@reddit

were you able to get this to work for any model right now? so far I did not see anyone who got it working with Qwen3 Next

[-]

solidsnakeblue@reddit

I tried and could not get this working with CPU offload, managed to fit it (barely) with a 4 bit AWQ quant on my 2 x 3090 without that setting but the model output was somewhat nonsensical. Decided to give it some time and wait for proper support

[-]

Marksta@reddit

I gave this a try, it's not hybrid inference. Even doing 1GB offload crippled performance by 95%. Which was weird because I used it on a model I have like, 100x concurrency on. So the implementation must be exactly just pretending you have VRAM you don't. So I get the suspicion it doesn't really have a use besides maybe on gen5 x16.

[-]

onil_gova@reddit

[-]

power97992@reddit

Just use sglang or vllm

[-]

nmkd@reddit

Not fun on Windows

[-]

mgr2019x@reddit

We need cpu offloading...

[-]

bolmer@reddit

VLLM has cpu offloading isn't?

[-]

DistanceSolar1449@reddit

No. The vllm cpu offload is actually misnamed.

They swap pages from vram to system ram but the actual calculations are done on GPU. That limits you to pcie speed, not memory bandwidth speed. So only 32GB/sec for most Nvidia 3000/4000 gpus.

[-]

iKy1e@reddit

The fast support for everything by MLX has amazed me. I thought it’d be like most of Apples open source example code, and never touched again but it gets regular updates to support all sorts of models.

[-]

mearyu_@reddit

There's some geniuses that love their macintoshes like https://github.com/ml-explore/mlx-lm/pull/441

[-]

DistanceSolar1449@reddit

Look at how clean that PR is. 4 files, mostly just adding a qwen3_next.py file which is 500 lines, no big changes.

That’s why it was done quickly. I doubt you can do that in the mess of llama.cpp’s codebase.

[-]

Pristine-Woodpecker@reddit

I mean for one you can't paste Python code in the C++ codebase eh, nor will the proper offloading and multi-arch support materialize out of thin air.

[-]

ttkciar@reddit

I love llama.cpp with a burning love, but not gonna lie, sometimes it feels like that too.

[-]

belgradGoat@reddit

That’s a rare occasion I’d say

[-]

-Ellary-@reddit (OP)

So true.

[-]

LocoMod@reddit

I see you are doing your part in turning what is likely one of the best resources for LLMs in the entire web into a trashy meme sub.

[-]

autoencoder@reddit

it's tagged "funny", so I guess it lives up to its tag

[-]

-Ellary-@reddit (OP)

Since it is sub related problem trashy meme, I'd say I'm turning it into thematic LLM trashy meme sub.
Go get your own, this one is taken.

[-]

Muted-Celebration-47@reddit

From my tests, GLM4.5-air is better. So I think I will skip it and wait for new models.

[-]

WestPush7@reddit

Yo LM Studio on my M4 Pro, 48GB Ram (37GB VRAM) is running smoother than a greased weenie. Peak RAM usage? 34GB still got headroom! No hallucinations, no repetition loops, especially when I keep temp under 0.5 (all other settings default). Can even run fews apps, bunch of other tabs alongside it, like a boss.

What shocks me:
45-59 T/S on a laptop? That’s unholy fast.
Qwen Next 80b 3-bit as my daily driver and SEED OSS 36B @ 4-bit too with decent speed and accuracy.

[-]

Pristine-Woodpecker@reddit

I guess that's with a quant like this one? https://huggingface.co/nightmedia/Qwen3-Next-80B-A3B-Instruct-qx3-mlx

[-]

WestPush7@reddit

thats correct

[-]

-Ellary-@reddit (OP)

Oh right, M4 Pro, where did I put it...

[-]

spawncampinitiated@reddit

It's in your other coat

[-]

raysar@reddit

laptop price shock me XD

[-]

geataa@reddit

No servers No surveillance https://youtube.com/shorts/tsl8SBYFuOg?feature=share

check it guys

[-]

edward-dev@reddit

Given their previous model release history, there's a strong likelihood that Qwen 3.5 will be available within three months. By then, most llama.cpp users will likely have moved on, as an improved version of the model will probably already exist. Qwen 2 --> June 2024 Qwen 2.5 --> September 2024 Qwen 3 --> April 2025

Qwen 3.5 shouldn't be far off...

[-]

-Ellary-@reddit (OP)

True.

[-]

cGalaxy@reddit

Does the full model can run on 96gb vram, or does it need cpu offloading? Asking about vllm

[-]

-Ellary-@reddit (OP)

Even 64gb of vram will do.

[-]

cGalaxy@reddit

Are you talking about full model or a quant?

I thought it was about 160gb vram, but then i see this post about cpu offload with less than 128gb

[-]

-Ellary-@reddit (OP)

I'm talking about 4bits.

[-]

lemon07r@reddit

It doesn't seem to be very good sadly, or at least it feels like a sidestep from previous models

[-]

-Ellary-@reddit (OP)

[-]

raysar@reddit

There is way to load AWQ with VLLM using docker no?
I need to test it.
https://huggingface.co/cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit

[-]

UniqueAttourney@reddit

When you get 64Gb of DDR4 instead :(

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

daHaus@reddit

I'd help if I had the hardware to work with but I can't bring myself to give money to them right now. I keep thinking about how the CEOs of nvidia and AMD are cousins and it makes me wonder

[-]

CBW1255@reddit

I run it on MLX and while I understand you will want to check it out for yourself, for my use case (classifications, spelling / formulating sentences) it's just meh.

So, you're not missing out on much. I do realize this won't help you much.

[-]

ThinCod5022@reddit

[-]

ComplexType568@reddit

welp, for some cope, at least this is the prerelease for qwen 3 next's (3.5?) lineup so the moment the entire lineup comes out everything should work if llamacpp implements in time!!!

[-]

TipIcy4319@reddit

It's very unfortunate. By the time there's proper support, the community will have already moved on. I have 22gb VRAM and 64gb RAM. Wondering how well it would run.

[-]

condition_oakland@reddit

In the original release thread some big brained chad speculated qwen purposefully released this now to give the os community time to implement support for the new architecture, so when the next major qwen model drops it will have day one support.

[-]

JeffieSandBags@reddit

Small context itll be usable for chatting, long context won't be usable for us i believe.

[-]

Commercial-Celery769@reddit

Been waiting to distill it but I cant properly test without llama.cpp :(

[-]

ThinkExtension2328@reddit

When I have 128gb of ram and 28gb of vram locked and loaded but no gguf is available

[-]

MonitorAway2394@reddit

bwahahaha I be checking these f'ing subs 2-56600x daily hoping someone was like, legit, genius/miracle-made to absolve of us of this pain, this wait, this decay, I AM DECAYING QWEN, I AM DECAYING, HURRY! I NEED YOU BEFORE I FURTHER DECAY!!! ok my ass needs to get off the web now I'm too manic, :D

[-]

_raydeStar@reddit

Ahh the good old days when I was just mashing f5 on unsloth assuming it was coming any moment.

[-]

MonitorAway2394@reddit

the fucking 'web' lolololol I'm old...

[-]

Aware-Common-7368@reddit

What does it mean?

[-]

DrVonSinistro@reddit

QWEN teem will have a improved iteration of that model before Llama.cpp is done with implementing the support for it.

[-]

-Ellary-@reddit (OP)

QWEN team say that this arch will be for new Qwen 3.5 models in future, Qwen 3 Next is like a test run.

[-]

PANIC_EXCEPTION@reddit

Boy am I glad I have an M1 Max 64 GB

[-]

xxPoLyGLoTxx@reddit

Apparently the mlx version works. Just use that! ^^/s

[-]

MonitorAway2394@reddit

wait, what, where, how, and apologies I keep forgetting MLX is a thing since I got my m3 ultra sauce!! (I feel dumb as fuck lol)

[-]

MonitorAway2394@reddit

NVM got it running lol :D <3

[-]

xxPoLyGLoTxx@reddit

Nice! I haven’t tried it yet. But it’s on the docket. :)

[-]

jarec707@reddit

Maybe they could run a Mac emulator on one of those giant PCs /s

[-]

knownboyofno@reddit

If you have 6 GB VRAM, look at ktransformers. Here is a post talking about it a few hours ago.

https://www.reddit.com/r/LocalLLaMA/comments/1nipldx/ktransformers_now_supports_qwen3next/

[-]

-Ellary-@reddit (OP)

On it.

[-]

DataGOGO@reddit

Correct.

That is how big the model is, so you need that much ram to load the model, it doesn’t matter if it is vram or system memory, but you need the ram.

[-]

Majestic_Complex_713@reddit

Someone is gonna try to run this off their swap file. They'll probably report back in 2027.

[-]

-Ellary-@reddit (OP)

Right at the llama.cpp support day!

[-]

SuperChewbacca@reddit

What cards do you have? If you are running two MI50's, I am back porting, making big changes for vllm-gfx906 v1 engine to work with qwen 3 next.

[-]

-Ellary-@reddit (OP)

The mighty 3060 12gb ;) I'm use it for cache.
A3B models are great to run on CPU with regular RAM.

[-]

SuperChewbacca@reddit

Ahh gotcha, I see you said DDR5. Ktransformers is what you want with a 4 bit quant, which will get you not a ton of context, but you should be able to make it work.

[-]

Mediocre-Method782@reddit

That's how high-face cultures say "yeah nah you do it"