While you're waiting, load up GLM-4.5-Air-UD-Q3_K_XL, which is the best model I've tried *to date* on 64GB RAM. I know, I know; it's more than a month old and stale already, but it's better than twiddling thumbs
Yeah. Though you have to trade it off vs Air being able to do more things correctly first try.
I feel like people with high RAM Macs are close to being able to ditch cloud models just now. The current best setup is probably along these lines:
- Planning: GLM 4.5 or Deepseek level model makes detailed plan (if you cache the system prompt for this then you will be able to proceed instantly rather than wait for 8-10k to process)
- Code editing: Qwen 3 Next as first pass, GLM 4.5 Air as backup if Qwen fails (oooor, ideally a big brother linear attention Qwen 3.5 model whenever it comes out!)
- Debugging: GLM 4.5 Air, fallback to using Copilot with Claude or Gemini, or just fix things myself
u/Blizado do you have an AM5 mobo or older DDR4 (hence comment on 128GB max?) Typically 4x DIMMs is the "verboten" configuration as the guaranteed data rate is much lower than 2x DIMMs, however it is now becoming more likely - albiet not guaranteed. silicon lottery is getting better odds these days lol
Nope, a intel Z790 board which can only handle 128GB at max. And I have actually "fun" to find out how get it closer back to 6000Mhz, actually I'm only on 5200Mhz. XD
Its not quite that simple as my own quantizations of ubergarm/GLM-4.5-GGUF have high quality attn/shexp/first dense layer while the routed exps get more heavily quantized. This allows us to go below ~3bpw and maintain pretty decent perplexity (quality). Check my model cards for graphs on relative quality for size on a number of recent large MoEs.
In my experience the one I listed above performs well, but with the occasional single token error on long output or missed prompt detail. If I had OP's rig with a nice GPU as well I'd go with a larger quant, offloading just the experts to RAM. Objectively, with the triangle-ball pygame prompt, It scores better than any other local model I've tried so far like gpt oss 20b, Qwen 30a3b and 32B and GLM 4 32b. The latest dense Qwen thinking 32B Q8 might be better but my system is so slow I'm not going to test them :shrug:
Have to agree with this; I tested it with Hyperbolic API and I do feel like GLM 4.5 Air is noticeably better in many aspects (writing/coding/tool calling/etc.)
I tried and could not get this working with CPU offload, managed to fit it (barely) with a 4 bit AWQ quant on my 2 x 3090 without that setting but the model output was somewhat nonsensical. Decided to give it some time and wait for proper support
I gave this a try, it's not hybrid inference. Even doing 1GB offload crippled performance by 95%. Which was weird because I used it on a model I have like, 100x concurrency on. So the implementation must be exactly just pretending you have VRAM you don't. So I get the suspicion it doesn't really have a use besides maybe on gen5 x16.
They swap pages from vram to system ram but the actual calculations are done on GPU. That limits you to pcie speed, not memory bandwidth speed. So only 32GB/sec for most Nvidia 3000/4000 gpus.
The fast support for everything by MLX has amazed me. I thought it’d be like most of Apples open source example code, and never touched again but it gets regular updates to support all sorts of models.
Yo LM Studio on my M4 Pro, 48GB Ram (37GB VRAM) is running smoother than a greased weenie. Peak RAM usage? 34GB still got headroom! No hallucinations, no repetition loops, especially when I keep temp under 0.5 (all other settings default). Can even run fews apps, bunch of other tabs alongside it, like a boss.
What shocks me:
45-59 T/S on a laptop? That’s unholy fast.
Qwen Next 80b 3-bit as my daily driver and SEED OSS 36B @ 4-bit too with decent speed and accuracy.
Given their previous model release history, there's a strong likelihood that Qwen 3.5 will be available within three months. By then, most llama.cpp users will likely have moved on, as an improved version of the model will probably already exist.
Qwen 2 --> June 2024
Qwen 2.5 --> September 2024
Qwen 3 --> April 2025
I'd help if I had the hardware to work with but I can't bring myself to give money to them right now. I keep thinking about how the CEOs of nvidia and AMD are cousins and it makes me wonder
I run it on MLX and while I understand you will want to check it out for yourself, for my use case (classifications, spelling / formulating sentences) it's just meh.
So, you're not missing out on much. I do realize this won't help you much.
welp, for some cope, at least this is the prerelease for qwen 3 next's (3.5?) lineup so the moment the entire lineup comes out everything should work if llamacpp implements in time!!!
It's very unfortunate. By the time there's proper support, the community will have already moved on. I have 22gb VRAM and 64gb RAM. Wondering how well it would run.
In the original release thread some big brained chad speculated qwen purposefully released this now to give the os community time to implement support for the new architecture, so when the next major qwen model drops it will have day one support.
bwahahaha I be checking these f'ing subs 2-56600x daily hoping someone was like, legit, genius/miracle-made to absolve of us of this pain, this wait, this decay, I AM DECAYING QWEN, I AM DECAYING, HURRY! I NEED YOU BEFORE I FURTHER DECAY!!! ok my ass needs to get off the web now I'm too manic, :D
Ahh gotcha, I see you said DDR5. Ktransformers is what you want with a 4 bit quant, which will get you not a ton of context, but you should be able to make it work.
ElectronSpiderwort@reddit
While you're waiting, load up GLM-4.5-Air-UD-Q3_K_XL, which is the best model I've tried *to date* on 64GB RAM. I know, I know; it's more than a month old and stale already, but it's better than twiddling thumbs
Free-Combination-773@reddit
The biggest issue with it is waiting approximately forever for prompts to be processed and for responses to finish, isn't it?
-dysangel-@reddit
Yeah. Though you have to trade it off vs Air being able to do more things correctly first try.
I feel like people with high RAM Macs are close to being able to ditch cloud models just now. The current best setup is probably along these lines:
- Planning: GLM 4.5 or Deepseek level model makes detailed plan (if you cache the system prompt for this then you will be able to proceed instantly rather than wait for 8-10k to process)
- Code editing: Qwen 3 Next as first pass, GLM 4.5 Air as backup if Qwen fails (oooor, ideally a big brother linear attention Qwen 3.5 model whenever it comes out!)
- Debugging: GLM 4.5 Air, fallback to using Copilot with Claude or Gemini, or just fix things myself
thrownawaymane@reddit
What do you think the minimum line for that is GB of ram wise? 64gb or 128?
ElectronSpiderwort@reddit
Yes, the performance is predictable :/ But, muuuuch faster than 32b+ dense
-Ellary-@reddit (OP)
Oh, I'm using it right now, but regular Q4KS at 16k context.
It is a fun and balanced model, feels like somewhat like a 40-50b dense.
Time_Reaper@reddit
Now get 256gb of ddr5 and run big glm at q4km.
VoidAlchemy@reddit
Wendell just did a video on which mobos are more likely to support this now: https://www.youtube.com/watch?v=Rn18jQSi8vg
u/Blizado do you have an AM5 mobo or older DDR4 (hence comment on 128GB max?) Typically 4x DIMMs is the "verboten" configuration as the guaranteed data rate is much lower than 2x DIMMs, however it is now becoming more likely - albiet not guaranteed. silicon lottery is getting better odds these days lol
Blizado@reddit
Nope, a intel Z790 board which can only handle 128GB at max. And I have actually "fun" to find out how get it closer back to 6000Mhz, actually I'm only on 5200Mhz. XD
VoidAlchemy@reddit
Ahh, got it. Yeah its hard to track actual support across time with BIOS updates etc. as you say, plenty of "fun" haha <3
Blizado@reddit
Dumb when your Motherboard can only handle 128GB at max. XD
Borkato@reddit
What’s the tok/s like?
ElectronSpiderwort@reddit
Almost unusably slow on 2 channels of DDR4. 2.5 tok/sec inference. But, good answers. Would have been unbelievable just 5 years ago
MerePotato@reddit
Don't MoE models seriously suffer at high quantisation levels?
VoidAlchemy@reddit
Its not quite that simple as my own quantizations of ubergarm/GLM-4.5-GGUF have high quality attn/shexp/first dense layer while the routed exps get more heavily quantized. This allows us to go below ~3bpw and maintain pretty decent perplexity (quality). Check my model cards for graphs on relative quality for size on a number of recent large MoEs.
ElectronSpiderwort@reddit
In my experience the one I listed above performs well, but with the occasional single token error on long output or missed prompt detail. If I had OP's rig with a nice GPU as well I'd go with a larger quant, offloading just the experts to RAM. Objectively, with the triangle-ball pygame prompt, It scores better than any other local model I've tried so far like gpt oss 20b, Qwen 30a3b and 32B and GLM 4 32b. The latest dense Qwen thinking 32B Q8 might be better but my system is so slow I'm not going to test them :shrug:
Pristine-Woodpecker@reddit
Depends. Qwen degrades quickly below Q4, DeepSeek is still reasonable up to Q1.
Dr_Me_123@reddit
Is it really better than GLM-Air, or is it just because it’s a new architecture?
Muted-Celebration-47@reddit
You can try it with Openrouter. From my tests, GLM-4.5 air is better.
random-tomato@reddit
Have to agree with this; I tested it with Hyperbolic API and I do feel like GLM 4.5 Air is noticeably better in many aspects (writing/coding/tool calling/etc.)
stoppableDissolution@reddit
Theres not a slicer of chance it is better than air. But people are falling into new&shiny trap so easily
DistanceSolar1449@reddit
Just use vllm
BananaPeaches3@reddit
It doesn't support older GPUs.
Outrageous_Cap_1367@reddit
with ram?
DistanceSolar1449@reddit
https://docs.vllm.ai/en/v0.7.1/getting_started/examples/cpu_offload.html
milksteak11@reddit
Looks like offline only, I was excited for a sec
DistanceSolar1449@reddit
https://docs.vllm.ai/en/v0.8.1/getting_started/examples/basic.html#cpu-offload
prusswan@reddit
were you able to get this to work for any model right now? so far I did not see anyone who got it working with Qwen3 Next
solidsnakeblue@reddit
I tried and could not get this working with CPU offload, managed to fit it (barely) with a 4 bit AWQ quant on my 2 x 3090 without that setting but the model output was somewhat nonsensical. Decided to give it some time and wait for proper support
Marksta@reddit
I gave this a try, it's not hybrid inference. Even doing 1GB offload crippled performance by 95%. Which was weird because I used it on a model I have like, 100x concurrency on. So the implementation must be exactly just pretending you have VRAM you don't. So I get the suspicion it doesn't really have a use besides maybe on gen5 x16.
onil_gova@reddit
power97992@reddit
Just use sglang or vllm
nmkd@reddit
Not fun on Windows
mgr2019x@reddit
We need cpu offloading...
bolmer@reddit
VLLM has cpu offloading isn't?
DistanceSolar1449@reddit
No. The vllm cpu offload is actually misnamed.
They swap pages from vram to system ram but the actual calculations are done on GPU. That limits you to pcie speed, not memory bandwidth speed. So only 32GB/sec for most Nvidia 3000/4000 gpus.
iKy1e@reddit
The fast support for everything by MLX has amazed me. I thought it’d be like most of Apples open source example code, and never touched again but it gets regular updates to support all sorts of models.
mearyu_@reddit
There's some geniuses that love their macintoshes like https://github.com/ml-explore/mlx-lm/pull/441
DistanceSolar1449@reddit
Look at how clean that PR is. 4 files, mostly just adding a qwen3_next.py file which is 500 lines, no big changes.
That’s why it was done quickly. I doubt you can do that in the mess of llama.cpp’s codebase.
Pristine-Woodpecker@reddit
I mean for one you can't paste Python code in the C++ codebase eh, nor will the proper offloading and multi-arch support materialize out of thin air.
ttkciar@reddit
I love llama.cpp with a burning love, but not gonna lie, sometimes it feels like that too.
belgradGoat@reddit
That’s a rare occasion I’d say
-Ellary-@reddit (OP)
So true.
LocoMod@reddit
I see you are doing your part in turning what is likely one of the best resources for LLMs in the entire web into a trashy meme sub.
autoencoder@reddit
it's tagged "funny", so I guess it lives up to its tag
-Ellary-@reddit (OP)
Since it is sub related problem trashy meme, I'd say I'm turning it into thematic LLM trashy meme sub.
Go get your own, this one is taken.
Muted-Celebration-47@reddit
From my tests, GLM4.5-air is better. So I think I will skip it and wait for new models.
WestPush7@reddit
Yo LM Studio on my M4 Pro, 48GB Ram (37GB VRAM) is running smoother than a greased weenie. Peak RAM usage? 34GB still got headroom! No hallucinations, no repetition loops, especially when I keep temp under 0.5 (all other settings default). Can even run fews apps, bunch of other tabs alongside it, like a boss.
What shocks me:
45-59 T/S on a laptop? That’s unholy fast.
Qwen Next 80b 3-bit as my daily driver and SEED OSS 36B @ 4-bit too with decent speed and accuracy.
Pristine-Woodpecker@reddit
I guess that's with a quant like this one? https://huggingface.co/nightmedia/Qwen3-Next-80B-A3B-Instruct-qx3-mlx
WestPush7@reddit
thats correct
-Ellary-@reddit (OP)
Oh right, M4 Pro, where did I put it...
spawncampinitiated@reddit
It's in your other coat
raysar@reddit
laptop price shock me XD
geataa@reddit
No servers No surveillance https://youtube.com/shorts/tsl8SBYFuOg?feature=share
check it guys
edward-dev@reddit
Given their previous model release history, there's a strong likelihood that Qwen 3.5 will be available within three months. By then, most llama.cpp users will likely have moved on, as an improved version of the model will probably already exist. Qwen 2 --> June 2024 Qwen 2.5 --> September 2024 Qwen 3 --> April 2025
Qwen 3.5 shouldn't be far off...
-Ellary-@reddit (OP)
True.
cGalaxy@reddit
Does the full model can run on 96gb vram, or does it need cpu offloading? Asking about vllm
-Ellary-@reddit (OP)
Even 64gb of vram will do.
cGalaxy@reddit
Are you talking about full model or a quant?
I thought it was about 160gb vram, but then i see this post about cpu offload with less than 128gb
-Ellary-@reddit (OP)
I'm talking about 4bits.
lemon07r@reddit
It doesn't seem to be very good sadly, or at least it feels like a sidestep from previous models
-Ellary-@reddit (OP)
raysar@reddit
There is way to load AWQ with VLLM using docker no?
I need to test it.
https://huggingface.co/cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit
UniqueAttourney@reddit
When you get 64Gb of DDR4 instead :(
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
daHaus@reddit
I'd help if I had the hardware to work with but I can't bring myself to give money to them right now. I keep thinking about how the CEOs of nvidia and AMD are cousins and it makes me wonder
CBW1255@reddit
I run it on MLX and while I understand you will want to check it out for yourself, for my use case (classifications, spelling / formulating sentences) it's just meh.
So, you're not missing out on much. I do realize this won't help you much.
ThinCod5022@reddit
ComplexType568@reddit
welp, for some cope, at least this is the prerelease for qwen 3 next's (3.5?) lineup so the moment the entire lineup comes out everything should work if llamacpp implements in time!!!
TipIcy4319@reddit
It's very unfortunate. By the time there's proper support, the community will have already moved on. I have 22gb VRAM and 64gb RAM. Wondering how well it would run.
condition_oakland@reddit
In the original release thread some big brained chad speculated qwen purposefully released this now to give the os community time to implement support for the new architecture, so when the next major qwen model drops it will have day one support.
JeffieSandBags@reddit
Small context itll be usable for chatting, long context won't be usable for us i believe.
Commercial-Celery769@reddit
Been waiting to distill it but I cant properly test without llama.cpp :(
ThinkExtension2328@reddit
When I have 128gb of ram and 28gb of vram locked and loaded but no gguf is available
MonitorAway2394@reddit
bwahahaha I be checking these f'ing subs 2-56600x daily hoping someone was like, legit, genius/miracle-made to absolve of us of this pain, this wait, this decay, I AM DECAYING QWEN, I AM DECAYING, HURRY! I NEED YOU BEFORE I FURTHER DECAY!!! ok my ass needs to get off the web now I'm too manic, :D
_raydeStar@reddit
Ahh the good old days when I was just mashing f5 on unsloth assuming it was coming any moment.
MonitorAway2394@reddit
the fucking 'web' lolololol I'm old...
Aware-Common-7368@reddit
What does it mean?
DrVonSinistro@reddit
QWEN teem will have a improved iteration of that model before Llama.cpp is done with implementing the support for it.
-Ellary-@reddit (OP)
QWEN team say that this arch will be for new Qwen 3.5 models in future, Qwen 3 Next is like a test run.
PANIC_EXCEPTION@reddit
Boy am I glad I have an M1 Max 64 GB
xxPoLyGLoTxx@reddit
Apparently the mlx version works. Just use that! ^^/s
MonitorAway2394@reddit
wait, what, where, how, and apologies I keep forgetting MLX is a thing since I got my m3 ultra sauce!! (I feel dumb as fuck lol)
MonitorAway2394@reddit
NVM got it running lol :D <3
xxPoLyGLoTxx@reddit
Nice! I haven’t tried it yet. But it’s on the docket. :)
jarec707@reddit
Maybe they could run a Mac emulator on one of those giant PCs /s
knownboyofno@reddit
If you have 6 GB VRAM, look at ktransformers. Here is a post talking about it a few hours ago.
https://www.reddit.com/r/LocalLLaMA/comments/1nipldx/ktransformers_now_supports_qwen3next/
-Ellary-@reddit (OP)
On it.
DataGOGO@reddit
Correct.
That is how big the model is, so you need that much ram to load the model, it doesn’t matter if it is vram or system memory, but you need the ram.
Majestic_Complex_713@reddit
Someone is gonna try to run this off their swap file. They'll probably report back in 2027.
-Ellary-@reddit (OP)
Right at the llama.cpp support day!
SuperChewbacca@reddit
What cards do you have? If you are running two MI50's, I am back porting, making big changes for vllm-gfx906 v1 engine to work with qwen 3 next.
-Ellary-@reddit (OP)
The mighty 3060 12gb ;) I'm use it for cache.
A3B models are great to run on CPU with regular RAM.
SuperChewbacca@reddit
Ahh gotcha, I see you said DDR5. Ktransformers is what you want with a 4 bit quant, which will get you not a ton of context, but you should be able to make it work.
Mediocre-Method782@reddit
That's how high-face cultures say "yeah nah you do it"