GLM 5.1 Locally: 40tps, 2000+ pp/s

[-]

jacek2023@reddit

I have currently problem purchasing fourth 3090 because I don't see it available anywhere, so I am not sure when I could purchase four 6000 Pro

[-]

Blaze6181@reddit

Yeah they'll wave limit 1 if you make a good business case for it.

[-]

Blaze6181@reddit

They limit one per household. They will wave this if you have a business use for it.

[-]

val_in_tech@reddit (OP)

Its actually not that easy. I exchaisted local supply in my city few times just due to my purchases alone. And lots of defective ones.. Sometimes only show under sustained stress test.

[-]

thetaurean@reddit

Could you please share your test setup? Id like to make sure mine are behaving.

[-]

val_in_tech@reddit (OP)

Run largest model that fits your vram via vLLM and any llm bench for 60 mins non stop. Make sure vram is used 95-98% and gpu CPU is constantly 100%. This would rule out most typical hardware faults.

[-]

thetaurean@reddit

Whats the typical failure state? System crash?

[-]

Full crash, kernel panics, gpu dissapears, chained PCI failures and a bunch of them fallout. Worst part is - it's entirely possible to have faulty card but working just enough not to notice for awhile and then have super weird spontaneous problems. I had to do 2 returns on 1, they said they couldn't fail, I insisted and they could only crash it with my instructions. Then, I heard they just sold it to someone else... I'm talking about official distributor not some random people on the internet...

[-]

Ell2509@reddit

Yeah they are basically the same

/s

[-]

Bootes-sphere@reddit

That's impressive throughput on those RTX 6000 Pros! The 40tps you're hitting is genuinely solid for local inference at that scale. You're right that the software stack will unlock even more potential — sglang and similar optimizers are still evolving fast for enterprise GPU clusters. Have you experimented with different batch sizes or KV cache strategies to see if there's headroom left, or are you already hitting the hardware ceiling?

[-]

val_in_tech@reddit (OP)

I see potential to hit 70-90tps just tinkering with current setup. Likely sacrificing some context. It took some time to get to that stable version so just want to enjoy it for a bit before diving deeper. VLLM is very enterprise focused to the point that the only replies I even for in their discord was from AI bot.. RTX 6000s owners are a bit on their own figuring things out.

[-]

abmateen@reddit

350W x 4 almost 1200W plus Machines 150W is total 1400W is not local setup, it is most like a mini data center ..

[-]

val_in_tech@reddit (OP)

That rig has 2*2500w 🤠 CPU alone @ 300w

[-]

cstocks@reddit

hmmmm nice setup

[-]

xeeff@reddit

just told my gf it's crazy how rich some people are and they can spend huge £££ and have their own lil local SOTA at home and I see this

[-]

qwen_next_gguf_when@reddit

4xrtx 6000 pro , bro

[-]

IrisColt@reddit

humble bragging, heh

[-]

Blaze6181@reddit

Man... there's levels to this. It's all a question of how high a % of a home down payment you're willing to spend on inference gear. Fucking insane.

[-]

hitpopking@reddit

And I’m still rocking 3099

[-]

Cold_Tree190@reddit

You and me both

[-]

EbbNorth7735@reddit

Dude deserves an award for all 3 people he's going to help with their 50k setups

[-]

llamabott@reddit

Wish I was one of them, can't lie.

[-]

Ell2509@reddit

Asking if anyone else has run it. Dude has all of them!

[-]

jedsk@reddit

🥲

[-]

Front_Eagle739@reddit

I have not. And you get double my prefill and slightly more than double my token gen for the same model so I am very jealous. I did only spend 15k on my setup though at least

[-]

val_in_tech@reddit (OP)

Some places I got a card or two from should not he visited alone/with cash, and only during daylight 😅 what's your rig? Half the speed at 15k is really good.

[-]

Front_Eagle739@reddit

Lol the lure of the black gpu market is hard to resist even at risk of being pulled into a dark ally and mugged for your ram.

And I'm doing some research into low cost inference stacks as a launching point for doing some really interesting experiments ive got in mind. I've built a custom llama build that runs decode on a mac studio split between ffn on the mac and attention over network to an rtx 5090 workstation single gpu, 32gb ram 28GB/s nvme array that streams the whole model from disk to gpu during prefill once and does the whole batch in a single pass then hands off the kv to the mac for decode. Added dsa as well recently.

Currently at about 800 tok/s prefill for 16k batches. 350 ish at 100k context. Decode is about 18 tok/s though drops to 10ish long contexts. Kimi a bit better.

Anyway when Im happy ill open source this bit. Should stack nicely with anybody who has a gpu thats got spare compute but doesnt fit the whole model in vram.

[-]

FortiTree@reddit

The 28 Gb/s nvme cross-link would be your bottleneck adding massive latency when data transferred between the 2 hardware. GPU can PP says 5ms but the cross link will add 20ms+ depending on your context size.

This is why native internal nvlink within the same chassis is the industry standard.

[-]

Front_Eagle739@reddit

Pardon? The nvme is for the weight streaming on the rtx host. And yes it causes a first token latency so you only use it for prompts that wont complete in a few seconds on the mac anyway but you batch the whole lot through. The network link between devices is just 10gb eth and is A bottleneck though with a couple optimisations i get 100 to 150 us rtt from it even without rdma which would cut it right down. Enough for 10 to 18 tok/s when the rest of the stack is fast enough which is nice because it lets you bypass the quadratic attention cost the will have you under 5 tok/s near max context. Its not as fast at zero comtext true but you just switch over once its a benefit.

[-]

FullOf_Bad_Ideas@reddit

Nice speeds

managed to get reap-ed nvfp4 version

I can't believe this has good quality. REAP is terrible, NVFP4 on a model not trained for it, assuming it also quantizes attention, is probably a double whammy.

It's a very good model baseline so maybe it can remain fine after this but I think there might be better models to run on 4x 6000 Pros. For example IQ3_KS quant from ubergarm - https://huggingface.co/ubergarm/GLM-5.1-GGUF or Qwen 3.5 397B ~6bpw

[-]

val_in_tech@reddit (OP)

Agree with the sentiment, as never seen good REAP before this setup. Happy to run whatever bench for the sake of science. But I myself don't trust any benches except performance as its hard to argue. Used all the models you mention, this one is a real deal for my work. Qwen 397B was not impressive at all. Ubergarm quants are awesome.

[-]

FullOf_Bad_Ideas@reddit

Happy to run whatever bench for the sake of science.

I think PPL and KLD would be perfect but it's hard to measure in SGLang.

Someone did KLD for Qwen 3.5 397B in SGLang - https://github.com/voipmonitor/rtx6kpro/blob/master/benchmarks/kld-evaluation.md

if you have too much time on your hands you can try to do the same with GLM 5.1 REAP NVFP4 quants that you're using, you'd need to rent some GPUs to run FP8/BF16 model for reference though.

It would be easier to eval something like AIME 24, and it still should show up as it requires long reasoning which is likely show any issues if there are any major issues.

Used all the models you mention, this one is a real deal for my work. Qwen 397B was not impressive at all. Ubergarm quants are awesome.

You still could run GLM 4.7 with very good quant and it may come out better if you like how GLM 5.1 reasons. I do think I like Qwen 3.5 397B better than GLM 4.7 but largely because it's so much faster on my hardware.

[-]

val_in_tech@reddit (OP)

Yeah speed ends up a deciding factor sometimes. At the speed I run glm 5.1 with sglang now iterations are very fast and few sessions can run in parallel. Its worth a lot. Vs higher quants maybe be a bit smarter (maybe?) but then 30-60mins later I see a problem solving is not in the right direction. Ubergarm says the same thing about Qwen, he likes that one more partially coz he can use mmproj with it. I'll look at aime 24. But really happy with that version so far as a end user.

[-]

FullOf_Bad_Ideas@reddit

I saw Qwen 397B quant comparison - https://www.reddit.com/r/LocalLLaMA/comments/1roz3yl/if_youre_using_nvidias_nvfp4_of_qwen35397_try_a/ - NVFP4 looks very bad again but all are pretty bad. If you tried any of them I wouldn't be surprised if you didn't like them.

Exllamav3 quants have much higher accuracy with NeuroSenkos 5bpw quant having KLD of just 0.0079 vs 0.11 (yes, 14x higher despite being similar size) for Nvidia's official Qwen 3.5 397B NVFP4 at similar total size.

I'm personally using my own ~3.5bpw quant that has KLD 0.0286 despite being much smaller than AWQ/NVFP4 and i don't know anything better in that size.

[-]

DeepOrangeSky@reddit

How about GLM 5.1 via a q3 GGUF of some sort (maybe Q3_K_S or Q3_K_M or something)? That would still fit into VRAM + context, I think, and would presumably be superior to a REAP at nvfp4, considering how bad REAPs tend to be.

Also, how much slower does it run if use just one single RTX 6000, instead of all four of them, and use offloading, and just run the active stuff on the lone card and run the rest on dram, compared to running the whole entire model in VRAM? Is it like 2x slower? 10x slower? What is the speed difference?

[-]

val_in_tech@reddit (OP)

Even slightest ram offload totally kills the vibe. Maybe your get 1/4 token generation if your CPU is awesome and lots of fast ram, prompt processing don't remember, might be even worse. Like 1/10th. I did run pretty much all gguf quants of GLM 5.1 from Ubergarm (some of the best) and this one "feels" better than Q3s, partially because they are just not very usable past 80-100k because of slower speeds. Ik_llama github and Ubergarm community discussion is a good place to check. Lots of folks run them with 1-2 gpus and post very detailed benchmarks

[-]

DeepOrangeSky@reddit

Even slightest ram offload totally kills the vibe. Maybe your get 1/4 token generation if your CPU is awesome and lots of fast ram, prompt processing don't remember, might be even worse. Like 1/10th.

Damn... I always thought that rule was more for dense models, where it's like, either it all fits, and you're all good, or if any amount spills over then you're super fucked, and then I thought for MoEs it was much more forgiving and you basically just needed the active params/attention/KV etc to fit and it didn't matter nearly so much whether all the inactive stuff didn't fit.

But, maybe it's just bad in both cases and just way extra worse for the dense models, but still bad in both cases? I dunno. I don't really have much frame of reference yet since I only got into all this stuff a few months ago and have just been using a mac studio so far, so, yea I don't really know much about offloading or dedicated GPU VRAM to DRAM offloading ratios and speed effects from that or anything yet, other than just seeing the occasional random setups people post every so often (which I usually just ignored all those threads since I just have this mac studio or whatever) but now I've started finally glancing at them now that I'm becoming more curious about the setups and speeds I guess.

I did run pretty much all gguf quants of GLM 5.1 from Ubergarm (some of the best) and this one "feels" better than Q3s, partially because they are just not very usable past 80-100k because of slower speeds

Interesting, I wasn't expecting that tbh since whenever people talk about REAP models on here they usually talk about them like they are so terrible that it might as well be called the Grim Reaper because they're so shitty that you basically just instantly die if you try using it. But maybe people were exaggerating a little and they aren't so bad for all use cases or something, lol

Anyway, thanks for the reply, I'll check out that other stuff.

Recently I became obsessed with the idea of trying to do an NVMe Raid 0 setup with several sticks of high speed NVMe + a dedicated GPU, to see if it is somehow possible to get usable speeds (well, "usable" meaning 1/10th of the speeds that most people consider to be "usable" but this wouldn't be for coding necessarily/initially, so, that would be okay maybe) from huge models like GLM 5.1, or Qwen 397, or Kimi, or stuff like that, without having to spend a fortune. I figured as long as you can do the active params/attention stuff fit nice with room to spare for KV and everything on the GPU, then maybe even if the rest of the model was some some NVMe Raid 0 setup, you could still get good speeds on an MoE like that somehow, and get to save a few grand. But, I dunno, everyone I ask about it (including AI, but also humans) all always tell me "it won't work. It'll be slow AF" and that the speeds will be more or less however much memory bandwidth I can get, so like if I used 4 sticks of 14 GB/s with raid 0 to make it be like 14 * 4 = 56 GB/s of memory bandwidth from the NVMe, and let's say the big MoE had like 10 GB of active params/etc, then maybe it'd do like 5.6 tok/s in theory/on paper, and real world maybe 3-4 tok/s. But, I don't get why it works that way, like, I would've thought as long as you can do the active params/attention/whatever you call it part of the MoE on the GPU then I don't get why it would need to be limited to like 4 tok/s, it seems like if it had a 5090 hooked up to that setup then it would just go flying at like 50 tok/s or 100 tok/s or something, since who cares if the inactive part goes slow as long as it can do the active part super fast with the GPU. But, I'm not very computer savvy/technical yet so I guess there are some technical aspects to do with the routing and latency or I dunno, whatever technical stuff that I don't really understand yet. Both regarding that NVMe setup and also for GPU + ram setups, too, I guess.

[-]

val_in_tech@reddit (OP)

Ask around in ik_llama discussion sections or Ubergarm quants. Most of them do exactly that, offload a small portion to vram. I think it's mostly supposed to give you PP boost in some cases. Some results looked promising. Maybe only 2-5 times slower than what I get 100% in vram, same engine. Those folks are very eager to share. Don't be shy and ask them. But the thing is, if you're 100% in vram you are likely using engine with full TP support that can load gpus properly all at once. Then its a whole different game. The example I posted was that. And this is not even fully optimized. I think I can do 70-90tps by all sorts of tuning while sacrificing some context. Which is pretty crazy for a model of this size.

[-]

funding__secured@reddit

Omg the GPU poors are so annoying. Where are the mods?

[-]

Gold_Scholar1111@reddit

how about 2 mac studio m3 ultra 512gb?

[-]

val_in_tech@reddit (OP)

You can certainly run most models there. Its better than a bunch of 6000s that way. I wouldn't call that usable because of very slow PP. Modern ai harness is 5-20k initial prompt. Unless you're ok waiting 5-20minutes for first reply.

[-]

Gold_Scholar1111@reddit

i tried on a single mac m3 ultra 512gb with mlx 4bit glm5.1, it took about totally 8 minutes to process 30k tokens prompt and to generate around 6k tokens output.

[-]

val_in_tech@reddit (OP)

How do you feel about the speed? I really appreciate that machine for what it can run, but feel really weird about doing work and waiting more than 10 seconds. Need to see what direction it takes to feel the time is not gonna be wasted.

[-]

Gold_Scholar1111@reddit

it takes about one day for 200 tasks. it is slow, but dont mind as i feel good to have time for reading books. it is weekend anyway.

[-]

Gold_Scholar1111@reddit

1 mac 512 is around 19tg/s for 4bit glm5.1 at 4k tokens, two should be around 36tg/s. i dont know pp. i may test them if i have time.

[-]

FoxiPanda@reddit

You can get a non-REAP'd version of GLM5.1 running on a single 512GB Mac Studio at ~24tok/s TG but prettttty slow PP (like 100tok/s PP)...obviously that degrades to like 17-18tok/s when you get higher into the context.

For the dual 512G scenario, you'd honestly end up with slightly higher TG but lower PP and it's not worth it to split across multiple even using JACCL RDMA over TB5...at least currently. The stability and speed gains aren't worth it tbh. I've tried lol.

[-]

Dany0@reddit

The M5 ultra can't come soon enough. I hope the delay is because apple is stockpiling chips so that it doesn't sell out on day -1

[-]

FoxiPanda@reddit

It's going to be so hilariously expensive lol... I'm seriously expecting $20-25K for the 512GB if they even launch one at that capacity at all.

[-]

Dany0@reddit

I can see a future where we're all fundraising to get an m5 ultra colocated in a data centre and then we do a timeshare 80s style

[-]

FoxiPanda@reddit

Welp, time to go watch the wolf of wall st again... "fundraising" you say...

[-]

amitbahree@reddit

I am downloading the model as we speak and its one of the ones I am going to also benchmark. (More here: What do you want me to try? : r/LocalLLaMA)

[-]

elelem-123@reddit

As the person who asked you about that model, thank you for doing so, taking the time and making the effort.

[-]

val_in_tech@reddit (OP)

Would be interesting to see performance on that rig compared to RTX 6000s.

[-]

yammering@reddit

What patches? I’ve been failing to get the same model up on my Spark cluster.

[-]

val_in_tech@reddit (OP)

Updated the post. Hopefully it helps. Not sure spark is the same though. Good luck!

[-]

yammering@reddit

It’s not but generally needs at least the same as SM120.

[-]

val_in_tech@reddit (OP)

Share your numbers. If you get to run it. I'm sure lots of people are curious about your machine.

[-]

BankjaPrameth@reddit

The drop of PP and TG on high context is brutal even for 4 x RTX 6000 Bros

[-]

val_in_tech@reddit (OP)

Its improving. Was much higher drop earlier in a past. I'm pretty bullish on inferance software being further optimized for worskation Blackwell and new algoritms.

[-]

Clear-Ad-9312@reddit

That is interesting because the sm120 chips lack hardware instructions that make the sm100 chips capable of running newer algorithms like FA4. Those instructions offer 4x to 8x improvement in efficiency. Without them, the card has to use sm89 instruction set compatible algorithms.
I personally think the "Blackwell"(not real Blackwell in my opinion) workstation cards are fast enough for anyone who is doing solo inference.

[-]

val_in_tech@reddit (OP)

Most certainly ok for solo today. I do see ecosystem evolving into one person needing 10+ inferance requests working in parallel in the future. So hopefully eventually will get optimized one way or another.

[-]

JockY@reddit

You did the thing I procrastinated. Awesome. Sorry people are shitting on you for having good gear.

Would you mind sharing your patches, etc? I’d love to give this a whirl.

[-]

val_in_tech@reddit (OP)

Haters gonna hate. Included in the update post description. Let me know if anything else is missing. Good luck!

[-]

Dany0@reddit

And I have FRIENDS and a SOCIAL LIFE and I'm NOT jealous of your 4x rtx 6000 pro 😤😭

[-]

a_beautiful_rhind@reddit

Just ask them all to chip in for a friends' server.

[-]

speedb0at@reddit

Real

[-]

Eyelbee@reddit

You don't know how little it means for you to say "pretty close to sonnet experience". People here claim opus tier experience for every model every day.

[-]

val_in_tech@reddit (OP)

GLM is no opus. Nothing is. All benchmarks are useless in their own way. Lots of claims are self promotion. I'm just a dumb user of both and can speak from my lil bubble of experience.

[-]

jmakov@reddit

What's the pipi metric?

[-]

val_in_tech@reddit (OP)

Love it 😆 simply put - noone wants small pipi.

[-]

ttkciar@reddit

Prompt Processing, in tokens per second

[-]

3dom@reddit

4xrtx 6000 pro

Nah, I'll prefer 200t/s remotely for the mere $200/month.

[-]

InformationSweet808@reddit

Interesting drop from 2229 → 863 pp/s with context scaling. Any tricks to keep prefill higher at 32K+ or is it just memory bandwidth hitting limits?

[-]

val_in_tech@reddit (OP)

This is the best I've seen for this model, being short of a few B300s. Sometimes its just a matter of time when someone optimizes. Early minimax KV was very large and you can see similar performance drop in ik_llama, but vLLM for instance is much more consistent for minimax in particular, and KV quants gotten really good since. And once you go even higher like kimi / glm, challenge is - not many people have that much vram to share their experiences. I know a few and they all use it for very different usecases so you're on your own a lot.

[-]

moonrust-app@reddit

Crying in a 5090. Feels sad man.

[-]

ormandj@reddit

For the low, low price of $36,000 worth of GPU, you too can run local GLM 5.1! Models like that are best left to DC hardware - the good news is the smaller models are rapidly improving and getting closer and closer to SOTA models of last year. I suspect by EoY 2026 we'll have opus-quality running on single 6000 series blackwell cards, or even multiple 3090s.

[-]

TapAggressive9530@reddit

Agreed

[-]

putrasherni@reddit

Would you be better with 6 RTX 6000 ?

[-]

val_in_tech@reddit (OP)

Nope, actually worse. You need TP for max speed and vLLM / sglang only support 1/2/4/8/16 cards for that with glm architecture. You could run higher quality quant on more cards with ik_llama for example, but speed will drop by around 30-60% for both generation and prompt processing. So next stop is 8 RTX 6000.

[-]