754B params? When there are models like Gemma 4 31B and Qwen 3.5 35B in similar benchmark territory, what value does a large param model like this bring? It is tricky to gather apples to apples comparisons for GLM-5.1 to Gemma 4 and Qwen 3.5, but my impression is that they are in the same neighborhood in output results.
Agentic coding, I’d say it has coding ability at about opus 4.5 level, it is fully capable of performing tasks on large codebases with tooling like forgecode, I can’t say the same for smaller models.
Why so poor quant? You should go for q4_k_m with 12Gb Vram for this models. And offload MoE layers to Cpu. Check my older post, I talk about it many times.
:) Yep, Inbetween plan change happened. We're getting Ryzen 9950X3D2 Dual Edition so .... (Workstation/Server plan didn't workout, Crappy sellers upped ECC RDIMM RAM's prices to sky in my country)
So beautiful when even IQ1 is multi-file haha. I wonder if the old truth of the lower quant of a bigger model still being better than the lossless version of a smaller model still applies nowadays. Has anyone tested that?
I’ve seen benchmarks on the strixhalo wiki for GLM4.7 with two of these devices using RPC:
RPC · dual server
These results were produced with two Strix Halo systems (Framework Desktops, each 128 GB) connected over 50 Gbps Ethernet (likely bandwidth is not the limiting factor here, but latency). One runs rpc-server from llama.cpp; the other runs llama-bench --rpc.
This setup allows distributed inference, splitting large GGUF models across both machines. The metric shows what you can expect when latency is limited by the network and the workload is balanced between two RPC participants.
My experience - I’m not. I should preface that I use the Vulkan drivers, but it’s in my todo to try the lemonade sdk sometime.
It’s slow for dense models up to 30b. MoE models around 120b have to be quantised which is not bad - usually Q6 with enough context takes a hit less than 100GB VRAM (I have Qwen 3.5 122b which i don’t use at all), but that leaves a bit of resource on the table which is hard to utilise when running headless (running anything else may lead to contention for RAM between the CPU and GPU in my opinion (and probably according to chipandcheese, but I had an LLM process the article about Strix halo and answer my questions - I didn’t read it myself)).
It’s also not fast for PP.
For simple tasks like consuming an article and answering questions I find something on a GPU with 32GB VRAM to be superior for chat.
And with current prices - I can’t recommend it. Framework increased the prices again yesterday and I expect others will follow. :)
Yeah well I may have gone full stupid and got a confirmed oculink card and an RTX 4000 to Frankenstein on to it after watching too much level1techs, so… I’m sure I’ll be cursing a lot.
I haven’t had enough time to figure that out yet. Being on amd hardware is definitely something holding me back a bit but that’s likely a skills issue on my side.
Yeah, a bit of my concern I'll have as well...seeing as I already have a 128gb one on the way. But NVIDIA abandoning/bricking/remotely crippling hardware scares me more so I settled on the AMD route.
I managed to get to a whole Claude Code context today without the model falling apart - wondering if they've got more capacity for a while now that they've finished tweaking 5.1..
Inference quality will be good for a while - I am rolling in tokens in legacy pro sub at the moment. Good eating for a while. I am assuming two months good service and two months ass as a rule of thumb. Still worth the money if that holds.
I don't have max plan so can't judge, but ollama cloud provides way more usage for 20 bucks a month vs. z.ai, it's also slow especially some models but at least it gives you a lot of usage and more or less stable token rates.
I mean, Anthropic literally said Mythos preview won't be on the public API. GPT 4.5 is likely only in use internally. API access may be limited in the coming years.
Awesome! Although at 754B even an NVFP4 is going to be a very tight squeeze onto a 4x RTX 6000 PRO rig when taking context space into consideration. Fingers crossed it can be made to fit.
it's ok with the past trend of vram growth we can expect to run this model locally on a single flagship consumer card something like an rtx 9090 ti super limited edition by the end of 2043, note that this is an expectation based on the trend by ai so results... will vary
i heard rumors on chinese social media that deepseek has new architecture to allow efficient run of 1T model on regular hardware (32GB VRAM?) when that come out, these giant model should be able to run locally with updates.
we will just have to wait to see if the rumors were made up.
I think they're imagining a future where deepseek's "engram" research means that deepseek-v4 is just going to be a <50B dense model with a terabyte of engrams that don't have to be stored in memory.
I do not think this is likely, but it is a nice dream.
Yes! Finally another large models. Excited about this one. I know all the top posts will say "but what about my 6GB vram GPU" but we have a ton of small models. We need large models that can do impressive things.
Even though I cannot run it myself (well outside of SSD shenanegans), it being open source does make me happy and also more likely to use zai/glm5.1 as a provider for cloud inference when I do need it.
Oh wow, all the doomers saying that the company that releases open-source models and said they were going to release open source models, wasn't going to, were wrong!?
mrinterweb@reddit
754B params? When there are models like Gemma 4 31B and Qwen 3.5 35B in similar benchmark territory, what value does a large param model like this bring? It is tricky to gather apples to apples comparisons for GLM-5.1 to Gemma 4 and Qwen 3.5, but my impression is that they are in the same neighborhood in output results.
a_beautiful_rhind@reddit
Let me guess.. you've used none of them?
mrinterweb@reddit
I haven't used GLM-5.1. Just Gemma 4 26B-A4B and Qwen 3.5 35B-A3B.
Hoak-em@reddit
Agentic coding, I’d say it has coding ability at about opus 4.5 level, it is fully capable of performing tasks on large codebases with tooling like forgecode, I can’t say the same for smaller models.
danielhanchen@reddit (OP)
We made some GGUFs for GLM 5.1 at https://huggingface.co/unsloth/GLM-5.1-GGUF
putrasherni@reddit
wen for gpu poors like 32GB 64GB and 96GB
Mashic@reddit
If you're poor with 32GB, what do I call myself with 12GB.
TheManicProgrammer@reddit
If you cl yourself poor with 12GB what do I call myself with 4GB.
Borkato@reddit
Starving to death like a medieval peasant during wartime and the pestilence
huffalump1@reddit
Man and I figured that dropping ~$500 on a 4070 was a decent choice for gaming + AI use for a few years...
...and it's not bad, but definitely not enough for even ~30B models!
Sometimes I wonder if a 4060 16gb would've been a better choice, but honestly, even THAT isn't much more when it comes to these modern massive models.
MoudieQaha@reddit
I only have 6GB bruh ...
putrasherni@reddit
you have qwen3.5 9B and new Gemma 4 models
look no further
false79@reddit
Tecnically, I can not look further. Welp.
putrasherni@reddit
look back to openrouter and kilocode free models
false79@reddit
what? That's not local.
Mashic@reddit
I run gemma:4-31b/26b and qwen:3.5-35b at at ud-iq2 and ud-iq3.
Skyline34rGt@reddit
Why so poor quant? You should go for q4_k_m with 12Gb Vram for this models. And offload MoE layers to Cpu. Check my older post, I talk about it many times.
jacek2023@reddit
potato
pmttyji@reddit
8GB in my current laptop
jacek2023@reddit
again?
pmttyji@reddit
:) Yep, Inbetween plan change happened. We're getting Ryzen 9950X3D2 Dual Edition so .... (Workstation/Server plan didn't workout, Crappy sellers upped ECC RDIMM RAM's prices to sky in my country)
denoflore_ai_guy@reddit
Destitute
postitnote@reddit
Can you add UD-TQ1_0?
fallingdowndizzyvr@reddit
They said they weren't going to make those anymore.
Limp_Classroom_2645@reddit
thank you for your service sir
No_Conversation9561@reddit
Are you doing MLX UD as well?
Due-Memory-6957@reddit
So beautiful when even IQ1 is multi-file haha. I wonder if the old truth of the lower quant of a bigger model still being better than the lossless version of a smaller model still applies nowadays. Has anyone tested that?
segmond@reddit
Thanks! Currently running Q5 for 5.0 and it's a very capable model. Can't wait to try this.
ttkciar@reddit
Thank you for sharing your hard work with the community :-)
danielhanchen@reddit (OP)
Thanks!
putrasherni@reddit
plz help ream or reap or glm 5.1 coder or glm 5.1 air pleaase
Ok_Technology_5962@reddit
Ream reap tq0.01
ShadyShroomz@reddit
Is this moe? What speeds do you think id get with 4x 3090s with offloading? What about 4x 6000 pros (the 96gb version).
I was thinking I could convince my wife we could take out another mortgage on the house.
ormandj@reddit
The answer is "no".
ShadyShroomz@reddit
Youre not my wife.
Plane_Yak2354@reddit
Holy duck! I’m strolling in with my AMD Ryzen AI Max+ 395 thinking alright let’s GO! Oh uhh wait… nevermind…
dsartori@reddit
I’ve seen benchmarks on the strixhalo wiki for GLM4.7 with two of these devices using RPC:
RPC · dual server
These results were produced with two Strix Halo systems (Framework Desktops, each 128 GB) connected over 50 Gbps Ethernet (likely bandwidth is not the limiting factor here, but latency). One runs rpc-server from llama.cpp; the other runs llama-bench --rpc.
This setup allows distributed inference, splitting large GGUF models across both machines. The metric shows what you can expect when latency is limited by the network and the workload is balanced between two RPC participants.
Plane_Yak2354@reddit
My wife already hates I bought one of these machines. Sounds like it’s time for me to double down! :D
pchew@reddit
Out of curiosity, are you happy with anything you've got running on just the one?
ProfessionalSpend589@reddit
My experience - I’m not. I should preface that I use the Vulkan drivers, but it’s in my todo to try the lemonade sdk sometime.
It’s slow for dense models up to 30b. MoE models around 120b have to be quantised which is not bad - usually Q6 with enough context takes a hit less than 100GB VRAM (I have Qwen 3.5 122b which i don’t use at all), but that leaves a bit of resource on the table which is hard to utilise when running headless (running anything else may lead to contention for RAM between the CPU and GPU in my opinion (and probably according to chipandcheese, but I had an LLM process the article about Strix halo and answer my questions - I didn’t read it myself)).
It’s also not fast for PP.
For simple tasks like consuming an article and answering questions I find something on a GPU with 32GB VRAM to be superior for chat.
And with current prices - I can’t recommend it. Framework increased the prices again yesterday and I expect others will follow. :)
pchew@reddit
Yeah well I may have gone full stupid and got a confirmed oculink card and an RTX 4000 to Frankenstein on to it after watching too much level1techs, so… I’m sure I’ll be cursing a lot.
ProfessionalSpend589@reddit
It's a nice combo. I have attached a Radeon AI Pro R9700 via OcuLink.
My idea was to run one of the 120b models on a single Strix Halo + eGPU, but I can't stop using Qwen 3.5 397b even if it's slower:)
Plane_Yak2354@reddit
Oh no… 😆
Plane_Yak2354@reddit
I haven’t had enough time to figure that out yet. Being on amd hardware is definitely something holding me back a bit but that’s likely a skills issue on my side.
pchew@reddit
Yeah, a bit of my concern I'll have as well...seeing as I already have a 128gb one on the way. But NVIDIA abandoning/bricking/remotely crippling hardware scares me more so I settled on the AMD route.
xrvz@reddit
Doesn't matter. Anyone serious about AI needs one as insurance.
coder543@reddit
Keep in mind that GLM-5/GLM-5.1 is substantially larger than GLM-4.7 was.
dsartori@reddit
Sure but there is a quant that would fit two strix halo boxes while there is not one that fits one.
Separate-Forever-447@reddit
...checks for a cerebras REAP
oxygen_addiction@reddit
Stepfun 3.6 upcoming.
jacek2023@reddit
thanks but this is too big for my 84GB of VRAM
danielhanchen@reddit (OP)
:( The smallest quant is 206GB for now :(
mynamasteph@reddit
That's 1 bit, are you wanting something even more compromised than that?
Irythros@reddit
0 bit. It's all 0's.
Fortyseven@reddit
Compresses REALLY well.
SawToothKernel@reddit
So Llama 4 then.
falcongsr@reddit
0000?
00000!
gnnr25@reddit
What did you call me?
-dysangel-@reddit
bonsaiiiiii
lolwutdo@reddit
how big would this model be if we got a bonsai quant?
-dysangel-@reddit
around 110GB
LegacyRemaster@reddit
Hey Daniel... TQ1 for 96+96 gb vram? :D
jacek2023@reddit
let's wait then https://huggingface.co/zai-org/GLM-5.1/discussions/2
miniocz@reddit
Fits my 1TB SSD just fine. 1t/s here I come!
jacek2023@reddit
what's your usecase for 1t/s model?
miniocz@reddit
Fun :) Or to think about setting up complex/novel pipelines and implementing new proposed methods. Essentially planning.
Altruistic_Heat_9531@reddit
wait 84G VRAM? what in the combination resulted into 84GB VRAM
jacek2023@reddit
72+12, but according to reddit experts it's probably impossible because they have laptop and the cloud
Particular-Way7271@reddit
Maybe 4 5060ti and a 7900xt xD
LatentSpacer@reddit
Offload to disk.
nastypalmo@reddit
This is too big for my 6gb vram
Karnemelk@reddit
can't wait for the first person to load it on a raspberry pi 8gb with SSD offloading.
deejeycris@reddit
Hopefully a proper provider picks this up. Sorry z.ai but your inference platform sucks, models are great tho.
-dysangel-@reddit
I managed to get to a whole Claude Code context today without the model falling apart - wondering if they've got more capacity for a while now that they've finished tweaking 5.1..
GreenGreasyGreasels@reddit
Inference quality will be good for a while - I am rolling in tokens in legacy pro sub at the moment. Good eating for a while. I am assuming two months good service and two months ass as a rule of thumb. Still worth the money if that holds.
deejeycris@reddit
They have too much throttling and too low usage quotas for the price. I doubt optimizing the model performance a bit will sort any meaningful effect..
-dysangel-@reddit
Low usage quotas for the price? I got a whole year of Max for the same as 1.5 months of Claude, and I don't think I've ever hit usage limits!
I've been having problems over about 80k context for over a month now, but today it was working fine right up to the limit.
deejeycris@reddit
I don't have max plan so can't judge, but ollama cloud provides way more usage for 20 bucks a month vs. z.ai, it's also slow especially some models but at least it gives you a lot of usage and more or less stable token rates.
Comrade-Porcupine@reddit
DeepInfra seems to have it already.
StanPlayZ804@reddit
Sorry, this model is a bit too small for my 80 petabytes of VRAM.
darknecross@reddit
Agent swarm.
mxforest@reddit
Just cut it in pieces.
Ok-Contest-5856@reddit
These models are super important for when Anthropic and OpenAI decide to rug pull their coding plans.
GreenGreasyGreasels@reddit
Coding plan ? Pulling all Api access is not our question if they want the whole pie. They will still apps and access them not Api tokens.
Corporate_Drone31@reddit
I mean, Anthropic literally said Mythos preview won't be on the public API. GPT 4.5 is likely only in use internally. API access may be limited in the coming years.
corruptbytes@reddit
really should've went with the 512gb model instead of the 256gb
milkipedia@reddit
"754B parameters"
*** passes out ***
Limp_Classroom_2645@reddit
*** pisses pants ***
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
Vicar_of_Wibbly@reddit
Awesome! Although at 754B even an NVFP4 is going to be a very tight squeeze onto a 4x RTX 6000 PRO rig when taking context space into consideration. Fingers crossed it can be made to fit.
fanhed@reddit
I have 4x RTX Pro 6000s, but I can't run them at all.
t3rmina1@reddit
You can give me 1, I have 3
Vicar_of_Wibbly@reddit
What can’t you run?
getpodapp@reddit
With turboquant maybe
danielhanchen@reddit (OP)
Yes NVFP4 would be cool!
junolau@reddit
it's ok with the past trend of vram growth we can expect to run this model locally on a single flagship consumer card something like an rtx 9090 ti super limited edition by the end of 2043, note that this is an expectation based on the trend by ai so results... will vary
This_Maintenance_834@reddit
i heard rumors on chinese social media that deepseek has new architecture to allow efficient run of 1T model on regular hardware (32GB VRAM?) when that come out, these giant model should be able to run locally with updates.
we will just have to wait to see if the rumors were made up.
florinandrei@reddit
haha
the__storm@reddit
What would that even mean? That's like 1/4 of a bit per weight lmao
coder543@reddit
I think they're imagining a future where deepseek's "engram" research means that deepseek-v4 is just going to be a <50B dense model with a terabyte of engrams that don't have to be stored in memory.
I do not think this is likely, but it is a nice dream.
Due-Memory-6957@reddit
According to rumors, Deepseek v4 was released in February.
frogsarenottoads@reddit
Starting to wonder what Google is up to, no release since January and the likes of ZAI, Qwen, Open Source in general are absolutely cooking.
ShadyShroomz@reddit
Gemma4 released like last week bro lmao
frogsarenottoads@reddit
I am aware of that and it wasn't last week only a few days ago I mean more along the lines of Gemini Pro
coder543@reddit
Google has released a bunch of things since January, including Gemma 4.
fuutott@reddit
Wake up neo
Kaljuuntuva_Teppo@reddit
Dang.. 1.51 TB 😂
Well at least some Apple Studio users with 512GB RAM might be able to run this at Q3/Q4.
joblesspirate@reddit
I'm trying now. The XL model was crawling. Giving unsloth/GLM-5.1-GGUF:UD-IQ2_M a shot. I'd love for this to work out!
-dysangel-@reddit
I've been using 5 at IQ2_XXS and it's been great, so no point taking up even more bandwidth. Going to try the same for 5.1
marhalt@reddit
Yes! Finally another large models. Excited about this one. I know all the top posts will say "but what about my 6GB vram GPU" but we have a ton of small models. We need large models that can do impressive things.
klippers@reddit
Yay this means nanoGPT should add it back to the subscription
TopChard1274@reddit
Oh, Grandmother, what big model you have!
coder543@reddit
Has Z.ai ever explained what GLM-5-Turbo is? Is it a smaller model, like a GLM 5 Air? Will it ever be released openly?
Cinci_Socialist@reddit
Nope it's a mystery all we know is it's fast and made for openclaw basically
Cinci_Socialist@reddit
GLM 5.1 is basically opus 4.5, this is a huge win
True_Requirement_891@reddit
glm-5-turbo pls
getting_serious@reddit
I still have a Xeon DDR3 mainboard here that is New Old Stock and I've been telling myself that I'll never a system with it. Damnit.
No_Conversation9561@reddit
TurboQuant is a god send
AndreVallestero@reddit
Any M3 Ultra users? IQ4_XS looks to be viable with 100k context
Jackalzaq@reddit
Thank you for the quants!
qwen_next_gguf_when@reddit
Nvm 735b.
false79@reddit
LFG!
-dysangel-@reddit
false79@reddit
you....lol
Adventurous-Okra-407@reddit
Even though I cannot run it myself (well outside of SSD shenanegans), it being open source does make me happy and also more likely to use zai/glm5.1 as a provider for cloud inference when I do need it.
Due-Memory-6957@reddit
Oh wow, all the doomers saying that the company that releases open-source models and said they were going to release open source models, wasn't going to, were wrong!?
nakedspirax@reddit
Yeah nahh
twack3r@reddit
Awesome!
I’m ready for it, UDQ3KXL here we go.
danielhanchen@reddit (OP)
Let me know how it goes!
twack3r@reddit
Will do
themrzmaster@reddit
Thank god China!
bcdr1037@reddit
Imagine if we were stuck with the American companies... W China
dampflokfreund@reddit
Text only...?
danielhanchen@reddit (OP)
Yes sadly for now
ttkciar@reddit
A locally hostable model that nearly matches Claude at codegen, and it's text-only?? Oh noes!
Personally I think this is tremendous. Multimodality is overrated. We can do a lot with a model this capable.
danielhanchen@reddit (OP)
Haha ok this is a fair point :)
jeffwadsworth@reddit
Haha just have to let llama.cpp catch up.
Ok_Technology_5962@reddit
Isnt this the same as glm 5 does it even need updating?
Edzomatic@reddit
The api pricing is a bit more expensive than GLM 5, which is a bummer considering they're the same size
FrozenFishEnjoyer@reddit
Got excited with this release but I remember I only have 16GB VRAM.
Significant_Fig_7581@reddit
No lite version ❤️🩹😢