GLM-5.1 | TheaterFire

[-]

mrinterweb@reddit

754B params? When there are models like Gemma 4 31B and Qwen 3.5 35B in similar benchmark territory, what value does a large param model like this bring? It is tricky to gather apples to apples comparisons for GLM-5.1 to Gemma 4 and Qwen 3.5, but my impression is that they are in the same neighborhood in output results.

[-]

a_beautiful_rhind@reddit

Let me guess.. you've used none of them?

[-]

mrinterweb@reddit

I haven't used GLM-5.1. Just Gemma 4 26B-A4B and Qwen 3.5 35B-A3B.

[-]

Hoak-em@reddit

Agentic coding, I’d say it has coding ability at about opus 4.5 level, it is fully capable of performing tasks on large codebases with tooling like forgecode, I can’t say the same for smaller models.

[-]

danielhanchen@reddit (OP)

We made some GGUFs for GLM 5.1 at https://huggingface.co/unsloth/GLM-5.1-GGUF

[-]

putrasherni@reddit

wen for gpu poors like 32GB 64GB and 96GB

[-]

Mashic@reddit

If you're poor with 32GB, what do I call myself with 12GB.

[-]

TheManicProgrammer@reddit

If you cl yourself poor with 12GB what do I call myself with 4GB.

[-]

Borkato@reddit

Starving to death like a medieval peasant during wartime and the pestilence

[-]

huffalump1@reddit

Man and I figured that dropping ~$500 on a 4070 was a decent choice for gaming + AI use for a few years...

...and it's not bad, but definitely not enough for even ~30B models!

Sometimes I wonder if a 4060 16gb would've been a better choice, but honestly, even THAT isn't much more when it comes to these modern massive models.

[-]

MoudieQaha@reddit

I only have 6GB bruh ...

[-]

putrasherni@reddit

you have qwen3.5 9B and new Gemma 4 models
look no further

[-]

false79@reddit

Tecnically, I can not look further. Welp.

[-]

putrasherni@reddit

look back to openrouter and kilocode free models

[-]

false79@reddit

what? That's not local.

[-]

Mashic@reddit

I run gemma:4-31b/26b and qwen:3.5-35b at at ud-iq2 and ud-iq3.

[-]

Skyline34rGt@reddit

Why so poor quant? You should go for q4_k_m with 12Gb Vram for this models. And offload MoE layers to Cpu. Check my older post, I talk about it many times.

[-]

jacek2023@reddit

potato

[-]

pmttyji@reddit

8GB in my current laptop

[-]

jacek2023@reddit

again?

[-]

pmttyji@reddit

:) Yep, Inbetween plan change happened. We're getting Ryzen 9950X3D2 Dual Edition so .... (Workstation/Server plan didn't workout, Crappy sellers upped ECC RDIMM RAM's prices to sky in my country)

[-]

denoflore_ai_guy@reddit

Destitute

[-]

postitnote@reddit

Can you add UD-TQ1_0?

[-]

fallingdowndizzyvr@reddit

They said they weren't going to make those anymore.

[-]

Limp_Classroom_2645@reddit

thank you for your service sir

[-]

No_Conversation9561@reddit

Are you doing MLX UD as well?

[-]

Due-Memory-6957@reddit

So beautiful when even IQ1 is multi-file haha. I wonder if the old truth of the lower quant of a bigger model still being better than the lossless version of a smaller model still applies nowadays. Has anyone tested that?

[-]

segmond@reddit

Thanks! Currently running Q5 for 5.0 and it's a very capable model. Can't wait to try this.

[-]

ttkciar@reddit

Thank you for sharing your hard work with the community :-)

[-]

danielhanchen@reddit (OP)

Thanks!

[-]

putrasherni@reddit

plz help ream or reap or glm 5.1 coder or glm 5.1 air pleaase

[-]

Ok_Technology_5962@reddit

Ream reap tq0.01

[-]

ShadyShroomz@reddit

Is this moe? What speeds do you think id get with 4x 3090s with offloading? What about 4x 6000 pros (the 96gb version).

I was thinking I could convince my wife we could take out another mortgage on the house.

[-]

ormandj@reddit

The answer is "no".

[-]

ShadyShroomz@reddit

Youre not my wife.

[-]

Plane_Yak2354@reddit

Holy duck! I’m strolling in with my AMD Ryzen AI Max+ 395 thinking alright let’s GO! Oh uhh wait… nevermind…

[-]

dsartori@reddit

I’ve seen benchmarks on the strixhalo wiki for GLM4.7 with two of these devices using RPC:

RPC · dual server

These results were produced with two Strix Halo systems (Framework Desktops, each 128 GB) connected over 50 Gbps Ethernet (likely bandwidth is not the limiting factor here, but latency). One runs rpc-server from llama.cpp; the other runs llama-bench --rpc.

This setup allows distributed inference, splitting large GGUF models across both machines. The metric shows what you can expect when latency is limited by the network and the workload is balanced between two RPC participants.

[-]

Plane_Yak2354@reddit

My wife already hates I bought one of these machines. Sounds like it’s time for me to double down! :D

[-]

pchew@reddit

Out of curiosity, are you happy with anything you've got running on just the one?

[-]

ProfessionalSpend589@reddit

My experience - I’m not. I should preface that I use the Vulkan drivers, but it’s in my todo to try the lemonade sdk sometime.

It’s slow for dense models up to 30b. MoE models around 120b have to be quantised which is not bad - usually Q6 with enough context takes a hit less than 100GB VRAM (I have Qwen 3.5 122b which i don’t use at all), but that leaves a bit of resource on the table which is hard to utilise when running headless (running anything else may lead to contention for RAM between the CPU and GPU in my opinion (and probably according to chipandcheese, but I had an LLM process the article about Strix halo and answer my questions - I didn’t read it myself)).

It’s also not fast for PP.

For simple tasks like consuming an article and answering questions I find something on a GPU with 32GB VRAM to be superior for chat.

And with current prices - I can’t recommend it. Framework increased the prices again yesterday and I expect others will follow. :)

[-]

pchew@reddit

Yeah well I may have gone full stupid and got a confirmed oculink card and an RTX 4000 to Frankenstein on to it after watching too much level1techs, so… I’m sure I’ll be cursing a lot.

[-]

ProfessionalSpend589@reddit

It's a nice combo. I have attached a Radeon AI Pro R9700 via OcuLink.

My idea was to run one of the 120b models on a single Strix Halo + eGPU, but I can't stop using Qwen 3.5 397b even if it's slower:)

[-]

Plane_Yak2354@reddit

Oh no… 😆

[-]

Plane_Yak2354@reddit

I haven’t had enough time to figure that out yet. Being on amd hardware is definitely something holding me back a bit but that’s likely a skills issue on my side.

[-]

pchew@reddit

Yeah, a bit of my concern I'll have as well...seeing as I already have a 128gb one on the way. But NVIDIA abandoning/bricking/remotely crippling hardware scares me more so I settled on the AMD route.

[-]

xrvz@reddit

Doesn't matter. Anyone serious about AI needs one as insurance.

[-]

coder543@reddit

Keep in mind that GLM-5/GLM-5.1 is substantially larger than GLM-4.7 was.

[-]

dsartori@reddit

Sure but there is a quant that would fit two strix halo boxes while there is not one that fits one.

[-]

Separate-Forever-447@reddit

...checks for a cerebras REAP

[-]

oxygen_addiction@reddit

Stepfun 3.6 upcoming.

[-]

jacek2023@reddit

thanks but this is too big for my 84GB of VRAM

[-]

danielhanchen@reddit (OP)

:( The smallest quant is 206GB for now :(

[-]

mynamasteph@reddit

That's 1 bit, are you wanting something even more compromised than that?

[-]

Irythros@reddit

0 bit. It's all 0's.

[-]

Fortyseven@reddit

Compresses REALLY well.

[-]

SawToothKernel@reddit

So Llama 4 then.

[-]

falcongsr@reddit

0000?

00000!

[-]

gnnr25@reddit

What did you call me?

[-]

-dysangel-@reddit

bonsaiiiiii

[-]

lolwutdo@reddit

how big would this model be if we got a bonsai quant?

[-]

-dysangel-@reddit

around 110GB

[-]

LegacyRemaster@reddit

Hey Daniel... TQ1 for 96+96 gb vram? :D

[-]

jacek2023@reddit

let's wait then https://huggingface.co/zai-org/GLM-5.1/discussions/2

[-]

miniocz@reddit

Fits my 1TB SSD just fine. 1t/s here I come!

[-]

jacek2023@reddit

what's your usecase for 1t/s model?

[-]

miniocz@reddit

Fun :) Or to think about setting up complex/novel pipelines and implementing new proposed methods. Essentially planning.

[-]

Altruistic_Heat_9531@reddit

wait 84G VRAM? what in the combination resulted into 84GB VRAM

[-]

jacek2023@reddit

72+12, but according to reddit experts it's probably impossible because they have laptop and the cloud

[-]

Particular-Way7271@reddit

Maybe 4 5060ti and a 7900xt xD

[-]

LatentSpacer@reddit

Offload to disk.

[-]

nastypalmo@reddit

This is too big for my 6gb vram

[-]

Karnemelk@reddit

can't wait for the first person to load it on a raspberry pi 8gb with SSD offloading.

[-]

deejeycris@reddit

Hopefully a proper provider picks this up. Sorry z.ai but your inference platform sucks, models are great tho.

[-]

-dysangel-@reddit

I managed to get to a whole Claude Code context today without the model falling apart - wondering if they've got more capacity for a while now that they've finished tweaking 5.1..

[-]

GreenGreasyGreasels@reddit

Inference quality will be good for a while - I am rolling in tokens in legacy pro sub at the moment. Good eating for a while. I am assuming two months good service and two months ass as a rule of thumb. Still worth the money if that holds.

[-]

deejeycris@reddit

They have too much throttling and too low usage quotas for the price. I doubt optimizing the model performance a bit will sort any meaningful effect..

[-]

-dysangel-@reddit

Low usage quotas for the price? I got a whole year of Max for the same as 1.5 months of Claude, and I don't think I've ever hit usage limits!

I've been having problems over about 80k context for over a month now, but today it was working fine right up to the limit.

[-]

deejeycris@reddit

I don't have max plan so can't judge, but ollama cloud provides way more usage for 20 bucks a month vs. z.ai, it's also slow especially some models but at least it gives you a lot of usage and more or less stable token rates.

[-]

Comrade-Porcupine@reddit

DeepInfra seems to have it already.

[-]

StanPlayZ804@reddit

Sorry, this model is a bit too small for my 80 petabytes of VRAM.

[-]

darknecross@reddit

Agent swarm.

[-]

mxforest@reddit

Just cut it in pieces.

[-]

Ok-Contest-5856@reddit

These models are super important for when Anthropic and OpenAI decide to rug pull their coding plans.

[-]

GreenGreasyGreasels@reddit

Coding plan ? Pulling all Api access is not our question if they want the whole pie. They will still apps and access them not Api tokens.

[-]

Corporate_Drone31@reddit

I mean, Anthropic literally said Mythos preview won't be on the public API. GPT 4.5 is likely only in use internally. API access may be limited in the coming years.

[-]

corruptbytes@reddit

really should've went with the 512gb model instead of the 256gb

[-]

milkipedia@reddit

"754B parameters"

*** passes out ***

[-]

Limp_Classroom_2645@reddit

*** pisses pants ***

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

Vicar_of_Wibbly@reddit

Awesome! Although at 754B even an NVFP4 is going to be a very tight squeeze onto a 4x RTX 6000 PRO rig when taking context space into consideration. Fingers crossed it can be made to fit.

[-]

fanhed@reddit

I have 4x RTX Pro 6000s, but I can't run them at all.

[-]

t3rmina1@reddit

You can give me 1, I have 3

[-]

Vicar_of_Wibbly@reddit

What can’t you run?

[-]

getpodapp@reddit

With turboquant maybe

[-]

danielhanchen@reddit (OP)

Yes NVFP4 would be cool!

[-]

junolau@reddit

it's ok with the past trend of vram growth we can expect to run this model locally on a single flagship consumer card something like an rtx 9090 ti super limited edition by the end of 2043, note that this is an expectation based on the trend by ai so results... will vary

[-]

This_Maintenance_834@reddit

i heard rumors on chinese social media that deepseek has new architecture to allow efficient run of 1T model on regular hardware (32GB VRAM?) when that come out, these giant model should be able to run locally with updates.

we will just have to wait to see if the rumors were made up.

[-]

florinandrei@reddit

efficient run of 1T model on regular hardware (32GB VRAM?)

haha

[-]

the__storm@reddit

1T model on regular hardware (32GB VRAM?)

What would that even mean? That's like 1/4 of a bit per weight lmao

[-]

coder543@reddit

I think they're imagining a future where deepseek's "engram" research means that deepseek-v4 is just going to be a <50B dense model with a terabyte of engrams that don't have to be stored in memory.

I do not think this is likely, but it is a nice dream.

[-]

Due-Memory-6957@reddit

According to rumors, Deepseek v4 was released in February.

[-]

frogsarenottoads@reddit

Starting to wonder what Google is up to, no release since January and the likes of ZAI, Qwen, Open Source in general are absolutely cooking.

[-]

ShadyShroomz@reddit

Gemma4 released like last week bro lmao

[-]

frogsarenottoads@reddit

I am aware of that and it wasn't last week only a few days ago I mean more along the lines of Gemini Pro

[-]

coder543@reddit

Google has released a bunch of things since January, including Gemma 4.

[-]

fuutott@reddit

Wake up neo

[-]

Kaljuuntuva_Teppo@reddit

Dang.. 1.51 TB 😂
Well at least some Apple Studio users with 512GB RAM might be able to run this at Q3/Q4.

[-]

joblesspirate@reddit

I'm trying now. The XL model was crawling. Giving unsloth/GLM-5.1-GGUF:UD-IQ2_M a shot. I'd love for this to work out!

[-]

-dysangel-@reddit

I've been using 5 at IQ2_XXS and it's been great, so no point taking up even more bandwidth. Going to try the same for 5.1

[-]

marhalt@reddit

Yes! Finally another large models. Excited about this one. I know all the top posts will say "but what about my 6GB vram GPU" but we have a ton of small models. We need large models that can do impressive things.

[-]

klippers@reddit

Yay this means nanoGPT should add it back to the subscription

[-]

TopChard1274@reddit

Oh, Grandmother, what big model you have!

[-]

coder543@reddit

Has Z.ai ever explained what GLM-5-Turbo is? Is it a smaller model, like a GLM 5 Air? Will it ever be released openly?

[-]

Cinci_Socialist@reddit

Nope it's a mystery all we know is it's fast and made for openclaw basically

[-]

Cinci_Socialist@reddit

GLM 5.1 is basically opus 4.5, this is a huge win

[-]

True_Requirement_891@reddit

glm-5-turbo pls

[-]

getting_serious@reddit

I still have a Xeon DDR3 mainboard here that is New Old Stock and I've been telling myself that I'll never a system with it. Damnit.

[-]

No_Conversation9561@reddit

TurboQuant is a god send

[-]

AndreVallestero@reddit

Any M3 Ultra users? IQ4_XS looks to be viable with 100k context

[-]

Jackalzaq@reddit

Thank you for the quants!

[-]

qwen_next_gguf_when@reddit

Nvm 735b.

[-]

false79@reddit

LFG!

[-]

-dysangel-@reddit

[-]

false79@reddit

you....lol

[-]

Adventurous-Okra-407@reddit

Even though I cannot run it myself (well outside of SSD shenanegans), it being open source does make me happy and also more likely to use zai/glm5.1 as a provider for cloud inference when I do need it.

[-]

Due-Memory-6957@reddit

Oh wow, all the doomers saying that the company that releases open-source models and said they were going to release open source models, wasn't going to, were wrong!?

[-]