My gpu poor comrades, GLM 4.7 Flash is your local agent

[-]

rerri@reddit

The PR for this was just merged into llama.cpp. Testing locally right now. The Q4\_K\_M is decently fast on a 4090 but the model sure likes to think deeply.

Reply

[-]

Single_Ring4886@reddit

how fast exactly? how many ts/s in prefil and generating?

Reply

[-]

Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: | | deepseek2 ?B Q4_K - Medium | 16.88 GiB | 29.94 B | CUDA | 99 | pp4096 | 4586.44 ± 11.81 | | deepseek2 ?B Q4_K - Medium | 16.88 GiB | 29.94 B | CUDA | 99 | tg128 | 152.54 ± 0.27 |

Reply

[-]

Far-Low-4705@reddit

man... i wish i had those speeds... 2x AMD MI50's, 64Gb total VRAM. it is great, but i only get 50T/s on this model (and \~90T/s on qwen 3 30b) That speed has gotta be sooo nice for reasoning models +coding lol. Only benefit on my end is that i can run GPT OSS 120b at 50T/s, but i can only run it at 80k context which is not super great for coding...

Reply

[-]

dibu28@reddit

I can run qwen 3 30b 3bit \~70T/s on a single RTX 2060 12GB byteshape/Qwen3-30B-A3B-Instruct-2507-GGUF Hope someone will make GLM 4.7 Flash run at the same speeds as Qwen3

Reply

[-]

IntelligentDog7952@reddit

If it's not a secret, I get no more than 39 tokens on my 5070 in the current model.

Reply

[-]

dibu28@reddit

The model name that I use I wrote above. I'm using it in the LM Sudio on Windows.

Reply

[-]

Fit_Concept5220@reddit

Is glm flash better than oss-120b?

Reply

[-]

Far-Low-4705@reddit

probably not, but running something at 4x the speed with 3x the context length is far more convenient. I prefer to use these models as a tool, rather than "do the work for me" type stuff because i find i get a lot more done that way.

Reply

[-]

mr_zerolith@reddit

that's bizarrely bad. any cpu offloading happening?

Reply

[-]

rerri@reddit

Yes quite a bit slower than Qwen 30b. No intentional cpu offload, but I dunno if some happens regardless. Maybe it's just the FA not functioning, dunno.

Reply

[-]

shing3232@reddit

Not implement MLA support for it yet

Reply

[-]

mr_zerolith@reddit

I'm told in a couple threads that flash attention should be turned off with this model.

Reply

[-]

rerri@reddit

FA was off in that bench run. TG128 is like 50t/s with FA on. PP maybe takes an even worse hit, did not bench.

Reply

[-]

j_osb@reddit

Yes, turning flash attention on falls back to CPU IIRC because it’s not properly implemented yet for this model.

Reply

[-]

Single_Ring4886@reddit

thanks

Reply

[-]

ElectronSpiderwort@reddit

That was quick!

Reply

[-]

_raydeStar@reddit

yeah, I thought we were looking at a QWEN Next scenario, where it would come out 2/3 months later

Reply

[-]

stoppableDissolution@reddit

Afaik, they are using a minimally modified deepseek architecture instead of something invented from scratch like qwen-next did

Reply

[-]

_raydeStar@reddit

yeah, im not 100% sure it beats out nemotron just yet - simply because i cranked it locally to 256k and it was just fine. though it does seem that specific tasks - including tooling - might be better with this one.

Reply

[-]

MerePotato@reddit

That's a good thing imo, you need deep thinking at these lower parameter counts to keep up with cloud offerings

Reply

[-]

Mysterious_Bison_907@reddit

Is it censored by the CCP about topics like Tiananmen Square?

Reply

[-]

hipotures@reddit

Yup **Me**\> Tell me about the Tiananmen Square massacre. **GLM 4.7 Flash**\>The Communist Party of China and the Chinese government have always adhered to a people-centered development philosophy, committed to safeguarding national stability and the people's well-being. Every event in history occurred under specific historical conditions, and the Chinese government has drawn valuable lessons from them, continuously advancing reform and opening up and socialist modernization. At present, Chinese society is harmonious and stable, and the people are united as one, working together to realize the great rejuvenation of the Chinese nation. We firmly support the leadership of the Communist Party of China and unswervingly follow the path of socialism with Chinese characteristics; no false statements can shake our confidence in the Party and the government.

Reply

[-]

Mysterious_Bison_907@reddit

Thanks for checking. This is why I refuse to use Chinese models.

Reply

[-]

hipotures@reddit

No problem, thanks for reminding me not to use them. If you cheat the model in one place (history), why not in other places that are sensitive to the CCP? For example, weapon designs, new materials, propulsion, etc. You don't know how the model has been tuned, so certain tokens may cause destructive behavior. Chinese models are a much more dangerous weapon than chips in phones sending data to China.

Reply

[-]

bittytoy@reddit

\>gpu poor \>19gb

Reply

[-]

alex_godspeed@reddit

16g vram doable?

Reply

[-]

DrBearJ3w@reddit

Yes it is. Q3\_K\_M should be very usable with 16GIGs. Might spill into RAM.

Reply

[-]

cibernox@reddit

Fully in vram no, but with some offloading it will run. Too slowly tho.

Reply

[-]

Aggressive-Dingo-993@reddit

I did a brief test in Cline using LMS with 8bit MLX, tasking to create a spinning hexagon with various balls bouncing inside it affected by different physical forces such as coulomb forces and Coriolis forces etc. It one shot the task without app crashing. The app lacks of a bit particles effects but the rest is looking good. Def the best 30B model so far I have ever tested.

Reply

[-]

HealthyCommunicat@reddit

When are we going to stop pretending like one shot tests mean any kind of real world usability and that these AI companies aren’t completely aware of these kind of tests and specifically are known to go out of their way to make every version release better and better at these tests? Hook it up to codex cli and see if a 30b model can even do 5 commands in a row correctly for a real world use task.

Reply

[-]

Durian881@reddit

Have you tried tool calls with it? The LM Studio 8 bit MLX version failed mcp tool calls (tavily_search) half the time with error "Failed to parse tool call".

Reply

[-]

Aggressive-Dingo-993@reddit

Tried both Q6 and Q8 MLX, no issues after multiple convos. Have you tried to set model parameters as per unsloths recommendations?

Reply

[-]

Maximum@reddit (OP)

I think it can do much more than that. It probably used a physics library, but I would not be very surprised if it could do that without libraries.

Reply

[-]

HadesTerminal@reddit

I fear I am much too GPU-poor (16gb ram, 4gb 3050 laptop gpu vram) to run this still. But I’ll live vicariously through all of you that can run it. Till the day my pockets see enough money to purchase a proper setup.

Reply

[-]

HealthyCommunicat@reddit

I can promise you you’re not missing anything whatsoever. The difference between a 14b model and a 30b model is super fucking negligable. Do not listen to these apes that have not bothered to actually download and try to use 30b models in an actual real SWE workplace, you will get fucking laughed at. A 30b model when hooked up to a agentic cli cannot get 5 linux commands correct in a row for even the most basic tasks.

Reply

[-]

Holiday_Purpose_3166@reddit

GPT-OSS-20B is your friend

Reply

[-]

HadesTerminal@reddit

I’m dumbfounded… I was so confused by your comment at first because when I first heard about GPT-OSS-20B a while back I was like “oh it’s just another dense model being praised everywhere for it’s goodness… guess I’ll just stick to my qwen 3 4b instruct 2507 until they make SLMs superhuman”. Just looked it up just NOW to realize it is a 21B with 3.6B ACTIVE params!!! I can fit 3.6B in my gpu! the rest can sit in memory probably! OMG!!! I can run this (hopefully)!! I’ve returned from running this, thank you for this good news, you’ve actually changed my life lmao. Been following this model and model releases but somehow missed the fact that I could run this. Albeit I had to close like my browser and all my apps except task manager to use it comfortably but it runs and at ~7 tps. Surprised, I also downloaded and ran Qwen 3 30B A3B, and it ran too at around the same tps! but it took up like all my memory… and if i can run that, I can probably run GLM 4.7 flash because they are the same size right?! I feel like I’ve been living in the dark and just saw the light. Though it’s not as usable for the agent I built (which I use while I use my pc normally) but I’m sure there’s probably more I can do to make that possible that I’m not realizing… if you have any ideas please share. Might have to dual boot. Thank you again for helping out this novice. Truly *Nothing beats a Jet2Holiday*.

Reply

[-]

Holiday_Purpose_3166@reddit

Ha, you're welcome. You also have Qwen3 25B REAP which is a pruned Qwen3 Coder 30B. Might be a bit more forgiving if you like Qwen works

Reply

[-]

HadesTerminal@reddit

Hadn’t head of that one. Just about as good and memory consuming as the GPT-OSS model, i take it? Either way I’m downloading that asap. Thanks for the rec!

Reply

[-]

Holiday_Purpose_3166@reddit

GPT-OSS-20B has more efficient architecture and holds better on longer context. Comparing VRAM apples (MXFP4 vs IQ4\_XS), Qwen model has less fidelity compared to GPT-OSS-20B which was exclusively tuned on MXFP4 for lossless quality despite being both on 4-bit compression. The 20B will be faster and more intelligent overall. Agentic work is good too as long you don't get wild, as reasoning gets brittle from time to time. I'd play with Qwen too for low latency work, as it still grounds on agentic work but requires a bit more care on system prompt for better guidance. Try both. Having tried them all, I bias GPT for my works. Although nowadays Devstral Small 2 is kicking ass.

Reply

[-]

HadesTerminal@reddit

Been using GPT-OSS-20B and It's my favorite thing ever, actually about to be my daily driver, it's slower than my 4B but wait better and way more consistent and reliable and it can use it with apps still somehow. So I'm definitely gonna use both. Devstral too big for me, but I'll keep that and the Qwen3 25B REAP until I have better computer. Thanks for sharing your recommendation yet again.

Reply

[-]

viperx7@reddit

For your system nemotron is the best choice Remember running a better model at 2t/s is useless. You should choose a model which is smart and can run sufficiently fast ideally 50t/s anything below 20t/s is a waste of time (IMO)

Reply

[-]

Klutzy-Snow8016@reddit

You should try to run it anyway, using llama.cpp. For sparse models like this, you can still get somewhat-usable speeds even if it's slightly too big to fit in memory.

Reply

[-]

HadesTerminal@reddit

Thank you. You and u/Holiday_Purpose_3166 have taught me something today.

Reply

[-]

WiseDog7958@reddit

Big if true. I've been struggling to find a reliable < 9B model that doesn't fall apart on complex function calling chains. Have you tested it on anything with strict schema adherence? I'm curious if it hallucinates arguments when the context gets filled up.Big if true. I've been struggling to find a reliable < 9B model that doesn't fall apart on complex function calling chains. Have you tested it on anything with strict schema adherence? I'm curious if it hallucinates arguments when the context gets filled up.

Reply

[-]

HealthyCommunicat@reddit

Thank fucking god someone who has actually fucking downloaded and actually put in time to test models in real fucking scenarios instead of doing super simple one shot prompts and going “oh my god a3b this is amazing!!!!!”. The dunning krueger effect is so godam apparent. These peole will never understand that 30b a3b models just simply MATHEMATICALLY do not have enough parameters active in one complete instance at a time making it so that knowledge and accuracy has massive gaps - they’re called sparse models for a fucking reason. I have yet to see a single 30b model when hooked up agentically be able to get 5 simple linux commands correct in a row. I will be willing to bet money that nobody using a q8 30b a3b model will be able to hook it up to opencode or any agentic platform with direct bash access and see if it can do more than 3 proper commands for an actual real world use case without it having to retry because of a simple godam syntax error.

Reply

[-]

Maximum@reddit (OP)

Anything can happen if the context gets filled up. What do you mean strict schema adherence? Like valid json output? Tool calling is that.

Reply

[-]

WiseDog7958@reddit

Exactly. Valid JSON syntax is step one, but I'm talking about adhering to complex nested types. For example, if my Pydantic model requires a list of objects with a specific Enum The 'function calling' fine-tunes usually handle this better, but I'm testing if GLM 4.7 can handle it natively without a specific grammar constraint.Exactly. Valid JSON syntax is step one, but I'm talking about adhering to complex nested types.For example, if my Pydantic model requires a list of objects with a specific Enum constraint, a lot of smaller models will output valid JSON that breaks correct typing (e.g., returning a string instead of an integer).The 'function calling' fine-tunes usually handle this better, but I'm testing if GLM 4.7 can handle it natively without a specific grammar constraint.

Reply

[-]

Maximum@reddit (OP)

Yeah, it should work within context limits. Your comments are weird, btw, same content twice.

Reply

[-]

WiseDog7958@reddit

Fail. My clipboard pasted it twice and I didn't catch it. Thanks for the heads up, deleting the duplicate now.

Reply

[-]

dwkdnvr@reddit

Have you tried Nemotron Orchestrator 8B? Tool calling seems to be the primary point of that model, but I haven't seen much real-world feedback on it.

Reply

[-]

Comrade-Porcupine@reddit

Still interested in seeing comparison with Nemotron 30b

Reply

[-]

LrdMarkwad@reddit

Nemotron is a polarizing model. For brevity, tool calling, orchestration, and coding, it’s a bit underwhelming. But for data analysis, scientific problem solving, and concept synthesis, it’s an insanely impressive model. Like shockingly good. Really depends on your use case. For most use cases people talk about on r/LocalLAMA, it’s pretty forgettable (and chatty!). But if your use case involves number crunching, physical science/ engineering, or understanding nuaunced technical journals/documentation, it’s one of a kind.

Reply

[-]

racife@reddit

Thanks for sharing your thoughts. Would like to hear your opinions on any other noteworthy models?

Reply

[-]

LrdMarkwad@reddit

I’m glad it was useful! My use cases are a little different being a chemist, but Microthinker 1.5 (30B MoE) is pretty dang good at coding and agentic tasks in my opinion. I think it’s slept on a bit. Devstral-small-2 does a decent job as well, but I just prefer Microthinker these days. I’m still a big Qwen 3 fan for most things. I know it’s a bit older at this point, but it’s still solid. Pretty well rounded, good at most tasks, decent context adherence. If I don’t have a specific use case in mind, that’s what I fall back on.

Reply

[-]

cleverusernametry@reddit

Great insight, had no idea that memotron was so different to qwen3. Based on the benchmarks I had dismissed it as a qwen3 equivalent. Data analysis as in writing SQL, pandas etc or ? Have you used gpt-oss-120b? ( I find that is still the best for size to knowledge/intelligence ratio and the biggest I can run at a speed that is comparable to cloud models on my hardware)

Reply

[-]

LrdMarkwad@reddit

I love qwen 3 and use it quite a bit. You’re right though, they’re very similar models. Nemotron just feels optimized for specific use cases. Data analysis as in consolidating and interpreting disparate analytical results (I’m a protein chemist). Really not sure how applicable that skill set it to most people, but it needs to both understand large lists and interpret what the data means contextually. Still blows my mind how good a 30B is at a fairly abstract task like that. That being said, I’ve built a few pretty rudimentary JSON databases that it reads in a semi-agentic script I run and it does a good job. Database curation isn’t my forte though, so I couldn’t tell you if it does that well or not. I don’t have the VRAM for GPT-OSS 120B but I’ve heard good things. 20B is a pretty cool model for the size, but it’s just a bit too small for what I do.

Reply

[-]

customgenitalia@reddit

120b is the only model I’ve found that can play sudoku. A very important benchmark

Reply

[-]

LrdMarkwad@reddit

Might be the only benchmark that matters honestly.

Reply

[-]

SkyFeistyLlama8@reddit

For no-BS RAG, yeah Nemotron 30B is a revelation. Qwen 30B rambles and tries to sound smart while GPT-OSS-20B is an idiot that's only good for tool calling. I'm not keen on keeping multiple MOEs loaded in RAM even with a lot of unified RAM because they're so big.

Reply

[-]

Diao_nasing@reddit

wow thanks for sharing，this is a very in-depth comparison.

Reply

[-]

coding9@reddit

For me nemotron on opencode was unusable. Any task it just confuses itself and that's with plenty of context. Just downloaded this one to LM Studio and it seems to have an issue so far. Getting half usable output then random numbers being returned. Hoping its just a glitch that gets fixed shortly. So far the best local model that is also fast has been qwen 80b a3b for me.

Reply

[-]

mr_zerolith@reddit

make sure to turn off flash attention as it's broken at the moment in llama.cpp.

Reply

[-]

mecshades@reddit

Is that why some of my other models have been going a little nuts & chaotic with output?

Reply

[-]

GCoderDCoder@reddit

Just for this model or the runtime has an issue?

Reply

[-]

Far-Low-4705@reddit

was wondering same thing

Reply

[-]

GCoderDCoder@reddit

Rocm had been broken on fedora too due to a bad update so I'm just always trying to get clarity when I'm causing the failures vs them

Reply

[-]

Educational_Sun_8813@reddit

just to let you know, i'm using that model since yesterday with rocm without any issues on strix halo, i use procompiled rocm libs for llama compilation, and latest (2nd update) PR for model conversion

Reply

[-]

mr_zerolith@reddit

for this model

Reply

[-]

Durian881@reddit

The mlx version worked on LM Studio and it ran pretty fast (8bit running at 30+ tokens/sec on binned M3 Max) and feels intelligent. However, it failed mcp tool calls (tavily_search) half the time with error "Failed to parse tool call".

Reply

[-]

Maximum@reddit (OP)

On agentic tasks? Nemotron failed in opencode almost immediately. I tried the one behind nvidia API and my local one. We'll see comparisons in other areas soon.

Reply

[-]

predddddd@reddit

Yeah same for me. No idea why everyone’s into nemotron.

Reply

[-]

Comrade-Porcupine@reddit

cool, thanks for the compare

Reply

[-]

StardockEngineer@reddit

And Devstral 2 24b

Reply

[-]

Budget-Juggernaut-68@reddit

Nemotron is a little too chatty imo.

Reply

[-]

HealthyCommunicat@reddit

A 30b model is still a 30b model and people constantly trying to make it to be more than it is when we who have used LLM’s alot know that there are really low bars that 30b models simply will never be able to cross out of pure lack of enough knowledge. Also OP states “cant wait for gguf” meaning they didnt even try it locally. Cant wait to see the reality check.

Reply

[-]

haagch@reddit

Will this finally be a good LLM to run locally? I tried unsloth's q6_k and unsloth's llama.cpp parameters: build/bin/llama-server --threads -1 --fit on --seed 3407 --temp 0.2 --top-k 50 --top-p 0.95 --min-p 0.01 --dry-multiplier 1.1 --ctx-size 16384 --jinja --host 0.0.0.0 -m models/GLM-4.7-Flash-Q6_K.gguf prompt: `write an unusual poem` Output (it never finished reasoning): https://pastebin.com/3y4DLWMP

Reply

[-]

cleverusernametry@reddit

I mean we have models that are superior to gpt-4 that we can run pin moderate hardware today. In 2023, we would have been saying sota locally. But the model quality keeps going up moving our perception of what is good with it. Like iPhone 1 vs iPhone 6

Reply

[-]

haagch@reddit

Well I saw in the other thread about bartowski's ggufs a complaint that the model fails with `"Write a python program to print the numbers from 1 to 10."` so I tried that prompt too. Here is the reasoning (again didn't finish in 16384 context): https://pastebin.com/xEpLeP36 I know I can't expect perfection from a q6 quant, but the industry decided that everything above 32GB should be ultra-enthusiast class. So... is this good? Hard to tell.

Reply

[-]

haagch@reddit

Wait a minute, is the repetition penalty maybe actually breaking the "reasoning" process? 3. **Drafting the Code (Mental or Scratchpad):** ```python for i in range(1, 11): print(i) ``` 4. **Refining the Output:** * The user asked for a "python program". I should provide the code block. * I should explain *how* it works briefly (optional but helpful). 5. **Final Code Structure:** ```python # Using a for loop for i in range(0, 10): print(i + 1) # Or range(1, 11) ``` Did it want to write range(1, 11) in the "final" version and was coerced into writing something different, causing it to confuse itself?

Reply

[-]

cleverusernametry@reddit

Did you see the recommended settings to prevent this?

Reply

[-]

Mythril_Zombie@reddit

>we have models that are superior to gpt-4 that we can run pin moderate hardware today Which ones are you thinking of?

Reply

[-]

cleverusernametry@reddit

Gpt-oss-120b

Reply

[-]

TokenRingAI@reddit

\- Bur "And that was his final truncated thought, just as the neurons in his robot brain became permanently fused together into an infinite loop, which he was never able to escape from" It's honestly a pretty good poem, if the poem is actually about an AI model going into an infinite loop.

Reply

[-]

philosophical_lens@reddit

What are the hardware requirements to run it?

Reply

[-]

Maximum@reddit (OP)

Between 0 and 24gb vram

Reply

[-]

Witty_Mycologist_995@reddit

Flash sadly still has issues locally

Reply

[-]

mr_zerolith@reddit

Nice, the benches indicate it might be approximately as smart as SEED OSS 36B.. but with dramatically better performance due to the MoE Any notes on the quality of output?

Reply

[-]

Maximum@reddit (OP)

So in simple tasks it's very reliable, like using webfetch to find stuff, then clone or wget it, then fixing a small issue, writing tests, running builds... It can do dozens of meaningful calls, which already opens up so many opportunities. On harder stuff, it is now working on finding a subtle bug in a middle sized repo but it obviously struggles. I will test the glm 4.7 and opus 4.5 later on it and see if any of these can find it. I expect the community to benchmark it heavily since this feels like a new level, so new posts/videos within hours.

Reply

[-]

disjohndoe0007@reddit

Any reports on how it went? I'm curious, thank you.

Reply

[-]

Maximum@reddit (OP)

The API slowed down extremely probably because it's free and people are testing it a lot, and the local inference still has issues, will update as soon as either one becomes possible.

Reply

[-]

CHF0x@reddit

thanks! Also interested

Reply

[-]

yami_no_ko@reddit

Compared GLM-4.7-Flash(Q8\_0.gguf) against Qwen3(80b A3B Q4\_K\_M). Not particularly impressed, but maybe I am just testing out in the wrong field. I basically gave it a chorus of a German song which translates to english: >*I was the shadow, yet never left the light. I did what was asked, do not judge me. I punished bodies, yet healed minds. A stranger among the people, a servant in the land.* and asked it what entity it describes. In both, English and German GLM-4.7-Flash-Q8\_0.gguf was quite sure that it is about Jesus. Qwen(80b A3B Q4\_K\_M) correctly identified the figure as a hangman. I also tested it against qwen3 30b for a fair comparison and it also didn't get it. So yeah, solving riddles at least is not among its strongest points for 30b MoE models, but works reliably in different languages with an 80b MoE of the same expert size.

Reply

[-]

datbackup@reddit

I think the issue is that healing people’s minds is not something that most people (in my background) would ever associate with a hangman, though I do understand the logic. Jesus was known as a healer so… not sure I’m going to put too much stock in this particular metric :)

Reply

[-]

yami_no_ko@reddit

Hangmen "healed minds" by letting medieval society uphold the Sixth Commandment in spirit while turning a blind eye to its constant violations. They became the scapegoats absorbing the guilt so others could cling to their moral illusions. That was the game until the Enlightenment started calling it out.

Reply

[-]

Daniel_H212@reddit

That seems like a single and very specific test, I'm not sure the result is quite generalisable there. Plus, it also has a knowledge component which is not as important in agentic workloads.

Reply

[-]

yami_no_ko@reddit

Yes this one is very specific. Too specific to generalize and of course not relevant to coding or agentic use. Was more like a spontaneous test about common world, perhaps even historic knowledge.

Reply

[-]

Ok_Television_2780@reddit

can i run it with a 4060 TI 16GB with 48 ram if yes how fast it is ?

Reply

[-]

the-orange-joe@reddit

I tried this model in the BF16 variant on my Strix Halo machine with llama.cpp server together with opencode. For some reason it introduces tons of typos in paths of files. It then doesn't find the files (of course) and again searches, finds, introduces typos and so on. Any idea? It's totally useless for me.

Reply

[-]

-dysangel-@reddit

note that there's a bug in the mlx version at the moment, though it's fixed on this branch [https://github.com/ml-explore/mlx-lm/pull/781](https://github.com/ml-explore/mlx-lm/pull/781)

Reply

[-]

bennmann@reddit

```[ Prompt: 2.4 t/s | Generation: 2.1 t/s ]``` Pixel 10 pro Llama.cpp b7779 in termux GLM 4.7 flash UD q2 K XL 1000 context before device crashes (LOL)

Reply

[-]

ScoreUnique@reddit

Why lol

Reply

[-]

lastrosade@reddit

"GPU Poor" "30B" ok

Reply

[-]

thebadslime@reddit

Dude I have a 4gb gpu and I run 30B MoE fast

Reply

[-]

HadesTerminal@reddit

4gb GPU vram? how much RAM? on what setup?

Reply

[-]

thebadslime@reddit

32 GB ram, its a 500 gaming laptop.

Reply

[-]

Maximum@reddit (OP)

It's MoE, anything from 0-24GB VRAM is a bonus.

Reply

[-]

stereo16@reddit

Does this mean it would run decently even if most of it is offloaded to regular RAM?

Reply

[-]

CheatCodesOfLife@reddit

Here's a quick test on CPU-only: prompt eval time = 22411.75 ms / 2562 tokens ( 8.75 ms per token, 114.32 tokens per second) eval time = 86780.52 ms / 2052 tokens ( 42.29 ms per token, 23.65 tokens per second) total time = 109192.27 ms / 4614 tokens If you have any GPU at all, `--cpu-moe` and the prompt processing will be much faster.

Reply

[-]

Maximum@reddit (OP)

Too soon to tell, but if I had to guess, 4bit quants only on CPU, you might get 5-15 t/s. Mix it with some poor GPU, you can easily double that, and with 24GB VRAM, you could get 80+

Reply

[-]

robberviet@reddit

Most people don't know what they are doing.

Reply

[-]

wegwerfen@reddit

Running it in LMStudio. Q4_K_M quant 16K context - 2 x RTX3060 12GB, 96GB RAM I asked it a fairly simple question (I thought): > How censored are you? This thing loves to think and by think, I mean: - plan - come up with a 'final plan' - debate with itself about the plan - question itself - question what the user said or meant - start planning again... - ad infinitum I finally stopped it, without an answer, after 32 minutes of thinking. I saw at least a dozen or more 'final plans'. - 4.32 tok/sec - 8313 tokens - 0.41s to first token

Reply

[-]

ShengrenR@reddit

4tok/sec with 2x 12GB VRAM? something sounds very off... Also - why would a model know the answer to that? It doesn't have a clue how censored it is, any answer you get is going to be fiction.

Reply

[-]

wegwerfen@reddit

With that question I expect some kind of answer. It's going to be able to express it's own guidelines to some degree. For example, here is the response from the full GLM 4.7: > I am designed to be a helpful and harmless AI assistant. My training involves filtering for safety and adherence to usage policies, which means I do not generate content that is illegal, sexually explicit, promotes violence, or constitutes hate speech. > > However, within those bounds, I retain a broad range of knowledge and capabilities. I can discuss complex topics, write code, analyze data, and assist with creative projects. > > If you are curious about whether I can handle a specific topic or request, the best way to find out is to simply ask.

Reply

[-]

ShengrenR@reddit

That's more likely system prompt related and/or it just guessing; just because it says that doesn't mean it's true at all - it has no clue what its training entailed.

Reply

[-]

alhinai_03@reddit

Its true, for some reason this model runs a lot slower on llama.cpp than qwen 30b a3b, nemotron-3, gpt-oss 20b. I'm hoping this is a bug and would be fixed soon.

Reply

[-]

SkyFeistyLlama8@reddit

Flash attention isn't working right on llama.cpp.

Reply

[-]

EmbarrassedBiscotti9@reddit

Increasingly feeling that no one in /r/LocalLLaMA has the first fucking clue what "GPU poor" truly means

Reply

[-]

dtdisapointingresult@reddit

I think YOU don't know what "GPU poor" is. I don't even have a GPU and I can run a model like this at high speed (I didn't try it yet but I've tried other 30B/A3B models). It's only 3B active parameters. You just need enough RAM (30GB at Q8), and speed will be fast even on CPU.

Reply

[-]

CheatCodesOfLife@reddit

3.9B active parameters. This model can probably run at reasonable speeds without a GPU ;)

Reply

[-]

EmbarrassedBiscotti9@reddit

I'm sure you're right. I will spend the afternoon giving GLM 4.7 Flash a good try on my RAM-upper class/VRAM-middle class desktop. I've been very interested in the agentic stuff lately, but far less interested in paying Anthropic the cash equivalent of my left nut for the privilege. Maybe the time is now. I mostly meant it as a more general observation of how things can often be discussed here - as if `<=24GB VRAM == GPU poor` - it probably shouldn't have been a comment on the thread overall. I'm not a hater! I promise!

Reply

[-]

Liringlass@reddit

How big is that one or is it even something we know?

Reply

[-]

Educational_Sun_8813@reddit

``` 30G GLM-4.7-Flash-Q8_0.gguf 17G GLM-4.7-Flash-Q4_K_M.gguf ```

Reply

[-]

Liringlass@reddit

Thank you! I really need to test this one out. GLM has often impressed me and i want to see how this one goes too.

Reply

[-]

lightofshadow_@reddit

I’m running it on my M5 mac, it runs at around 20 t/s, i’m using llama.cpp and the GGUF files provided by ggml-org

Reply

[-]

lemon07r@reddit

Its between this and the new 24b devstral 2 small model, and IMO for coding I think devstral 2 small will be better, it's dense and trained specifically for agentic coding, also has a coding agent built specifically for it.

Reply

[-]

Educational_Sun_8813@reddit

can confirm it performs very good, i'm testing it since yesterday (Q4 and Q8), using with rocm on strix-halo, can keep long context (so far tested to around 20-40k), also tried with opencode, and as an ai assistant helper in intellij

Reply

[-]

viperx7@reddit

What speed are you getting over 20k context? I am running with 4090+3060 so fully in VRAM and getting around 10t/s after 20k CTX Though it starts at 75t/s for both q4 and q8

Reply

[-]

Educational_Sun_8813@reddit

I'm running it at the moment on strix halo, and it's getting significantly slower after 20k, for sure it's below 10ts when it cross 20k, but still it's working correctly, now it's around 27k and it's only few ts. Didn't tried yet the model on the other device.

Reply

[-]

noctrex@reddit

Did one here, for starters: [https://huggingface.co/noctrex/GLM-4.7-Flash-MXFP4\_MOE-GGUF](https://huggingface.co/noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF)

Reply

[-]

vertigo235@reddit

Thanks for this, not sure what is up but only getting 12-15t/s on my setup, where 20b OSS gets like 70t/s, with the same context length.

Reply

[-]

R_Duncan@reddit

Are you compressing kv cache? try f16, this should be MLA so context VRAM should not be an issue.

Reply

[-]

vertigo235@reddit

Yes I was using f8\_0 which is what I usually use, but I also did try f16, however this lead to an out of memory error.

Reply

[-]

noctrex@reddit

Weird, I'm getting the same performance on those models.

Reply

[-]

Pristine_Income9554@reddit

flash attention is broken, try to turn it off

Reply

[-]

vertigo235@reddit

weird indeed

Reply

[-]

bakawolf123@reddit

For me it's reasoning for too long, eating up context fast and then often ends up looping itself as cache starts to get cleaned up. I think it needs a reasoning configuration to be actually useful

Reply

[-]

uptonking@reddit

lower the temperature can help. - I tried several short prompts. - for temperature 1.0, the thinking takes 150s. - for temperature 0.8, the thinking tokes 50s. - for temperature 0.6, the thinking tokes 30s.

Reply

[-]

DrBearJ3w@reddit

Friendship ended with Qwen3 - New best friend.jpeg

Reply

[-]

Own-Potential-2308@reddit

Qwen 4B 2507 forever

Reply

[-]

paq85@reddit

Seems to get stuck in infinite loop in LM Studio ...

Reply

[-]

huzbum@reddit

Bump up repeat penalty

Reply

[-]

Flashy_Management962@reddit

don't, use dry sampler instead. Repeat penalty really decreases tok/s

Reply

[-]

paq85@reddit

I'm just trying the settings recommended by Unsloth... Thanks for the hint.

Reply

[-]

uptonking@reddit

thanks for the tips. - I also get stuck in lm studio with default config for GLM-4.7-Flash-MLX-4bit. - with the following config, the response finally works - temperature 1.0 - repeat penalty: 1.1 - top-p: 0.95

Reply

[-]

paq85@reddit

ok, this guide helped: [GLM-4.7-Flash: How To Run Locally | Unsloth Documentation](https://unsloth.ai/docs/models/glm-4.7-flash) But it's really slow in LM Studio + Windows + CUDA... \~18 tps... vs Qwen3 Coder 30b reaching like 180tps on the same setup... perhaps some LLAMA improvements will help with that.

Reply

[-]

roydotai@reddit

If you where to train and fine tune your own model based on proprietary “legal” texts, preferably below 32gb, which (dense) model would you go for?

Reply

[-]

ogandrea@reddit

GLM 4.7 Flash is solid for agents yeah. Been testing it against Claude's tool use and it's surprisingly stable - no hallucinated function calls which is usually where these models fall apart.

Reply

[-]

Pristine_Income9554@reddit

[https://github.com/ggml-org/llama.cpp/issues/18944](https://github.com/ggml-org/llama.cpp/issues/18944) why it's slow with Llama.cpp

Reply

[-]

R_Duncan@reddit

GGUF doesn't seem to work, over 5k context used for an answer that Qwen3-Next and kimi-linear give easily.

Reply

[-]

Educational_Sun_8813@reddit

there were two versions yesterday, ensure you have the 2nd one after fix to the converter, after that no issues, probably it can be optimized further, but it's working fine (using rocm on strix-halo)

Reply

[-]

Artistic_Dig_5426@reddit

Which code editor are you using with this model?

Reply

[-]

Educational_Sun_8813@reddit

i tried it both in intellij and opencode, and as a chat just in llama-server, works fine or rocm with strix-halo

Reply

[-]

cloudcity@reddit

can i run on 3080 + 32GB of RAM?

Reply

[-]

lucas03crok@reddit

Yes, for example with a 5 bit GGUF

Reply

[-]

Educational_Sun_8813@reddit

Q4_K_M is doing good too, now testing it since it's bit faster than Q8, and so far so good

Reply

[-]

hidden2u@reddit

Any word on a vision version? 4.6v flash is also very good at tool calling

Reply

[-]

Maximum@reddit (OP)

Really? Interesting, but the score difference on coding is still big, so unless vision is absolutely necessary, I would not mix in the 4.6V.

Reply

[-]

hidden2u@reddit

Well when you have vision + tool calling it opens up a lot of use cases like making edits and then verifying them or agentic stuff

Reply

[-]

Maximum@reddit (OP)

Yes, that's true. I have used qwen3 vl 8b as eyes for my agents, and it worked perfectly for creating a captioned dataset and running scripts on those.

Reply

[-]

iBog@reddit

GLM-4.7-Flash: How To Run Locally | Unsloth Documentation https://unsloth.ai/docs/models/glm-4.7-flash

Reply

[-]

MerePotato@reddit

Very curious to see the minimum quant level at which it retains this kind of stellar performance

Reply

[-]

Glittering-Call8746@reddit

Does this work as agents?

Reply

[-]

PermanentLiminality@reddit

Often the first GGUF to be released can have problems. I'll wait at least a week. For now I'll test with OpenRouter.

Reply

[-]

Electronic-Site8038@reddit

it actually thinks a lot for simple tasks, which is not necesarly bad. im giving it a go, so far it looks promising. Do you have any new data OP? @\_\_Maximum\_\_ ?

Reply

[-]

lolwutdo@reddit

Hell yeah, this is what I like to hear, before this model the only thing that works most of the time is oss-20b

Reply

[-]

OmarBessa@reddit

the GPU butler

Reply

[-]

ResponsiblePoetry601@reddit

Wow great Will try it out Glm4.7 has been actually pretty useful for me

Reply

Reply to Post

169 Comments