Most used models and performance on M3u 512 gb | TheaterFire

Most used models and performance on M3u 512 gb

Posted by nomorebuttsplz@reddit | LocalLLaMA | View on Reddit | 47 comments

Most used models and performance on M3u 512 gb

Bored, thought this screenshot was cute, might delete later.

Model: Kimi K2 thinking
Use case: idk it's just cool having a huge model running local. I guess I will use it for brainstorming stuff, medical stuff, other questionable activities like academic writing. PP speed/context size is too limited for a lot of agentic workflows but it's a modest step above other open source models for pure smarts
PP speed: Q3 GGUF 25 t/s (gguf 4 bit at 26k context) faster with lower context;
Token gen speed: 3ish to 20 t/s depending on context size

Model: GLM 4.6
Use Case: vibe coding (slow but actually can create working software semi-autonomously with Cline); creative writing; expository/professional writing; general quality-sensitive use
PP Speed: 4 bit MLX 50-70 t/s at large context sizes (greater than 40k)
Token Gen speed: generally 10-20

Model: Minimax-m2
Use case: Document review, finance, math,
PP Speed: MLX 4 bit 3-400 at modest sizes (10k ish)
Token gen speed: 40-50 at modest sizes

Model: GPT-OSS-120
Use case: Agentic searching, large document ingesting; general medium-quality, fast use
PP speed: 4 bit MLX near 1000 at modest context sizes. But context caching doesn't work, so has to reprocess every turn.
Token gen speed: about 80 at medium context sizes

Model: Hermes 405b
Use case: When you want stuff to have that early 2024 vibe... not really good at anything except maybe low context roleplay/creative writing. Not the trivia king people seem to think.
PP Speed: mlx 4 bit: Low... maybe 25 t/s?
Token gen Speed: Super low... 3-5 t/s

Model: Deepseek 3.1:
Use case: Used to be for roleplay, long context high quality slow work. Might be obsoleted by glm 4.6... not sure it can do anything better
PP Speed: Q3 GGUF: 50 t/s
Token gen speed: 3-20 depending on context size

[-]

hakyim@reddit

This post is super helpful. Do you have any update you want to share 5 months after this post?

[-]

nomorebuttsplz@reddit (OP)

GLM-5 is faster and only slightly stupider than Kimi k2 thinking which is still the smartest overall.

Qwen 3 27b and Gemma 4 31b have made everything between them and GLM-5 seem somewhat questionable. Minimax M2.7 might change that. I am not familiar with Step 3.5 etc.

[-]

Academic-Screen-3481@reddit

How are you running Kimi K2 thinking Q3 GGUF?
Are you using llamacpp? What are the command line parameters?
19 t/s seems pretty fast.
Thanks for posting this.

[-]

nomorebuttsplz@reddit (OP)

Oh, sorry were you asking about prefill time?

[-]

Academic-Screen-3481@reddit

Thanks for your reply. I was asking about generation speed - sorry for the confusion.
I've been experimenting with Kimi K2 Thinking Q3_K_XL on 512Gb M3 Ultra. For LM Studio I got 13.5 token/sec from an empty context. I've been using koboldcpp which gives me 16 token/sec from an empty context. (Both of them should be using the same llamacpp backend, but I'm guess koboldcpp is more up-to-date).
Then I tried Kimi K2 Thinking Q4_X from ubergarm (https://huggingface.co/ubergarm/Kimi-K2-Thinking-GGUF). In theory this is essentially identical to the INT4 that Kimi K2 was based on. It's 540GB and doesn't fit on the 512Gb M3 Ultra, so I connected my 128Gb Macbook Pro M4 and used a distributed llama-server rpc setup. I used a Thunderbolt 5 cable to give it a nice fast connection and got 15 tokens/sec on an empty context. More realistically, I tried an 18k prompt and it processed it at 105.41 tokens/sec prompt processing (so 171 seconds to process) with a generation speed of 3.87 tokens per second. And my macbook got very hot.

[-]

nomorebuttsplz@reddit (OP)

Yeah I think 19 is slight exaggeration. It slows down so fast that in practice you won't get that if the answer is more than a few tokens long. Typically it will be more like 12 for a first answer. I am running on LM studio.

[-]

false79@reddit

Super cool post. All my questions already answered

[-]

Front-Relief473@reddit

So now my question is, are you going to buy the 512GB m3 Ultra or not?

[-]

nomorebuttsplz@reddit (OP)

My advice would be to get about 3k worth of ddr5 ram and run GLM sized models for 12 months until the m5 ultra comes out. I think GLM, minimax m2, qwen 235b all demonstrate that sub 400b is a sweet spot with current llm tech.

That advice is based on a total ignorance of your needs, uses, preferences, budget, and knowledge of when m5 ultra will come out, if it even ever does.

[-]

power97992@reddit

nah, I'm thinking about the m6 max? the m5 ultra with 512 gb or 1tb will come out in 5-8 months, but it is too expensive unless u get 96/128gb..

[-]

nomorebuttsplz@reddit (OP)

Thanks, glad it was helpful

[-]

synn89@reddit

Thanks for posting all these details. I've been curious what people were using on a more practical day to day thing with the M3 Ultra. I'm hoping we continue to see strong models in the GLM size range as I feel like in a couple years these M3u hardware specs will be doable at around 5k USD with a reasonable home foot print.

[-]

cosimoiaia@reddit

Most used where?

[-]

nomorebuttsplz@reddit (OP)

Depends on task. Overall GLM 4.6 is most used. Then OSS-120 or Kimi.

[-]

cosimoiaia@reddit

I didn't say for what task, I asked where, on what setup, local, some api service? Also how did you get this data? This seems sus as af to me.

[-]

nomorebuttsplz@reddit (OP)

Local that's what m3u 512 means in the title. this is lm studio.

[-]

cosimoiaia@reddit

Ah, so this is just your preference... I don't have Mac and never used lmstudio, so I could have never guessed.

Maybe you could have been slightly more clear in the title, like "Most model "I" use", the way you posted it sounded more like a global statistic or a ground truth for a platform.

Thank you for the clarification thou.

[-]

Ackerka@reddit

Add Qwen3 Coder 480b 4 bit quant version to your list. It works for me the best for vibe coding.

Concerning Kimi K2 Thinking, the Q3 K XL version consumes too much memory. If you add only a single page document to the prompt your Mac Studio M3 Ultra 512GB system can easily hangup. Even for shorter questions after enormous amount of thinking the responses were weaker but surely not stronger than other smaller models. So I'm not convinced either. The original INT4 version might be stronger but it does not fit into 512GB.

[-]

nomorebuttsplz@reddit (OP)

I was able to put in about28000 into k2 thinking at q3 k xl. That should be many pages

[-]

Ackerka@reddit

Interesting. I used LM Studio for running the model and added a one page long PDF and my system hung up during prompt injection. Simple text questions were answered but slower and never better than a bit smaller non thinking models. After the computer froze I removed the model, so I cannot run further tests now without downloading the huge model again. I also tried Q2 K XL version but it often stuck in endless thinking loop, so it was definitely useless. I saw amazing results from Kimi K2 Thinking on different platforms but I'm sure they are not from the Q3 K XL versions. Probably the original INT4 is a big deal.

[-]

nomorebuttsplz@reddit (OP)

I had poor results (like literal nonsense) until I updated the metal llama.cpp backend in LM studio, even though there was nothing about kimi in the release notes. Also, make sure you are running something like sudo sysctl iogpu.wired_limit_mb=510000 to free up more ram to the gpu

[-]

Ackerka@reddit

Thanks for the tip. I currently have Metal llama.cpp v1.56.0 in LM Studio but I'm not absolutely sure about that I had the same version when I tried the model as autoupdate is enabled. Nevertheless, I did get meaningful answers but not perfect ones. E.g. prompt: "Create an HTML WEB page with javascript that displays an analog clock." It generated a working solution for 3602 tokens with 11.79 token/s speed but the hands of the clock were rotated 90 degrees counter clockwise compared to the correct solution. This task was nailed perfectly only with two models on my local tests: qwen3-coder-480b and gpt-oss-120b interestingly. I tested it on 14 models and Minimax-M2 did perform worse than Kimi-K2-Thinking Q3 K XL by the way. glm-4.6-mlx-6 generated a fancy page without a working analog clock for me.

[-]

nomorebuttsplz@reddit (OP)

I am on v1.57.

[-]

Ackerka@reddit

Thanks for the tip. I currently have Metal llama.cpp v1.56.0 in LM Studio but I'm not absolutely sure about that I had the same version when I tried the model as autoupdate is enabled. Nevertheless, I did get meaningful answers but not perfect ones. E.g. prompt: "Create an HTML WEB page with javascript that displays an analog clock." It generated a working solution for 3602 tokens with 11.79 token/s speed but the hands of the clock were rotated 90 degrees counter clockwise compared to the correct solution. This task was nailed perfectly only with two models on my local tests: qwen3-coder-480b and gpt-oss-120b interestingly. I tested it on 14 models and Minimax-M2 did perform worse than Kimi-K2-Thinking Q3 K XL by the way.

[-]

Professional-Bear857@reddit

Did you try qwen 235b thinking, it my favourite so far, although I have 256gb of ram so can't run a decent quant of deepseek.

[-]

nomorebuttsplz@reddit (OP)

I have tried it. It's definitely solid for actual work, but doesn't seem to have the spark of intelligence that GLM and larger models have, and I don't like how it writes creative stuff with a lot of AI slopisms.

[-]

Professional-Bear857@reddit

Yeah I found glm made too many 1 shot mistakes, qwen is really good with 1 shot coding tasks.

[-]

lolwutdo@reddit

Thanks for the performance specs, I’m ngl 4.6 running around 10-20tps is kinda disappointing for a $10k+ computer when you can run the same model on cpu for 2-3tps on a $1500~ ddr5 rig (pre price jump).

Don't get me wrong I’d still love those speeds, but idk if that’s worth spending an extra $500 per token in speed; definitely reshapes my perspective of everything.

[-]

power97992@reddit

Good luck now getting 512 gb of ram for 1500 bucks … i checked yesterday, it was almost 3680 bucks (459.99*8) . Also u didnt factor cpu and motherboard , power supply and the gpu into the price… Even a year ago, it would’ve costed around 3500-4000…

[-]

lolwutdo@reddit

I'm talking about consumer hardware, 256gb ddr5 would be the max and can run full GLM.

But yeah, you're pretty much screwed if you didn't buy the ram before the prices jumped.

My build for 128gb ddr5, 5060ti, ryzen 8700g, b850m ended up costing me around $1500-$1600 irc, and this was as of the beginning of October.

[-]

SexMedGPT@reddit

Dollar per token per second is a weird metric to use

[-]

lolwutdo@reddit

True but if the main reason of buying an expensive computer is to run big models faster, it’s a valid metric; how much value are you getting out of a $10k computer when a computer at 10% of the cost can do the same thing, just slower.

[-]

The_Hardcard@reddit

It seems like “barely slower” would only apply to short responses. For 5000 token responses, that is about 5 minutes versus 40 minutes, that’s more than barely slower.

It would depend on how heavy your use case is, but heavy, serious interaction with the model makes would make that a pretty large gap.

[-]

egomarker@reddit

So you are unhappy because you want 20x speed for 7x price instead of 10x speed for 7x price.

[-]

hezarfenserden@reddit

where is the image from?

[-]

nomorebuttsplz@reddit (OP)

Lm studio

[-]

Only_Situation_4713@reddit

For comparison 12 3090 gets me 12k prompt processing with VLLM and 20 tokens per second for GLM and Minimax.

[-]

foucist@reddit

Sounds like a $5,600 Mac Studio with 256GB RAM would be on par with that.

[-]

nomorebuttsplz@reddit (OP)

12k prompt processing t/s for both glm and minimax?

[-]

Only_Situation_4713@reddit

Yeah I think each GPU hovers around 178w under load.

[-]

AvocadoArray@reddit

Gonna just turn the furnace off this year and run a few prompts a day instead.

[-]

Turbulent_Pin7635@reddit

Just imagine the sound and heat of it running...

[-]

vatarysong@reddit

k2,m2 are all cool!

[-]

No_Conversation9561@reddit

what is roleplay in this context?

[-]

nomorebuttsplz@reddit (OP)

DnD style “game” that needs to keep track of characters, keep things interesting, and have some ability to model a world.

[-]

Investolas@reddit

No Qwen3-Next? mlx-community version goes

[-]

nomorebuttsplz@reddit (OP)

Seems similar to gpt OSS performance wise, but maybe a bit slower for gen but faster for prefill? How do you think it compares?