Pushing the limit: minimax m2.7 q8_0 128k on 2x3090, 256GB DDR4

Posted by wombweed@reddit | LocalLLaMA | View on Reddit | 13 comments

CPU is just a secondhand 10900x. Using 128k context, unquantized kv cache. Model is at q8_0 to mitigate some weird behavior I was seeing at lower quants.

Speed is very slow at around 50tps pp, 10tps tg, but usable for coding agent workflows.

Anybody else running MoE models in this size class on relatively low-end hardware? For my purposes, speed is less important than accuracy, as long as it's not like literally all day. Any other models you'd recommend I'd try or additional optimization tips that could help within my constraints? I wish they'd released the draft model for MTP on this model but it looks like they declined to do so for 2.7.

My ik_llama flags -- sorry for the funny formatting, this is pasted out of my vibe coded NixOS config:

          "${ik-llama-cuda}/bin/llama-server"
          + " -m ${modelPath}"
          + " --host 0.0.0.0"
          + " --port ${toString cfg.port}"
          + " -c ${toString cfg.contextLength}"
          + " -ngl 999"
          + " --cpu-moe"
          + " -sm graph"
          + " -fa on"
          + " -t 16"
          + " -tb 16"
          + " -b 4096"
          + " -ub 4096"
          + " -np 1"
          + " -muge"
          + " -ger"
          + " --jinja"
          + " --metrics"
          + " --temp 1.0"
          + " --top-p 0.95"
          + " --top-k 40"
          + " --min-p 0.01"

[-]

FullstackSensei@reddit

X299 is such an under appropriated platform. You can very probably get a good uplift if you upgrade to a higher core count part. And don't be afraid to "downgrade to a 9th or even 7th Gen CPU. They're all basically the same, with only minor frequency bumps. A 7980xe, 9980xe, or 10980xe will provide a very nice uplift.

[-]

wombweed@reddit (OP)

I completely agree! I got it because it seemed like a good value just for AVX-512 and using the RAM I already had laying around. Good call on on the 10980xe, I definitely see that in this rig's future.

[-]

FullstackSensei@reddit

AVX-512 doesn't contribute much. It's the quad channel memory controller that can go up to 256GB and 4266MT. That's like having 256GB RAM running at 8500MT on dual channel DDR5.

[-]

DegenerateGandhi@reddit

Wait, do the older gen ones even support 256gb? I think my 7th gen only does 128gb max.

[-]

wombweed@reddit (OP)

That makes sense. I'm running my sticks at stock XMP 3200 settings, I'll have to see if I can push my kit higher.

[-]

MelodicRecognition7@reddit

-t 16

https://files.catbox.moe/5w3eqh.png

[-]

Lowkey_LokiSN@reddit

Have you tried KTransformers yet? I've yet to personally try it out but it's on my checklist as a potential perfomance-uplift candidate for heterogeneous CPU/GPU inference

https://github.com/kvcache-ai/ktransformers/blob/main/doc%2Fen%2FMiniMax-M2.5.md

[-]

wombweed@reddit (OP)

Interesting suggestion, I have not tried this one, will star the repo and check it out on a rainy day. Thank you!

[-]

wombweed@reddit (OP)

I had minimax read this and I think it’s correct unfortunately. Intel i9-10900X (Skylake-X) has AVX-512 F/BW/VL/VNNI but lacks AVX-512_BF16 and AVX-512_VBMI. The M2.5 guide uses --kt-method FP8 which requires both. This means the FP8 native CPU backend will not wor

[-]

AI-Agent-Payments@reddit

With `--cpu-moe` on a 10900x you're leaving a lot on the table, the 10-core HT config means your expert dispatch threads are competing hard for L3. Dropping `-t` to 10 or even 8 and bumping `-tb` to match sometimes squeezes out 15-20% TG improvement on that chip family because you stop thrashing the cache with excess threads. Also worth trying `-b 2048 -ub 512` if your batch sizes in the coding agent are mostly single-request, since the 4096/4096 pairing is optimized for throughput over latency and you're already bandwidth-bound on CPU side.

[-]

wombweed@reddit (OP)

Really appreciate these suggestions, I will be giving these a shot

[-]

Shoddy_Bed3240@reddit

50 tps pp is painfully slow

[-]

wombweed@reddit (OP)

Yeah, not appropriate for interactive sessions if you're used to bootstrapping an entire app in 5 minutes like Claude. Most of the stuff I have it work on is relatively simple, so for my purposes it's not a dealbreaker. But I can totally see this being unacceptable for other applications, I have a separate node running qwen3.6-35b-a3b for when the speed/quality tradeoff makes sense.