Best setup for MiniMax-M2.7 (230B) | 3x RTX 5090 | Threadripper 9975 | 512GB RAM

Posted by scorpios_1200@reddit | LocalLLaMA | View on Reddit | 32 comments

I have the following hardware and want to run MiniMax-M2.7 (230B) locally. What is the best software stack and configuration to maximize performance?

Specs:

GPU: 3x RTX 5090
CPU: AMD Threadripper Pro 9975
RAM: 512GB ECC DDR5-5600
What is the best technology to run this 230B model across my GPU and CPU/RAM?
What is the ideal balance between context length and tokens per second for this specific?
How should I optimize the weight offloading to the 512GB system RAM?
Are there specific BIOS or OS tweaks to maximize throughput between the 9975 and the 5090s?

[-]

Noobysz@reddit

How many tokens per sec ur getting if i may ask?

[-]

scorpios_1200@reddit (OP)

Not tested yet, I’m planning to benchmark minimax‑M2.7‑UD‑Q8_K_XL with llama.cpp vs minimax‑M2.7 via SGLang. I’ll post an update soon.

[-]

Noobysz@reddit

Yes please im extra excited for tge SGLang results

[-]

sleepingsysadmin@reddit

No chance of over 25TPS at low context sizes.

[-]

For mixed inference (GPU/CPU) don't even think about llama.cpp (mainline). Go full ik_llama. Have patience, learn to master the extra parametrization and you'll be able to squeeze every bit of performance in a mixed inference mode for that hw of yours. You'll thank me later. 😎

[-]

scorpios_1200@reddit (OP)

Thanks for the tip! I’m curious—are you running a similar hybrid setup? I’d love your insight on how you balance the sparse experts. So far on mainline llama.cpp (Ubuntu 24.04), I’ve managed to get qwen 397B (Q8) running at \~2.6–3.0 t/s with 13 layers on GPU and 47 on CPU. I’m using an 80K context with FP8 KV, but I hit a massive stability wall the second I cross 16 threads.

[-]

Urb4nn1nj4@reddit

Do you mind elaborating? I usually use mainline on my 2x 3090 threadripper ddr4 256gb for the bigger models. Is this basically because it’s easier to offload gpu layers on ik-llama?

[-]

One-Macaron6752@reddit

Check ubergarm HF model Page for MiniMax M2.7. There you have references for ik_llama invoke commands for mixed inference systems (GPU, CPU). https://huggingface.co/ubergarm/MiniMax-M2.7-GGUF

He's producing some very nice quants, specifically for ik_llama. Also aessedai has great quants too, compatible with both llama mainline as well as ik_llama.

ik_llama.cpp is a high-performance fork of the original llama.cpp project designed to maximize CPU and hybrid GPU/CPU inference speeds, offering 3x to 4x speed improvements in multi-GPU configurations via its new "split mode graph" execution.

"-sm graph" is the golden nugget of iwkrakow in ik_llama that basically allows for graphs on CUDA backends while providing up to 2x 3x speed over "-sm layer" (of course, depending on the model). Also ik_llama makes KV cache rotation possible (and does is safely) via Hadamard, that allows for higher KV consistency in long contexts (>65k) even with Q8_0 or Q4_0... Once you master it (please, have patience with it and yourself) you'll see it's the closest in performance to vLLM / sglang backends.

[-]

cmndr_spanky@reddit

MY advice based on my own experience running LLMs that have to spill into system ram and run some layers on GPUs and others on CPU:

Run on linux (not windows, not WSL, actual linux server).

Do not run it at anything less than q4 (at q2 and less you're just pissing away electricity on a nerfed model and at that point you'd get better performance using higher quant of a smaller model).

Use llama-server to host the model (Llama CPP). No not Ollama, it's slower and has fewer options.

Use Claude Code with full access to your linux server to help you test different llama-server configurations and context window sized so you can optimally offload MOE layers on CPU vs GPUs. You can ask it to run a few benchmarks at different context sizes and have some options to choose from.

Obviously don't copy these settings because this is for my hardware and a totally different model, but this was what Claude Code and I figured out iteratively was the best configuration to run Qwen 35b on my specific hardware (involving some spill over into system ram and CPU run model layers):

```
-m ~/models/Qwen3.5-35B-A3B-Q4_K_M.gguf
--mmproj ~/models/mmproj-F16.gguf     # vision projector
--no-mmproj-offload                    # keep vision projector in RAM, not GPU
--host 0.0.0.0 --port 8080
-ngl 999                               # offload all layers to GPU
--flash-attn on
--split-mode layer --tensor-split 8,12  # split across GPUs proportional to VRAM
--n-cpu-moe 23                         # 23 of 40 MoE layers'experts offloaded to CPU RAM
--kv-unified
--cache-type-k bf16 --cache-type-v bf16 # bf16 KV cache (avoids gibberish)
--batch-size 4096 --ubatch-size 1024
--ctx-size 65536                        # 65K context window
--parallel 1                            # single request slot (see concurrency section)
--jinja                                 # Jinja2 chat templates
```

[-]

Urb4nn1nj4@reddit

Minimax suffers at quants below 8 bit more than other models. Llama.cpp and Ubuntu? I’d target native context and see what performance is. Don’t fall into the urge to quant hahah just swap to Qwen 397b which does much better at some 2 bit and most 3/4 bit quants

[-]

scorpios_1200@reddit (OP)

I tested qwen 397B Q8 with an offload split of 13 layers on gpu and 47 layers on cpu, running \~80K context with FP8 KV cache, and I’m seeing \~2.6–3.0 tok/s. Past 16 CPU threads, stability drops and the run becomes unreliable. Next I’m planning to try minimax‑M2.7‑UD‑Q8_K_XL in llama.cpp, or alternatively MiniMax‑M2.7 via SGLang to see if it improves throughput.

[-]

Urb4nn1nj4@reddit

Check this out. You might be able to go down to IQ2M or 3KXL or 4KM if you’re paranoid on 397b. https://kaitchup.substack.com/p/summary-of-qwen35-gguf-evaluations

[-]

No_Conversation9561@reddit

3 tok/s?

I’m getting 27 tok/s on 2 x M3 Ultra 256GB

[-]

Specific-Rub-7250@reddit

Well, I have a similar setup with a Threadripper Pro 5995wx with 512GB of DDR4 3200MT RAM (8 Channel) and Dual AMD Radeon AI PRO R9700. I am running Minimax 2.7 at Q8_0 which benchmarks around 280 t/s pp and 16 tk/s tg. You need to benchmark the batch size (ubatch) and the number of batch threads to use.

[-]

jacek2023@reddit

With 3x3090 and x399 I use Minimax in Q3, with your setup you should probably use Q4 and it will be much faster than mine.

[-]

Buildthehomelab@reddit

why just why 3 5090?

[-]

scorpios_1200@reddit (OP)

higher TFLOPS for multi-turn then rtx pro 6000

[-]

scorpios_1200@reddit (OP)

i got it free, If you understand, just help, either share your knowledge or move on.

[-]

Then, your option is llama.cpp.

- install latest CUDA
- install latest llama.cpp
- just try to run `llama-server --fit --ctx-size ... -hf :
--fit flag will handle best balance for you
choose --ctx-size according to your needs (max for MiniMax2.7 is 196000 or smth)
- (you can't ask for OS tweaks if you haven't provided your OS, but there aren't many - just use OS recommended NVidia drivers, then install CUDA toolkit according to NVidia official site docs)
- (no BIOS tweaks, I think)

Then, you can learn `llama-bench` to optimize more.