Best setup for MiniMax-M2.7 (230B) | 3x RTX 5090 | Threadripper 9975 | 512GB RAM
Posted by scorpios_1200@reddit | LocalLLaMA | View on Reddit | 32 comments
I have the following hardware and want to run MiniMax-M2.7 (230B) locally. What is the best software stack and configuration to maximize performance?
Specs:
- GPU: 3x RTX 5090
- CPU: AMD Threadripper Pro 9975
-
RAM: 512GB ECC DDR5-5600
-
What is the best technology to run this 230B model across my GPU and CPU/RAM?
- What is the ideal balance between context length and tokens per second for this specific?
- How should I optimize the weight offloading to the 512GB system RAM?
- Are there specific BIOS or OS tweaks to maximize throughput between the 9975 and the 5090s?
Noobysz@reddit
How many tokens per sec ur getting if i may ask?
scorpios_1200@reddit (OP)
Not tested yet, I’m planning to benchmark minimax‑M2.7‑UD‑Q8_K_XL with llama.cpp vs minimax‑M2.7 via SGLang. I’ll post an update soon.
Noobysz@reddit
Yes please im extra excited for tge SGLang results
sleepingsysadmin@reddit
No chance of over 25TPS at low context sizes.
One-Macaron6752@reddit
For mixed inference (GPU/CPU) don't even think about llama.cpp (mainline). Go full ik_llama. Have patience, learn to master the extra parametrization and you'll be able to squeeze every bit of performance in a mixed inference mode for that hw of yours. You'll thank me later. 😎
scorpios_1200@reddit (OP)
Thanks for the tip! I’m curious—are you running a similar hybrid setup? I’d love your insight on how you balance the sparse experts. So far on mainline llama.cpp (Ubuntu 24.04), I’ve managed to get qwen 397B (Q8) running at \~2.6–3.0 t/s with 13 layers on GPU and 47 on CPU. I’m using an 80K context with FP8 KV, but I hit a massive stability wall the second I cross 16 threads.
Urb4nn1nj4@reddit
Do you mind elaborating? I usually use mainline on my 2x 3090 threadripper ddr4 256gb for the bigger models. Is this basically because it’s easier to offload gpu layers on ik-llama?
One-Macaron6752@reddit
Check ubergarm HF model Page for MiniMax M2.7. There you have references for ik_llama invoke commands for mixed inference systems (GPU, CPU). https://huggingface.co/ubergarm/MiniMax-M2.7-GGUF
He's producing some very nice quants, specifically for ik_llama. Also aessedai has great quants too, compatible with both llama mainline as well as ik_llama.
ik_llama.cpp is a high-performance fork of the original llama.cpp project designed to maximize CPU and hybrid GPU/CPU inference speeds, offering 3x to 4x speed improvements in multi-GPU configurations via its new "split mode graph" execution.
"-sm graph" is the golden nugget of iwkrakow in ik_llama that basically allows for graphs on CUDA backends while providing up to 2x 3x speed over "-sm layer" (of course, depending on the model). Also ik_llama makes KV cache rotation possible (and does is safely) via Hadamard, that allows for higher KV consistency in long contexts (>65k) even with Q8_0 or Q4_0... Once you master it (please, have patience with it and yourself) you'll see it's the closest in performance to vLLM / sglang backends.
cmndr_spanky@reddit
MY advice based on my own experience running LLMs that have to spill into system ram and run some layers on GPUs and others on CPU:
Run on linux (not windows, not WSL, actual linux server).
Do not run it at anything less than q4 (at q2 and less you're just pissing away electricity on a nerfed model and at that point you'd get better performance using higher quant of a smaller model).
Use llama-server to host the model (Llama CPP). No not Ollama, it's slower and has fewer options.
Use Claude Code with full access to your linux server to help you test different llama-server configurations and context window sized so you can optimally offload MOE layers on CPU vs GPUs. You can ask it to run a few benchmarks at different context sizes and have some options to choose from.
Obviously don't copy these settings because this is for my hardware and a totally different model, but this was what Claude Code and I figured out iteratively was the best configuration to run Qwen 35b on my specific hardware (involving some spill over into system ram and CPU run model layers):
Urb4nn1nj4@reddit
Minimax suffers at quants below 8 bit more than other models. Llama.cpp and Ubuntu? I’d target native context and see what performance is. Don’t fall into the urge to quant hahah just swap to Qwen 397b which does much better at some 2 bit and most 3/4 bit quants
scorpios_1200@reddit (OP)
I tested qwen 397B Q8 with an offload split of 13 layers on gpu and 47 layers on cpu, running \~80K context with FP8 KV cache, and I’m seeing \~2.6–3.0 tok/s. Past 16 CPU threads, stability drops and the run becomes unreliable. Next I’m planning to try minimax‑M2.7‑UD‑Q8_K_XL in llama.cpp, or alternatively MiniMax‑M2.7 via SGLang to see if it improves throughput.
Urb4nn1nj4@reddit
Check this out. You might be able to go down to IQ2M or 3KXL or 4KM if you’re paranoid on 397b. https://kaitchup.substack.com/p/summary-of-qwen35-gguf-evaluations
No_Conversation9561@reddit
3 tok/s?
I’m getting 27 tok/s on 2 x M3 Ultra 256GB
Specific-Rub-7250@reddit
Well, I have a similar setup with a Threadripper Pro 5995wx with 512GB of DDR4 3200MT RAM (8 Channel) and Dual AMD Radeon AI PRO R9700. I am running Minimax 2.7 at Q8_0 which benchmarks around 280 t/s pp and 16 tk/s tg. You need to benchmark the batch size (ubatch) and the number of batch threads to use.
jacek2023@reddit
With 3x3090 and x399 I use Minimax in Q3, with your setup you should probably use Q4 and it will be much faster than mine.
Buildthehomelab@reddit
why just why 3 5090?
scorpios_1200@reddit (OP)
higher TFLOPS for multi-turn then rtx pro 6000
Buildthehomelab@reddit
If you have this much money to spend on this, pay someone to help you with this.
If you dont understand why i specifically 3 cards, you are going to be in for a world of hurt and wasted resources.
Leafytreedev@reddit
How does someone spend $30k+ on a multi-gpu ai build and spend no effort researching lol. I'm guessing it's because his mobo maxes out at 3
Traditional_Fox1225@reddit
Mo money but ain’t got none brains
Buildthehomelab@reddit
shit here i am , stressing about my 2 3090 that is coming. I cant fathom getting a setup like that handed to you.
scorpios_1200@reddit (OP)
i got it free, If you understand, just help, either share your knowledge or move on.
Makers7886@reddit
I can solve your issues for free (5090). You pay shipping + handling though.
Buildthehomelab@reddit
sure you did, you got 3x5090 for free i call BS
Voxandr@reddit
Maybe he got it from his daddy
Buildthehomelab@reddit
Sugar daddy
Voxandr@reddit
You are underestimating the how much `some` physical can earn a lot more $$$$$$$ easily than skills with brains.
Buildthehomelab@reddit
Looks like it, that is like 20K plus USD setup
joost00719@reddit
Found it for free at 4am somewhere 😁
Buildthehomelab@reddit
of the back of a truck
Ok-Measurement-1575@reddit
Llama.cpp, offload up|down.
Total_Activity_7550@reddit
Your setup is not common. I think you should find out empirically.
First, you won't be able to use vllm efficiently, as tensor-parallel requires 2\^n value.
Then, your option is llama.cpp.
Then, you can learn `llama-bench` to optimize more.