[Research use case] MiniMax-M2.7 with small context, CPU+GPU (5090) setup on Llama.cpp
Posted by Opening-Broccoli9190@reddit | LocalLLaMA | View on Reddit | 5 comments
I was experimenting yesterday with running oversized models with smaller context size, hoping that leaving them overnight could compensate for the slow token generation and periodic pauses for compaction or task chunking.
Summary: For research you'll need the model and quants which will give you 60k context window first and foremost, completely on VRAM + RAM, and then decide how many parameters will you use. Harnesses like Hermes eat up 10k context just to start working, while every search result needs about 10k context for reasoning. Running any model for research with context below 40k is a gamble, ideally you'd need 60k window (10k for prompt, ±10k per search result * 5 search results).
Below are my runs and iterations.
Setup:
I picked one of the more granularly quantisized models - MiniMax-M2.7 with 229B parameters and selected 4 bit quant (, which would leave me 12gb of headroom for my 32gb VRAM on 5090 and 64gb RAM system once deployed.
Below is the docker command example I used for experiments
command: >
-hf unsloth/MiniMax-M2.7-GGUF:UD-IQ3_S
-ngl 18
--jinja
--fit-ctx 40000
--no-mmap
--parallel 1
Tasks:
1. Chat completion with Search tool for "When was BF6 released"
2. Hermes-driven research for "What are the trending news on local llama subreddit in the last 24 hours"
First run - manually configured 18 layers on GPU, 45 on CPU, 100k context, progressive weights loading from ssd when needed (mmap).
22 tps for processing the query
3-4 tps for generating response
Result:
1. Tool called, results truncated and compacted with critical loss of data. Wrong answer.
2. Research task for latest news via Hermes bot has caused a timeout after 30+ minutes
Learning: using SSD as extended memory in practice is a non-starter.
Second run - auto-fit 13 layers on GPU, 50 on CPU, 10k context, progressive weights loading from ssd when needed (mmap).
200 tps for processing the query
14 tps for generating response
Result:
1. Tool called, results truncated and compacted with critical loss of data. Wrong answer.
2. Research task for latest news via Hermes bot has caused recursive context compaction, timeout as well.
Learning: with 10k context the quality of the model means nothing for modern workloads and tool calling.
Third run - auto-fit 10 layers on GPU, 53 on CPU, 40k context, everything in-memory (no-mmap)
400 tps for processing the query
25 tps for generating response
Result:
1. Tool called, results truncated and compacted with critical loss of data. Wrong answer.
2. Research task for latest news via Hermes bot has caused recursive context compaction, timeout as well.
Learning: While GPU+CPU ram is 5-6 times slower on query processing and 2 times slower on query generation - without adequate space for context it's usability drops to zero.
RegularRecipe6175@reddit
FWIW I tried M2.7 up to Q4KXL on llama.cpp and the output was too inconsistent to use for any serious work. I tried a number of different settings. I also found reports that the minimax family really suffers from quantization. Since I'm practically limited to a 4-bit quant, I gave up on M2.7. For me, Qwen 3.6 27b is the current king. 4x3090 / Strix Halo. Of course, YMMV.
Opening-Broccoli9190@reddit (OP)
I'll try it out!
MelodicRecognition7@reddit
I'm afraid this is the reason
Opening-Broccoli9190@reddit (OP)
You mean the reason for context overflow? The wrong answers were due to the fact that the BF6 date is beyond the cutoff knowledge training on the model. I don't think quants are involved in this
MelodicRecognition7@reddit
I mean the reason for failed tasks, when you go below 4 bits wrong answers are way more likely to happen.