Testing MiMo-V2.5-IQ3_S with 1'048'576 context

Posted by LegacyRemaster@reddit | LocalLLaMA | View on Reddit | 17 comments

llama-server.exe --model "H:\\gptmodel\\AesSedai\\MiMo-V2.5-GGUF\\MiMo-V2.5-IQ3\_S-00001-of-00004.gguf" --ctx-size 1048576 --threads 16 --host [127.0.0.1](http://127.0.0.1) \--no-mmap --jinja --fit on --flash-attn on -sm layer --n-cpu-moe 0 --threads 16 --parallel 1 --temp 0.2 load\_tensors: offloaded 49/49 layers to GPU load\_tensors: Vulkan0 model buffer size = 72842.29 MiB load\_tensors: Vulkan1 model buffer size = 34524.53 MiB load\_tensors: Vulkan\_Host model buffer size = 488.91 MiB RTX 6000 96gb+ W7800 48gb I started testing with the IQ3 version because the second w7800 is on another machine. What's impressed me so far is the processing speed, both on llamaserver and vscode+kilocode. While minimax drops very quickly in processing and prefill t/sec at 50k context, mimo is faster and more stable. It's still early to give an overall assessment. It tends to loop. With repetition penalty at 1.1 and temp at 0.2, the code seems to improve. Also, if it loops, stopping and restarting doesn't do it again. Perhaps it's better to use a fixed seed. This is the main problem I've encountered. I'll let you know how it goes when I break 300k context.

Reply to Post

17 Comments

[-]

LegacyRemaster@reddit (OP)

https://preview.redd.it/b5nk9kiip30h1.png?width=1435&format=png&auto=webp&s=067ef7f8ab06af3458d0f0f7a8c473ec98b49072 Update --> 300k context. 33,4 t/s. Not bad. The output is good and consistent.

[-]

Jealous-Astronaut457@reddit

What settings ?

[-]

tarruda@reddit

Yes it is fast, but I found this IQ3_S quant to be kinda bad: In a few tests that I did it got stuck into reasoning loop.

[-]

Digger412@reddit

AesSedai here - I've heard that the official API loops as well, and I think that agentic workloads and severe quantization contribute to it too. I've mostly used the Pro model for conversational usage / creative writing and haven't experienced looping but I also run it at nearly Q8_0 so I certainly acknowledge there's a big gap there. Maybe try dropping the temp or increasing the rep penalty perhaps?

[-]

LegacyRemaster@reddit (OP)

https://preview.redd.it/f5l2ueq9v30h1.png?width=762&format=png&auto=webp&s=9b528b8e12a85d005444162fe4f223128d543334 zero loops. Rep penality 1.1 temp 0.2

[-]

tarruda@reddit

Note that temp 1.0 is the recommended official temperature. Setting it to 0.2 might affect the performance. I'm currently trying to create my own quant in the q3 range. Will play with the tensors sizes to see if I can improve it.

[-]

LegacyRemaster@reddit (OP)

yes. Playing with params to "mitigate"

[-]

FoxiPanda@reddit

I've done a fair amount of testing with MiMo-v2.5 now and I have to tell you, that model is great for the first ~110-130K of the context window and after that she kinda loses it lol. That 1M context window with coherence deep into that context is still a dream for me at least. I was trying with a Q5 quant of MiMo - I might bump that up to a Q8 and deal with the speed penalty if I can get better coherence later on in the context window.

[-]

LegacyRemaster@reddit (OP)

As always, it depends on the code, the task, etc. I've gotten to 400k of document analysis, graphs, and HTML page production with zero errors. I haven't seen any major problems with Python either. But obviously, there's C++ and 100 other languages, so it's difficult to draw conclusions based on the limited use case. Even Minimax and Qwen 27b get lost beyond a certain context. At 400k, I asked for the conversation to be condensed, and it was perfect.

[-]

unbannedfornothing@reddit

Did you experienced looping issues? On default settings with claude code it loops once in 20-30 tool calls?

[-]

rhythmdev@reddit

I asked her “what’s up?” She couldn’t take it… went complete nuts

[-]

FoxiPanda@reddit

I have not experienced looping issues actually. Instead, the model just straight up goes off the rails on agentic tasks. It'll start making up code that doesn't exist or failing tool calls trying to use them on non-existent files or... a plethora of other bizarre behaviors. Looping actually hasn't been one of them for me on MiMo-v2.5 interestingly. I can also get it to do about 100 tool calls in a single turn early on in context before it starts to get weird there too.

[-]

unbannedfornothing@reddit

Did you used latest quants? They've been updated about 5 hours ago, though I not compared hashes, dunno if they're actually changed.

[-]

takoulseum@reddit

RTX 6000 with windows, that’s so sad

[-]

LegacyRemaster@reddit (OP)

I have linux too

[-]

FatheredPuma81@reddit

Please someone needs to test this in Agentic work like OpenCode and see how long until it has a total meltdown lol.

[-]

LegacyRemaster@reddit (OP)

https://preview.redd.it/aumoo9yt030h1.png?width=1505&format=png&auto=webp&s=3053260359d452979908a412df418b49640cace4 testing with kilocode now