Testing MiMo-V2.5-IQ3_S with 1'048'576 context
Posted by LegacyRemaster@reddit | LocalLLaMA | View on Reddit | 17 comments
llama-server.exe --model "H:\\gptmodel\\AesSedai\\MiMo-V2.5-GGUF\\MiMo-V2.5-IQ3\_S-00001-of-00004.gguf" --ctx-size 1048576 --threads 16 --host [127.0.0.1](http://127.0.0.1) \--no-mmap --jinja --fit on --flash-attn on -sm layer --n-cpu-moe 0 --threads 16 --parallel 1 --temp 0.2
load\_tensors: offloaded 49/49 layers to GPU
load\_tensors: Vulkan0 model buffer size = 72842.29 MiB
load\_tensors: Vulkan1 model buffer size = 34524.53 MiB
load\_tensors: Vulkan\_Host model buffer size = 488.91 MiB
RTX 6000 96gb+ W7800 48gb
I started testing with the IQ3 version because the second w7800 is on another machine. What's impressed me so far is the processing speed, both on llamaserver and vscode+kilocode. While minimax drops very quickly in processing and prefill t/sec at 50k context, mimo is faster and more stable.
It's still early to give an overall assessment. It tends to loop. With repetition penalty at 1.1 and temp at 0.2, the code seems to improve. Also, if it loops, stopping and restarting doesn't do it again. Perhaps it's better to use a fixed seed. This is the main problem I've encountered. I'll let you know how it goes when I break 300k context.
17 Comments
LegacyRemaster@reddit (OP)
Jealous-Astronaut457@reddit
tarruda@reddit
Digger412@reddit
LegacyRemaster@reddit (OP)
tarruda@reddit
LegacyRemaster@reddit (OP)
FoxiPanda@reddit
LegacyRemaster@reddit (OP)
unbannedfornothing@reddit
rhythmdev@reddit
FoxiPanda@reddit
unbannedfornothing@reddit
takoulseum@reddit
LegacyRemaster@reddit (OP)
FatheredPuma81@reddit
LegacyRemaster@reddit (OP)