Minimax M2.7 on Q3_K_S or Smaller Model with greater precision?

Posted by iFai1x@reddit | LocalLLaMA | View on Reddit | 8 comments

I currently am looking for models to fit into my single DGX Spark for use. I have an RTX Pro 6000 and also a 5090 as well that I'm considering using in combination if the DGX Spark is too slow, but the intent here is to play around with OpenClaw.

I've looked around for some benchmarks, but I'm assuming websites such as PinchBench are referring to full precision models and how well they were able to accomplish tasks on average.

Any tips and experiences from what others are using here for their OpenClaw setup? I've considered Minimax-m2.7, Qwen3.5-27B, Gemma 4 31B, Nemotron 3 Super 120B, and Qwen3.5-122B-A10B. All of these models I would be in Q4 (except Minimax m2.7) for the DGX Spark, or perhaps Q8 or greater on some of these models on the Pro 6000.

My confusion or concern is really asking if Q3 is too aggressive of a quant to run Minimax m2.7, or if running at higher precision on a smaller model will net more consistent results in OpenClaw. Of course, reading into benchmarks only really show you a comparison on full precision.

Any help would be appreciated!

[-]

ai_guy_nerd@reddit

Quantization at Q3 often hits a wall where logic starts to degrade, especially for complex reasoning tasks. In most cases, dropping to a slightly smaller model but keeping Q4_K_M or Q6 precision yields much more stable results. The "intelligence" of the model is often more dependent on the precision of the weights than the raw parameter count when you go below 4-bit.

For an OpenClaw setup, consistency is everything since the agents rely on strict formatting and tool-call precision. If the quant is too aggressive, you'll see more hallucinations in the JSON outputs. It might be worth testing Gemma 4 31B at a higher precision; it's often the sweet spot for that hardware profile.

OpenClaw usually plays well with Q4+ across the board, but if VRAM is the bottleneck, try a smaller model with better precision rather than a huge one at Q3.

[-]

iFai1x@reddit (OP)

Thanks for the input, have you had a chance to try out Qwen3.5-27B and compare it to Gemma 4 31B, both at full precision or near full (Q8)?

Qwen3.6-35B-A3B also released very recently, but I'm wondering if you had any comment on dense vs MoE models based on your experience for OpenClaw.

[-]

Rude_Ambassador_6270@reddit

strange no one has done a benchmark simply "per GB", comparing the lowest quants of larger models with "sane" quants of smaller ones

(if I didn't see it no one has done it)

[-]

waitmarks@reddit

I don't have a dgx spark so I don't know how much you can actually allocate to the GPU on that, but I have a strix halo with 128G of ram and the best quant of minimax 2.7 that fits on mine is the Q4_XS from AesSedai

[-]

iFai1x@reddit (OP)

interesting, IQ quants are really slow on the DGX Spark. for Q quants I can get around 15tok/s on LM studio. What speeds are you getting, and how is the capability (i.e. usefulness of the responses for your use case)?

[-]

waitmarks@reddit

I get 23tok/s with strix halo on the Q4_XS quant. I am still testing this quant for usefulness, I just learned that it existed yesterday. Before I was using Q3_K_XL from unsloth and it was quite good for programming purposes. It was able to solve a bug for me that opus couldn't, albeit with some prodding and retrying.

[-]

iFai1x@reddit (OP)

what was your context length?

[-]

waitmarks@reddit

Thats with basically no context. Strix halo performance drops off dramatically with long contexts. So, I try to limit it as much as possible.