Best open-weight model to run locally on 8x A100 80GB for generating teacher data?

Posted by i_am__not_a_robot@reddit | LocalLLaMA | View on Reddit | 26 comments

I have access to a SLURM cluster with 8x NVIDIA A100 80GB GPUs (=640 GB VRAM) on a single task, and I want to run an open-weight model locally with llama.cpp for data generation, not coding.

My use case is generating teacher data for downstream fine-tuning of very small models on specific economic topics across multiple industries and sectors. I need reasonably strong general reasoning, structured answers, and good structured-output consistency at \~32-64k context.

Prior experiments indicate that 32-64k tokens total, including the prompt and a few relevant source documents, is sufficient for my use case. This is single-user / single-task inference only, so quality and consistency matter more to me than raw throughput.

What model would you pick, or recommend I look into, for this specific task?

I was looking at Kimi-K2.6-UD-Q4_K_XL, but it sadly won't fit (did not account for the multi-GPU overhead and KV cache requirements).

[-]

Party-Log-1084@reddit

If KV cache is killing you, try DeepSeek V3. It uses MLA so the memory footprint for a 64k context is tiny compared to standard models. Alternatively, just run a Q6 of Llama 3.1 405B. You have 640GB, so it easily fits with plenty of room to spare for context if you enable flash attention.

[-]

Traditional-Gap-3313@reddit

Have you used 405B for such tasks? How is it on world knowledge and reasoning about text? My use case is similar to OP's. Need pretty text and modern moe are all coding maxed.

[-]

i_am__not_a_robot@reddit (OP)

Thanks for your suggestions!

[-]

mangoking1997@reddit

You should be able to do CPU offload as it's a MOE Model, and the performance shouldn't be that bad?

[-]

i_am__not_a_robot@reddit (OP)

Yes, thanks for your suggestion, I considered this, however, our cluster is actually RAM and CPU-poor and the use policy forbids running CPU-intensive workloads. So this is sadly not an option.

[-]

SexyAlienHotTubWater@reddit

Training takes a lot of data. I wouldn't try to generate a large dataset on CPU.

[-]

mangoking1997@reddit

It's not running on the CPU, it just holds the rest of the model it's not currently using the experts of in ram.

[-]

i_am__not_a_robot@reddit (OP)

Thanks for pointing that out. I need to learn more about how MoE models are served in practice, and I will look into whether this is feasible or allowed for our setup.

[-]

BreakIt-Boris@reddit

Download the original Kimi K2.6 release, not the GGUF, and use via VLLM.

https://huggingface.co/moonshotai/Kimi-K2.6/tree/main

Kimi released their weights in INT4, which the A100 supports natively. In fact better support than Ada/Hopper/Blackwell for INT4 ( not fp4 ).

I thought Kimi also used Deepseeks MLA for their attention mechanism. If so you should easily be able to fit a single 65k context on top of the 600gb weights.

Try tensor parallel first, but if that fails due to overhead the run with data parallel instead - should reduce overhead size.

[-]

i_am__not_a_robot@reddit (OP)

Thanks for the advice!

[-]

ResidentPositive4122@reddit

You most definitely want to run vLLM or sglang on that and not llama.cpp. You want the best throughput possible, and those two are known for that.

[-]

i_am__not_a_robot@reddit (OP)

Thanks for the advice! I thought vLLM was only preferable in scenarios with multiple concurrent users, but it seems like that's not the case.

[-]

Serprotease@reddit

I second this, your best bet is glm5.1 at fp4 or a mix fp8/fp4 with vllm and to do concurrent requests to maximize the throughput.

Glm5.1 is roughly between Sonnet 4.6 and opus 4.6 levels and the best thing you can run with this amount of VRAM. Kimi 2.5/2.6 is also an option (it’s fp4 native) but it’s not as good and 2.6 tends to think for a very large amount of time. If you care about thinking trace, it might be an issue.

[-]

i_am__not_a_robot@reddit (OP)

Unfortunately, the A100 (Ampere) doesn't have native FP4/FP8 tensor core support, so I'm only looking at INT quants.

[-]

ResidentPositive4122@reddit

You can run fp8 on Ampere, it will be handled by the marlin kernels. Not native speed acceleration, but still pretty good.

[-]

i_am__not_a_robot@reddit (OP)

That's good to know. Thanks! I didn't realize that was an option on Ampere in terms of competitive speed.

[-]

mangoking1997@reddit

Llama.cpp is better for single instances is it not?

[-]

Kornelius20@reddit

I have (free) access to a SLURM cluster with 8x NVIDIA A100 80GB GPUs (=640 GB VRAM)

[-]

MelodicRecognition7@reddit

Kimi-K2.6-UD-Q4_K_XL

take a look at Kimi-K2.6-Q4_X or better Kimi-K2.5-Q4_X

[-]

i_am__not_a_robot@reddit (OP)

Does Q4_X have any meaningful VRAM footprint benefits over UD-Q4_K_XL? Because I couldn't get 64k context with Kimi-K2.6-UD-Q4_K_XL on my setup.

[-]

MelodicRecognition7@reddit

try FP8 or Q8 context, it is generally not a good idea to quantize the context but 8 bit is still acceptable quality.

[-]

i_am__not_a_robot@reddit (OP)

The A100 (Ampere) doesn't support FP8 tensor cores, but it might be worth looking into Q8 KV cache. Thanks for the suggestion. That said, I've seen advice suggesting that for accuracy and reliability, it is better to run a slightly smaller model with BF16/FP16 KV cache than to run a larger model that only fits by relying on aggressively quantized KV cache...?

[-]

MelodicRecognition7@reddit

yes a smaller model in higher quant plus 16bit cache highly likely will be more accurate than a larger model in lower quant plus 8 bit cache. But it depends on the models and on the task. You should try GLM as others recommend but do not try Qwen, Deepseek or Llama.

[-]

Bubbly-Staff-9452@reddit

It depends heavily on the model. Generally FP8/Q8_0 is free at least or extremely close. You can test the perplexity at the various KV quants and mix KV quants. I run Turbo3 on some models with good results and can fit much more context. Also, as general advice, I would try to fit a large model yes, but make sure it is also the most current gen of whichever model family you choose because intelligence increases so fast smaller models of a newer gen of the same family generally beat the larger, older models. Someone mentioned GLM 5.1, that’s a good choice at a Q4 quant.

[-]

MelodicRecognition7@reddit

Unsloth's quant is overinflated for no reason, AesSedai's Q4_X is "almost original" already, like 99.99% from original weights.

[-]

-dysangel-@reddit

GLM 5.1