Best open-weight model to run locally on 8x A100 80GB for generating teacher data?

Posted by i_am__not_a_robot@reddit | LocalLLaMA | View on Reddit | 26 comments

I have access to a SLURM cluster with 8x NVIDIA A100 80GB GPUs (=640 GB VRAM) on a single task, and I want to run an open-weight model locally with llama.cpp for data generation, not coding.

My use case is generating teacher data for downstream fine-tuning of very small models on specific economic topics across multiple industries and sectors. I need reasonably strong general reasoning, structured answers, and good structured-output consistency at \~32-64k context.

Prior experiments indicate that 32-64k tokens total, including the prompt and a few relevant source documents, is sufficient for my use case. This is single-user / single-task inference only, so quality and consistency matter more to me than raw throughput.

What model would you pick, or recommend I look into, for this specific task?

I was looking at Kimi-K2.6-UD-Q4_K_XL, but it sadly won't fit (did not account for the multi-GPU overhead and KV cache requirements).