Only 120 tps on Qwen 35b on h200

Posted by Theio666@reddit | LocalLLaMA | View on Reddit | 15 comments

Just a sanity check, this is too slow and something is wrong, right? Like, this is setup with mtp, vllm with awq quants, I suspect that I did configure something wrongly. Machine has 570 driver and cuda 12.6, so to make things work I had to improvise, build singularity image from vllm docker and stuff. What's expected speed for this GPU, so I know when I'm getting the setup correctly?