Only 120 tps on Qwen 35b on h200
Posted by Theio666@reddit | LocalLLaMA | View on Reddit | 15 comments
Just a sanity check, this is too slow and something is wrong, right? Like, this is setup with mtp, vllm with awq quants, I suspect that I did configure something wrongly. Machine has 570 driver and cuda 12.6, so to make things work I had to improvise, build singularity image from vllm docker and stuff. What's expected speed for this GPU, so I know when I'm getting the setup correctly?
p4s2wd@reddit
Try to remove the line
Bird476Shed@reddit
Ok-Measurement-1575@reddit
How much with mtp disabled?
hurdurdur7@reddit
120tps on that small model (for this hardware) doesn't sound right.
Unable-Tea3788@reddit
Can you share your VLLM configuration ? I am hitting 11to 140 tok/s on 2*3090 with nvlinks, a H200 should not be this low...
Theio666@reddit (OP)
This/similar config worked just fine on a100. I also had to patch marlin kernel to make this all work. Thanks for the answer, this def means that it's a problem with driver, asked sysadmins to update to 580
Unable-Tea3788@reddit
Try increasing the "num_speculative_tokens" up to 5 step by step, see at each steps if it reaches better results.
Environmental-Metal9@reddit
Have you tried without speculative decoding to get a real baseline? I’ve found that getting all the params correctly for each model is sometimes hard, and that can hurt TPS when not well configured. I’d check the simplest version of the command to run vllm and see the speeds you’re getting that way. Also, I’ve found that with vllm I don’t get the fastest single request speed, but when I batch 50 requests I get like 5000tps (because it is counting total tokens per second across all concurrent requests) which is great if your task can be parallelized like that (synthetic data generation comes to mind) but it isn’t great if you’re serving a single chat window for one user only.
For single tasks, I’ve found llama.cpp to give me better performance on models up to a certain size (300b at quant 4 pushing 40 to 50tps isn’t too bad). you don’t need to actually use llama.cpp, I’m suggesting it more as a diagnostic tool
ImportancePitiful795@reddit
You have BOUGHT the H200 or is one somewhere stored in the "Cloud" and you rent it?
Theio666@reddit (OP)
I've not bought this, this is a new hardware at my company and I'm learning how to effectively use it. If I had this at home or in cloud it would be way easier to update everything and not fuck with singularity -_-
I asked to see if this is a driver/cuda problem or not, because if it is I can ask sysadmins to update drivers. So far it seems it is driver issue, asked them to bump to 580.
ImportancePitiful795@reddit
Yeah, need latest drivers to make sure that's is not the main issue.
mangoking1997@reddit
Running in fp8 or fp16?
Theio666@reddit (OP)
AWQ 4bit, so not the native format but should not be that slow, unless I'm missing something. For comparison, fp8 on a100 is 80tps, which is also non-native format for ampere.
jacek2023@reddit
speed depends on context
Theio666@reddit (OP)
I'm aware, this is like first 5k context window, so should not go down this hard.