Only 120 tps on Qwen 35b on h200

Posted by Theio666@reddit | LocalLLaMA | View on Reddit | 15 comments

Just a sanity check, this is too slow and something is wrong, right? Like, this is setup with mtp, vllm with awq quants, I suspect that I did configure something wrongly. Machine has 570 driver and cuda 12.6, so to make things work I had to improvise, build singularity image from vllm docker and stuff. What's expected speed for this GPU, so I know when I'm getting the setup correctly?

[-]

p4s2wd@reddit

Try to remove the line

--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

[-]

Bird476Shed@reddit

a sanity check,

$ llama-bench -m Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 143921 MiB):
Device 0: NVIDIA H200X-141C, compute capability 9.0, VMM: no, VRAM: 143921 MiB
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q8_0         |  35.80 GiB |    34.66 B | CUDA       |  99 |           pp512 |     4978.56 ± 179.88 |
| qwen35moe 35B.A3B Q8_0         |  35.80 GiB |    34.66 B | CUDA       |  99 |           tg128 |        139.33 ± 0.13 |
build: 15fa3c493 (8920)

[-]

Ok-Measurement-1575@reddit

How much with mtp disabled?

[-]

hurdurdur7@reddit

120tps on that small model (for this hardware) doesn't sound right.

[-]

Unable-Tea3788@reddit

Can you share your VLLM configuration ? I am hitting 11to 140 tok/s on 2*3090 with nvlinks, a H200 should not be this low...

[-]

Theio666@reddit (OP)

MODEL_PATH="/mnt/asr_hot/agafonov/models/Qwen3.6-35B-A3B-AWQ/"
        SERVED_NAME="Qwen3.6-35B-A3B-AWQ"
        GPU_COUNT=1
        CPU_COUNT=12
        TIME_LIMIT="20-1"
        TP_SIZE=1
        PORT=16777
        EXTRA_ARGS=(
            --max-num-seqs 32
            --max-model-len 128000
            --gpu-memory-utilization 0.9
            --enable-auto-tool-choice
            --tool-call-parser qwen3_coder
            --reasoning-parser qwen3
            --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
        )

export VLLM_ENABLE_CUDA_COMPATIBILITY=1
export VLLM_CUDA_COMPATIBILITY_PATH=/usr/local/cuda/compat
export VLLM_SLEEP_WHEN_IDLE=1
export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4
export VLLM_USE_V1=1

This/similar config worked just fine on a100. I also had to patch marlin kernel to make this all work. Thanks for the answer, this def means that it's a problem with driver, asked sysadmins to update to 580

[-]

Unable-Tea3788@reddit

Try increasing the "num_speculative_tokens" up to 5 step by step, see at each steps if it reaches better results.

[-]

Environmental-Metal9@reddit

Have you tried without speculative decoding to get a real baseline? I’ve found that getting all the params correctly for each model is sometimes hard, and that can hurt TPS when not well configured. I’d check the simplest version of the command to run vllm and see the speeds you’re getting that way. Also, I’ve found that with vllm I don’t get the fastest single request speed, but when I batch 50 requests I get like 5000tps (because it is counting total tokens per second across all concurrent requests) which is great if your task can be parallelized like that (synthetic data generation comes to mind) but it isn’t great if you’re serving a single chat window for one user only.

For single tasks, I’ve found llama.cpp to give me better performance on models up to a certain size (300b at quant 4 pushing 40 to 50tps isn’t too bad). you don’t need to actually use llama.cpp, I’m suggesting it more as a diagnostic tool

[-]

ImportancePitiful795@reddit

You have BOUGHT the H200 or is one somewhere stored in the "Cloud" and you rent it?

[-]

Theio666@reddit (OP)

I've not bought this, this is a new hardware at my company and I'm learning how to effectively use it. If I had this at home or in cloud it would be way easier to update everything and not fuck with singularity -_-

I asked to see if this is a driver/cuda problem or not, because if it is I can ask sysadmins to update drivers. So far it seems it is driver issue, asked them to bump to 580.

[-]

ImportancePitiful795@reddit

Yeah, need latest drivers to make sure that's is not the main issue.

[-]

mangoking1997@reddit

Running in fp8 or fp16?

[-]

Theio666@reddit (OP)

AWQ 4bit, so not the native format but should not be that slow, unless I'm missing something. For comparison, fp8 on a100 is 80tps, which is also non-native format for ampere.

[-]

jacek2023@reddit

speed depends on context

[-]

Theio666@reddit (OP)

I'm aware, this is like first 5k context window, so should not go down this hard.