Can't get over 250TPS on RTX5090 with Qwen3.5-4B
Posted by luckyj@reddit | LocalLLaMA | View on Reddit | 22 comments
My main model is qwen3.6-27b-mtp and I'm getting around 100tps and 2500tps prefill, which is great. I've tried adding a second small model for auxiliary tasks, and even when it's the only model running, it doesn't go over 200-250tps.
I'm building llama.cpp and running on docker windows. I've also tried havenoammo/llama:cuda13-server, and get exactly the same performance so I think my build flags are OK. I've also tested with LM Studio and performance is similar.
I think I should be getting much better performance out of a tiny 4B model on an RTX5090, and have tried everything I can think of, and still there's a bottleneck somewhere.
GPU use is low(ish), around 50%, and CPU is basically idle.
My docker-compose.yml:
llama2:
image: havenoammo/llama:cuda13-server
container_name: llama-cuda13-3
runtime: nvidia
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
ports:
- "8081:8080"
volumes:
- E:\user\Documents\LM Studio Models\unsloth:/models
- ./model2.ini:/app/models.ini
environment:
- NVIDIA_VISIBLE_DEVICES=all
command: >
--models-preset /app/models.ini
--port 8080
--host 0.0.0.0
-t 8
-n -1
restart: unless-stopped
and models2.ini:
version = 1
[*]
n-gpu-layers = -1
batch-size = 4096
ubatch-size = 4096
jinja = true
cache-type-k = q8_0
cache-type-v = q8_0
perf = true
metrics = true
parallel = 4
cont-batching = true
kv-unified = true
ctx-checkpoints = 8
[qwen3.5-4b]
load-on-startup = true
model = /models/Qwen3.5-4B-GGUF/Qwen3.5-4B-Q4_K_S.gguf
; mmproj = /models/Qwen3.6-27B-MTP-GGUF/mmproj-BF16.gguf
ctx-size = 32000
chat-template-kwargs = {}
reasoning = off
temp = 1
top-p = 1
top-k = 20
min-p = 0.0
presence-penalty = 2.0
repeat-penalty = 1.0
flash-attn = on
slalomz@reddit
I tried on my 5090 and I get around 300 t/s with that same model.
luckyj@reddit (OP)
I haven't tried llama-bench. I've just been running my regular jobs. Will try that! Thanks
jtjstock@reddit
100tps on 27B with what kind of task, quant and context? I can get that on my dual 5060ti’s using p2p, so it seems low to me for a 5090…
luckyj@reddit (OP)
100-110tps with Qwen3.6-27B-UD-Q5_K_XL with MTP. Context is 128k, and as it fills up, TPS goes down to 80-90TPS. Prefill is about 2400tps.
Usage right now is hermes-agent, context is always between 20k and 80k tokens.
jtjstock@reddit
I guess I need to try a q5 to see how much my tps drops then
luckyj@reddit (OP)
what quant an kv quant are you using?
YehowaH@reddit
Kv Quant leads to wild degeneration, would not recommend to quant kv.
jtjstock@reddit
IQ4XS with Q8 MTP heads, so I expect to see a drop at Q5_K_XL
Kv is q8.
BitGreen1270@reddit
What prompt are you using to benchmark? I'm using the exact same gguf on my 5090 and I think I'm getting a bit higher. I'm using q8_0 for everything , including ctkd and ctvd. But am on ubuntu.Can run the prompt and share my speed.
Also what is the smaller model you are running? Even with the Gemma e2b I'm not seeing more than 190 tps
Main_Problem_2696@reddit
You're hitting the compute ceiling, not a memory bottleneck. A 4B Q4 model fits entirely in L2 cache, so 250 tps is roughly the upper limit for single batch processing on a 5090. Lower batch size from 4096 to 256-512 and test 4-8 concurrent requests. That's where your parallel setting matters and total throughput will rise even if per request TPS stays the same. Used Runable to benchmark batch sizes on a 4090, clean TPS chart in 20 minutes showed tiny models hit a compute wall fast. Your 5090 is fine. Just physics
PaMRxR@reddit
Maybe this post will be interesting for you. It's for datacenter GPUs but it goes into a lot of details and I found it generally educational. https://blog.kog.ai/building-a-single-kernel-latency-optimized-llm-inference-engine-on-amd-mi300x-gpus/
jikilan_@reddit
Try switch to linux next for the free increased performance
JustinPooDough@reddit
Would you consider trying bare metal Linux instead?
luckyj@reddit (OP)
I can't at the moment. My Linux server is not capable of physically hosting the rtx5090 (too big, too much power), so it's mounted on my windows pc in which I need windows.
anykeyh@reddit
How much tps with MTP disabled? If about 40tps, 250tps for a 4B model is expected.
luckyj@reddit (OP)
250tps with MTP disabled, testing now with MTP enabled, and it's somehow worse. I'm probably missing something here. Would need more time. But it wouldn't work for me either way
ridablellama@reddit
if your trying to max throughput you go vllm fine tune it and add more conccurrent reqeusts fill that last 50% for more total tokens per second. llama.cpp has parrallel slots you can try but vllm is the best. It has paged attention and chunked prefill
jtjstock@reddit
I need to try q5’s to see how much my tps drops haha.
FDosha@reddit
5090 has bandwidth = 1.79 tb/sec. 4b model will have theoretically max tps =450ts with 8bit, or 225 with 16bit precision. So that looks as your numbers
luckyj@reddit (OP)
I think you're right
iMrParker@reddit
That's about normal. 50% utilization is expected as you're hitting the upper level for memory bandwidth and other overheads. It's a dense model so it makes it easy to estimate
Model with context ~6gb on disk. Memory bandwidth is 1700. So 1700/6 gives around 280 tokens per second. Given overhead and you get around 250tps
luckyj@reddit (OP)
sigh, I think you're right. I don't know why I was expecting more. A 27B with no MTP gives me around 50tps. So math checks