Can't get over 250TPS on RTX5090 with Qwen3.5-4B

Posted by luckyj@reddit | LocalLLaMA | View on Reddit | 22 comments

My main model is qwen3.6-27b-mtp and I'm getting around 100tps and 2500tps prefill, which is great. I've tried adding a second small model for auxiliary tasks, and even when it's the only model running, it doesn't go over 200-250tps.

I'm building llama.cpp and running on docker windows. I've also tried havenoammo/llama:cuda13-server, and get exactly the same performance so I think my build flags are OK. I've also tested with LM Studio and performance is similar.

I think I should be getting much better performance out of a tiny 4B model on an RTX5090, and have tried everything I can think of, and still there's a bottleneck somewhere.

GPU use is low(ish), around 50%, and CPU is basically idle.

My docker-compose.yml:

llama2:
    image: havenoammo/llama:cuda13-server
    container_name: llama-cuda13-3
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ports:
      - "8081:8080"
    volumes:
      - E:\user\Documents\LM Studio Models\unsloth:/models
      - ./model2.ini:/app/models.ini
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    command: >
      --models-preset /app/models.ini
      --port 8080
      --host 0.0.0.0
      -t 8
      -n -1
    restart: unless-stopped

and models2.ini:

version = 1

[*]
n-gpu-layers    = -1

batch-size      = 4096
ubatch-size     = 4096
jinja           = true
cache-type-k    = q8_0
cache-type-v    = q8_0
perf            = true
metrics         = true
parallel        = 4
cont-batching   = true
kv-unified      = true
ctx-checkpoints    = 8

[qwen3.5-4b]
load-on-startup = true
model           = /models/Qwen3.5-4B-GGUF/Qwen3.5-4B-Q4_K_S.gguf
; mmproj          = /models/Qwen3.6-27B-MTP-GGUF/mmproj-BF16.gguf
ctx-size        = 32000
chat-template-kwargs = {}
reasoning       = off
temp            = 1
top-p           = 1
top-k           = 20
min-p           = 0.0
presence-penalty = 2.0
repeat-penalty  = 1.0
flash-attn      = on

[-]

slalomz@reddit

I tried on my 5090 and I get around 300 t/s with that same model.

> llama-bench -fa 1 --mmap 0 -p "2048,16384" -n "128,1024" -r 3 -hf "unsloth/Qwen3.5-4B-GGUF:Q4_K_S"

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 32606 MiB):
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32606 MiB
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| qwen35 4B Q4_K - Small         |   2.40 GiB |     4.21 B | CUDA       |  99 |  1 |    0 |          pp2048 |     15735.10 ± 32.77 |
| qwen35 4B Q4_K - Small         |   2.40 GiB |     4.21 B | CUDA       |  99 |  1 |    0 |         pp16384 |      14878.74 ± 1.00 |
| qwen35 4B Q4_K - Small         |   2.40 GiB |     4.21 B | CUDA       |  99 |  1 |    0 |           tg128 |        315.50 ± 0.74 |
| qwen35 4B Q4_K - Small         |   2.40 GiB |     4.21 B | CUDA       |  99 |  1 |    0 |          tg1024 |        314.32 ± 5.10 |

[-]

luckyj@reddit (OP)

I haven't tried llama-bench. I've just been running my regular jobs. Will try that! Thanks

[-]

jtjstock@reddit

100tps on 27B with what kind of task, quant and context? I can get that on my dual 5060ti’s using p2p, so it seems low to me for a 5090…

[-]

luckyj@reddit (OP)

100-110tps with Qwen3.6-27B-UD-Q5_K_XL with MTP. Context is 128k, and as it fills up, TPS goes down to 80-90TPS. Prefill is about 2400tps.

Usage right now is hermes-agent, context is always between 20k and 80k tokens.

[-]

jtjstock@reddit

I guess I need to try a q5 to see how much my tps drops then

[-]

luckyj@reddit (OP)

what quant an kv quant are you using?

[-]

YehowaH@reddit

Kv Quant leads to wild degeneration, would not recommend to quant kv.

[-]

jtjstock@reddit

IQ4XS with Q8 MTP heads, so I expect to see a drop at Q5_K_XL

Kv is q8.

[-]

BitGreen1270@reddit

What prompt are you using to benchmark? I'm using the exact same gguf on my 5090 and I think I'm getting a bit higher. I'm using q8_0 for everything , including ctkd and ctvd. But am on ubuntu.Can run the prompt and share my speed.

Also what is the smaller model you are running? Even with the Gemma e2b I'm not seeing more than 190 tps

[-]

Main_Problem_2696@reddit

You're hitting the compute ceiling, not a memory bottleneck. A 4B Q4 model fits entirely in L2 cache, so 250 tps is roughly the upper limit for single batch processing on a 5090. Lower batch size from 4096 to 256-512 and test 4-8 concurrent requests. That's where your parallel setting matters and total throughput will rise even if per request TPS stays the same. Used Runable to benchmark batch sizes on a 4090, clean TPS chart in 20 minutes showed tiny models hit a compute wall fast. Your 5090 is fine. Just physics

[-]

PaMRxR@reddit

Maybe this post will be interesting for you. It's for datacenter GPUs but it goes into a lot of details and I found it generally educational. https://blog.kog.ai/building-a-single-kernel-latency-optimized-llm-inference-engine-on-amd-mi300x-gpus/

[-]

jikilan_@reddit

Try switch to linux next for the free increased performance

[-]

JustinPooDough@reddit

Would you consider trying bare metal Linux instead?

[-]

luckyj@reddit (OP)

I can't at the moment. My Linux server is not capable of physically hosting the rtx5090 (too big, too much power), so it's mounted on my windows pc in which I need windows.

[-]

anykeyh@reddit

How much tps with MTP disabled? If about 40tps, 250tps for a 4B model is expected.

[-]

luckyj@reddit (OP)

250tps with MTP disabled, testing now with MTP enabled, and it's somehow worse. I'm probably missing something here. Would need more time. But it wouldn't work for me either way

[-]

ridablellama@reddit

if your trying to max throughput you go vllm fine tune it and add more conccurrent reqeusts fill that last 50% for more total tokens per second. llama.cpp has parrallel slots you can try but vllm is the best. It has paged attention and chunked prefill

[-]

jtjstock@reddit

I need to try q5’s to see how much my tps drops haha.

[-]

FDosha@reddit

5090 has bandwidth = 1.79 tb/sec. 4b model will have theoretically max tps =450ts with 8bit, or 225 with 16bit precision. So that looks as your numbers

[-]

luckyj@reddit (OP)

I think you're right

[-]

iMrParker@reddit

That's about normal. 50% utilization is expected as you're hitting the upper level for memory bandwidth and other overheads. It's a dense model so it makes it easy to estimate

Model with context ~6gb on disk. Memory bandwidth is 1700. So 1700/6 gives around 280 tokens per second. Given overhead and you get around 250tps

[-]

luckyj@reddit (OP)

sigh, I think you're right. I don't know why I was expecting more. A 27B with no MTP gives me around 50tps. So math checks