Can't get over 250TPS on RTX5090 with Qwen3.5-4B

Posted by luckyj@reddit | LocalLLaMA | View on Reddit | 22 comments

My main model is qwen3.6-27b-mtp and I'm getting around 100tps and 2500tps prefill, which is great. I've tried adding a second small model for auxiliary tasks, and even when it's the only model running, it doesn't go over 200-250tps.

I'm building llama.cpp and running on docker windows. I've also tried havenoammo/llama:cuda13-server, and get exactly the same performance so I think my build flags are OK. I've also tested with LM Studio and performance is similar.

I think I should be getting much better performance out of a tiny 4B model on an RTX5090, and have tried everything I can think of, and still there's a bottleneck somewhere.

GPU use is low(ish), around 50%, and CPU is basically idle.

My docker-compose.yml:

llama2:
    image: havenoammo/llama:cuda13-server
    container_name: llama-cuda13-3
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ports:
      - "8081:8080"
    volumes:
      - E:\user\Documents\LM Studio Models\unsloth:/models
      - ./model2.ini:/app/models.ini
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    command: >
      --models-preset /app/models.ini
      --port 8080
      --host 0.0.0.0
      -t 8
      -n -1
    restart: unless-stopped

and models2.ini:

version = 1

[*]
n-gpu-layers    = -1

batch-size      = 4096
ubatch-size     = 4096
jinja           = true
cache-type-k    = q8_0
cache-type-v    = q8_0
perf            = true
metrics         = true
parallel        = 4
cont-batching   = true
kv-unified      = true
ctx-checkpoints    = 8

[qwen3.5-4b]
load-on-startup = true
model           = /models/Qwen3.5-4B-GGUF/Qwen3.5-4B-Q4_K_S.gguf
; mmproj          = /models/Qwen3.6-27B-MTP-GGUF/mmproj-BF16.gguf
ctx-size        = 32000
chat-template-kwargs = {}
reasoning       = off
temp            = 1
top-p           = 1
top-k           = 20
min-p           = 0.0
presence-penalty = 2.0
repeat-penalty  = 1.0
flash-attn      = on