2 old RTX 2080 Ti with 22GB vram each Qwen3.6 27B at 38 token/s with f16 kv cache

Posted by snapo84@reddit | LocalLLaMA | View on Reddit | 26 comments

PLEASE KEEP IN MIND BOTH OF MY CARDS ARE POWER LIMITED TO 150W (i hate noise)
-------

Just wanted to share my current setup, that might help some users out there...

services:
  llama-server:
    image: ghcr.io/ggml-org/llama.cpp:full-cuda12-b9128
    container_name: llama-server
    restart: unless-stopped
    ports:
      - "16384:8080"
    volumes:
      - ./models:/models:ro
    command: >
      --server
      --model /models/Qwen3.6-27B-IQ4_XS-uc.gguf
      --alias "Qwen3.6 27B"
      --temp 0.6
      --top-p 0.95
      --min-p 0.00
      --top-k 20
      --port 8080
      --host 0.0.0.0
      --cache-type-k f16
      --cache-type-v f16
      --fit on
      --presence-penalty 1.32
      --repeat-penalty 1.0
      --jinja
      --chat-template-file /models/Qwen3.6.jinja
      --mmproj /models/Qwen3.6-27B-mmproj-BF16.gguf
      --webui
      --spec-default
      --chat-template-kwargs '{"preserve_thinking": true}'
      --reasoning-budget 8192
      --reasoning-budget-message "... thinking budget exceeded, let's answer now.\n"
      --split-mode tensor
    user: "1000:1000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - NVIDIA_VISIBLE_DEVICES=all

This is my exact config, my 2 extremely old 2080Ti gpus where upgraded in china to have 22GB vram each... and on ebay i bought a NVLINK (i do not recommend bying it, as no meassurable difference appears)

Quantisation i run is IQ4_XS

if i change the kv cache to q8_0 it sometimes happens during long coding sessions that the model loops, this is why i run kv-cache@f16 and never have this problem since then.

i use the hauhaucs qwen3.6 model uncensored on IQ4 matrix quants.

You can also forget about MTP as you are compute bound with those cards and not bandwidth bound.

The absolut biggest boost came from --split-mode tensor , this gave me a boost from 14 token/s to 38t/s

i think without the power limit we should get 45 token/s
what i also never did think about is the --fit on ... i always declared context length manually worked great but it looks like its not a good idea to always run at 95% vram consumption. fit on also improved token gen a little.

Btw. this is a < 1k USD setup running on 400w peak on the wall, and it works great with hermes and opencode.

the jinja template i use is this one:
https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates (in this setup template 11, i did not yet test the newer templates)

[-]

pseudobacon@reddit

Any downsides to the 2080Tis? Do idle properly? What about fan control does that work out of the box? Any special things needed for drivers?

[-]

snapo84@reddit (OP)

special about the 2080Ti ... you are pretty stuck with drivers on the cuda12 range...
additionally vllm and sglang do not support this architecture anymore, thats why i use llama.

i only use this cards for the 27B dense model from qwen, i have no other need...

[-]

Jammystocker@reddit

I recommend tryingout vLLM with flashinfer, it is really great since it uses marlin kernels.

The webhie/Qwen3.6-27B-int4-AutoRound model on hugging face works really great, for a single request I get around 40-45 tps, and I can have 4 running parallel at around 36 tps, which I never thought possible.

[-]

snapo84@reddit (OP)

would you be able to share your exact docker compose config for the vllm setup? i thought RTX 20 series arent supported on VLLM and SGlang

[-]

Jammystocker@reddit

Sure! I am not using docker, but just built it by cloning the github repo, created a python venv and uv pip install vllm

Here is my full vllm serve command:

VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 uv run vllm serve /localdir/models/Qwen3.6-27B-int4-AutoRound --attention-backend FLASHINFER \ --max-num-seqs 4 \ --max-num-batched-tokens 8192 \ -tp 2 \ --gpu-memory-utilization 0.97 \ --max-model-len 200000 \ --reasoning-parser qwen3 \ --enable-prefix-caching \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --port 8081 \ --enable-chunked-prefill \ --safetensors-load-strategy=prefetch \ --mamba-cache-mode all \ --mamba-block-size 8 \ --cudagraph-capture-sizes 1 \ --trust-remote-code \ --override-generation-config '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":0.0,"repetition_penalty":1.0}'

I know the 0.97 gpu util is kinda pushing it, but the cards are literally not being used for anything else, and I havent had an OOM so far. I could properly also go for more context, something like 220K, but I do not really need more for most tasks. this model also uses the fixed chat jinja template, so you do not need to change it. Also you probably do not need the VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 variable, it was good to have in previous versions but I think it is default now.

[-]

snapo84@reddit (OP)

thank you very much, might try it out... but i am interested in uncensored llm's :-) reason is vulnerability research. i absolutely love the int4 autoround model from intel ... but it dosent fit my pipeline very well in a censored fashion.
for example the unlsoth qwen3.6 27B model declined nearly ALL reverse engineering tasks on usb devices i connect to the agentiv system to extract the binary and try to reverse it....
i would love to see a model as good as the hauhauCS aggressive uncensored qwen 3.6 27B , because it works like a charm for reverse engineering....

[-]

Jammystocker@reddit

Really interesting use case!

I have not tried it, but I saw this uncensored quant lately if you hadn't already read it: https://www.reddit.com/r/LocalLLM/comments/1sypbu6/qwen3627b_uncensored_heretic_is_out_now_with_kld/

I think it also has the fixed template, but I haven't looked into it.

[-]

SarcasticBaka@reddit

Cuda 13 and vllm don't support Turing anymore? since when?

[-]

NickCanCode@reddit

So RTX 2080TI cannot use MTP because with power limit it will be compute bound?

[-]

snapo84@reddit (OP)

it can, but at the 150W i run each card at, i am not bandwidth limited, i am compute limited, i still have 650GB/s per card in bandwidth, but only about 60% of compute at 150W.

if i put the cards on 220W each i can get pretty exactly 44.5 tokens/s , MTP helps a lot if you are extremely bandwidth constrained (caugh DGX scam spark 250GB/s)

if i would run the Q8_0 version , then it woould maybe help to use MTP, but i still think compute on this cards specifically is the limiting factor, not bandwidth.

remember, those cards 27 TFlops f16, so all operations run in this...
the new 5090's have specific Raytrace blocks that the Turing not yet had, and it has a lot more of those Raytrace blocks. While the turing can do 27TFlops on fp16, fp8, fp4 .... the blackwell cards have 500-1000 TFLOPS at int4 or even more. So yes the cards are compute limited, but the cards are enough for multiple session inference (with llama cpp , 4 parallel requests, i see in every parallel request still 15 tokens, which equates to 60 token / s) while GPU utilisation is flat at 100%

[-]

a_beautiful_rhind@reddit

nvlink really only gives you gains for TP and that's if P2P is being used.

[-]

snapo84@reddit (OP)

nvlink is active but i dont see much data over it.... also on the pci express bus you only sometimes see more than 1GB/s

[-]

a_beautiful_rhind@reddit

did you run p2p bandwidth tests and all that? does it show in nvidia-smi? llama.cpp built with NCCL and peer access?

[-]

snapo84@reddit (OP)

[-]

a_beautiful_rhind@reddit

That's not really p2p speed test, just theoretical b/w of the link. But it is detected so you just gotta make sure backend is built with all the goodies.

In IK_llama it shows me that peer access is enabled but I don't remember what mainline does. I remember having to compile both with p2p on and the NCCL libraries where it could find them. Easy for me because I use ccmake so I can see all the parameters and enable/disable. A lot of people still build by passing them in the command line so many things end up at default.

[-]

No-Refrigerator-1672@reddit

Hi! Did you, by any chance, tried those cards in ComfyUI? I'm considering buying one strictly for image generation purposes.

[-]

TristeCloud@reddit

I own an RTX 2080 ti 22GB myself and it's pretty good for ComfyUI. I did have to fork SageAttention2 (https://github.com/gameblabla/SageAttention2) because it was horribly slow for video generation in particular. I can generate a video 360p in \~50 seconds and 1024x1024 video in around 180 seconds.

[-]

a_beautiful_rhind@reddit

This is my main comfyui card. Stuff not supporting turing anymore is a huge PITA. Card itself is pretty fast though if you tweak it up.

[-]

snapo84@reddit (OP)

no, not tested, as i never had a need to generate images or videos.... the only thing i use currently is the llm and image recognition (getting coordinates from a image to know where to click for automatic gui testing and detecting user interface errors after "vibe coding")

[-]

Endlesscrysis@reddit

Mind if I ask where you got the 22 gb 2080 cards? I'm assuming through chinese sellers? How was the process of getting them/how did you feel secure enough to buy them? I'm worried about getting scammed lmao.

[-]

autisticit@reddit

It's probably 2x 11GB

[-]

snapo84@reddit (OP)

nope its 2 x 22GB vram

[-]

Endlesscrysis@reddit

I thought so too but the 24gb vram each kinda threw me off

[-]

snapo84@reddit (OP)

i got them from aliexpress .... you can search there for 22GB RTX 2080 Ti.
Keep in mind i run mine at ONLY 150W power limit , because my PSU has 650W only and i am a little allergic against noise...
when i got mine they where 350$ , unfortunately they are now about 500-550$
as a pc i took a very old pc that had 2 pci express x16 slots (as the cards only support pci express 3.0 it dosent realy matter)
But i am still fascinated that such old cards deliver 40 token/s on a 27B dense model...

[-]

Derefringence@reddit

There's tons of reliable sellers, also scams unfortunately. Chat with them, check their sales history and reviews, and ask for proof or documentation, some are really happy to share

[-]

jacek2023@reddit

I have 2070 somewhere 😄