Best config for Qwen3.6 27b / llama.cpp / opencode | TheaterFire

Best config for Qwen3.6 27b / llama.cpp / opencode

Posted by Familiar_Wish1132@reddit | LocalLLaMA | View on Reddit | 106 comments

Please share your best config <3
llama.ccp:
"A:/0_llama_server/llama-server.exe" -m "a:\0_LM_Studio\unsloth\Qwen3.6-27B-GGUF\Qwen3.6-27B-UD-Q5_K_XL.gguf" --port 8080 --alias qwen3.5:27b -ngl 999 --threads 22 --flash-attn on --host 0.0.0.0 --no-mmap -mg 1 --batch-size 1024 --ubatch-size 512 --ctx-checkpoints 128 --ctx-size 196610 --reasoning on --jinja --draft-max 128 --spec-ngram-size-n 48 --draft-min 2 --spec-type ngram-mod --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --repeat_penalty 1.0 --presence_penalty 0.0 --chat-template-kwargs "{\"preserve_thinking\":true}" --tensor-split 0.46,0.54

[-]

legodfader@reddit

Anyone with dual 3090s?

[-]

AdamDhahabi@reddit

Sort of, 3090 + 2x 5070 Ti, running Unsloth Q8 at 25 t/s

[-]

Important_Quote_1180@reddit

That’s a cool setup! I have just a single 3090 and an old 1660 running unreal and the UD Q5 was running about 35 toks

[-]

AdamDhahabi@reddit

I previously had a P5000 which is 1080 equivalent (288 GB/s), it bottlenecked me in several ways:
- Pascal generation locked me into CUDA 12.x, now 13.x -> small percentage speedup
- More VRAM allowed me to run Bartowski GGUF instead of Unsloth UD -> 10% speedup (Unsloth UD squeezes more into VRAM but at speed penalty)
- Replaced with a latest gen card with more memory bandwidth -> large speedup
- Some memory overclock (nvidia-smi) which I could not do before with my P5000 -> more speedup

[-]

psyclik@reddit

Getting around 40ts (1300 prefill) on 4x3090 with graph parallel on ik_llama (q8, 256k context)

[-]

Familiar_Wish1132@reddit (OP)

share your run command please

[-]

Cferra@reddit

export CUDA_VISIBLE_DEVICES=0,1

\~/llama.cpp/build/bin/llama-server \

-m \~/models/Qwen3.6-27B-Q4_K_M.gguf \

--mmproj \~/models/mmproj-Qwen3.6-27B-F16.gguf \

-ngl 999 --split-mode layer --tensor-split 1,1 \

--ctx-size 262144 --parallel 1 \

--cache-type-k q8_0 --cache-type-v q8_0 \

--flash-attn on \

--jinja \

--reasoning-format deepseek \

-a qwen3.6-27b \

--host 0.0.0.0 --port 8003

enables all features

[-]

Cferra@reddit

# Qwen 3.6 27B on 2x RTX 3090 NVLink — benching three ways (TurboQuant TQ3_4S vs standard Q4_K_M vs TurboQuant V-cache on current-main llama.cpp)

Spent the afternoon benchmarking [Qwen 3.6 27B](https://huggingface.co/Qwen/Qwen3.6-27B-Instruct) on a dual-3090 box with three different llama.cpp configurations, because I wanted to see where TurboQuant actually helps on Ampere — and the answer turned out to be more nuanced than I expected. Sharing the numbers because they complicate some of the "TurboQuant makes quants faster" claims floating around.

**TL;DR:** On Qwen 3.6 27B specifically, **vanilla llama.cpp + standard Q4_K_M + q8_0 KV wins on every axis at every context depth** on 2x RTX 3090 NVLink. TurboQuant isn't needed for long-context fit (Qwen 3.6's hybrid attention keeps KV cheap without it) and costs 10–20% generation throughput vs the plain q8_0 V-cache. The TurboQuant fork that does win on *memory* does so by an amount that doesn't matter on this hardware.

## Hardware

| | |

|-|-|

| GPUs | 2x RTX 3090, NVLink (\~56 GB/s aggregate), compute capability 8.6, 24 GB each |

| Host | Ubuntu 24.04.4 LTS, kernel 6.8.0-110 |

| Driver / CUDA | 590.48.01 / CUDA 12.0 |

| Compiler | gcc 13.3.0 |

## Configurations tested

| # | llama.cpp base | Weights | K-cache | V-cache | llama-server version |

|---|----------------|---------|---------|---------|----------------------|

| **A** | [`turbo-tan/llama.cpp-tq3`](https://github.com/turbo-tan/llama.cpp-tq3) `main` @ `794c5dc` | `TQ3_4S` (\~3.5 bpw, WHT) | q8_0 | **tq3_0** (3-bit WHT) | `102 (794c5dc)` |

| **B** | [`ggerganov/llama.cpp`](https://github.com/ggerganov/llama.cpp) `master` @ `8bccdbbff` | `Q4_K_M` (\~4.5 bpw) | q8_0 | q8_0 | `8890 (8bccdbbff)` |

| **C** | [`TheTom/llama-cpp-turboquant`](https://github.com/TheTom/llama-cpp-turboquant) `feature/turboquant-kv-cache` @ `9e3fb40e8` | `Q4_K_M` | q8_0 | **turbo3** (3-bit WHT) | `8983 (9e3fb40e8)` |

All three: `-ngl 999 --split-mode layer --tensor-split 1,1 --flash-attn on --parallel 1`. Models from `unsloth/Qwen3.6-27B-GGUF` (Q4_K_M) and `YTan2000/Qwen3.6-27B-TQ3_4S`.

Note the `llama-server --version` column — **turbo-tan's fork is on upstream b102 (\~6 months behind master)** while TheTom's is actively rebased and is actually 93 commits *ahead* of my vanilla checkout. That matters a lot, as you'll see.

## Results (matched 32k ctx, temp=0)

| Prompt (tok) | A: turbo-tan TQ3_4S + tq3_0 V | B: vanilla Q4_K_M + q8_0 V | C: TheTom Q4_K_M + turbo3 V |

|-------------:|:------------------------------|:---------------------------|:----------------------------|

| 1,028 | 782 PP / 34.5 TG | **1,161 PP / 41.0 TG** | 1,134 PP / 40.4 TG |

| 3,028 | 1,045 PP / 33.7 TG | 1,638 PP / 40.5 TG | **1,659 PP** / 39.3 TG |

| 7,028 | 1,130 PP / 32.3 TG | 1,849 PP / **39.7 TG** | **1,877 PP** / 37.0 TG |

| 15,028 | 1,118 PP / 30.4 TG | 1,822 PP / **38.2 TG** | **1,851 PP** / 33.4 TG |

| 28,028 | 1,069 PP / 27.3 TG | 1,717 PP / **36.0 TG** | **1,745 PP** / 28.8 TG |

All three generated correct output on the same correctness probe (17 × 23 → 391 with coherent reasoning traces). Quality parity; performance very different.

### VRAM @ 262k native context (`--parallel 1`)

| Config | Total VRAM | Fits on 2x 3090? |

|--------|-----------:|------------------|

| A: turbo-tan TQ3_4S | 26.7 GB | ✅ |

| B: vanilla Q4_K_M + q8_0 V | **30.9 GB** | ✅ (by \~15 GB margin) |

| C: TheTom Q4_K_M + turbo3 V | 28.1 GB | ✅ |

## What the numbers actually say

### 1. Turbo-tan's fork is slower because of its aged llama.cpp base, not because of TurboQuant

Config A is \~35–55% slower than Config C on the same hardware, even though both use TurboQuant V-cache. The only difference: A's fork is on upstream `b102` (October 2025-ish), C's is on `b8983` (current master). TurboQuant weight-quant infrastructure is fine — the fork is just missing 6 months of MMQ/FlashAttention kernel tuning.

If you want to use TurboQuant, **use TheTom's fork, not turbo-tan's**, unless you specifically need `TQ3_4S` weights (which only turbo-tan supports). For most use cases, Q4_K_M weights + TheTom's TurboQuant V-cache is the right call.

### 2. On Qwen 3.6, TurboQuant V-cache doesn't unlock any context you couldn't already reach

I assumed TurboQuant would be necessary to fit 256k context. **It isn't.** Qwen 3.6's architecture (48 linear-attention + 16 full-attention layers) means only 25% of the 64 layers have traditional KV — the rest use recurrent state with a tiny memory footprint. Plain q8_0 KV at 256k uses \~12 GB of KV total, and comfortably fits on 2x 3090 with Q4_K_M weights.

For non-hybrid models (Qwen 2.5, Llama 3, Gemma, Mistral — anything with full attention on every layer) TurboQuant would be more compelling. For Qwen 3.6 on this hardware, it isn't.

### 3. TurboQuant V-cache costs \~10–20% TG on Ampere, with the gap widening at long context

Compare B vs C — same weights, same base branch, only the V-cache differs. TG hit from turbo3 V:

- 1k ctx: 41.0 → 40.4 tok/s (−1.4%)

- 7k: 39.7 → 37.0 (−6.8%)

- 15k: 38.2 → 33.4 (−12.5%)

- 28k: 36.0 → 28.8 (−20.0%)

This is the "3-bit codebook dequant cost during decode" that TheTom's own MI355X benchmark called out: WHT inverse rotation + 8-entry codebook lookup per token per KV head every generation step isn't free. Prefill is barely affected because prefill is KV-*write*-dominated, and writes benefit from the smaller quant.

If `turbo4` (4-bit, simpler dequant) eventually lands in turbo-tan's fork or someone ports it into TheTom's current base, it'd probably recover most of that TG gap per the MI355X data (84% of f16 TG for turbo4 vs 64% for turbo3 on that hardware). On 3090 the gap would be smaller in absolute terms.

### 4. The "\~25 tok/s prompt generation" gotcha

If you see a low PP number on a toy prompt (32 tokens → \~25 tok/s "PP"), don't panic. That's fixed request/slot/first-token overhead diluted across too few prefill tokens. PP only becomes meaningful past \~500 prompt tokens. My real-workload PP numbers above start at \~1,000-token prompts.

## Winning configuration for Qwen 3.6 27B on 2x RTX 3090

```bash

\~/llama.cpp/build/bin/llama-server \

-m Qwen3.6-27B-Q4_K_M.gguf \

--mmproj mmproj-Qwen3.6-27B-F16.gguf \

-ngl 999 --split-mode layer --tensor-split 1,1 \

--ctx-size 262144 --parallel 1 \

--cache-type-k q8_0 --cache-type-v q8_0 \

--flash-attn on \

--host 0.0.0.0 --port 8003

```

This gives you native 256k context, \~1,700–1,850 tok/s prefill in the 3–15k sweet spot, \~36–40 tok/s generation, 30.9 GB VRAM. No fork, no patches.

## Build command (any of the three forks)

```bash

cmake -B build \

-DGGML_CUDA=ON \

-DCMAKE_CUDA_ARCHITECTURES=86 \

-DGGML_CUDA_GRAPHS=ON \

-DCMAKE_BUILD_TYPE=Release \

-DLLAMA_BUILD_TESTS=OFF \

-DLLAMA_BUILD_EXAMPLES=OFF \

-DLLAMA_BUILD_SERVER=ON \

-GNinja

cd build && ninja llama-server llama-cli llama-quantize

```

For `-DCMAKE_CUDA_ARCHITECTURES=86` sub in your actual compute capability, or drop it entirely to get the default ARCH list.

## When TurboQuant actually helps (based on this data)

- **Dense-attention models** (not Qwen 3.6): KV dominates VRAM, compression pays.

- **High `--parallel`**: each slot carries its own KV, so per-slot savings multiply.

- **Multi-model deployments**: if you need Qwen 3.6 alongside another model on the same GPUs, the \~3 GB KV savings at 256k matter.

- **On AMD MI300X/MI355X** per TheTom's own benchmarks: the gap closes to parity with f16 at pp512, and `turbo4` is actively competitive.

- **When you need more than 256k context** on 2x 3090 specifically: at that point standard q8_0 V *would* start becoming the limiting factor (call it the 350k–400k range on this hardware).

## Links

- Models: [`unsloth/Qwen3.6-27B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) · [`YTan2000/Qwen3.6-27B-TQ3_4S`](https://huggingface.co/YTan2000/Qwen3.6-27B-TQ3_4S)

- Forks: [`turbo-tan/llama.cpp-tq3`](https://github.com/turbo-tan/llama.cpp-tq3) · [`TheTom/llama-cpp-turboquant`](https://github.com/TheTom/llama-cpp-turboquant) · [`ggerganov/llama.cpp`](https://github.com/ggerganov/llama.cpp)

- TurboQuant paper: [arXiv 2504.19874](https://arxiv.org/abs/2504.19874) (ICLR 2026) — PolarQuant + QJL

Happy to share raw logs if anyone wants to cross-check a specific cell. And if you've got a non-hybrid 30B–70B model handy where KV dominates, I'd be curious to see the comparison there — my hunch is TurboQuant's story gets much more compelling on Llama/Qwen-2.5-class architectures.

[-]

Cferra@reddit

i have 2x 3090s with NVlink

 Qwen 3.6 27B @ TQ3_4S (Walsh–Hadamard 3.5-bit) on 2x RTX 3090 NVLink — native 256k context, ~1,050 tok/s prefill, ~34 tok/s generation


Wanted to share some real numbers running 
**Qwen 3.6 27B**
 with 
**TurboQuant KV-cache compression**
 on an Ampere dual-3090 box. This uses the `turbo-tan/llama.cpp-tq3` fork (the Ampere-optimized one that supports the `TQ3_4S` weight quant — TheTom's `llama-cpp-turboquant` fork doesn't have TQ3_4S yet).


## TL;DR


- 
**Weight quant:**
 TQ3_4S (~3.5 bpw Walsh–Hadamard, from `YTan2000/Qwen3.6-27B-TQ3_4S` on HF)
- 
**KV cache:**
 K = Q8_0, V = TQ3_0 (turbo-tan's 3-bit WHT V-cache)
- 
**Native 262,144-token context**
 allocated without OOM
- 
**~1,050 tok/s prefill**
, 
**~34 tok/s generation**
 at short/medium context
- 
**27.2 GB total VRAM**
 across both 3090s with full 256k KV pre-allocated


## Hardware


| | |
|-|-|
| GPUs | 2x NVIDIA GeForce RTX 3090 (compute capability 8.6, 24 GB each, NVLink active @ 56 GB/s aggregate) |
| Host | Ubuntu 24.04.4 LTS, kernel 6.8.0-110-generic |
| Driver / CUDA | NVIDIA driver 590.48.01, CUDA 12.0 (V12.0.140) |
| Compiler | gcc 13.3.0 |


## Software


| | |
|-|-|
| Fork | [`turbo-tan/llama.cpp-tq3`](
https://github.com/turbo-tan/llama.cpp-tq3
) |
| Branch / commit | `main` @ `794c5dc2` ("fix: restore tq3 ci warning and quant thresholds") |
| `llama-server --version` | `version: 102 (794c5dc)` |
| Models | `YTan2000/Qwen3.6-27B-TQ3_4S` (13 GB main + 928 MB F16 mmproj from `unsloth/Qwen3.6-27B-GGUF`) |


## Build


```bash
# Clone
git clone https://github.com/turbo-tan/llama.cpp-tq3.git ~/llama-tq3
cd ~/llama-tq3


# ninja (pip path works if apt doesn't have it)
pip install --user --break-system-packages ninja
export PATH=$HOME/.local/bin:$PATH


# Configure (Ampere sm_86 only — drop -DCMAKE_CUDA_ARCHITECTURES if you want the default list)
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=86 \
  -DGGML_CUDA_GRAPHS=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_BUILD_TESTS=OFF \
  -DLLAMA_BUILD_EXAMPLES=OFF \
  -DLLAMA_BUILD_SERVER=ON \
  -GNinja


# Build
cd build && ninja llama-server llama-cli llama-quantize
```


Build time on a Ryzen box with 24 parallel compile jobs: ~6 minutes end-to-end (CUDA template instances — `mmq-instance-*` and `fattn-*` — dominate).


## Launch


```bash
export LD_LIBRARY_PATH=$HOME/llama-tq3/build/bin:$LD_LIBRARY_PATH
export CUDA_VISIBLE_DEVICES=0,1


~/llama-tq3/build/bin/llama-server \
  -m ~/models/Qwen3.6-27B-TQ3_4S.gguf \
  --mmproj ~/models/mmproj-Qwen3.6-27B-F16.gguf \
  -ngl 999 \
  --split-mode layer --tensor-split 1,1 \
  --ctx-size 262144 \
  --parallel 1 \
  --cache-type-k q8_0 \
  --cache-type-v tq3_0 \
  --flash-attn on \
  -a qwen3.6-27b-turboquant \
  --host 0.0.0.0 --port 8003
```


**Important flag notes:**


- `--cache-type-v tq3_0` is the turbo-tan naming. TheTom's fork uses `turbo3`/`turbo4` — these flags are 
**not interchangeable**
 between forks.
- `--parallel 1` gives one full-depth slot. With the llama.cpp default of 4 slots, your KV budget is divided into 4 concurrent contexts (so `--ctx-size 262144 --parallel 4` means each slot only gets ~65k usable context).
- `-ngl 999` offloads all layers to GPU. `--tensor-split 1,1` splits layers evenly across both 3090s.
- `--flash-attn on` is required for TurboQuant V-cache to actually kick in.


## VRAM footprint (full 256k KV pre-allocated)


| | GPU 0 | GPU 1 | Total |
|-|------:|------:|------:|
| Idle (no server) | 0.0 GB | 0.0 GB | 0.0 GB |
| Loaded @ 32k ctx, 4 slots | 10.6 GB | 8.3 GB | 18.9 GB |
| Loaded @ 256k ctx, 1 slot | 15.1 GB | 11.7 GB | 
**26.7 GB**
 |
| In use @ 64k prompt | 15.3 GB | 11.9 GB | 27.2 GB |


Roughly +7.8 GB for 8x the KV depth (32k → 256k) with `q8_0` K + `tq3_0` V. Headroom of ~21 GB remains — you could run another small model alongside, or bump `--parallel` for concurrent users if you don't need the full 256k per slot.


## Performance — context sweep


All runs at native 262,144-token context, `--parallel 1`, K=Q8_0, V=TQ3_0, FA on, temp=0.2, sm_86 build:


| Prompt size (tokens) | Prefill (PP) tok/s | Generation (TG) tok/s | Wall (to 50–200 gen tokens) |
|---------------------:|-------------------:|----------------------:|---------------------------:|
| 1,628 | 928 | 34.2 | 6.5s |
| 3,418 | 1,050 | 33.4 | 9.3s |
| 9,618 | 
**1,107**
 | 32.1 | 10.5s |
| 24,018 | 1,044 | 28.3 | 16.1s |
| 64,018 | 877 | 21.7 | 48.6s |


PP stays above 1,000 tok/s through ~24k context, then degrades gracefully (~21% drop from 10k → 64k). TG behaves as expected: ~34 tok/s at short context (weight-bandwidth bound), tapering to ~22 tok/s at 64k as attention over the longer KV becomes the dominant cost — even with tq3_0 V, the dequant-during-decode isn't free.


**Gotcha worth calling out:**
 if you benchmark with a toy prompt (say 32 tokens) you'll see "PP" reported as ~25 tok/s and think something is broken. It isn't. That's fixed per-request overhead (slot setup + first-token latency) diluted across too few prefill tokens. PP only becomes meaningful above a few hundred tokens.


## Quality spot-check


The model correctly computes `17 * 23 = 391`, shows a coherent think channel with two independent derivation paths converging on the same answer, handles vision via the mmproj, and produces clean prose at 24k–64k context depth. No obvious quantization damage — the TQ3_4S weights + TQ3_0 V-cache combination is holding together on a 27B reasoning model.


Qwen 3.6 emits everything into `reasoning_content` by default (thinking mode on) — so if your client only reads `content`, you'll see empty responses until the think block closes. That's a client/template issue, not a quant issue.


## Comparison context


| Model / quant / hardware | PP tok/s | TG tok/s | Source |
|--------------------------|---------:|---------:|--------|
| Qwen3-32B Q4_K_M, 2x 3090 NVLink | 700–1,000 | 20–28 | community llama.cpp benches |
| Qwen2.5-32B Q4_K_M, 2x 3090 NVLink | 600–900 | 22–30 | community llama.cpp benches |
| turbo-tan README benchmark | 221 | 15 | [repo README](
https://github.com/turbo-tan/llama.cpp-tq3
) |
| 
**This setup (Qwen 3.6 27B TQ3_4S, 2x 3090 NVLink)**
 | 
**877–1,107**
 | 
**21.7–34.2**
 | this post |


The TQ3_4S weight quant is ~3.5 bpw vs Q4_K_M's ~4.5 bpw — the ~20% smaller memory footprint buys you back a chunk of bandwidth budget, which is exactly what the 3090's 936 GB/s HBM lanes want to see.


## Credits & links


- 
**TurboQuant**
 KV-cache compression — Walsh–Hadamard rotation + RaBitQ-inspired 3-bit quant, originally by turbo-tan ([`turbo-tan/llama.cpp-tq3`](
https://github.com/turbo-tan/llama.cpp-tq3
))
- 
**TheTom's fork**
 with broader TurboQuant variants (TQ4_1S, TURBO3, TURBO4) — [`TheTom/llama-cpp-turboquant`](
https://github.com/TheTom/llama-cpp-turboquant
) — doesn't currently read TQ3_4S weights, so you need turbo-tan's fork specifically for `YTan2000`'s GGUF
- 
**Model GGUF**
 — [`YTan2000/Qwen3.6-27B-TQ3_4S`](
https://huggingface.co/YTan2000/Qwen3.6-27B-TQ3_4S
)
- 
**mmproj**
 — [`unsloth/Qwen3.6-27B-GGUF`](
https://huggingface.co/unsloth/Qwen3.6-27B-GGUF
) (for the F16 multimodal projector)


Happy to answer questions on the launch flags, VRAM behavior, or compare against vanilla llama.cpp Q4_K_M on the same box if folks are interested.

[-]

Swedgetarian@reddit

Q4_K_XL on a 4090 24GB, fully in VRAM. Squeezed for context without kv cache quant. But on short (\~1k) context getting 40 t/s tg.

docker run -v /mnt/data/gguf:/mnt/data/gguf \
-p 8095:8095 \
--gpus all \
ghcr.io/ggml-org/llama.cpp:full-cuda \
-s \
-m \
/mnt/data/gguf/Qwen3.6-27B-UD-Q4_K_XL.gguf \
--host 0.0.0.0 \
--port 8095 \
--ctx-size 32000 \
--no-mmap \
--flash-attn on \
--n-gpu-layers 999 \
--chat-template-kwargs "{\"preserve_thinking\":true}" \
--temp 0.7 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.00 \
--repeat_penalty 1.0 \
--presence_penalty 0.0

[-]

Familiar_Wish1132@reddit (OP)

thx, try https://huggingface.co/Jackrong/Qwopus3.6-27B-v1-preview-GGUF

[-]

jacek2023@reddit

cache?

[-]

Familiar_Wish1132@reddit (OP)

? i have 256GB RAM, do i need to specify cache? isn't it taking max? i would gladly give more as i have enough :D please what param to set?

[-]

jacek2023@reddit

just asking because I use cache ram: https://www.reddit.com/r/LocalLLaMA/comments/1sqp8pp/opencode_with_gemma_26b/

[-]

Familiar_Wish1132@reddit (OP)

--cache-ram → CPU memory (system RAM)

So if you do it intentionaly okay, but if you have enough VRAM then it's a bottleneck

[-]

Familiar_Wish1132@reddit (OP)

okay thx will test it out <3

[-]

soyalemujica@reddit

24gb vram 7900XTX 35t/s, and 27t/s at 160k context:

llama-server.exe -ctv q8_0 -ctk q8_0 -c 160000 --temp 0.6 --top-p 0.95 --top-k 20 --repeat-penalty 1.0 --fit on

[-]

dodistyo@reddit

How?? i can barely run it with 64k ctx window, nad that's using kv cache Q4 quantization.

I have the same hardware, same model but with lmstudio.

the model size it self 19gb ish, right? unless i downloaded the wrong model here.

[-]

soyalemujica@reddit

I do not use lm studio, I use llama.cpp, the model size is 17gb

[-]

dodistyo@reddit

yea, lmstudio is actually using llama.cpp under the hood. so the result should be not too different i believe. full GPU offload right? I'll give it a try myself tho using llama.cpp.

[-]

Familiar_Wish1132@reddit (OP)

what GPU?

[-]

sumrix@reddit

Which quantization format are you using?

[-]

soyalemujica@reddit

UDQ4KXL

[-]

sumrix@reddit

Man, I get only 13 t/s on the same GPU, with the same quantization format and the same parameters.

[-]

soyalemujica@reddit

You must have something using your GPU vram originally

[-]

Techngro@reddit

The difference is he has RGB in his rig.

[-]

Familiar_Wish1132@reddit (OP)

Vulkan?

[-]

soyalemujica@reddit

Yeah, Vulcan, PP at 400

[-]

Familiar_Wish1132@reddit (OP)

ufff noice !

[-]

Familiar_Wish1132@reddit (OP)

what ? 27t/s at 160k? what is your pp at 160k context? Thank you very much for this info

[-]

dero_name@reddit

Ah, I see you're running `llama-server` from a floppy drive. Bold choice!

[-]

SkyFeistyLlama8@reddit

I can hear the drive motor crunching from across the Internet as it tries to load 20 GB at like 100 kilobytes per second.

[-]

Radiant_Condition861@reddit

got mine on zip drive.

Ready to take on the click of death !

[-]

neverbyte@reddit

dude! I had one of these as a kid and my mind was blown. each disk could hold like 250? regular floppies worth of data? it was awesome! did I have anything of any real size that needed storing? did I actually use it for anything? don't remember, but i felt like a baller. nostalgia!

[-]

illforgetsoonenough@reddit

Horrible memories unlocked

[-]

Familiar_Wish1132@reddit (OP)

xD

[-]

Impossible_Art9151@reddit

ymmd

[-]

No_Mango7658@reddit

Go home kids

[-]

SingleProgress8224@reddit

ymmgta

[-]

Familiar_Wish1132@reddit (OP)

xD xD xD ofcourse ahahaha

[-]

hedsht@reddit

5090: web dev

 llama-server
  -m /models/qwen36-27b/Qwen3.6-27B-UD-Q5_K_XL.gguf
  --mmproj /models/qwen36-27b/mmproj-BF16.gguf
  --alias qwen3.6-27b
  --host 127.0.0.1
  --port ${PORT}
  --ctx-size 163840
  --n-gpu-layers -1
  --parallel 1
  --jinja
  --cache-type-k bf16
  --cache-type-v bf16
  --reasoning on
  --chat-template-kwargs '{"enable_thinking": true, "preserve_thinking": true}'
  --temp 0.6
  --top-p 0.95
  --top-k 20
  --min-p 0.0
  --presence-penalty 0.0
  --repeat-penalty 1.0
  --flash-attn on
  --batch-size 2048
  --ubatch-size 512
  --threads 8
  --threads-batch 16
  --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64

[-]

jessez05@reddit

```llama-server.exe \^

-m "C:\Users\pv\models\Qwen3.6-27B-UD-Q4_K_XL.gguf" \^

--alias qwen3.6-27b \^

--host 127.0.0.1 \^

--port 11434 \^

--ctx-size 262144 \^

-ngl -1 \^

--parallel 2 \^

--jinja \^

--chat-template-kwargs "{\"enable_thinking\": true, \"preserve_thinking\": true}" \^

--reasoning on \^

--temp 0.6 \^

--top-p 0.95 \^

--top-k 20 \^

--min-p 0.0 \^

--presence-penalty 0.0 \^

--repeat-penalty 1.0 \^

--flash-attn on \^

--batch-size 2048 \^

--ubatch-size 512 \^

--threads 14 \^

--threads-batch 22 \^

--cache-type-k q8_0 \^

--cache-type-v q8_0 ```

output 50tok/s, 29GB VRAM, rtx 5090

[-]

nunodonato@reddit

2 questions:

1 - how much tok/s?

2 - do you see any inference speed by using spec decoding?

[-]

hedsht@reddit

1.) depends, but 40-45 tokens/s on avg

2.) Generation throughput improved by about 28.0% Prompt throughput improved by about 4.2%

[-]

Familiar_Wish1132@reddit (OP)

thx will try it out <3

[-]

hedsht@reddit

if you run --no-mmproj and a 128k you can even fit a Q6, but i need mmproj for my workflow.

[-]

srigi@reddit

You can keep mmproj in RAM/CPU with --no-mmproj-offload. You save GPU mem while will still be able process images/PDFs

[-]

hedsht@reddit

ah yeah, totally forgot about that, good stuff, thank you!

[-]

SmallHoggy@reddit

Thank you 🙏🏼

[-]

Familiar_Wish1132@reddit (OP)

i have put mmproj with qwen3.6 35ba3b to another mi50 gpu that i have to separate it from main coding llm. So with oh-my-openagent i have setup visual category to use the other model with mmproj.

[-]

Cferra@reddit

export CUDA_VISIBLE_DEVICES=0,1

\~/llama.cpp/build/bin/llama-server \

-m \~/models/Qwen3.6-27B-Q4_K_M.gguf \

--mmproj \~/models/mmproj-Qwen3.6-27B-F16.gguf \

-ngl 999 --split-mode layer --tensor-split 1,1 \

--ctx-size 262144 --parallel 1 \

--cache-type-k q8_0 --cache-type-v q8_0 \

--flash-attn on \

--jinja \

--reasoning-format deepseek \

-a qwen3.6-27b \

--host 0.0.0.0 --port 8003

enables all features

[-]

Familiar_Wish1132@reddit (OP)

can you please test it out without? From my testing it looks faster idk if it's turboquant or ngram but the parallel settings seems slow down generation

--parallel 1

[-]

oxygen_addiction@reddit

You should also play around with batch size

https://github.com/ggml-org/llama.cpp/discussions/15396

https://www.reddit.com/r/LocalLLaMA/comments/1rg4zqv/comment/o7rszuj/?context=3

[-]

Pleasant-Shallot-707@reddit

Pi

[-]

lemondrops9@reddit

Qwen3.6 27B is out???

[-]

Familiar_Wish1132@reddit (OP)

yep finaly \^\^

[-]

lemondrops9@reddit

sweet I was trying out the 3.5 version last night. Time to download

[-]

WoodCreakSeagull@reddit

Splitting the Q4_K_M + BF16 mmproj between an RTX 5070 Ti (16GB) and Arc B580 (12GB) using llama.cpp for vulkan.

-c 200000 --fit off --parallel 2 -ngl 99 --tensor-split 57,43 -b 1024 -ub 256 --flash-attn on --no-mmap --mlock --temp 0.7 --top-p 0.9 --min-p 0.05 --top-k 40 --repeat-penalty 1.05 --repeat-last-n 64 -ctk q8_0 -ctv q8_0 --chat-template-kwargs {"preserve_thinking": true} --no-warmup --jinja

25 t/s on first prompt, 15 t/s with 50k context loaded.

Feels pretty slow compared to 35B but definitely usable. Had to tinker with some of the values at the edges and lower context to 200k/use smaller batch sizes to keep from spilling over into CPU.

[-]

timanu90@reddit

You can mix cards?

Would be possible to mix nvidia and AMD cards?

[-]

WoodCreakSeagull@reddit

If they share a backend, yes. Vulkan is fairly universal so it should be able to work fine for AMD and Nvidia I think.

[-]

timanu90@reddit

Thanks for the info. Will check on that

[-]

akumaburn@reddit

It looks like you’ve set --draft-min / --draft-max, but there’s no draft model configured, so those flags won’t have any effect.

You might also want to reduce the number of threads. llama.cpp doesn’t scale particularly well with higher thread counts, so try something in the 6–8 range instead.

A --top-k of 20 is on the low side as well; something around 40 or higher is usually a better starting point.

Everything else looks fine.

[-]

andy2na@reddit

its "Draftless" N-Gram Speculative Decoding

https://github.com/ggml-org/llama.cpp/blob/master/docs/speculative.md

[-]

akumaburn@reddit

hmm interesting; wasn't aware this was a thing.. though I'm not sure how much this will help when using opencode

[-]

Familiar_Wish1132@reddit (OP)

Thank you will test it out <3

[-]

Willing-Toe1942@reddit

if you want the best run pi coding agent instead of opencode

[-]

Familiar_Wish1132@reddit (OP)

Why? i need to have webui, for me it's very important. You think that the lower system prompt context could help huh?

[-]

Sir-Draco@reddit

I haven't used the pi coding agent but I do know it is very configurable (technically it is entirely configurable) and there is a lot of research coming out about how smaller system prompts allow the models nowadays to perform better since they are RL trained out of their mind (or out of their weights) and they know how to be agents. Something to consider, pi probably works better for these smaller models is my bet I am making. Will be trying it out this weekend

[-]

Familiar_Wish1132@reddit (OP)

it makes sense. let us know <3

[-]

t2noob@reddit

What do people think of mine ? It runs on a dual p40 setup. I use it as daily with nanobot. I was playing mostly fix the config with openclaw. ExecStart=/usr/bin/numactl --interleave=all /root/llama-cpp-turboquant/build-cuda-only/bin/llama-server \ -m /storage/ollama/models/gguf/Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf \ --mmproj /storage/ollama/models/gguf/mmproj-qwen3.6-35b-f16.gguf \ --host 0.0.0.0 \ --port 8080 \ -ngl 99 \ --no-mmproj-offload \ -c 65536 \ -ctk turbo4 \ -ctv turbo4 \ -sm layer \ -np 1 \ -b 2048 \ -ub 2048 \ --image-max-tokens 2048 \ --metrics \ --jinja \ --reasoning-format deepseek \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0 \ --repeat-penalty 1.05

[-]

morejona@reddit

I'm pretty sure you don't want to apply the turbo quant to both K and V caches, only V. Otherwise your context will drastically worsen. At least that's been reported by others: https://www.reddit.com/r/LocalLLaMA/comments/1sm6d2k/what_is_the_current_status_with_turbo_quant

[-]

Familiar_Wish1132@reddit (OP)

uff turbo4, would you share a gh link for that llamacpp please? you are using reasoning format huh?

[-]

t2noob@reddit

https://github.com/TheTom/llama-cpp-turboquant

[-]

Familiar_Wish1132@reddit (OP)

Thx

[-]

Impossible_Art9151@reddit

for what kind of optimization is your command for?

27B is running on my dgx ... and it is a little bit to slow. <10t/s
Maybe so can provide a dgx command that performs better than mine?
I am running the big q8 with 512000 ctx and num-paralell 2

[-]

Familiar_Wish1132@reddit (OP)

Share yours i will update the post and also ask for dgx

[-]

Impossible_Art9151@reddit

./llamaserver -hf unsloth/... --host ... --port ... --ctx-size 512000 --no-mmap --parallel 2 --flash-attn on --n-gpu-layers 999 -chat-template-kwargs "{\"preserve_thinking\":true}" ... (followed by temp, top-p, ...)

[-]

Able_Zombie_7859@reddit

...512k context

[-]

Far_Cat9782@reddit

512k context wtf

[-]

Familiar_Wish1132@reddit (OP)

Pleaes paste full, for people. the exact that you are using <3

[-]

Impossible_Art9151@reddit

llama-server -hf unsloth/Qwen3.6-27B-GGUF:UD-Q8_K_XL --host 0.0.0.0 --port 8095 --ctx-size 512000 --no-mmap --parallel 2 --flash-attn on --n-gpu-layers 999 -chat-template-kwargs "{\"preserve_thinking\":true}" --temp 0.7 --top-p 0.95 --top-k 20 --min-p 0.00 --repeat_penalty 1.0 --presence_penalty 0.0

[-]

Familiar_Wish1132@reddit (OP)

updated thx

[-]

Impossible_Art9151@reddit

for linux you should include "./" ./llamaserver
not only llamaserver

[-]

Familiar_Wish1132@reddit (OP)

people will figure it out :D

[-]

andy2na@reddit

its your context window. at 256k on my 3090, I was getting 13t/s, when you drop to 32k, goes to 30t/s

[-]

anthonyg45157@reddit

what system? this runs so slow on my 3090 but it seems its setup to split with system ram

[-]

Familiar_Wish1132@reddit (OP)

100K filled context i have 400/11 pp/tg 2x3080 20GB 256GB DDR4

It's on vram, idk seems same speed as the 3.5 27b

[-]

anthonyg45157@reddit

Hmmm I'm only getting 11 per second as well with my 3090... it seems Vram and system ram is being used..11ntok/s is pretty damn slow..should get around 30-40 on gpu ram only....idk what I'm missing lol

[-]

andy2na@reddit

its your context window. at 256k on my 3090, I was getting 13t/s, when you drop to 32k, goes to 30t/s

[-]

anthonyg45157@reddit

Yup same which mean context is being shared with system ram I guess?

Seems the 35b Moe is best for people who have a decent GPU but a ton of ram these dense models can be ran but are so slow (depending on your needs)

[-]

andy2na@reddit

Dense is too slow for my everyday use-case, 35b MoE is best to keep in VRAM all the time, and switch to dense if you need to code or heavy agent use

[-]

Familiar_Wish1132@reddit (OP)

Or maybe do planning with dense and coding with moe?

[-]

andy2na@reddit

I think people recommend the reverse? Planning with MoE and Dense for the actual coding - or just use dense for it all and just let it run

[-]

Familiar_Wish1132@reddit (OP)

idk maybe. but logicaly if planning is done correct and plan is prepared, dumber model can just put it in place?

[-]

Familiar_Wish1132@reddit (OP)

Interesting, but i don't see much CPU usage, regular 5%

[-]

anthonyg45157@reddit

Gonna to check into it more working and tinkering at the same time is rough 😂

[-]

Familiar_Wish1132@reddit (OP)

Yeah i feel you, i was forced to put my work aside xD xD xD damn qwen team xD

[-]

ComfyUser48@reddit

-m /models/Qwen3.6-27B-UD-Q6_K_XL.gguf
--jinja
--alias "qwen36-27"
--ctx-size 112640
--no-mmproj-offload
-ngl 999
--presence-penalty 1.5
--temp 0.6
--top-p 0.95
--top-k 20
--min-p 0.0
--chat-template-kwargs '{"enable_thinking": false}'
--flash-attn on

[-]

Familiar_Wish1132@reddit (OP)

without thinking and preserve thinking? it was recommended to use that parameters !

[-]

ComfyUser48@reddit

preserve thinking is on by default. for coding there is no need for thinking 95% of time. it's way faster without it.

[-]

Familiar_Wish1132@reddit (OP)

From the blog post, preserve thinking is not on by default.

It is specifically stated that when using agentic/coding it is recommended to enable it

[-]

ComfyUser48@reddit

This is new to me! ty !

[-]

Ell2509@reddit

Where did you get the gguf? I have been waiting for it on ollama.

[-]

Familiar_Wish1132@reddit (OP)

unsloth, lmstudio

[-]

Constandinoskalifo@reddit

I thought --reasoning flag didn't work for qwen3.5? Does it work for 3.6?

[-]

Familiar_Wish1132@reddit (OP)

i seen in logs that the kwargs are deprecated, so i just put it, but the reasoning process is in place when using --reasoning