Best config for Qwen3.6 27b / llama.cpp / opencode
Posted by Familiar_Wish1132@reddit | LocalLLaMA | View on Reddit | 106 comments
Please share your best config <3
llama.ccp:
"A:/0_llama_server/llama-server.exe" -m "a:\0_LM_Studio\unsloth\Qwen3.6-27B-GGUF\Qwen3.6-27B-UD-Q5_K_XL.gguf" --port 8080 --alias qwen3.5:27b -ngl 999 --threads 22 --flash-attn on --host 0.0.0.0 --no-mmap -mg 1 --batch-size 1024 --ubatch-size 512 --ctx-checkpoints 128 --ctx-size 196610 --reasoning on --jinja --draft-max 128 --spec-ngram-size-n 48 --draft-min 2 --spec-type ngram-mod --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --repeat_penalty 1.0 --presence_penalty 0.0 --chat-template-kwargs "{\"preserve_thinking\":true}" --tensor-split 0.46,0.54
legodfader@reddit
Anyone with dual 3090s?
AdamDhahabi@reddit
Sort of, 3090 + 2x 5070 Ti, running Unsloth Q8 at 25 t/s
Important_Quote_1180@reddit
That’s a cool setup! I have just a single 3090 and an old 1660 running unreal and the UD Q5 was running about 35 toks
AdamDhahabi@reddit
I previously had a P5000 which is 1080 equivalent (288 GB/s), it bottlenecked me in several ways:
- Pascal generation locked me into CUDA 12.x, now 13.x -> small percentage speedup
- More VRAM allowed me to run Bartowski GGUF instead of Unsloth UD -> 10% speedup (Unsloth UD squeezes more into VRAM but at speed penalty)
- Replaced with a latest gen card with more memory bandwidth -> large speedup
- Some memory overclock (nvidia-smi) which I could not do before with my P5000 -> more speedup
psyclik@reddit
Getting around 40ts (1300 prefill) on 4x3090 with graph parallel on ik_llama (q8, 256k context)
Familiar_Wish1132@reddit (OP)
share your run command please
Cferra@reddit
export CUDA_VISIBLE_DEVICES=0,1
\~/llama.cpp/build/bin/llama-server \
-m \~/models/Qwen3.6-27B-Q4_K_M.gguf \
--mmproj \~/models/mmproj-Qwen3.6-27B-F16.gguf \
-ngl 999 --split-mode layer --tensor-split 1,1 \
--ctx-size 262144 --parallel 1 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--flash-attn on \
--jinja \
--reasoning-format deepseek \
-a qwen3.6-27b \
--host 0.0.0.0 --port 8003
enables all features
Cferra@reddit
# Qwen 3.6 27B on 2x RTX 3090 NVLink — benching three ways (TurboQuant TQ3_4S vs standard Q4_K_M vs TurboQuant V-cache on current-main llama.cpp)
Spent the afternoon benchmarking [Qwen 3.6 27B](https://huggingface.co/Qwen/Qwen3.6-27B-Instruct) on a dual-3090 box with three different llama.cpp configurations, because I wanted to see where TurboQuant actually helps on Ampere — and the answer turned out to be more nuanced than I expected. Sharing the numbers because they complicate some of the "TurboQuant makes quants faster" claims floating around.
**TL;DR:** On Qwen 3.6 27B specifically, **vanilla llama.cpp + standard Q4_K_M + q8_0 KV wins on every axis at every context depth** on 2x RTX 3090 NVLink. TurboQuant isn't needed for long-context fit (Qwen 3.6's hybrid attention keeps KV cheap without it) and costs 10–20% generation throughput vs the plain q8_0 V-cache. The TurboQuant fork that does win on *memory* does so by an amount that doesn't matter on this hardware.
## Hardware
| | |
|-|-|
| GPUs | 2x RTX 3090, NVLink (\~56 GB/s aggregate), compute capability 8.6, 24 GB each |
| Host | Ubuntu 24.04.4 LTS, kernel 6.8.0-110 |
| Driver / CUDA | 590.48.01 / CUDA 12.0 |
| Compiler | gcc 13.3.0 |
## Configurations tested
| # | llama.cpp base | Weights | K-cache | V-cache | llama-server version |
|---|----------------|---------|---------|---------|----------------------|
| **A** | [`turbo-tan/llama.cpp-tq3`](https://github.com/turbo-tan/llama.cpp-tq3) `main` @ `794c5dc` | `TQ3_4S` (\~3.5 bpw, WHT) | q8_0 | **tq3_0** (3-bit WHT) | `102 (794c5dc)` |
| **B** | [`ggerganov/llama.cpp`](https://github.com/ggerganov/llama.cpp) `master` @ `8bccdbbff` | `Q4_K_M` (\~4.5 bpw) | q8_0 | q8_0 | `8890 (8bccdbbff)` |
| **C** | [`TheTom/llama-cpp-turboquant`](https://github.com/TheTom/llama-cpp-turboquant) `feature/turboquant-kv-cache` @ `9e3fb40e8` | `Q4_K_M` | q8_0 | **turbo3** (3-bit WHT) | `8983 (9e3fb40e8)` |
All three: `-ngl 999 --split-mode layer --tensor-split 1,1 --flash-attn on --parallel 1`. Models from `unsloth/Qwen3.6-27B-GGUF` (Q4_K_M) and `YTan2000/Qwen3.6-27B-TQ3_4S`.
Note the `llama-server --version` column — **turbo-tan's fork is on upstream b102 (\~6 months behind master)** while TheTom's is actively rebased and is actually 93 commits *ahead* of my vanilla checkout. That matters a lot, as you'll see.
## Results (matched 32k ctx, temp=0)
| Prompt (tok) | A: turbo-tan TQ3_4S + tq3_0 V | B: vanilla Q4_K_M + q8_0 V | C: TheTom Q4_K_M + turbo3 V |
|-------------:|:------------------------------|:---------------------------|:----------------------------|
| 1,028 | 782 PP / 34.5 TG | **1,161 PP / 41.0 TG** | 1,134 PP / 40.4 TG |
| 3,028 | 1,045 PP / 33.7 TG | 1,638 PP / 40.5 TG | **1,659 PP** / 39.3 TG |
| 7,028 | 1,130 PP / 32.3 TG | 1,849 PP / **39.7 TG** | **1,877 PP** / 37.0 TG |
| 15,028 | 1,118 PP / 30.4 TG | 1,822 PP / **38.2 TG** | **1,851 PP** / 33.4 TG |
| 28,028 | 1,069 PP / 27.3 TG | 1,717 PP / **36.0 TG** | **1,745 PP** / 28.8 TG |
All three generated correct output on the same correctness probe (17 × 23 → 391 with coherent reasoning traces). Quality parity; performance very different.
### VRAM @ 262k native context (`--parallel 1`)
| Config | Total VRAM | Fits on 2x 3090? |
|--------|-----------:|------------------|
| A: turbo-tan TQ3_4S | 26.7 GB | ✅ |
| B: vanilla Q4_K_M + q8_0 V | **30.9 GB** | ✅ (by \~15 GB margin) |
| C: TheTom Q4_K_M + turbo3 V | 28.1 GB | ✅ |
## What the numbers actually say
### 1. Turbo-tan's fork is slower because of its aged llama.cpp base, not because of TurboQuant
Config A is \~35–55% slower than Config C on the same hardware, even though both use TurboQuant V-cache. The only difference: A's fork is on upstream `b102` (October 2025-ish), C's is on `b8983` (current master). TurboQuant weight-quant infrastructure is fine — the fork is just missing 6 months of MMQ/FlashAttention kernel tuning.
If you want to use TurboQuant, **use TheTom's fork, not turbo-tan's**, unless you specifically need `TQ3_4S` weights (which only turbo-tan supports). For most use cases, Q4_K_M weights + TheTom's TurboQuant V-cache is the right call.
### 2. On Qwen 3.6, TurboQuant V-cache doesn't unlock any context you couldn't already reach
I assumed TurboQuant would be necessary to fit 256k context. **It isn't.** Qwen 3.6's architecture (48 linear-attention + 16 full-attention layers) means only 25% of the 64 layers have traditional KV — the rest use recurrent state with a tiny memory footprint. Plain q8_0 KV at 256k uses \~12 GB of KV total, and comfortably fits on 2x 3090 with Q4_K_M weights.
For non-hybrid models (Qwen 2.5, Llama 3, Gemma, Mistral — anything with full attention on every layer) TurboQuant would be more compelling. For Qwen 3.6 on this hardware, it isn't.
### 3. TurboQuant V-cache costs \~10–20% TG on Ampere, with the gap widening at long context
Compare B vs C — same weights, same base branch, only the V-cache differs. TG hit from turbo3 V:
- 1k ctx: 41.0 → 40.4 tok/s (−1.4%)
- 7k: 39.7 → 37.0 (−6.8%)
- 15k: 38.2 → 33.4 (−12.5%)
- 28k: 36.0 → 28.8 (−20.0%)
This is the "3-bit codebook dequant cost during decode" that TheTom's own MI355X benchmark called out: WHT inverse rotation + 8-entry codebook lookup per token per KV head every generation step isn't free. Prefill is barely affected because prefill is KV-*write*-dominated, and writes benefit from the smaller quant.
If `turbo4` (4-bit, simpler dequant) eventually lands in turbo-tan's fork or someone ports it into TheTom's current base, it'd probably recover most of that TG gap per the MI355X data (84% of f16 TG for turbo4 vs 64% for turbo3 on that hardware). On 3090 the gap would be smaller in absolute terms.
### 4. The "\~25 tok/s prompt generation" gotcha
If you see a low PP number on a toy prompt (32 tokens → \~25 tok/s "PP"), don't panic. That's fixed request/slot/first-token overhead diluted across too few prefill tokens. PP only becomes meaningful past \~500 prompt tokens. My real-workload PP numbers above start at \~1,000-token prompts.
## Winning configuration for Qwen 3.6 27B on 2x RTX 3090
```bash
\~/llama.cpp/build/bin/llama-server \
-m Qwen3.6-27B-Q4_K_M.gguf \
--mmproj mmproj-Qwen3.6-27B-F16.gguf \
-ngl 999 --split-mode layer --tensor-split 1,1 \
--ctx-size 262144 --parallel 1 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--flash-attn on \
--host 0.0.0.0 --port 8003
```
This gives you native 256k context, \~1,700–1,850 tok/s prefill in the 3–15k sweet spot, \~36–40 tok/s generation, 30.9 GB VRAM. No fork, no patches.
## Build command (any of the three forks)
```bash
cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES=86 \
-DGGML_CUDA_GRAPHS=ON \
-DCMAKE_BUILD_TYPE=Release \
-DLLAMA_BUILD_TESTS=OFF \
-DLLAMA_BUILD_EXAMPLES=OFF \
-DLLAMA_BUILD_SERVER=ON \
-GNinja
cd build && ninja llama-server llama-cli llama-quantize
```
For `-DCMAKE_CUDA_ARCHITECTURES=86` sub in your actual compute capability, or drop it entirely to get the default ARCH list.
## When TurboQuant actually helps (based on this data)
- **Dense-attention models** (not Qwen 3.6): KV dominates VRAM, compression pays.
- **High `--parallel`**: each slot carries its own KV, so per-slot savings multiply.
- **Multi-model deployments**: if you need Qwen 3.6 alongside another model on the same GPUs, the \~3 GB KV savings at 256k matter.
- **On AMD MI300X/MI355X** per TheTom's own benchmarks: the gap closes to parity with f16 at pp512, and `turbo4` is actively competitive.
- **When you need more than 256k context** on 2x 3090 specifically: at that point standard q8_0 V *would* start becoming the limiting factor (call it the 350k–400k range on this hardware).
## Links
- Models: [`unsloth/Qwen3.6-27B-GGUF`](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) · [`YTan2000/Qwen3.6-27B-TQ3_4S`](https://huggingface.co/YTan2000/Qwen3.6-27B-TQ3_4S)
- Forks: [`turbo-tan/llama.cpp-tq3`](https://github.com/turbo-tan/llama.cpp-tq3) · [`TheTom/llama-cpp-turboquant`](https://github.com/TheTom/llama-cpp-turboquant) · [`ggerganov/llama.cpp`](https://github.com/ggerganov/llama.cpp)
- TurboQuant paper: [arXiv 2504.19874](https://arxiv.org/abs/2504.19874) (ICLR 2026) — PolarQuant + QJL
Happy to share raw logs if anyone wants to cross-check a specific cell. And if you've got a non-hybrid 30B–70B model handy where KV dominates, I'd be curious to see the comparison there — my hunch is TurboQuant's story gets much more compelling on Llama/Qwen-2.5-class architectures.
Cferra@reddit
i have 2x 3090s with NVlink
Swedgetarian@reddit
Q4_K_XL on a 4090 24GB, fully in VRAM. Squeezed for context without kv cache quant. But on short (\~1k) context getting 40 t/s tg.
docker run -v /mnt/data/gguf:/mnt/data/gguf \-p 8095:8095 \--gpus all \ghcr.io/ggml-org/llama.cpp:full-cuda\-s \-m \/mnt/data/gguf/Qwen3.6-27B-UD-Q4_K_XL.gguf \--host0.0.0.0\--port 8095 \--ctx-size 32000 \--no-mmap \--flash-attn on \--n-gpu-layers 999 \--chat-template-kwargs "{\"preserve_thinking\":true}" \--temp 0.7 \--top-p 0.95 \--top-k 20 \--min-p 0.00 \--repeat_penalty 1.0 \--presence_penalty 0.0Familiar_Wish1132@reddit (OP)
thx, try https://huggingface.co/Jackrong/Qwopus3.6-27B-v1-preview-GGUF
jacek2023@reddit
cache?
Familiar_Wish1132@reddit (OP)
? i have 256GB RAM, do i need to specify cache? isn't it taking max? i would gladly give more as i have enough :D please what param to set?
jacek2023@reddit
just asking because I use cache ram: https://www.reddit.com/r/LocalLLaMA/comments/1sqp8pp/opencode_with_gemma_26b/
Familiar_Wish1132@reddit (OP)
--cache-ram→ CPU memory (system RAM)So if you do it intentionaly okay, but if you have enough VRAM then it's a bottleneck
Familiar_Wish1132@reddit (OP)
okay thx will test it out <3
soyalemujica@reddit
24gb vram 7900XTX 35t/s, and 27t/s at 160k context:
llama-server.exe -ctv q8_0 -ctk q8_0 -c 160000 --temp 0.6 --top-p 0.95 --top-k 20 --repeat-penalty 1.0 --fit on
dodistyo@reddit
How?? i can barely run it with 64k ctx window, nad that's using kv cache Q4 quantization.
I have the same hardware, same model but with lmstudio.
the model size it self 19gb ish, right? unless i downloaded the wrong model here.
soyalemujica@reddit
I do not use lm studio, I use llama.cpp, the model size is 17gb
dodistyo@reddit
yea, lmstudio is actually using llama.cpp under the hood. so the result should be not too different i believe. full GPU offload right? I'll give it a try myself tho using llama.cpp.
Familiar_Wish1132@reddit (OP)
what GPU?
sumrix@reddit
Which quantization format are you using?
soyalemujica@reddit
UDQ4KXL
sumrix@reddit
Man, I get only 13 t/s on the same GPU, with the same quantization format and the same parameters.
soyalemujica@reddit
You must have something using your GPU vram originally
Techngro@reddit
The difference is he has RGB in his rig.
Familiar_Wish1132@reddit (OP)
Vulkan?
soyalemujica@reddit
Yeah, Vulcan, PP at 400
Familiar_Wish1132@reddit (OP)
ufff noice !
Familiar_Wish1132@reddit (OP)
what ? 27t/s at 160k? what is your pp at 160k context? Thank you very much for this info
dero_name@reddit
Ah, I see you're running `llama-server` from a floppy drive. Bold choice!
SkyFeistyLlama8@reddit
I can hear the drive motor crunching from across the Internet as it tries to load 20 GB at like 100 kilobytes per second.
Radiant_Condition861@reddit
got mine on zip drive.
Ready to take on the click of death !
neverbyte@reddit
dude! I had one of these as a kid and my mind was blown. each disk could hold like 250? regular floppies worth of data? it was awesome! did I have anything of any real size that needed storing? did I actually use it for anything? don't remember, but i felt like a baller. nostalgia!
illforgetsoonenough@reddit
Horrible memories unlocked
Familiar_Wish1132@reddit (OP)
xD
Impossible_Art9151@reddit
ymmd
No_Mango7658@reddit
Go home kids
SingleProgress8224@reddit
ymmgta
Familiar_Wish1132@reddit (OP)
xD xD xD ofcourse ahahaha
hedsht@reddit
5090: web dev
jessez05@reddit
```llama-server.exe \^
-m "C:\Users\pv\models\Qwen3.6-27B-UD-Q4_K_XL.gguf" \^
--alias qwen3.6-27b \^
--host 127.0.0.1 \^
--port 11434 \^
--ctx-size 262144 \^
-ngl -1 \^
--parallel 2 \^
--jinja \^
--chat-template-kwargs "{\"enable_thinking\": true, \"preserve_thinking\": true}" \^
--reasoning on \^
--temp 0.6 \^
--top-p 0.95 \^
--top-k 20 \^
--min-p 0.0 \^
--presence-penalty 0.0 \^
--repeat-penalty 1.0 \^
--flash-attn on \^
--batch-size 2048 \^
--ubatch-size 512 \^
--threads 14 \^
--threads-batch 22 \^
--cache-type-k q8_0 \^
--cache-type-v q8_0 ```
output 50tok/s, 29GB VRAM, rtx 5090
nunodonato@reddit
2 questions:
1 - how much tok/s?
2 - do you see any inference speed by using spec decoding?
hedsht@reddit
1.) depends, but 40-45 tokens/s on avg
2.) Generation throughput improved by about 28.0% Prompt throughput improved by about 4.2%
Familiar_Wish1132@reddit (OP)
thx will try it out <3
hedsht@reddit
if you run --no-mmproj and a 128k you can even fit a Q6, but i need mmproj for my workflow.
srigi@reddit
You can keep mmproj in RAM/CPU with
--no-mmproj-offload. You save GPU mem while will still be able process images/PDFshedsht@reddit
ah yeah, totally forgot about that, good stuff, thank you!
SmallHoggy@reddit
Thank you 🙏🏼
Familiar_Wish1132@reddit (OP)
i have put mmproj with qwen3.6 35ba3b to another mi50 gpu that i have to separate it from main coding llm. So with oh-my-openagent i have setup visual category to use the other model with mmproj.
Cferra@reddit
export CUDA_VISIBLE_DEVICES=0,1
\~/llama.cpp/build/bin/llama-server \
-m \~/models/Qwen3.6-27B-Q4_K_M.gguf \
--mmproj \~/models/mmproj-Qwen3.6-27B-F16.gguf \
-ngl 999 --split-mode layer --tensor-split 1,1 \
--ctx-size 262144 --parallel 1 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--flash-attn on \
--jinja \
--reasoning-format deepseek \
-a qwen3.6-27b \
--host 0.0.0.0 --port 8003
enables all features
Familiar_Wish1132@reddit (OP)
can you please test it out without? From my testing it looks faster idk if it's turboquant or ngram but the parallel settings seems slow down generation
oxygen_addiction@reddit
You should also play around with batch size
https://github.com/ggml-org/llama.cpp/discussions/15396
https://www.reddit.com/r/LocalLLaMA/comments/1rg4zqv/comment/o7rszuj/?context=3
Pleasant-Shallot-707@reddit
Pi
lemondrops9@reddit
Qwen3.6 27B is out???
Familiar_Wish1132@reddit (OP)
yep finaly \^\^
lemondrops9@reddit
sweet I was trying out the 3.5 version last night. Time to download
WoodCreakSeagull@reddit
Splitting the Q4_K_M + BF16 mmproj between an RTX 5070 Ti (16GB) and Arc B580 (12GB) using llama.cpp for vulkan.
-c 200000 --fit off --parallel 2 -ngl 99 --tensor-split 57,43 -b 1024 -ub 256 --flash-attn on --no-mmap --mlock --temp 0.7 --top-p 0.9 --min-p 0.05 --top-k 40 --repeat-penalty 1.05 --repeat-last-n 64 -ctk q8_0 -ctv q8_0 --chat-template-kwargs {"preserve_thinking": true} --no-warmup --jinja
25 t/s on first prompt, 15 t/s with 50k context loaded.
Feels pretty slow compared to 35B but definitely usable. Had to tinker with some of the values at the edges and lower context to 200k/use smaller batch sizes to keep from spilling over into CPU.
timanu90@reddit
You can mix cards?
Would be possible to mix nvidia and AMD cards?
WoodCreakSeagull@reddit
If they share a backend, yes. Vulkan is fairly universal so it should be able to work fine for AMD and Nvidia I think.
timanu90@reddit
Thanks for the info. Will check on that
akumaburn@reddit
It looks like you’ve set
--draft-min/--draft-max, but there’s no draft model configured, so those flags won’t have any effect.You might also want to reduce the number of threads. llama.cpp doesn’t scale particularly well with higher thread counts, so try something in the 6–8 range instead.
A
--top-kof 20 is on the low side as well; something around 40 or higher is usually a better starting point.Everything else looks fine.
andy2na@reddit
its "Draftless" N-Gram Speculative Decoding
https://github.com/ggml-org/llama.cpp/blob/master/docs/speculative.md
akumaburn@reddit
hmm interesting; wasn't aware this was a thing.. though I'm not sure how much this will help when using opencode
Familiar_Wish1132@reddit (OP)
Thank you will test it out <3
Willing-Toe1942@reddit
if you want the best run pi coding agent instead of opencode
Familiar_Wish1132@reddit (OP)
Why? i need to have webui, for me it's very important. You think that the lower system prompt context could help huh?
Sir-Draco@reddit
I haven't used the pi coding agent but I do know it is very configurable (technically it is entirely configurable) and there is a lot of research coming out about how smaller system prompts allow the models nowadays to perform better since they are RL trained out of their mind (or out of their weights) and they know how to be agents. Something to consider, pi probably works better for these smaller models is my bet I am making. Will be trying it out this weekend
Familiar_Wish1132@reddit (OP)
it makes sense. let us know <3
t2noob@reddit
What do people think of mine ? It runs on a dual p40 setup. I use it as daily with nanobot. I was playing mostly fix the config with openclaw. ExecStart=/usr/bin/numactl --interleave=all /root/llama-cpp-turboquant/build-cuda-only/bin/llama-server \ -m /storage/ollama/models/gguf/Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf \ --mmproj /storage/ollama/models/gguf/mmproj-qwen3.6-35b-f16.gguf \ --host 0.0.0.0 \ --port 8080 \ -ngl 99 \ --no-mmproj-offload \ -c 65536 \ -ctk turbo4 \ -ctv turbo4 \ -sm layer \ -np 1 \ -b 2048 \ -ub 2048 \ --image-max-tokens 2048 \ --metrics \ --jinja \ --reasoning-format deepseek \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0 \ --repeat-penalty 1.05
morejona@reddit
I'm pretty sure you don't want to apply the turbo quant to both K and V caches, only V. Otherwise your context will drastically worsen. At least that's been reported by others: https://www.reddit.com/r/LocalLLaMA/comments/1sm6d2k/what_is_the_current_status_with_turbo_quant
Familiar_Wish1132@reddit (OP)
uff turbo4, would you share a gh link for that llamacpp please? you are using reasoning format huh?
t2noob@reddit
https://github.com/TheTom/llama-cpp-turboquant
Familiar_Wish1132@reddit (OP)
Thx
Impossible_Art9151@reddit
for what kind of optimization is your command for?
27B is running on my dgx ... and it is a little bit to slow. <10t/s
Maybe so can provide a dgx command that performs better than mine?
I am running the big q8 with 512000 ctx and num-paralell 2
Familiar_Wish1132@reddit (OP)
Share yours i will update the post and also ask for dgx
Impossible_Art9151@reddit
./llamaserver -hf unsloth/... --host ... --port ... --ctx-size 512000 --no-mmap --parallel 2 --flash-attn on --n-gpu-layers 999 -chat-template-kwargs "{\"preserve_thinking\":true}" ... (followed by temp, top-p, ...)
Able_Zombie_7859@reddit
...512k context
Far_Cat9782@reddit
512k context wtf
Familiar_Wish1132@reddit (OP)
Pleaes paste full, for people. the exact that you are using <3
Impossible_Art9151@reddit
llama-server -hf unsloth/Qwen3.6-27B-GGUF:UD-Q8_K_XL --host 0.0.0.0 --port 8095 --ctx-size 512000 --no-mmap --parallel 2 --flash-attn on --n-gpu-layers 999 -chat-template-kwargs "{\"preserve_thinking\":true}" --temp 0.7 --top-p 0.95 --top-k 20 --min-p 0.00 --repeat_penalty 1.0 --presence_penalty 0.0
Familiar_Wish1132@reddit (OP)
updated thx
Impossible_Art9151@reddit
for linux you should include "./" ./llamaserver
not only llamaserver
Familiar_Wish1132@reddit (OP)
people will figure it out :D
andy2na@reddit
its your context window. at 256k on my 3090, I was getting 13t/s, when you drop to 32k, goes to 30t/s
anthonyg45157@reddit
what system? this runs so slow on my 3090 but it seems its setup to split with system ram
Familiar_Wish1132@reddit (OP)
100K filled context i have 400/11 pp/tg 2x3080 20GB 256GB DDR4
It's on vram, idk seems same speed as the 3.5 27b
anthonyg45157@reddit
Hmmm I'm only getting 11 per second as well with my 3090... it seems Vram and system ram is being used..11ntok/s is pretty damn slow..should get around 30-40 on gpu ram only....idk what I'm missing lol
andy2na@reddit
its your context window. at 256k on my 3090, I was getting 13t/s, when you drop to 32k, goes to 30t/s
anthonyg45157@reddit
Yup same which mean context is being shared with system ram I guess?
Seems the 35b Moe is best for people who have a decent GPU but a ton of ram these dense models can be ran but are so slow (depending on your needs)
andy2na@reddit
Dense is too slow for my everyday use-case, 35b MoE is best to keep in VRAM all the time, and switch to dense if you need to code or heavy agent use
Familiar_Wish1132@reddit (OP)
Or maybe do planning with dense and coding with moe?
andy2na@reddit
I think people recommend the reverse? Planning with MoE and Dense for the actual coding - or just use dense for it all and just let it run
Familiar_Wish1132@reddit (OP)
idk maybe. but logicaly if planning is done correct and plan is prepared, dumber model can just put it in place?
Familiar_Wish1132@reddit (OP)
Interesting, but i don't see much CPU usage, regular 5%
anthonyg45157@reddit
Gonna to check into it more working and tinkering at the same time is rough 😂
Familiar_Wish1132@reddit (OP)
Yeah i feel you, i was forced to put my work aside xD xD xD damn qwen team xD
ComfyUser48@reddit
-m /models/Qwen3.6-27B-UD-Q6_K_XL.gguf
--jinja
--alias "qwen36-27"
--ctx-size 112640
--no-mmproj-offload
-ngl 999
--presence-penalty 1.5
--temp 0.6
--top-p 0.95
--top-k 20
--min-p 0.0
--chat-template-kwargs '{"enable_thinking": false}'
--flash-attn on
Familiar_Wish1132@reddit (OP)
without thinking and preserve thinking? it was recommended to use that parameters !
ComfyUser48@reddit
preserve thinking is on by default. for coding there is no need for thinking 95% of time. it's way faster without it.
Familiar_Wish1132@reddit (OP)
From the blog post, preserve thinking is not on by default.
It is specifically stated that when using agentic/coding it is recommended to enable it
ComfyUser48@reddit
This is new to me! ty !
Ell2509@reddit
Where did you get the gguf? I have been waiting for it on ollama.
Familiar_Wish1132@reddit (OP)
unsloth, lmstudio
Constandinoskalifo@reddit
I thought --reasoning flag didn't work for qwen3.5? Does it work for 3.6?
Familiar_Wish1132@reddit (OP)
i seen in logs that the kwargs are deprecated, so i just put it, but the reasoning process is in place when using --reasoning