Follow-up: Qwen3.6-27B on 1× RTX 3090 — pushing to ~218K context + ~50–66 TPS, tool calls now stable (PN12 fix)

Posted by AmazingDrivers4u@reddit | LocalLLaMA | View on Reddit | 38 comments

Following up on our previous post about running Qwen3.6-27B on a single RTX 3090 (\~125K context, higher TPS).

We’ve been pushing further on both context length and stability for tool-agent workloads.

Current results:

- \~218K context @ \~50 / 66 TPS (text, narr/code)

- \~198K + vision @ \~51 / 68 TPS

- tool calls with \~25K-token outputs now complete without OOM

So lower TPS than our earlier config, but significantly higher context + stability under real workloads.

---

### What changed

Previously, long tool outputs (\~25K tokens) would consistently crash.

This turned out to be related to a Genesis patch (PN12) that was supposed to mitigate a memory issue, but wasn’t actually applying on vLLM dev205+:

- `apply_all` reported success

- but the underlying code path was unchanged

Root cause was anchor drift in the patch.

After fixing that, the tool-prefill OOM disappeared and higher context configs became usable.

Fix:

https://github.com/Sandermage/genesis-vllm-patches (PR #13)

---

### What we’re optimizing for

The goal here isn’t just max TPS or max context in isolation, but pushing both together on a single 3090:

- high context (200K+)

- usable throughput

- stable tool-agent workloads

---

### Notes / limitations

- There is still a second memory cliff around \~50–60K for single-prompt workloads on 1 GPU

- That one doesn’t apply with tensor parallelism (e.g. 2× 3090)

- Results depend heavily on quantization + config

---

### Repro

https://github.com/noonghunna/club-3090

---

Curious how others are balancing context vs TPS on 3090/4090 setups.

[-]

disgruntledempanada@reddit

Not seeing many setups for a 5090 but I imagine using the same setup I could boost context to max?

[-]

Optimal-Bass-5246@reddit

Yes, with a 5090, I get 160+tps and the full 256k.

[-]

Dany0@reddit

I really want this but my workflow requires vision, has anyone gotten those vllm forks to work with vision?

[-]

Optimal-Bass-5246@reddit

Let me know if you have an issues. That is only the 2nd time creating a repo. I am currently testing updated scripts posted here:https://github.com/noonghunna/club-3090

[-]

I can confirm cobraphil's recipe works with little changes through docker wsl2. I was able to replicate 160tps decode only on simple & pure coding prompts. Real TPS on my mixed coding+generalist real-world usecase is ±120TPS on my 45k-200k context tasks, which is still brilliant

Base setup leaves ±2.5gb vram free

So far output quality feels around ud Q4_k_xl I used before. Not quite as my usual Q5_k_xl setup, but the perf speed is worth it

[-]

rp1226@reddit

Do you have setup details?

[-]

AmazingDrivers4u@reddit (OP)

theoretically yes.

[-]

Equal_Jellyfish_4771@reddit

The PN12 anchor drift issue is wild-vLLM dev branches can be brutal for patch stability.. Glad you tracked down the root cause instead of just throwing more VRAM at it. Are you seeing similar stability gains with other long-context workloads, or is this mainly a tool-prefill win?

[-]

AmazingDrivers4u@reddit (OP)

having to redo all the testing again, since new genesis patches arrived late last night. will be able to shed some light only after testing it.

[-]

Zyj@reddit

With long context beyond 70k deteriorating in quality anyway, perhaps it‘s better to go another route…?

[-]

satyaloka93@reddit

I can't get even 128k from these club 3090 scripts, no matter how I tweak gpu utilization. I think you can't be running any window manager with this (I'm on Windows 11/wsl and RTX 4090).

[-]

kapitanfind-us@reddit

This is indeed a lot of great work, owner of a 3090. Big thanks!

[-]

AccomplishedFix3476@reddit

218k on a 3090 is wild, the PN12 fix unblocks so many agentic flows ngl. tool call stability is the part most ppl underestimate — 0-shot benchmarks dont say much about an agent calling tools 50x in a session. honestly the fact that we're squeezing this out of consumer hardware now is insane, the gap between local and frontier keeps shrinking 🔥 what u stress testing tools on next 👀

[-]

Important_Quote_1180@reddit

Appreciate the follow up! This is what I got from your last post and it’s been great. It was a good guide you put forth and I’ll really like to have this other config in the bank.

G2 vLLM Stack — qwen3.6-27b-autoround on RTX 3090

Model: qwen3.6-27b-autoround-int4 (AutoRound INT4 quantization) served via vLLM nightly (dev21) on port 8020. Context window: 125K tokens. KV cache uses TurboQuant 3-bit NC. Speculative decoding via MTP with 3 draft tokens. Cudagraph mode set to PIECEWISE — this is the critical setting that makes MTP work without garbling output (the default FULL mode breaks speculative decoding on this rig).

Hardware: RTX 3090 24GB, NVIDIA driver 580.126, GPU memory at 97% utilization (23.1GB of 24.5GB). Running at 348W out of a 350W power limit, 66°C, 98% utilization during benchmark.

Key launch flags: --gpu-memory-utilization 0.97, --max-num-seqs 1, --max-num-batched-tokens 4128, --enable-chunked-prefill, --enable-prefix-caching, --reasoning-parser qwen3, --tool-call-parser qwen3_coder, --kv-cache-dtype turboquant_3bit_nc, --compilation-config.cudagraph_mode PIECEWISE, --speculative-config for MTP with 3 speculative tokens. Also applies Genesis unified patch and tolist cudagraph patch at container startup.

Live benchmark results from 2026-04-26: 100-token output generated at 82.4 tok/s in 1.21s total. 400-token output at 82.1 tok/s in 4.87s. 800-token output at 71.3 tok/s in 11.22s. Time-to-first-token estimated at 0.3-0.6 seconds depending on prompt length. Sustained baseline is roughly 67-89 tok/s depending on workload shape.

The PIECEWISE cudagraph setting costs about 15-20% throughput versus theoretical FULL mode speeds (which could hit 100+ tok/s) but FULL mode produces garbled, repeating output when combined with MTP speculative decoding on this hardware. The tradeoff is worth it — clean output at 82 tok/s beats garbled output at 108 tok/s.

Bottom line: 27B parameter model, INT4 quantized, running single-GPU on a consumer 3090, delivering 82 tokens per second with sub-second first-token latency and full reasoning/tool-calling support.

[-]

cleversmoke@reddit

I'm trying this today with a RTX 3090. At ~70 tok/s, it would get into the RTX 5090 range of speed that I wanted with llama.cpp!

[-]

Hello_my_name_is_not@reddit

Is this auto round int 4 quant "better" than the Unsloth or Bartowski q4's from hugging face?

Ive never the int4 models you're mentioning. I have a AMD r9700 though so not sure if that makes a difference or works on mine?

[-]

Important_Quote_1180@reddit

Auto round seems to be top tier for this config but I have used unsloth for many other models and Bartowski I’m not familiar with personally but it’s popular

[-]

VolandBerlioz@reddit

in opencode on new chat with tools i hit the cliff 1 (29k tokens, sys prompt + tools) no matter the setup - lower context, lower memory util, whatever - everytime i hit cliff 1 - on both vision and no vision with the latest patches.

[-]

AmazingDrivers4u@reddit (OP)

cliff 1 has been addressed in today's updates, did you try the latest and greatest with the upgraded context sizes?

[-]

VolandBerlioz@reddit

Hopefully that might be helpful (Codex Generated, based on logs and etc)

What Happens:

The model boots cleanly and accepts requests, but real OpenCode long-context requests fail during prefill, not during startup or decode.

Failure mode:

- Client sends a large OpenCode request/tool context.

- vLLM begins chunked prefill.

- CUDA OOM occurs inside the compiled Qwen FFN path.

- Server returns HTTP 500 / stream breaks, and the vLLM process shuts down.

The observed OOM site is:

/root/.cache/vllm/.../inductor_cache/...py", line 1208

buf9 = empty_strided_cuda((s18, 17408), (17408, 1), torch.float16)

torch.OutOfMemoryError: Tried to allocate 138.00 MiB

OOM numbers from the failing run:

GPU total: 23.56 GiB

Free at failure: 131.75 MiB

Process memory in use: 23.40 GiB

PyTorch allocated: 22.59 GiB

CUDA graph private pools: 24.00 MiB

PyTorch reserved but unallocated: 472.77 MiB

Requested allocation: 138.00 MiB

Input Size

The synthetic stress test uses about 100000 chars, roughly 25K tokens, as a mock tool response. That synthetic test succeeds on the current vllm-text profile.

For the real OpenCode failure, vLLM did not log the exact prompt token count in the crash trace. Based on the failure class, it is happening in the same region: large tool/context prefill \~30K-tokens OpenCode tool/schema/context.

Configs Tested

We tested the relevant long-context vLLM profiles and memory-util variants, including 0.975 and 0.985 + lower context lenghts; the real OpenCode workload still breaks. Lowering/raising mem-util shifts the KV/activation tradeoff but does not reliably create enough activation headroom for the real request.

The current synthetic test passes on:

vllm-qwen36-27b-long-text

max_model_len=128000

gpu_memory_utilization=0.975

language_model_only=True

Synthetic result:

\~25K-token mock tool prefill

HTTP 200

response length: 723 chars

finish=stop

Why It Fails

This is no longer the FA2 softmax_lse Cliff 1 path. The FA clamp sidecar and PN12 sidecar are both applying.

Confirmed applied:

[pn12_ffn_pool_anchor_fix] SiluAndMul.forward_cuda: applied

[fa_max_seqlen_clamp] _flash_attn_varlen: applied

The remaining failure is the FFN intermediate activation allocation:

empty_strided_cuda((s18, 17408), ..., torch.float16)

That buffer is roughly:

max_num_batched_tokens × intermediate_size × fp16

≈ 4128 × 17408 × 2 bytes

≈ 138-144 MiB

At the moment of failure, the GPU only has \~131 MiB free, so the allocation misses by a few MiB. Real OpenCode requests appear to create a worse activation/KV/cache pressure profile than the synthetic test, even when the synthetic 25K tool-prefill succeeds.

[-]

AmazingDrivers4u@reddit (OP)

Thanks for the detailed diagnosis — your read is exactly right and our PN12 sidecar has a real gap on this. Filed it as

https://github.com/noonghunna/club-3090/issues/16 with the full analysis.

**TL;DR:** PN12 patches the eager-mode `SiluAndMul.forward_cuda`, but vLLM's torch.compile inductor-compiled FFN forward inlines the SiluAndMul op and never calls our patched method. So the pool is bypassed in the compile path. Our verify-stress 25K synthetic happens to hit shapes that go through eager, which is why it passes; real OpenCode 29K with sys+tools mixed prefill produces shapes that hit the inductor path and OOM at the FFN intermediate exactly where you saw it.

**Three workarounds, in order:**

**Stick with `tools-text.yml`** — already works for you. fp8 KV uses Genesis PN8 (not PN12) which closes Cliff 1 mech B via a different mechanism that does reach the compile path. 75K context handles your 30K OpenCode prefill comfortably.
**Add `--enforce-eager` to `long-text.yml` or `long-vision.yml` command list.** Forces all forwards through eager Python, where PN12's pool reliably applies. Costs \~20-30% TPS but preserves the 218K / 198K context. Just append it to the `command:` block before booting.
**Lower `--gpu-memory-utilization` to 0.94** on long-text/long-vision. Frees \~250 MiB activation headroom at the cost of KV pool size (effective max_model_len drops to \~150K). Same idea as why our default 48K + 0.92 never hits this.

I just pushed updated comments on all three affected composes documenting these escape hatches (commit 6bff99a).

Real fix needs an inductor-pass-level intervention or a torch.compile-aware sidecar — bigger work, tracking in #16. If anyone wants to dig into that, the briefing for it is in the issue.

[-]

VolandBerlioz@reddit

Great one! And in general the whole project! What you guys doing is admirable! Thanks!

[-]

VolandBerlioz@reddit

Yeah absolutely, latest commit.

[-]

AmazingDrivers4u@reddit (OP)

please shoot a bug report on git.

[-]

Important_Quote_1180@reddit

Do you have any testing for the 35B A3B on a single 24GB card with ram offloading? I’m stuck to one 3090 and I have 192gb of ddr5 I can use. I want to load up LoRa adapters for unreal engine game design but the 27b dense cannot fit a LoRa at any context level

[-]

AmazingDrivers4u@reddit (OP)

not at this point in time, i've tested 35B on a single card in the past haven't gotten around to shipping a proper config for it yet.

[-]

NewtoAlien@reddit

Following this as follow single 3090 user 😁, thanks for the update.

[-]

tomz17@reddit

As an owner of several 3090's, following with interest.. Keep up the good work.

[-]

shoonmcgregor@reddit

Thanks for the work, would love to see a minimal Docker image for this

[-]

jax_cooper@reddit

I have the previous blogpost open in a tab for like a week now to read it and try it out but now I really have to

[-]

ZachCope@reddit

Posting as a dual 3090 owner so I can find this thanks

[-]

Tough_Frame4022@reddit

What is the quality of recall on that 200k token context?

[-]

AmazingDrivers4u@reddit (OP)

100%

[-]

youcloudsofdoom@reddit

Just jumping in to say that I found your repo via another comment on this sub, and it's made this dual 3090 owner very happy - just got the dflash variant working and I am now never going back ot my janky homebrewed llama.cpp build with 30 TG on 27B. Seeing a big jump up in p/p and t/s, as well as a notable increase in tool use stability with Hermes. Will be keeping an eye on the repo for more development, thanks for the work!

[-]

AmazingDrivers4u@reddit (OP)

By default thinking is disabled but 27b does generate lots of tokens when thinking is enabled. I'm evaluating https://github.com/andthattoo/structured-cot/tree/main currently and if the bench results are positive i might include it in the build to help keeping the guff out of thinking blocks.

[-]

jacek2023@reddit

I am following these posts, will try to reproduce all your workflows at some point later on my multiple 3090s

[-]

DeltaSqueezer@reddit

There is still a second memory cliff around ~50–60K for single-prompt workloads on 1 GPU

Can you share a bit more on this. What is the issue and impact?

[-]

Long_comment_san@reddit

Just a random question: y'all using presence penalty 1.5 like devs recommend or some alternative settings (like DRY)?