Follow-up: Qwen3.6-27B on 1× RTX 3090 — pushing to ~218K context + ~50–66 TPS, tool calls now stable (PN12 fix)

Posted by AmazingDrivers4u@reddit | LocalLLaMA | View on Reddit | 38 comments

Following up on our previous post about running Qwen3.6-27B on a single RTX 3090 (\~125K context, higher TPS).

We’ve been pushing further on both context length and stability for tool-agent workloads.

Current results:

- \~218K context @ \~50 / 66 TPS (text, narr/code)

- \~198K + vision @ \~51 / 68 TPS

- tool calls with \~25K-token outputs now complete without OOM

So lower TPS than our earlier config, but significantly higher context + stability under real workloads.

---

### What changed

Previously, long tool outputs (\~25K tokens) would consistently crash.

This turned out to be related to a Genesis patch (PN12) that was supposed to mitigate a memory issue, but wasn’t actually applying on vLLM dev205+:

- `apply_all` reported success

- but the underlying code path was unchanged

Root cause was anchor drift in the patch.

After fixing that, the tool-prefill OOM disappeared and higher context configs became usable.

Fix:

https://github.com/Sandermage/genesis-vllm-patches (PR #13)

---

### What we’re optimizing for

The goal here isn’t just max TPS or max context in isolation, but pushing both together on a single 3090:

- high context (200K+)

- usable throughput

- stable tool-agent workloads

---

### Notes / limitations

- There is still a second memory cliff around \~50–60K for single-prompt workloads on 1 GPU

- That one doesn’t apply with tensor parallelism (e.g. 2× 3090)

- Results depend heavily on quantization + config

---

### Repro

https://github.com/noonghunna/club-3090

---

Curious how others are balancing context vs TPS on 3090/4090 setups.