Reasoning Guard: Stopping LLM Thinking Loops at the Proxy Layer

Posted by Electronic-Fly-6465@reddit | LocalLLaMA | View on Reddit | 6 comments

Reasoning Guard: Stopping LLM Thinking Loops at the Proxy Layer

I’ve been running Qwen3.6 MoE behind a vLLM proxy and hit a specific reliability issue: occasional runaway reasoning loops.

This isn’t a criticism of Qwen3.6. The model is excellent — in my setup, it’s more robust than Qwen3.5 for agentic coding, path handling, debugging, and tool-style workflows. But occasionally, especially on file-path, debugging, and code-tracing prompts, it can get stuck inside a reasoning block and repeat itself endlessly.

At 180+ tokens/sec, even a 20–30 second loop burns through a lot of tokens, blocks GPU time, and stalls agents.

So I built a Reasoning Guard at the proxy layer.

Architecture

Client → Proxy → vLLM → Model

The proxy watches the streaming response as it leaves vLLM. It doesn’t modify the model weights, it doesn’t require a second LLM call, and it doesn’t use embeddings or semantic analysis. It just applies cheap, deterministic checks while the stream is active.

What It Checks

The guard currently monitors:

Recovery Flow

When the guard triggers, it:

  1. Stops the upstream stream
  2. Captures the reasoning produced so far
  3. Reissues the request with that reasoning baked in as prior assistant context
  4. Disables thinking for the continuation
  5. Merges phase 1 and phase 2 usage stats

Because vLLM prefix caching is already active, the continuation is effectively seamless. Phase 2 usually resumes with \~50–100ms TTFT, so the client just sees reasoning flow directly into the final answer instead of hanging.

Good reasoning still comes through. The guard only steps in when reasoning exceeds configured limits or starts showing repetition patterns.

Why This Exists

This isn’t trying to compete with provider-side reasoning controls. OpenAI, Anthropic, DeepSeek, and others already have model/API-level systems for this. This is narrower: a practical runtime guard for teams running their own inference stack who want deterministic protection from runaway reasoning without changing the model or swapping proxies.

Observability

My proxy logs each trigger with:

I’ve tested it against both normal requests and stress cases derived from real trace logs. The loop detector catches repeated paragraphs, n-gram repetition, recurring sentence patterns, and common reasoning-loop openings. The cut-and-continue path has been validated end-to-end through the live proxy.

Result

Before: Occasional 2000+ token reasoning blocks that went nowhere.
After: The model still reasons when useful, but runaway thinking gets cut and redirected into an answer.

It’s basically a proxy-level seatbelt for local LLM inference.

Not magic. No model surgery. Just stream interception, token counting, loop detection, and a clean recovery path.

I would love to discuss other neat mitigations like this that help smaller models operate more effectively.