Reasoning Guard: Stopping LLM Thinking Loops at the Proxy Layer

Posted by Electronic-Fly-6465@reddit | LocalLLaMA | View on Reddit | 6 comments

Reasoning Guard: Stopping LLM Thinking Loops at the Proxy Layer

I’ve been running Qwen3.6 MoE behind a vLLM proxy and hit a specific reliability issue: occasional runaway reasoning loops.

This isn’t a criticism of Qwen3.6. The model is excellent — in my setup, it’s more robust than Qwen3.5 for agentic coding, path handling, debugging, and tool-style workflows. But occasionally, especially on file-path, debugging, and code-tracing prompts, it can get stuck inside a reasoning block and repeat itself endlessly.

At 180+ tokens/sec, even a 20–30 second loop burns through a lot of tokens, blocks GPU time, and stalls agents.

So I built a Reasoning Guard at the proxy layer.

Architecture

Client → Proxy → vLLM → Model

The proxy watches the streaming response as it leaves vLLM. It doesn’t modify the model weights, it doesn’t require a second LLM call, and it doesn’t use embeddings or semantic analysis. It just applies cheap, deterministic checks while the stream is active.

What It Checks

The guard currently monitors:

Reasoning token caps (configurable by effort level)
Repeated paragraph detection
Sliding-window n-gram repetition
Repeated sentence fingerprinting
Fuzzy opening-pattern detection (catches loops like “Actually, I think I’ve found it…”)
Cut-and-continue recovery path

Recovery Flow

When the guard triggers, it:

Stops the upstream stream
Captures the reasoning produced so far
Reissues the request with that reasoning baked in as prior assistant context
Disables thinking for the continuation
Merges phase 1 and phase 2 usage stats

Because vLLM prefix caching is already active, the continuation is effectively seamless. Phase 2 usually resumes with \~50–100ms TTFT, so the client just sees reasoning flow directly into the final answer instead of hanging.

Good reasoning still comes through. The guard only steps in when reasoning exceeds configured limits or starts showing repetition patterns.

Why This Exists

This isn’t trying to compete with provider-side reasoning controls. OpenAI, Anthropic, DeepSeek, and others already have model/API-level systems for this. This is narrower: a practical runtime guard for teams running their own inference stack who want deterministic protection from runaway reasoning without changing the model or swapping proxies.

Observability

My proxy logs each trigger with:

Whether the guard fired
Trigger reason
Token cap used
Reasoning token count
Merged total usage
Stream-end metadata

I’ve tested it against both normal requests and stress cases derived from real trace logs. The loop detector catches repeated paragraphs, n-gram repetition, recurring sentence patterns, and common reasoning-loop openings. The cut-and-continue path has been validated end-to-end through the live proxy.

Result

Before: Occasional 2000+ token reasoning blocks that went nowhere.
After: The model still reasons when useful, but runaway thinking gets cut and redirected into an answer.

It’s basically a proxy-level seatbelt for local LLM inference.

Not magic. No model surgery. Just stream interception, token counting, loop detection, and a clean recovery path.

I would love to discuss other neat mitigations like this that help smaller models operate more effectively.

[-]

Clean_Initial_9618@reddit

Sorry I am running qwen3.6 27b on my RTX 3090 and yes it keeps looping thinking how does this solution solve it

Electronic-Fly-6465@reddit (OP)

You will need to have a proxy layer in between. it takes a bit to get correct. there are a many different cases to get right and the timing and token preservation needs to be perfect.
What are you serving it with now? llama.cpp or lm studio? I am using vLLM

LetsGoBrandon4256@reddit

Double check your sampler settings https://huggingface.co/Qwen/Qwen3.6-27B#using-qwen36-via-the-chat-completions-api

Qwen needs repetition_penalty, even higher for the MoE models.

Cultural_Meeting_240@reddit

nice solution. i had the same issue with qwen3 getting stuck in think loops during agentic stuff. proxy layer fix makes way more sense than messing with model params. curious how much token waste you actually saved after deploying this.

I had it running on a c++ skateboarding game for a few hours and what would normally be 16k thinking blocks every 5-10 mins are now nicely stopped at 600 tokens.
The agent harness doesnt even notice it, its like when another big request hits and delays a bit is all.

I actually run most of my tokens without reasoning. Proper context beats reasoning in my experience.

the-username-is-here@reddit

> Not magic. No model surgery.

No code, no examples, no stats.

Everyone and their dog does anti-loop guards. I did mine as Pi extension which just injects steering. Your point being?..