Reasoning Guard: Stopping LLM Thinking Loops at the Proxy Layer
Posted by Electronic-Fly-6465@reddit | LocalLLaMA | View on Reddit | 6 comments
Reasoning Guard: Stopping LLM Thinking Loops at the Proxy Layer
I’ve been running Qwen3.6 MoE behind a vLLM proxy and hit a specific reliability issue: occasional runaway reasoning loops.
This isn’t a criticism of Qwen3.6. The model is excellent — in my setup, it’s more robust than Qwen3.5 for agentic coding, path handling, debugging, and tool-style workflows. But occasionally, especially on file-path, debugging, and code-tracing prompts, it can get stuck inside a reasoning block and repeat itself endlessly.
At 180+ tokens/sec, even a 20–30 second loop burns through a lot of tokens, blocks GPU time, and stalls agents.
So I built a Reasoning Guard at the proxy layer.
Architecture
Client → Proxy → vLLM → Model
The proxy watches the streaming response as it leaves vLLM. It doesn’t modify the model weights, it doesn’t require a second LLM call, and it doesn’t use embeddings or semantic analysis. It just applies cheap, deterministic checks while the stream is active.
What It Checks
The guard currently monitors:
- Reasoning token caps (configurable by effort level)
- Repeated paragraph detection
- Sliding-window n-gram repetition
- Repeated sentence fingerprinting
- Fuzzy opening-pattern detection (catches loops like “Actually, I think I’ve found it…”)
- Cut-and-continue recovery path
Recovery Flow
When the guard triggers, it:
- Stops the upstream stream
- Captures the reasoning produced so far
- Reissues the request with that reasoning baked in as prior assistant context
- Disables thinking for the continuation
- Merges phase 1 and phase 2 usage stats
Because vLLM prefix caching is already active, the continuation is effectively seamless. Phase 2 usually resumes with \~50–100ms TTFT, so the client just sees reasoning flow directly into the final answer instead of hanging.
Good reasoning still comes through. The guard only steps in when reasoning exceeds configured limits or starts showing repetition patterns.
Why This Exists
This isn’t trying to compete with provider-side reasoning controls. OpenAI, Anthropic, DeepSeek, and others already have model/API-level systems for this. This is narrower: a practical runtime guard for teams running their own inference stack who want deterministic protection from runaway reasoning without changing the model or swapping proxies.
Observability
My proxy logs each trigger with:
- Whether the guard fired
- Trigger reason
- Token cap used
- Reasoning token count
- Merged total usage
- Stream-end metadata
I’ve tested it against both normal requests and stress cases derived from real trace logs. The loop detector catches repeated paragraphs, n-gram repetition, recurring sentence patterns, and common reasoning-loop openings. The cut-and-continue path has been validated end-to-end through the live proxy.
Result
Before: Occasional 2000+ token reasoning blocks that went nowhere.
After: The model still reasons when useful, but runaway thinking gets cut and redirected into an answer.
It’s basically a proxy-level seatbelt for local LLM inference.
Not magic. No model surgery. Just stream interception, token counting, loop detection, and a clean recovery path.
I would love to discuss other neat mitigations like this that help smaller models operate more effectively.
Clean_Initial_9618@reddit
Sorry I am running qwen3.6 27b on my RTX 3090 and yes it keeps looping thinking how does this solution solve it
Electronic-Fly-6465@reddit (OP)
You will need to have a proxy layer in between. it takes a bit to get correct. there are a many different cases to get right and the timing and token preservation needs to be perfect.
What are you serving it with now? llama.cpp or lm studio? I am using vLLM
LetsGoBrandon4256@reddit
Double check your sampler settings https://huggingface.co/Qwen/Qwen3.6-27B#using-qwen36-via-the-chat-completions-api
Qwen needs
repetition_penalty, even higher for the MoE models.Cultural_Meeting_240@reddit
nice solution. i had the same issue with qwen3 getting stuck in think loops during agentic stuff. proxy layer fix makes way more sense than messing with model params. curious how much token waste you actually saved after deploying this.
Electronic-Fly-6465@reddit (OP)
I had it running on a c++ skateboarding game for a few hours and what would normally be 16k thinking blocks every 5-10 mins are now nicely stopped at 600 tokens.
The agent harness doesnt even notice it, its like when another big request hits and delays a bit is all.
I actually run most of my tokens without reasoning. Proper context beats reasoning in my experience.
the-username-is-here@reddit
> Not magic. No model surgery.
No code, no examples, no stats.
Everyone and their dog does anti-loop guards. I did mine as Pi extension which just injects steering. Your point being?..