The RAG pipeline kept failing silently in production. Turned out the chain abstraction was hiding the actual failure point. Here's what we changed.

Posted by MammothChildhood9298@reddit | LocalLLaMA | View on Reddit | 1 comments

Had a multi-stage RAG pipeline in production:
query classification → parallel retrieval from two sources → re-ranking → generation. Ran fine in dev. In production under load, it would silently return degraded results with no error. Took me way too long to debug.

The problem: the framework I was using modelled the pipeline as a linear chain. When the web retrieval stage timed out, the chain caught the exception internally, continued with partial results, and the generation step had no idea it was working with incomplete context. No error surfaced. Just quietly bad outputs.

The fix was architectural. I needed:

  1. Each stage to be an isolated unit with explicit success/failure state

  2. The orchestrator to know the dependency graph — not just "run A then B then C" but "C depends on both A and B, and if A fails, C should fail loudly"

  3. Parallel execution where dependencies allow, not sequential by default I ended up rebuilding it as a DAG -> nodes as async tasks, edges as data dependencies. The re-ranker explicitly depends on both retrievers. If either fails, the re-ranker fails with a clear error instead of silently continuing. Been running SynapseKit for this:

github.com/SynapseKit/SynapseKit solves it . Has anyone else hit silent failure in chain-based pipelines? Curious how others are handling failure isolation in production RAG.