Best architecture for internal support system + log anomaly detection (RAG + observability)?

Posted by Brief_Watch7221@reddit | ExperiencedDevs | View on Reddit | 12 comments

Hi all,

I’m working on designing an internal system for an oceanographic/environmental data company, and I’d really value input from people who’ve built similar systems in production.

We monitor sensor data across ports and harbours, and I’m trying to design a system with two main components:

  1. Internal support / knowledge system

    • Centralised knowledge base (likely from structured docs like Obsidian or similar)

    • Natural language querying for internal engineers/support team

    • Strong requirement: very high accuracy with minimal hallucination

    • Ideally with citations / traceability

  2. Log analysis + anomaly detection

    • Sensor logs (format still being defined)

    • Detect anomalies or failures before customers report them

    • Integrate with dashboards (we currently use ThingsBoard)

What I’m trying to figure out:

•   Is a RAG-based system the right approach for the support side?

•   For logs:

•   Do you preprocess + structure logs first, or ever feed raw logs into LLMs?

•   Are people combining traditional anomaly detection (rules/ML) with LLMs for explanation?

•   Recommended stack:

•   LLMs (open-source vs API?)

•   Embeddings + vector DB choices

•   Time-series/log storage solutions

•   How are people handling:

•   Hallucination control in production?

•   Evaluation / observability of LLM outputs?

•   False positives in anomaly detection?

Constraints:

•   Likely self-hosted (we have IONOS servers)

•   Early-stage, so still exploring architecture

•   Logs/data scale not fully known yet

I’m not looking for generic advice more interested in real architectures, lessons learned, or things that failed.

If you’ve built something similar (RAG systems, observability tools, log analysis pipelines), I’d love to hear how you approached it.

Thanks!