Best architecture for internal support system + log anomaly detection (RAG + observability)?

Posted by Brief_Watch7221@reddit | ExperiencedDevs | View on Reddit | 12 comments

Hi all,

I’m working on designing an internal system for an oceanographic/environmental data company, and I’d really value input from people who’ve built similar systems in production.

We monitor sensor data across ports and harbours, and I’m trying to design a system with two main components:

Internal support / knowledge system

• Centralised knowledge base (likely from structured docs like Obsidian or similar)

• Natural language querying for internal engineers/support team

• Strong requirement: very high accuracy with minimal hallucination

• Ideally with citations / traceability
Log analysis + anomaly detection

• Sensor logs (format still being defined)

• Detect anomalies or failures before customers report them

• Integrate with dashboards (we currently use ThingsBoard)

⸻

What I’m trying to figure out:

•   Is a RAG-based system the right approach for the support side?

•   For logs:

•   Do you preprocess + structure logs first, or ever feed raw logs into LLMs?

•   Are people combining traditional anomaly detection (rules/ML) with LLMs for explanation?

•   Recommended stack:

•   LLMs (open-source vs API?)

•   Embeddings + vector DB choices

•   Time-series/log storage solutions

•   How are people handling:

•   Hallucination control in production?

•   Evaluation / observability of LLM outputs?

•   False positives in anomaly detection?

⸻

Constraints:

•   Likely self-hosted (we have IONOS servers)

•   Early-stage, so still exploring architecture

•   Logs/data scale not fully known yet

⸻

I’m not looking for generic advice more interested in real architectures, lessons learned, or things that failed.

If you’ve built something similar (RAG systems, observability tools, log analysis pipelines), I’d love to hear how you approached it.

Thanks!

[-]

Effective-Wave-8486@reddit

Been down this road with IoT sensor networks before. For the RAG component, I'd recommend separating your embedding pipeline from your retrieval service. Use something like Pinecone or Weaviate for vector storage, and consider chunking your docs by topic/system rather than fixed sizes since you're dealing with technical documentation.

For anomaly detection on sensor data, statistical methods often work better than ML for environmental data because seasonality and environmental factors are huge. I've had good luck with isolation forests combined with time series decomposition to handle seasonal patterns.

Architecture-wise, keep your knowledge base updates async from queries. Queue doc updates through something like Redis or RabbitMQ so ingestion doesn't impact response times. Also make sure your vector embeddings can handle technical jargon specific to oceanographic equipment.

The real challenge will be keeping your knowledge base current with sensor configs and troubleshooting steps. Build in workflows for your engineers to easily update procedures when they solve new issues.

[-]

Brief_Watch7221@reddit (OP)

This is really helpful, thanks , the separation between embedding and retrieval and the async ingestion point both clicked for me.

One thing I didn’t mention earlier, our logs are a mix of ASCII (plain text) and JSON. From what you said, it sounds like the right approach would be to first build a parsing/normalisation layer to convert everything into a consistent structured format before doing any anomaly detection or feeding anything downstream. Does that match how you handled it?

A couple of follow-ups if you don’t mind:

For vector DBs, would you still recommend something like Pinecone in a self-hosted setup, or would something like Weaviate/Qdrant be more practical?
When you used isolation forests + time series decomposition, was that running in real-time or batch?
Did you combine statistical methods with rule-based thresholds per sensor, or rely mostly on the models?

Also really agree with your point about keeping the knowledge base updated that feels like it could become the biggest bottleneck long-term.

Appreciate you sharing this super useful to hear from someone who’s actually worked on similar systems.

[-]

micseydel@reddit

You're replying to a bot https://www.reddit.com/r/softwaredevelopment/comments/1silbyl/comment/ofkwoxr/ https://imgur.com/a/rW0PKvu

[-]

Brief_Watch7221@reddit (OP)

Ohh 😮

[-]

EnderWT@reddit

Obvious bot

[-]

Brief_Watch7221@reddit (OP)

Not a bot bro 🤖

[-]

xpingu69@reddit

I am building a RAG. I have different loaders that scrape the internet and create the docs. I have a chunking and indexing pipeline to create embeddings (i use pgvector + text-embedding-3-large). Then I have a retrieval service for the queries. I store all the metadata of the docs, and then I can do a join on the source for each chunk to create citations. But there are some docs where citations are more complicated. Most work is in the chunking and indexing and retrieving. Depending on the query, you want to use different retrieval strategies, and maybe different chunking strategies. This may help https://graphrag.com I don't have a graph but it explains the concepts

[-]

Leading_Yoghurt_5323@reddit

for logs, don’t feed raw into LLMs… preprocess + detect anomalies with rules/ML first, then use LLMs just for explanation

[-]

psaux_grep@reddit

For logs both ELK and Sentry works great in a self-hosted environment.

Once it’s up and running I’d argue it’s mostly maintenance free, but a bit more effort involved settings these up nowadays than it used to be, especially ELK which has gotten quite «needy» on the setup. Still can’t believe how poorly the UX of Kibana has evolved over the last decade…

[-]

Brief_Watch7221@reddit (OP)

That’s really helpful, thanks I hadn’t fully considered using something like ELK or Sentry as the base layer instead of building log storage myself.

Makes sense to separate concerns and let a dedicated system handle ingestion/search/visualisation.

In your experience, did you use ELK just for storage and querying, or also for any kind of anomaly detection/alerting? I’m trying to figure out where to draw the line between what tools like ELK handle vs what I should build separately (e.g. statistical detection + LLM layer on top).

Also good to know about the overhead I’ve heard similar things about ELK being heavy. Would you still recommend it for a smaller setup, or would something lighter (like Loki/Grafana or similar) be more practical to start with?

Appreciate the insight especially the real-world tradeoffs

[-]

NANO56@reddit

Separation or concerns is the way to go. Using an established logging stack will enable you to have traditional observability which is easy to build.

We created a clever process to pull logs out. An LLM writes the elastic query which then pulls the logs for analysis. For our time data, it also writes prometheus queries as well. The user uses natural language with RAG to investigate. Im handwaving the implementation but hopefully it makes sense.

[-]

Dense_Gate_5193@reddit

For a system monitoring harbor sensors and managing engineering docs, you might want to look at a Graph-RAG architecture rather than a standard vector-only setup. Instead of just chunking your Obsidian files into a vector DB, you can treat them as a Canonical Graph Ledger. This maps your docs into a graph where relationships—like a specific sensor model linked to its failure modes—are preserved as edges. When a support query comes in, the system isn't just "guessing" based on text similarity; it’s traversing known paths in the graph, which is a lot more reliable for providing the citations and traceability your engineers need.

On the observability side, feeding raw logs directly to an LLM is a token-burn, so the play is to structure those logs as nodes in the same database. By using a co-located engine (like NornicDB), you can run your traditional anomaly rules (thresholds/ML) and then have the database automatically link those "Anomaly" nodes to the relevant "Support Doc" nodes via embedding similarity. This allows you to achieve sub-10ms retrieval while maintaining strict Schema Contracts (using REQUIRE blocks) to keep the LLM from hallucinating data that doesn't exist in your sensor registry. It’s basically a single-binary way to handle the whole loop—from detection to explanation—without the latency tax of a massive microservice sprawl.