Simple Multi-Agent Architecture Running Across Our Entire Org. Keeping everything in Loop.

Posted by Silent_Employment966@reddit | LocalLLaMA | View on Reddit | 22 comments

Currently, we're running agents at org scale. There were multiple problems we faced, like the credential problem, the state problem, and the execution trace problem, during our initial days but we overcame it & here's our simplified architecture.

Our setup runs three agent classes against a shared context layer. Observer agents sit at the edge pulling in external signals and writing structured events. Task agents pick up work from that stream, execute bounded actions, and write results back. Goal agents read the full execution history, build plans, sequence task agents, and re-plan when conditions shift.

LangGraph handles the goal agent layer. The stateful graph structure maps to how goal agents work: conditional branching, checkpointed state, and the ability to resume mid-plan when a task agent fails or returns a partial result. Hand-rolling that logic is how you end up with orchestration code nobody wants to touch six months later.

CrewAI handles task coordination. Role-based agent assignment with shared short-term and long-term memory, plus a planning agent that sequences tasks before execution starts. The crew model maps closely enough to the task-oriented agent class that we use it without custom scaffolding.

Harbor sits underneath all of it. Every agent in the fleet gets scoped access to tools, files, and workflows through Harbor's workspace model. Credentials stay in Harbor, not in model context. Every tool call produces a trace. When an agent calls a database, hits an external API, or triggers a downstream workflow, that action is logged with full provenance. At fleet scale, that trace layer is what lets you debug a failure in under an hour instead of a day.

The ring-based protocol governs message routing on top of this. Kernel agents at Ring 0 manage agent lifecycle. Orchestrators at Ring 1 route messages by agent metadata and classification. Goal agents at Ring 2 decompose intents into task plans. Task agents at Ring 3 execute with least privilege. Observer agents at Ring 4 run continuously, posting events without making decisions.

As the shared conversation deepens, newer agents start with a richer operational history than earlier ones did. Our best thing is that the coordination overhead per agent drops as the history grows.

[-]

Fickle_Temporary_794@reddit

This is a solid pattern. separating observers, task agents, and goal agents makes a lot more sense than one general agent trying to do everything

The part i’d watch closely is the shared context layer. If the history is structured, typed, and compacted well, newer agents get better over time. But if it becomes a dumping ground for logs, partial results, and stale assumptions, you can end up spreading noise across the whole fleet.

Feels like the real advantage here is not just multi-agent coordination, but having durable state + execution traces + scoped tool access in one system.

OrganicDress8135@reddit

The shared context layer is exactly where things go sideways. Once it starts accumulating stale assumptions and partial results, every downstream agent inherits the noise. Compaction strategy and explicit TTLs on context entries saved me a lot of headaches, treating context as a typed schema rather than a log dump makes a real difference.

LocalLLaMA-ModTeam@reddit

Rule 4 - Post is primarily commercial promotion.

the-username-is-here@reddit

Where's the mod when you need it?

Obviously it's a plug for some shitty startup in the last link.

rm-rf-rm@reddit

Just report it if we miss it and we'll take care of it. (in this case someone did report, but if many people report, it gets autoremoved)

AI-Agent-Payments@reddit

The piece nobody talks about in these three-tier architectures is what happens when a goal agent re-plans mid-execution but task agents already have in-flight work that was scoped to the old plan. We hit a state divergence bug where LangGraph's checkpoint reflected the new plan while two CrewAI tasks were still writing results back under old task IDs, and reconciling that took more effort than the original orchestration build. Worth designing an explicit "plan version" field into your shared context layer before you hit that at scale.

rduser@reddit

You post this again and you will be banned. You've been warned

Limp_Statistician529@reddit

i ran into this, when newer agents inherit a richer history they also inherit whatever bad data slipped in earlier. one observer pulls a weird signal, it gets written as an event, now every goal agent downstream is planning around it.

the memory engine we use scores every memory at write time so injection patterns n sketchy sources get flagged, then low-trust stuff gets down-ranked before it reaches the agent. history grows but doesnt get poisoned.

curious how Harbor's trace layer deals w that. Would love to share what i use with you

AmoebaDue6638@reddit

The credential scoping problem is the one that bites hardest at scale. We have been using Orthogonal for the API access layer since it handles auth for 200+ APIs behind one key, which removes a ton of the per-agent credential management overhead you described.

Potential-Leg-639@reddit

Ai slop / ad

AuggieKC@reddit

You are correct. Most of the replies are, too. This place is becoming unbearable.

shoumakongtou@reddit

The "credential problem" you mentioned is real — if you're running Goal/Task/Observer agents across multiple model families (Claude for goal, GPT for task, etc.), keeping each agent's SDK happy is its own problem.
We built openmodel.ai for exactly this: three native API surfaces (OpenAI, Anthropic, Gemini) so each agent's SDK speaks its native protocol while routing happens by model name. Worth a look if you ever route different agents to different providers.

kcarriedo@reddit

The credential / state / execution-trace triad is exactly the right framing — those are the three things that bite once you scale agents past one developer. Would love to hear more about how you solved the state problem specifically. The two patterns we've seen work are (a) shared filesystem with file-locking plus a coordinator process that arbitrates, and (b) a small Redis or SQLite service the agents check before any non-idempotent action. Which way did you go? (We're working on a coordinator-first approach at claudeverse.ai — collecting field reports like yours helps a lot.)

terraslate@reddit

how do you handle upgrading to a newer inference model when the whole show is built on a stack of cards? not jibing - just a reality check question.

Silent_Employment966@reddit (OP)

context layer is model-agnostic by design. swap the inference model and the tools, state, and traces stay intact

redballooon@reddit

Nice story. With my experience hard to believe. How often did you switch models so far?

sn2006gy@reddit

That's an impossibility though. You can lock the tools, but state and traces reflect the model(s) behavior.

this is the part that keeps the whole thing certain though

"Hand-rolling that logic is how you end up with orchestration code nobody wants to touch six months later."

Raseaae@reddit

How are you actually managing the communication between the different rings? Are the rings strictly isolated via network policies/Harbor workspaces?

Deep_Structure2023@reddit

most multi-agent setups get overcomplicated fast, but this separation of observer/task/goal agents actually feels maintainable.

The ring protocol is what keeps it that way. once orchestrators handle routing, each agent class stays narrow enough to reason independently

Makes sense, keeping agents narrow in scope is probably what keeps the whole system debuggable.