Architecture discussion: The missing infrastructure for continuously running AI Agents

Posted by Potato_Farmer_1993@reddit | ExperiencedDevs | View on Reddit | 18 comments

From an engineering perspective, the current AI agent stack feels incomplete. We have frameworks (LangChain), execution runtimes (sandboxes/Browserbase), and harnesses (DeepAgents/Claude Code). But they all share a fundamental flaw for long-running systems: they are trigger-based.

If you are tasked with building an agent that operates continuously and sustainably on its own, an Agent Harness isn't enough. What we actually need is a dedicated Agent Runtime Environment.

To clarify, I'm not talking about an Agent Execution Runtime Env (where the agent safely executes Python). I'm talking about the persistent daemon/supervisor layer—the environment that gives the agent a continuous lifecycle, manages its state, handles self-healing when the LLM inevitably hallucinates a crash, and provides a heartbeat for proactive background work.

How are you all architecting this? Are you just wrapping your agents in Kubernetes cronjobs and temporal workflows, or is there a better pattern emerging for true persistent agent environments?

[-]

NotMyRealNameObv@reddit

I built it myself (or rather, kiro built the initial implementation and once I got it running it has been building itself) using shell scripts and Python.

gibonai@reddit

I think the other commenters are right that the pain point isn't quite clear. Are you tired of sitting in front of a chat window and feeding it input every few minutes to keep it on track? Or is there something else going on?

Even human devs are "trigger" based; your manager assigns tasks, CS files bug reports, product hands you a ticket. Good devs may identify improvements without any input, but you probably don't want an LLM doing that, you'll just chew through tokens and it'll spit out lots of code you didn't ask for.

Now for my shameless pitch....my company built Gibon to solve the "babysitting the chatbot" pain point. You assign it a task (from JIRA, Linear, Slack, or our web UI) and it implements it and opens a PR, no need to keep prompting it to stay on track. If it sounds useful you can apply for early access at https://gibon.ai.

micseydel@reddit

But they all share a fundamental flaw for long-running systems: they are trigger-based.

Can you elaborate? My project is "trigger-based" and that seems to totally make sense, why would it do anything without having a reason to do it? They can set timers or subscribe to emails as events, so I really don't see how being event/trigger based is a problem.

CodelinesNL@reddit

Everything in software is trigger based in a sense. The question is just bad.

Sounds like this is written by someone who hasn’t written code for a while and absolutely has not written an AI API integration.

Start with what you’re trying to accomplish. The trigger based bit does not make sense. It’s just API calls. Everything a running microservice does is als responding to ‘triggers’.

I have a few simple agents running. One responds to questions. The other also runs on a schedule. It’s trivial to build.

ravenclau13@reddit

while true

dbenc@reddit

make no mistakes

So_Rusted@reddit

the ralph loop

ThirdWaveCat@reddit

(1) event-based systems like AWS Step Functions or Argo workflows on kubernetes can be configured to effectively loop indefinitely, usually with a locking mechanism for safety you provide. (probably with conditional puts on object storage).

(2) you could use Kubernete high availability controllers to ensure 1 process it active. This handles locking but is difficult for other reasons.

(3) You could build what kubernetes provides using databases or third party implementations of "consensus" algorithms (apache zookeeper, etcd, apache ratis, omnipaxos). This is referred to as a "consistent core".

konm123@reddit

What problem are you trying to solve? Why do you need to continuously run AI agents? What value you try to create?

Miserable_Heron_9007@reddit

Yeah. That's the question. The value, where is it? We've been tasked with creating an offer to be sold that would include some product part plus services. But people don't know what they need (neither I), and when they imagine something, they cannot imagine the value, except the press release.

Etiennera@reddit

I don't see one frankly. Either your agents are small and you have license to host them and you can load the model in memory, or you rely on big tech to do that and you use triggers.

Character-Cattle6565@reddit

Just create a "Judge" LLM to judge your Agents. Make sure to write "You are a senior architect, Your job is to judge your servant agents. Make no mistake. Do not allow agents to allucinate and if they do you should self heal them, if they fail again you should punish them and if they dont reward them".

lolimouto_enjoyer@reddit

"Do performance reviews on each and terminate the bottom 10%"

BluebellRhymes@reddit

This is work. You're not saying anything, just asking us to work for you.

Routine_Internal_771@reddit

https://code.claude.com/docs/en/tools-reference#monitor-tool

zangler@reddit

I mean...you are basically answering your own question. Build it. You are unlikely wanting something truly perpetual... treat it like anything else that needs to be up. Heartbeat, observer, retrigger.

johnpeters42@reddit

Apart from "are agents doing more good than harm" (let's say you've eliminated whatever you want to on that front), what's the use case for this? "Whenever you haven't done anything for (time period), pull the next ticket off the backlog and start working on it"? Which does sound simple enough to just throw a cronjob or equivalent at it.