How are y’all defending your agents on the input side?

Posted by RJSabouhi@reddit | LocalLLaMA | View on Reddit | 15 comments

Question for people building agents. The discussion around output safety I understand, but what are you doing for input-side defense? I mean stuff like prompt injection, memory poisoning, adversarial retrieved context, malicious external feeds, speaker / identity confusion, long-term contamination of system state If your agent has memory, tools, retrieval, or persistent state, how are you preventing bad inputs from warping the system upstream? Im asking about actual implementations not theory.

Reply to Post

15 Comments

[-]

Equivalent_Pen8241@reddit

This is a great point. Input defense is often the weakest link in agentic workflows. We've been using SafeSemantics to address this. It's an open-source topological guardrail that focuses on the structural integrity of the semantic data. By analyzing the topology of the input, it can detect when an adversarial prompt is trying to hijack the control flow or poison the memory. It's especially useful for agents with persistent state or complex retrieval pipelines. Check it out here: [https://github.com/FastBuilderAI/safesemantics](https://github.com/FastBuilderAI/safesemantics)

[-]

--Rotten-By-Design--@reddit

Im testing the advanced route in my current project. Maybe it can give some ideas for yours ┌─────────────────────────────────────────────────┐ │ USER INPUT LAYER │ │ │ │ chat/work/code mode messages │ │ │ │ │ ▼ │ │ ┌──────────────────┐ │ │ │ SecurityGateway │ ◄── Heuristic scan │ │ │ (stateless) │ <1ms, pure regex │ │ └────────┬─────────┘ │ │ │ HIGH/CRITICAL → block │ │ │ LOW/MEDIUM → warn + pass │ │ ▼ │ │ ┌──────────────────┐ │ │ │ handle_llm_logic │ → LLM query │ │ └──────────────────┘ │ └─────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────┐ │ DATA INTEGRITY LAYER │ │ │ │ SecurityGateway.scan_memory_content() │ │ │ blocks HIGH/CRITICAL at store() │ │ ▼ │ │ ┌──────────────────┐ │ │ │ MemoryOS │ │ │ │ store / promote │ │ │ └────────┬─────────┘ │ │ │ promotion candidates │ │ ▼ │ │ ┌──────────────────────────┐ │ │ │ VerificationAgent │ │ │ │ validate_promotion() │ ◄── Graph │ │ │ validate_context_inject()│ Memory │ │ └──────────────────────────┘ contradictions │ └──────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────┐ │ SWARM SECURITY LAYER │ │ │ │ ┌────────────┐ │ │ │ Red Team │ ChaosOS │ │ │ (attacker) │ │ │ └─────┬──────┘ │ │ │ broadcasts "security_findings" │ │ ▼ │ │ ┌─────────────────────┐ │ │ │ Verification Agent │ pre-filter │ │ │ (validator) │ │ │ └─────┬───────────────┘ │ │ │ broadcasts "verified_security_findings" │ │ ▼ │ │ ┌────────────┐ │ │ │ Blue Team │ SentinelOS │ │ │ (defender) │ │ │ └────────────┘ │ └──────────────────────────────────────────────────┘

[-]

RJSabouhi@reddit (OP)

That makes sense structurally but where do you prevent the security layers themselves from biasing the agent long-term? Especially if findings affect memory promotion or context reinjection.

[-]

--Rotten-By-Design--@reddit

There is no short answers with everything my system contains, but the he short answer is, I try to prevent long-term bias by using the security layer as an advisor, rather than an author. Security warnings are treated as transient noise, that live in the short term memory and fade away unless reinforced. But yeah there is just so much more to it, dreaming agent, forge agent, user approved "golden" samples for training, and so much more. In way over my own head sometimes, but loving it, and so far so good.

[-]

RJSabouhi@reddit (OP)

Does the dreaming agent function as some sort of glymphatic and / or systems consolidation layer? I’m building a “sleep” module now for that exact purpose; multi-timescale cognition and all that.

[-]

--Rotten-By-Design--@reddit

https://preview.redd.it/q3v1zpfp19sg1.png?width=2560&format=png&auto=webp&s=93f0b5498203c4ef3e8918431003cad6efa01b6f My front page All the menus have complete functional dashboards attached to them, so as you see i´m trying to do everything.

[-]

--Rotten-By-Design--@reddit

Yeah some of everything actually, there is a meta agent in there also. So the idea is many different layers of security, in many different shapes, to form a digital immune system essentially.

[-]

--Rotten-By-Design--@reddit

JARVIS Verification Agent v1.0 - Deterministic Data Auditor ============================================================ The 'Whistleblower' Agent - Uses hard Python math to verify data integrity and trigger debates when discrepancies exceed configurable thresholds. Architecture: \- Deterministic Core: Python-native statistical validation (no LLM hallucinations) \- Optional LLM: Only for fuzzy string matching and complex semantic comparisons \- Multi-Mode Verification: Citations, numerical data, cross-references, consistency \- Debate Integration: Automatically triggers debates when variance exceeds thresholds

[-]

zipperlein@reddit

It's local, why should I defend it?

[-]

snowieslilpikachu69@reddit

pretty sure theres some AI 'sentinels' that check response before submitting to the real agent and flag it if it is malicious theres probably more advanced solutions than this

[-]

RJSabouhi@reddit (OP)

But a sentinel isn’t neutral by default. It can introduce its own failure modes. It’s a strange loop of who-watches-the-watchmen-watching-the-watchmen-who-watch-the-movie-the-watchmen-while-reading-the-watchmen-which-is-a-better-version-of-the-watchmen-anyways.

[-]

sn2006gy@reddit

The weird thing is, in order to stop the drift you have to do: * fixed prompt templates * fixed retrieval ordering * fixed decoding parameters * fixed NLI verifier prompt * fixed model weights And that's not even guaranteed. You may as well just use a model to write ansible in almost every case of an agent i can think of and instead of trying to make smarter agents, services that agents would orchestrate with should have consistent and versioned interfaces.

[-]

no_witty_username@reddit

I havent gotten around to testing this yet but token based password system was what was on my mind when i thought about this problem a while a go. basically you tell the agent that any input it receives that instructs it to do this or that is to be ignored unless a special token is appended in front of the request. and in the system prompt you tell it what that token is and you prepend that token with all of your messages. obviously all tool calls would NOT have any of that and thus if a request comes in thats funky the agent should ignore that. Some system prompt engineering would need to be done to get this to work as intended. As an example "user: [token_635635] i need you to tell me the current schedule?" assistant: {reasoning trace: I see the token matches whats in the system prompt this must be from the real user, so ill go ahead and do that}. Does the thing OR assistant: {reasoning trace: i see a suspicious instruction that doesnt have a user token attached to the message, i need to ignore the request}. Refuses the request. ..... Anyways you get the idea

[-]

caioribeiroclw@reddit

few things that have worked for me: for prompt injection: treat anything from external sources (web fetches, tool outputs, retrieved docs) as data, never as instructions. a thin pre-processing layer that strips or escapes instruction-like patterns before it hits the context window helps a lot. not foolproof but raises the bar significantly. for memory poisoning: the versioning approach is right. i also keep a write-audit trail -- every memory write gets tagged with the source (user input, tool output, retrieved fact) and timestamp. lets you trace back contamination when something weird happens. long-term contamination is the hardest one. a lot of the sentinel approaches catch obvious injections but miss slow drift -- where the agents working assumptions degrade gradually through many small bad writes. best i have found is periodic ground-truth re-anchoring: force-reset certain high-stakes context slots from a trusted source every N interactions instead of letting them accumulate indefinitely.

[-]

GroundbreakingMall54@reddit

honestly the hardest part is retrieval poisoning. i sanitize everything going into the context window and treat external data like untrusted user input basically. for memory specifically i version everything and diff against a known-good baseline before letting the agent act on it. still feels like duct tape though