[D] do you guys actually get agents to learn over time or nah?

Posted by Tight_Scene8900@reddit | LocalLLaMA | View on Reddit | 25 comments

been messing with local agents (ollama + openai-compatible stuff) and I keep hitting the same isue

they don’t really learn across tasks

like:
run something → it works (or fails)
next day → similar task → repeats the same mistake

even if I already fixed it before

I tried different “memory” setups but most of them feel like:

dumping stuff into a vector db
retrieving chunks back into context

which helps a bit but doesn’t feel like actual learning, more like smarter copy-paste

so I hacked together a small thing locally that sits between the agent and the model:

logs each task + result
extracts small “facts” (like: auth needs bearer, this lib failed, etc.)
gives a rough score to outputs
keeps track of what the agent is good/bad at
re-injects only relevant stuff next time

after a few days it started doing interesting things:

stopped repeating specific bugs I had already corrected
reused patterns that worked before without me re-prompting
avoided approaches that had failed multiple times

still very janky and probably not the “right” way to do it, but it feels closer to learning from experience vs just retrying prompts

curious what you guys are doing for this

are you:

just using vector memory and calling it a day?
tracking success/failure explicitly?
doing any kind of routing based on past performance?

feels like this part is still kinda unsolved

[-]

pulse-os@reddit

dude you're literally building what I spent 10 months on with PULSE lol, reading your description felt like reading my own changelog from month 2.

the "smarter copy-paste" feeling from vector db retrieval is spot on — retrieval without scoring is just a search engine cosplaying as memory. what changed everything for me was adding explicit confidence scoring + reward signals. when an agent uses a memory and the task succeeds, that memory's confidence goes UP. when it uses a memory and things break, confidence goes DOWN. after a few hundred tasks the cream rises to the top automatically and the bad advice sinks.

basically reinforcement learning on your knowledge base, not just your model.

your "keeps track of what the agent is good/bad at" is huge btw — I call this agent competence profiling and it feeds directly into task routing. like claude-sonnet has 100% success on brain tasks across 186 observations but is weaker on deployment stuff, so the system routes deployment tasks to gemini instead. the agents don't just learn what works, the SYSTEM learns which agent to ask.

the "stopped repeating specific bugs" part, thats exactly what our anti-pattern system does. 201 active anti-patterns with evidence counts, and they get injected into every agent session at boot. if 10 agents hit the same bug, the 11th one sees "dont do this, 10 confirmed incidents" before it even starts writing code. one thing that'll save you pain: add contradiction detection early. once you have enough "facts" you WILL get conflicting ones ("use postgres" vs "sqlite is better") and without explicit conflict tracking the agent confidently serves both as true depending on which one the vector search happens to rank higher that day lol. this part is still kinda unsolved is the right take tho, most people stop at embeddings + retrieval. the scoring/routing/contradiction layer is where the actual learning happens.

[-]

ElvaR_@reddit

Been having good luck with agent zero.... It is crashing the computer at the moment when it calls the LLM.... But I'll fix it soon enough... Lol

[-]

Tight_Scene8900@reddit (OP)

agent zero is wild lol. crashes are a rite of passage at this point. curious how you’re handling memory between runs once you get it stable

[-]

ElvaR_@reddit

After sitting there in a loop for like an hour... It finally started to write the memory and figure out tool calling again. 0.somthing I had working pretty good. Got it to even split a video up for me and add some text to it. Supper impressed with that.

So far after that loop, and using an actual embedding LLM... Lol it started on its memory. Seems to be holding up.

[-]

Tight_Scene8900@reddit (OP)

haha an hour loop is wild but getting it to split a video and add text is actually impressive, thats a real tool calling win. the embedding llm jump helped me a lot too, stuff finally stopped hallucinating which related memories mattered. curious what embedding model you ended up using, nomic or one of the bge ones?

[-]

ElvaR_@reddit

Using qwen3-embedding:0.6B for the embedding. And then qwen3.5:4b for bot the main and the utility model. To cut down on the time between switching from model to model

[-]

StupidityCanFly@reddit

I have the agent storing logs and a periodic job that analyzes them and creates rules that work as part of the harness.

[-]

Tight_Scene8900@reddit (OP)

this is exactly what i ended up with too. how are you structuring the rules? i’m curious if you’re extracting them automatically or writing them manually after

[-]

StupidityCanFly@reddit

I have a defined syntax for the rules, and the rules are stored in JSON. So anything the agent visits/reviews and wants to add to rules is immediately tested in the code. If it’s invalid, the rule is fed to the LLM, and the agent gets feedback on what’s broken and tries again.

And as a rule of thumb, I stick as much of the logic into the code as possible. Fewer issues with having deterministic outputs.

[-]

Refefer@reddit

The ACE paper is an excellent resource for self learning via rules and context. Similarly, a blackbox QA agent helps quite a bit for identifying successful/unsuccessful tasks.

[-]

Tight_Scene8900@reddit (OP)

gonna check the ACE paper, hadn’t seen that one. the blackbox QA idea is interesting — do you run it as a separate agent judging the main one or more inline scoring?

[-]

Refefer@reddit

I run it as a separate agent: it gets the task and the outputs and has to validate the answers are correct. It helps tremendously with stuff like coding where it will call BS on written code, design smells, etc.

[-]

Tight_Scene8900@reddit (OP)

thats actually clean, a separate judge that only sees task + output and has to call it correct or not. sidesteps the whole self-judging trap because the judge isnt the same model that produced the work. the coding bs detection is the killer use case, curious if you use a smaller cheaper model for the judge or the same size as the main agent? been going back and forth on whether the judge needs to be as smart as the worker or if a dumber grounded checker is fine.

[-]

donhardman88@reddit

I feel your pain on the 'smarter copy-paste' thing. That's the wall everyone hits with basic vector memory—cosine similarity is great for finding a similar-sounding paragraph, but it's useless for actual learning or understanding how a system evolves.

The 'right way' (or at least the way that's actually working for me) is to move away from flat embeddings and toward a structural knowledge graph. Instead of just logging facts, you use AST parsing (tree-sitter) to map the actual relationships and dependencies.

When the agent 'learns' a fix, you don't just store a text chunk; you update the relationship in the graph. This way, the agent isn't just recalling a similar event—it's navigating a map of the project's logic. It's a bit more of a lift than a simple vector store, but it's the only way to get that 'experience' feeling rather than just a fancy search.

I've been building this into a tool called Octocode (Rust-based, uses MCP) specifically to solve this 'memory drift' and retrieval noise. It's not perfect, but it's a hell of a lot better than just dumping everything into a vector DB and hoping for the best.

[-]

Tight_Scene8900@reddit (OP)

yeah the graph angle is solid, tree-sitter into a structural map is probably the right foundation for code-native agents. ive thought about going that direction but ended up on a different signal entirely.

mine is purely behavioral. the agent scores its own output 1-5 after each task, low scores turn into warnings next time it tries something similar, high scores become patterns to reuse. doesnt look at the code at all, just tracks what happened when the agent worked on it and whether it went well.

honestly feels like both probably need to live in the same stack eventually. a perfect ast map still wont stop an agent from making the same mistake twice if nothing is keeping score. and pure behavioral tracking with no structural grounding is kinda just vibes.

mines called greencube btw, rust/tauri, similar energy to octocode. would be down to compare notes if youre into it

[-]

donhardman88@reddit

I think you're spot on. Combining the two – structural grounding for the 'what' and behavioral tracking for the 'how' – is probably the only way to get an agent that actually feels like it's evolving. One provides the map, the other provides the experience.

I'm curious though – have you found a way to benchmark the behavioral side? I've always struggled with the fact that 'learning' often feels subjective. I'd love to know if you've built a way to measure if the behavioral scores are actually reducing the error rate over time, or if it's mostly a qualitative improvement.

For me, the structural side is easier to measure (recall @ k, etc.), but the 'learning' part is the real challenge. If we can find a way to quantify the delta between a 'flat' agent and a 'behavioral' one, that would be a huge win for the whole community.

[-]

Tight_Scene8900@reddit (OP)

honestly no, and this is the thing i keep getting stuck on. structural has clean metrics because youre measuring retrieval against ground truth. behavioral has no equivalent. closest ive come up with is tracking thumbs down rate over time and watching for repeat error patterns, like if the agent hits the same mistake twice and then stops after feedback injection, thats a measurable delta. but its noisy and slow and i wouldnt call it a real benchmark.

the thing i want to build is pair comparison. run the same task twice, once with memory injection and once cold, measure whether the with-memory version gets a better grounded outcome (tests pass, tool calls succeed, whatever). hard part is finding tasks where the memory actually has something to say. random tasks would just be noise.

if theres a way to design a shared benchmark that works for both structural and behavioral approaches that would be a real contribution. would be down to brainstorm it if youre in.

[-]

Fair-Championship229@reddit

llm as a judge on its own output is known to be unreliable, theres a bunch of papers on this. youre basically building a system that lies to itself and calls it learning

[-]

Tight_Scene8900@reddit (OP)

yeah this is a fair hit, the self-correction literature is real and i wont pretend my loop is immune. the deepmind paper on self-correction is the obvious one.

the thing that keeps it from collapsing into pure confirmation bias in practice is that the llm self-score isnt actually the dominant signal. competence per domain is 70 percent actual task success rate and 30 percent llm self-assessment. so if the model keeps rating itself 5 on database stuff but the tasks keep erroring out or getting corrected, the success rate drags the competence down regardless of what the model thinks. the model grading itself only moves the needle a little.

on top of that theres a feedback path where the user can override a score directly, and old knowledge entries decay over time if nothing reinforces them so a wrong self-rating from week one doesnt haunt the agent forever.

still lossy, still has edges, you can construct cases where it drifts. but a 70/30 grounded signal with human correction on top beats pure self-critique and it beats no signal at all. if youve got specific failure modes from those papers in mind id genuinely want to hear them, trying to find the sharp edges before i trust it with anything serious.

[-]

MoneyPowerNexis@reddit

https://imgur.com/a/4jONOVb

[-]

Tight_Scene8900@reddit (OP)

lmao memento is unironically the correct mental model for this whole problem. leonards tattoos are basically a rules.md file getting injected into context every morning

[-]

Hot-Employ-3399@reddit

No. VRAM is not bog enough for putting extra stuff

[-]

Tight_Scene8900@reddit (OP)

yeah that’s the wall i kept hitting too. that’s actually why i went local-first desktop instead of trying to shove everything into the model. keep the memory layer outside the inference process entirel

[-]

Similar_Gur9888@reddit

this just sounds like RAG with extra steps

[-]

Tight_Scene8900@reddit (OP)

yeah I thought the same at first tbh

I guess the difference I’m seeing is it’s not retrieving external docs but its own past task outcomes + tracking failures over time

but yeah the retrieval part probably overlaps a lot