[D] do you guys actually get agents to learn over time or nah?
Posted by Tight_Scene8900@reddit | LocalLLaMA | View on Reddit | 25 comments
been messing with local agents (ollama + openai-compatible stuff) and I keep hitting the same isue
they don’t really learn across tasks
like:
run something → it works (or fails)
next day → similar task → repeats the same mistake
even if I already fixed it before
I tried different “memory” setups but most of them feel like:
- dumping stuff into a vector db
- retrieving chunks back into context
which helps a bit but doesn’t feel like actual learning, more like smarter copy-paste
so I hacked together a small thing locally that sits between the agent and the model:
- logs each task + result
- extracts small “facts” (like: auth needs bearer, this lib failed, etc.)
- gives a rough score to outputs
- keeps track of what the agent is good/bad at
- re-injects only relevant stuff next time
after a few days it started doing interesting things:
- stopped repeating specific bugs I had already corrected
- reused patterns that worked before without me re-prompting
- avoided approaches that had failed multiple times
still very janky and probably not the “right” way to do it, but it feels closer to learning from experience vs just retrying prompts
curious what you guys are doing for this
are you:
- just using vector memory and calling it a day?
- tracking success/failure explicitly?
- doing any kind of routing based on past performance?
feels like this part is still kinda unsolved
pulse-os@reddit
dude you're literally building what I spent 10 months on with PULSE lol, reading your description felt like reading my own changelog from month 2.
the "smarter copy-paste" feeling from vector db retrieval is spot on — retrieval without scoring is just a search engine cosplaying as memory. what changed everything for me was adding explicit confidence scoring + reward signals. when an agent uses a memory and the task succeeds, that memory's confidence goes UP. when it uses a memory and things break, confidence goes DOWN. after a few hundred tasks the cream rises to the top automatically and the bad advice sinks.
basically reinforcement learning on your knowledge base, not just your model.
your "keeps track of what the agent is good/bad at" is huge btw — I call this agent competence profiling and it feeds directly into task routing. like claude-sonnet has 100% success on brain tasks across 186 observations but is weaker on deployment stuff, so the system routes deployment tasks to gemini instead. the agents don't just learn what works, the SYSTEM learns which agent to ask.
the "stopped repeating specific bugs" part, thats exactly what our anti-pattern system does. 201 active anti-patterns with evidence counts, and they get injected into every agent session at boot. if 10 agents hit the same bug, the 11th one sees "dont do this, 10 confirmed incidents" before it even starts writing code. one thing that'll save you pain: add contradiction detection early. once you have enough "facts" you WILL get conflicting ones ("use postgres" vs "sqlite is better") and without explicit conflict tracking the agent confidently serves both as true depending on which one the vector search happens to rank higher that day lol. this part is still kinda unsolved is the right take tho, most people stop at embeddings + retrieval. the scoring/routing/contradiction layer is where the actual learning happens.
ElvaR_@reddit
Been having good luck with agent zero.... It is crashing the computer at the moment when it calls the LLM.... But I'll fix it soon enough... Lol
Tight_Scene8900@reddit (OP)
agent zero is wild lol. crashes are a rite of passage at this point. curious how you’re handling memory between runs once you get it stable
ElvaR_@reddit
After sitting there in a loop for like an hour... It finally started to write the memory and figure out tool calling again. 0.somthing I had working pretty good. Got it to even split a video up for me and add some text to it. Supper impressed with that.
So far after that loop, and using an actual embedding LLM... Lol it started on its memory. Seems to be holding up.
Tight_Scene8900@reddit (OP)
haha an hour loop is wild but getting it to split a video and add text is actually impressive, thats a real tool calling win. the embedding llm jump helped me a lot too, stuff finally stopped hallucinating which related memories mattered. curious what embedding model you ended up using, nomic or one of the bge ones?
ElvaR_@reddit
Using qwen3-embedding:0.6B for the embedding. And then qwen3.5:4b for bot the main and the utility model. To cut down on the time between switching from model to model
StupidityCanFly@reddit
I have the agent storing logs and a periodic job that analyzes them and creates rules that work as part of the harness.
Tight_Scene8900@reddit (OP)
this is exactly what i ended up with too. how are you structuring the rules? i’m curious if you’re extracting them automatically or writing them manually after
StupidityCanFly@reddit
I have a defined syntax for the rules, and the rules are stored in JSON. So anything the agent visits/reviews and wants to add to rules is immediately tested in the code. If it’s invalid, the rule is fed to the LLM, and the agent gets feedback on what’s broken and tries again.
And as a rule of thumb, I stick as much of the logic into the code as possible. Fewer issues with having deterministic outputs.
Refefer@reddit
The ACE paper is an excellent resource for self learning via rules and context. Similarly, a blackbox QA agent helps quite a bit for identifying successful/unsuccessful tasks.
Tight_Scene8900@reddit (OP)
gonna check the ACE paper, hadn’t seen that one. the blackbox QA idea is interesting — do you run it as a separate agent judging the main one or more inline scoring?
Refefer@reddit
I run it as a separate agent: it gets the task and the outputs and has to validate the answers are correct. It helps tremendously with stuff like coding where it will call BS on written code, design smells, etc.
Tight_Scene8900@reddit (OP)
thats actually clean, a separate judge that only sees task + output and has to call it correct or not. sidesteps the whole self-judging trap because the judge isnt the same model that produced the work. the coding bs detection is the killer use case, curious if you use a smaller cheaper model for the judge or the same size as the main agent? been going back and forth on whether the judge needs to be as smart as the worker or if a dumber grounded checker is fine.
donhardman88@reddit
I feel your pain on the 'smarter copy-paste' thing. That's the wall everyone hits with basic vector memory—cosine similarity is great for finding a similar-sounding paragraph, but it's useless for actual learning or understanding how a system evolves.
The 'right way' (or at least the way that's actually working for me) is to move away from flat embeddings and toward a structural knowledge graph. Instead of just logging facts, you use AST parsing (tree-sitter) to map the actual relationships and dependencies.
When the agent 'learns' a fix, you don't just store a text chunk; you update the relationship in the graph. This way, the agent isn't just recalling a similar event—it's navigating a map of the project's logic. It's a bit more of a lift than a simple vector store, but it's the only way to get that 'experience' feeling rather than just a fancy search.
I've been building this into a tool called Octocode (Rust-based, uses MCP) specifically to solve this 'memory drift' and retrieval noise. It's not perfect, but it's a hell of a lot better than just dumping everything into a vector DB and hoping for the best.
Tight_Scene8900@reddit (OP)
yeah the graph angle is solid, tree-sitter into a structural map is probably the right foundation for code-native agents. ive thought about going that direction but ended up on a different signal entirely.
mine is purely behavioral. the agent scores its own output 1-5 after each task, low scores turn into warnings next time it tries something similar, high scores become patterns to reuse. doesnt look at the code at all, just tracks what happened when the agent worked on it and whether it went well.
honestly feels like both probably need to live in the same stack eventually. a perfect ast map still wont stop an agent from making the same mistake twice if nothing is keeping score. and pure behavioral tracking with no structural grounding is kinda just vibes.
mines called greencube btw, rust/tauri, similar energy to octocode. would be down to compare notes if youre into it
donhardman88@reddit
I think you're spot on. Combining the two – structural grounding for the 'what' and behavioral tracking for the 'how' – is probably the only way to get an agent that actually feels like it's evolving. One provides the map, the other provides the experience.
I'm curious though – have you found a way to benchmark the behavioral side? I've always struggled with the fact that 'learning' often feels subjective. I'd love to know if you've built a way to measure if the behavioral scores are actually reducing the error rate over time, or if it's mostly a qualitative improvement.
For me, the structural side is easier to measure (recall @ k, etc.), but the 'learning' part is the real challenge. If we can find a way to quantify the delta between a 'flat' agent and a 'behavioral' one, that would be a huge win for the whole community.
Tight_Scene8900@reddit (OP)
honestly no, and this is the thing i keep getting stuck on. structural has clean metrics because youre measuring retrieval against ground truth. behavioral has no equivalent. closest ive come up with is tracking thumbs down rate over time and watching for repeat error patterns, like if the agent hits the same mistake twice and then stops after feedback injection, thats a measurable delta. but its noisy and slow and i wouldnt call it a real benchmark.
the thing i want to build is pair comparison. run the same task twice, once with memory injection and once cold, measure whether the with-memory version gets a better grounded outcome (tests pass, tool calls succeed, whatever). hard part is finding tasks where the memory actually has something to say. random tasks would just be noise.
if theres a way to design a shared benchmark that works for both structural and behavioral approaches that would be a real contribution. would be down to brainstorm it if youre in.
Fair-Championship229@reddit
llm as a judge on its own output is known to be unreliable, theres a bunch of papers on this. youre basically building a system that lies to itself and calls it learning
Tight_Scene8900@reddit (OP)
yeah this is a fair hit, the self-correction literature is real and i wont pretend my loop is immune. the deepmind paper on self-correction is the obvious one.
the thing that keeps it from collapsing into pure confirmation bias in practice is that the llm self-score isnt actually the dominant signal. competence per domain is 70 percent actual task success rate and 30 percent llm self-assessment. so if the model keeps rating itself 5 on database stuff but the tasks keep erroring out or getting corrected, the success rate drags the competence down regardless of what the model thinks. the model grading itself only moves the needle a little.
on top of that theres a feedback path where the user can override a score directly, and old knowledge entries decay over time if nothing reinforces them so a wrong self-rating from week one doesnt haunt the agent forever.
still lossy, still has edges, you can construct cases where it drifts. but a 70/30 grounded signal with human correction on top beats pure self-critique and it beats no signal at all. if youve got specific failure modes from those papers in mind id genuinely want to hear them, trying to find the sharp edges before i trust it with anything serious.
MoneyPowerNexis@reddit
https://imgur.com/a/4jONOVb
Tight_Scene8900@reddit (OP)
lmao memento is unironically the correct mental model for this whole problem. leonards tattoos are basically a rules.md file getting injected into context every morning
Hot-Employ-3399@reddit
No. VRAM is not bog enough for putting extra stuff
Tight_Scene8900@reddit (OP)
yeah that’s the wall i kept hitting too. that’s actually why i went local-first desktop instead of trying to shove everything into the model. keep the memory layer outside the inference process entirel
Similar_Gur9888@reddit
this just sounds like RAG with extra steps
Tight_Scene8900@reddit (OP)
yeah I thought the same at first tbh
I guess the difference I’m seeing is it’s not retrieving external docs but its own past task outcomes + tracking failures over time
but yeah the retrieval part probably overlaps a lot