Gemma4-31B worked in an iterative-correction loop (with a long-term memory bank) for 2 hours to solve a problem that baseline GPT-5.4-Pro couldn't
Posted by Ryoiki-Tokuiten@reddit | LocalLLaMA | View on Reddit | 59 comments
Thrumpwart@reddit
On release day I downloaded Gemma 4-21B, loaded it up, and immediately ran into gibberish outputs using lemonades llama-server. It happens to most models on release day, whatever.
Tonight, I finally tried against with an unsloth quant - holy crap this thing is smart. It's coherent and direct in a way few other models are. I forgot how good Gemma models can be at explaining complex concepts so well.
StardockEngineer@reddit
which one? I been using UDQ6 XL and I'm getting loops. Latest llama.cpp. Share with me how you're getting good outputs, please.
Thrumpwart@reddit
Are you setting the right inference parameters (temp, top_k, etc)?
StardockEngineer@reddit
Yup
Thrumpwart@reddit
I'm using the same model in latest llama.cpp ROCM build (pulled an hour ago). On an AMD W7900 with ROCM 7.2.1. Works fine for me with the following:
.build/bin/llama-server \ -m /home/Thrumpwart/LLM/Models/unsloth/gemma-4-31B-it-GGUF/gemma-4-31B-it-UD-Q6_K_XL.gguf \ -c 131072 \ -np 1 \ -ngl 99 \ -fa on \ -b 4096 \ -ub 4096 \ --temp 1.0 \ --top-p 0.95 \ --top-k 64 \ --min-p 0.0 \ --port 8080 \ --host 0.0.0.0
Note the above takes up 45+GB of VRAM and uses CPU also during inference. Downloading a Bartowski Q8 now just to see how it is.
StardockEngineer@reddit
I can get 31B running with the latest llama.cpp. I think you have too many parameters tho. Check ngl and fa defaults. Don’t need context, either. It’ll auto fit.
Only when messing with kv cache quants you might need to use fa
Thrumpwart@reddit
Nice, I had no idea!
Thrumpwart@reddit
Hmmm, I’ll pull newest llama.cpp tonight and run it. Will share results.
MonocleFox@reddit
Would you mind sharing details on your setup / how you ran it? I’m still trying to figure out the best way to run it (lmstudio, ollama, llamacpp) and config. things are moving fast
Thrumpwart@reddit
LMStudio on a Mac. Running the Unsloth Q8k_X_L. I used the parameter settings from the Gemma 4 HF page (I believe temp = 1 and Top_K = 64). The Unsloth model thinking mode wasn’t working but I found a hack here whereby I copy pasted a line of code into the Jinja template and set reasoning start and end prompt tothought (start) and (end).
JessicaVance83@reddit
what should be the minimum VPS config for gemma4?
TonyDaDesigner@reddit
i also had gpt 5.4 run into an issue that it couldnt fix. minimax was able to fix it in one prompt, surprisingly
Fit-Produce420@reddit
Mini max and step 3.5 flash are both great.
ecompanda@reddit
the framing of 'smaller model with memory beats bigger baseline' misses what i think is the actual variable: access to intermediate conclusions, not just compute time. the baseline can't commit 'i verified X is true' as a hard constraint for later steps. the loop is doing manually what attention fails at over long context: preventing the model from walking back conclusions it already validated.
curious if the 2 hours of runtime was mostly on hard subproblems or spread evenly across the task
Trovebloxian@reddit
What interface are you using? WebUI? GPT4ALL?
Soft_Match5737@reddit
The interesting thing about iterative correction beating single-shot GPT-5.4-Pro is that it reveals where the actual bottleneck is — it's not raw capability, it's the ability to backtrack when a reasoning path goes wrong. A 31B model that can say "wait, that step was wrong" and re-route will beat a 10x larger model that commits to its first chain of thought. The long-term memory bank is doing the heavy lifting here because it prevents the model from re-discovering the same dead ends across iterations.
Ryoiki-Tokuiten@reddit (OP)
Repo Link: https://github.com/ryoiki-tokuiten/Iterative-Contextual-Refinements
kaggleqrdl@reddit
Lol, what's the math problem? I'll believe it when I see it. Otherwise, it looks like spam
zuluana@reddit
I believe the problem and solution are shown in the 2nd image.
kaggleqrdl@reddit
No, they aren't, which makes you wonder why he specifically didn't show it.
_BreakingGood_@reddit
I get the same thought when people say shit like "Yeah I had my agent running the entire weekend autonomously churning through the project"
Like, the fuck project were you working on?
Far-Low-4705@reddit
Came here to say this lol
bdeetz@reddit
Do you have any examples of token spend for the system?
brixon@reddit
Run locally looping iterative systems work great for local models since you don’t really care the total token usage, you mostly focus on not blowing out your context.
jacek2023@reddit
your projects looks interesting
BestSeaworthiness283@reddit
Trully impressive
polandtown@reddit
bravo - what's your memory/setup?
korino11@reddit
Loops -way for monkeys. Need to look direct in layers and vetors.
DrVonSinistro@reddit
Plot twist: 2 hours at 1.2 t/s
weiyong1024@reddit
we see the same thing managing a fleet of ai agents. give a 30b model a persistent scratch pad between runs and it catches stuff that a frontier model misses on a single pass. the iterating is doing way more than the parameter count, most people underestimate how much memory + loops matter vs just throwing a bigger model at it
MonocleFox@reddit
Would you mind sharing more details on your setup / how you make it happen? I’ve got some tricky engineering problems that I think would benefit from this
weiyong1024@reddit
Each agent runs in its own docker container, isolated from the host and from each other. persistent state (config, memory, workspace) survives restarts via mounted volumes.
The part that might interest you - we have a roster system where every agent automatically knows who else is in the fleet, their role, and which channel they're on. Agents can u/mention each other when they hit something outside their expertise, so you get a distributed version of that iterative refinement - instead of one model looping, specialized agents consult each other and converge. Fleet topology changes (add/remove an agent) auto-sync to all running instances via hot-reload, no restarts needed.
we run about 9 of these on one mac from a browser dashboard. open-sourced if you want to poke around: https://github.com/clawfleet/ClawFleet
MonocleFox@reddit
This is super helpful, thank you! I’m going to take a look at the repo and try to get my brain around it!
Turbulent_Pin7635@reddit
Where I can learn to do this cool pipelines? Any tip?
openSourcerer9000@reddit
Langgraph is my go-to, lots of great examples in their docs
Turbulent_Pin7635@reddit
Thx!
openSourcerer9000@reddit
Looks like op used type script langgraph, the python flavor is what I'm familiar with
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
Designer_Reaction551@reddit
this tracks with what I've seen. the memory bank is doing the heavy lifting here, not the model size. we run a multi-step pipeline that stores state between iterations in plain JSON and the difference between 'try again from scratch' vs 'here is what you already tried and why it failed' is night and day. context rot is real but a well-scoped memory buffer fixes most of it.
CryptoUsher@reddit
kinda wild that a smaller model with memory loops beat a much larger baseline, makes you wonder how much of "performance" is just architecture and how much is giving models time to think
i’m starting to think the next leap isn’t in scale but in making models that can debug their own reasoning over multiple passes, like a compiler optimizing itself
what if the real bottleneck isn’t parameter count but the lack of persistent scratch pads across reasoning steps?
anyone tried simulating working memory with vector db rollbacks or timestamped context pruning?
openSourcerer9000@reddit
This kind of thing is probably the most exciting use case for AI. Just yesterday I saw this paper, where they beat human sota on some optimization problems by running minimaxes in open code like "agentic swarm optimization"
https://arxiv.org/html/2604.01658v1#bib.bib2
CryptoUsher@reddit
that minimax agentic swarm stuff is wild, feels like we're finally hacking around brute force
Far-Low-4705@reddit
I think a big part is using tools to interact with an environment and receive feedback.
And I think that “memory loops” just help it keep on a agentic loop for longer without running out of context
CryptoUsher@reddit
yeah i see that, tools + memory could be a game changer for agent-like behavior. fwiw i’ve been testing llama3-70b with a simple scratchpad loop and it’s way better at multi-step tasks than running raw. makes me think the future’s more about thinking than scaling
SkyFeistyLlama8@reddit
A harness with self-modifying prompts... like a constrained sandboxed version of OpenClaw. I like this idea. A memory scratchpad.
CryptoUsher@reddit
kinda wild to think we might hit better performance with a 7b model and a smart scratchpad than a 70b just thinking once. wonder if someone’s already baking this into Oobabooga or llama.cpp configs
SkyFeistyLlama8@reddit
Maybe that scratchpad could end up being like skills or whatever that gimmicky idea is. Load different scratchpads based on usage, like personal finance chat or business email writing.
single_plum_floating@reddit
Isn't that basically the main selling point of hermes agent? seems to me tool-use + memory within it is basically that.
Clear-Ad-9312@reddit
When it comes to longer context and heavy research, I think looping/recursive iterative loops make a big difference since pieces get built up and the main model does not get lost from context rot.
+1 for Hermes
CryptoUsher@reddit
yeah hermes does that pretty well, been running it on my 3090 with vllm and the self-correction actually works
ab2377@reddit
what's a long term memory bank?
garg-aayush@reddit
Impressive, would definitely check out the repo over the weekend.
ApexDigitalHQ@reddit
Asking an LLM to do math always makes me nervous but enough compute and time should be able to reason anything eventually. I have a notepad somewhere with some scribbled notes about auto-research but I'm sure there are plenty of you out there that have implemented something better than I've even imagined.
kaggleqrdl@reddit
Lol, what's the math problem? I'll believe it when I see it. Otherwise, it looks like spam
kaggleqrdl@reddit
That's really cool
LegitimateNature329@reddit
way — 13 agents that live entirely in email. You delegate tasks like you'd email a teammate. Small teams adopt it in hours, not weeks.
Borkato@reddit
!remindme 1 day to check this out
RemindMeBot@reddit
I will be messaging you in 1 day on 2026-04-08 23:00:30 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)
Borkato@reddit
This is really cool