Gemma4-31B worked in an iterative-correction loop (with a long-term memory bank) for 2 hours to solve a problem that baseline GPT-5.4-Pro couldn't

[-]

Thrumpwart@reddit

On release day I downloaded Gemma 4-21B, loaded it up, and immediately ran into gibberish outputs using lemonades llama-server. It happens to most models on release day, whatever.

Tonight, I finally tried against with an unsloth quant - holy crap this thing is smart. It's coherent and direct in a way few other models are. I forgot how good Gemma models can be at explaining complex concepts so well.

[-]

StardockEngineer@reddit

which one? I been using UDQ6 XL and I'm getting loops. Latest llama.cpp. Share with me how you're getting good outputs, please.

[-]

Thrumpwart@reddit

Are you setting the right inference parameters (temp, top_k, etc)?

[-]

StardockEngineer@reddit

Yup

[-]

Thrumpwart@reddit

I'm using the same model in latest llama.cpp ROCM build (pulled an hour ago). On an AMD W7900 with ROCM 7.2.1. Works fine for me with the following:

.build/bin/llama-server \ -m /home/Thrumpwart/LLM/Models/unsloth/gemma-4-31B-it-GGUF/gemma-4-31B-it-UD-Q6_K_XL.gguf \ -c 131072 \ -np 1 \ -ngl 99 \ -fa on \ -b 4096 \ -ub 4096 \ --temp 1.0 \ --top-p 0.95 \ --top-k 64 \ --min-p 0.0 \ --port 8080 \ --host 0.0.0.0

Note the above takes up 45+GB of VRAM and uses CPU also during inference. Downloading a Bartowski Q8 now just to see how it is.

[-]

StardockEngineer@reddit

I can get 31B running with the latest llama.cpp. I think you have too many parameters tho. Check ngl and fa defaults. Don’t need context, either. It’ll auto fit.

Only when messing with kv cache quants you might need to use fa

[-]

Thrumpwart@reddit

Nice, I had no idea!

[-]

Thrumpwart@reddit

Hmmm, I’ll pull newest llama.cpp tonight and run it. Will share results.

[-]

MonocleFox@reddit

Would you mind sharing details on your setup / how you ran it? I’m still trying to figure out the best way to run it (lmstudio, ollama, llamacpp) and config. things are moving fast

[-]

Thrumpwart@reddit

LMStudio on a Mac. Running the Unsloth Q8k_X_L. I used the parameter settings from the Gemma 4 HF page (I believe temp = 1 and Top_K = 64). The Unsloth model thinking mode wasn’t working but I found a hack here whereby I copy pasted a line of code into the Jinja template and set reasoning start and end prompt to thought (start) and (end).

[-]

JessicaVance83@reddit

what should be the minimum VPS config for gemma4?

[-]

TonyDaDesigner@reddit

i also had gpt 5.4 run into an issue that it couldnt fix. minimax was able to fix it in one prompt, surprisingly

[-]

Fit-Produce420@reddit

Mini max and step 3.5 flash are both great.

[-]

ecompanda@reddit

the framing of 'smaller model with memory beats bigger baseline' misses what i think is the actual variable: access to intermediate conclusions, not just compute time. the baseline can't commit 'i verified X is true' as a hard constraint for later steps. the loop is doing manually what attention fails at over long context: preventing the model from walking back conclusions it already validated.

curious if the 2 hours of runtime was mostly on hard subproblems or spread evenly across the task

[-]

Trovebloxian@reddit

What interface are you using? WebUI? GPT4ALL?

[-]

Soft_Match5737@reddit

The interesting thing about iterative correction beating single-shot GPT-5.4-Pro is that it reveals where the actual bottleneck is — it's not raw capability, it's the ability to backtrack when a reasoning path goes wrong. A 31B model that can say "wait, that step was wrong" and re-route will beat a 10x larger model that commits to its first chain of thought. The long-term memory bank is doing the heavy lifting here because it prevents the model from re-discovering the same dead ends across iterations.

[-]

Ryoiki-Tokuiten@reddit (OP)

Repo Link: https://github.com/ryoiki-tokuiten/Iterative-Contextual-Refinements

[-]

kaggleqrdl@reddit

Lol, what's the math problem? I'll believe it when I see it. Otherwise, it looks like spam

[-]

zuluana@reddit

I believe the problem and solution are shown in the 2nd image.

[-]

kaggleqrdl@reddit

No, they aren't, which makes you wonder why he specifically didn't show it.

[-]

_BreakingGood_@reddit

I get the same thought when people say shit like "Yeah I had my agent running the entire weekend autonomously churning through the project"

Like, the fuck project were you working on?

[-]

Far-Low-4705@reddit

Came here to say this lol

[-]

bdeetz@reddit

Do you have any examples of token spend for the system?

[-]

brixon@reddit

Run locally looping iterative systems work great for local models since you don’t really care the total token usage, you mostly focus on not blowing out your context.

[-]

jacek2023@reddit

your projects looks interesting

[-]

BestSeaworthiness283@reddit

Trully impressive

[-]

polandtown@reddit

bravo - what's your memory/setup?

[-]

korino11@reddit

Loops -way for monkeys. Need to look direct in layers and vetors.

[-]

DrVonSinistro@reddit

Plot twist: 2 hours at 1.2 t/s

[-]

weiyong1024@reddit

we see the same thing managing a fleet of ai agents. give a 30b model a persistent scratch pad between runs and it catches stuff that a frontier model misses on a single pass. the iterating is doing way more than the parameter count, most people underestimate how much memory + loops matter vs just throwing a bigger model at it

[-]

MonocleFox@reddit

Would you mind sharing more details on your setup / how you make it happen? I’ve got some tricky engineering problems that I think would benefit from this

[-]

weiyong1024@reddit

Each agent runs in its own docker container, isolated from the host and from each other. persistent state (config, memory, workspace) survives restarts via mounted volumes.

The part that might interest you - we have a roster system where every agent automatically knows who else is in the fleet, their role, and which channel they're on. Agents can u/mention each other when they hit something outside their expertise, so you get a distributed version of that iterative refinement - instead of one model looping, specialized agents consult each other and converge. Fleet topology changes (add/remove an agent) auto-sync to all running instances via hot-reload, no restarts needed.

we run about 9 of these on one mac from a browser dashboard. open-sourced if you want to poke around: https://github.com/clawfleet/ClawFleet

[-]

MonocleFox@reddit

This is super helpful, thank you! I’m going to take a look at the repo and try to get my brain around it!

[-]

Turbulent_Pin7635@reddit

Where I can learn to do this cool pipelines? Any tip?

[-]

openSourcerer9000@reddit

Langgraph is my go-to, lots of great examples in their docs

[-]

Turbulent_Pin7635@reddit

Thx!

[-]

openSourcerer9000@reddit

Looks like op used type script langgraph, the python flavor is what I'm familiar with

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

Designer_Reaction551@reddit

this tracks with what I've seen. the memory bank is doing the heavy lifting here, not the model size. we run a multi-step pipeline that stores state between iterations in plain JSON and the difference between 'try again from scratch' vs 'here is what you already tried and why it failed' is night and day. context rot is real but a well-scoped memory buffer fixes most of it.

[-]

CryptoUsher@reddit

kinda wild that a smaller model with memory loops beat a much larger baseline, makes you wonder how much of "performance" is just architecture and how much is giving models time to think
i’m starting to think the next leap isn’t in scale but in making models that can debug their own reasoning over multiple passes, like a compiler optimizing itself
what if the real bottleneck isn’t parameter count but the lack of persistent scratch pads across reasoning steps?
anyone tried simulating working memory with vector db rollbacks or timestamped context pruning?

[-]

openSourcerer9000@reddit

This kind of thing is probably the most exciting use case for AI. Just yesterday I saw this paper, where they beat human sota on some optimization problems by running minimaxes in open code like "agentic swarm optimization"

https://arxiv.org/html/2604.01658v1#bib.bib2

[-]

CryptoUsher@reddit

that minimax agentic swarm stuff is wild, feels like we're finally hacking around brute force

[-]

Far-Low-4705@reddit

I think a big part is using tools to interact with an environment and receive feedback.

And I think that “memory loops” just help it keep on a agentic loop for longer without running out of context

[-]

CryptoUsher@reddit

yeah i see that, tools + memory could be a game changer for agent-like behavior. fwiw i’ve been testing llama3-70b with a simple scratchpad loop and it’s way better at multi-step tasks than running raw. makes me think the future’s more about thinking than scaling

[-]

SkyFeistyLlama8@reddit

A harness with self-modifying prompts... like a constrained sandboxed version of OpenClaw. I like this idea. A memory scratchpad.

[-]

CryptoUsher@reddit

kinda wild to think we might hit better performance with a 7b model and a smart scratchpad than a 70b just thinking once. wonder if someone’s already baking this into Oobabooga or llama.cpp configs

[-]

SkyFeistyLlama8@reddit

Maybe that scratchpad could end up being like skills or whatever that gimmicky idea is. Load different scratchpads based on usage, like personal finance chat or business email writing.

[-]

single_plum_floating@reddit

Isn't that basically the main selling point of hermes agent? seems to me tool-use + memory within it is basically that.

[-]

Clear-Ad-9312@reddit

When it comes to longer context and heavy research, I think looping/recursive iterative loops make a big difference since pieces get built up and the main model does not get lost from context rot.
+1 for Hermes

[-]

CryptoUsher@reddit

yeah hermes does that pretty well, been running it on my 3090 with vllm and the self-correction actually works

[-]

ab2377@reddit

what's a long term memory bank?

[-]

garg-aayush@reddit

Impressive, would definitely check out the repo over the weekend.

[-]

ApexDigitalHQ@reddit

Asking an LLM to do math always makes me nervous but enough compute and time should be able to reason anything eventually. I have a notepad somewhere with some scribbled notes about auto-research but I'm sure there are plenty of you out there that have implemented something better than I've even imagined.

[-]

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

[-]

Borkato@reddit

This is really cool