server: fix checkpoints creation by jacekpoplawski · Pull Request #22929 · ggml-org/llama.cpp

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 40 comments

Imagine you are using a local model for agentic coding. You discuss the idea (50k tokens), then say “implement it”. The agent reads files, writes files, runs commands, produces another 20k tokens and the code is ready. Then your next prompt is just “thank you”, and... nothing happens, you have to wait for "something". What is happening is that some tools, like opencode, try to be smart and optimize the context. They modify something in the conversation history. In the best case, llama.cpp has to reprocess everything from that point. In the worst case, it has to reprocess the entire context (70k tokens) and you get “forcing full prompt re-processing...” To avoid that, I switched from opencode to pi. Not because pi has some magical features, but because it does not do that kind of context rewriting. Another issue is the model being smart by removing reasoning from the context. In the best case, llama.cpp only has to reprocess the last run (20k tokens). In the worst case, again, it has to reprocess everything (70k) To avoid that, you can enable “preserve thinking”, at least with Qwen 3.6. The goal of this PR is to avoid the worst case (full prompt re-processing) and get closer to the best case, where llama.cpp only reprocesses what actually changed. I have been using this code for about two weeks and in my opinion agentic coding is now more responsive.

Reply to Post

40 Comments

[-]

tomobobo@reddit

In my experience, this works great. You saved everyone so much time! Thank you!!!

[-]

DistanceSolar1449@reddit

Now we just need checkpoints on SSD. Checkpoints in VRAM is terrible, checkpoints stored in RAM is slightly better for dense models (but not good for MoE models or Macs). Ideally checkpoints should be stored on a fast SSD. A Macbook Pro M1 has something like 5GB/sec for the SSD, so you can read Qwen 3.6 35b max context at BF16 from SSD in a bit more than 1 second. Qwen 3.6 27b max kv cache is like 16GB, so a bit more than 3 seconds to load a checkpoint from SSD.

[-]

k-u-got-me@reddit

There is a option to save the slot to disk via an endpoint, I've been extensively using that feature, I even made a small rudimentary edit to my llama cpp server code to save the checkpoint data itself and it recovers about 80-90% of the checkpoint data and kv cache , it's really fast on a gen4 ssd to the point where I'm trying to make it so that I save everything I can around 5 gb of cache on disk, it only works with the chat that the cache was generated for , the next step personally would be to make it so that the disk kv cache is kinda universal to all chats but I have everything I need for my use cases. Would love to work on a PR this weekend if enough people want something like this. Would be my first open source commit too! , llama cpp is an amazing tool and I'm very grateful to the team for everything tbh.

[-]

mr_Owner@reddit

Oh nice how though!

[-]

k-u-got-me@reddit

\--slot-save-path is the flag, and you can specify the folder, and then hitting the endpoint just saves the kv cache for that slot in disk. the change i made basically saved the checkpoint metadata itself, i.e what n tokens were processed by the kv cache, not just having the kv cache saved itself, that allowed it to basically work like normal check pointing, since the new prompts only change the latter head of the context and the remaining 80-90% remains the same, atleast for agentic coding with large pre processed rules and stuff. basically before it saved all its thoughts but they had to perfectly align for it to "remember" but now it has the links i.e the n\_tokens it had processed before, the amount of n\_tokens that match directly depends on how much the head context has changed.

[-]

MuDotGen@reddit

Oh snap, it got merged finally? Great! Is it in the latest main branch bin releases yet?

[-]

Formal-Exam-8767@reddit

Can KV cache be spliced? What if you kept question KV-cache, spliced out reasoning part, and glued rest of response KV-cache to end of question KV-cache?

[-]

Imaginary-Unit-3267@reddit

If anything changes anywhere in the KV cache everything after it has to be recomputed. This is a fundamental limitation of autoregressive transformer models, unfortunately.

[-]

jacek2023@reddit (OP)

If reasoning is removed from the middle, KV cache after it becomes invalid. So in theory we could compute cache twice: once with reasoning, once without it. But then there is no speed gain.

[-]

Napster3301@reddit

great fix, but this is papering over the real bug: agent harnesses rewriting conversation mid-task and breaking kv cache. why is every harness reinventing context management instead of agreeing on a spec inference engines can optimize for?

[-]

Confident_Ideal_5385@reddit

Because the OpenAI API is fundamentally stateless, which may make sense if you're hosting models for saas users, but adds an insane amount of utterly pointless book-keeping for single user use cases. The big guys want a stateless protocol that matches cache in a "best effort" fashion. As an app developer, the ideal API would be able to push/pop tokens to a dedicated kv cache, removing the abstraction tax. This is, coincidentally, the ideal interface for single/few user local inference. But because tools like Pi need to run everywhere, everyone uses the lowest common denominator, which means forking a conversation is stateless and difficult, when it shouldn't be.

[-]

crantob@reddit

A MILLION POINTS TO YOU, SIR. Two years ago we had effortless continuing convos with straight llama.cpp I miss that. We have been enshittified already.

[-]

Imaginary-Unit-3267@reddit

I still haven't gotten used to the whole `/v1/chat/completions` thing. I miss just pasting text into EleutherAI's GPT-J demo and pasting its responses back into it. So much simpler than all this shit. And more creative than any chatbot, on enough temperature.

[-]

Xera1@reddit

Because there is no best way yet. There is no consensus and trying to force one at this point would be overly restrictive and have to be replaced in a month.

[-]

jacek2023@reddit (OP)

watch few sentences at 11:35 here [https://youtu.be/Dli5slNaJu0?si=3i\_a8piWcg3K3MX2](https://youtu.be/Dli5slNaJu0?si=3i_a8piWcg3K3MX2) this guy wrote Pi

[-]

farkinga@reddit

My subjective impression is: this works great! I am noticing vastly-less prompt re-processing. Nice work!

[-]

sammcj@reddit

Nice work on and thanks for the contribution!

[-]

FiLo420blazeit@reddit

Solid work. Does the checkpoint store an actual KV cache snapshot, or just enough metadata to find the longest reusable prefix on the next request? Curious how it handles mid-conversation edits vs pure tail-trimming, since opencode's rewrites aren't always append-only.

[-]

New_Spray_7886@reddit

Great work Jacek - the PR thread was a pleasure to read

[-]

Several-Tax31@reddit

Awesome work! This was a big headache lately.

[-]

am17an@reddit

Oh wow the poster becomes postee, Congrats on the merge!

[-]

jacek2023@reddit (OP)

Now let's wait for regressions and side effects 😉

[-]

LegacyRemaster@reddit

classic 😃

[-]

ilintar@reddit

Note: we've cooperated with u/jacek2023 to ensure that all the supported models/parsers are compatible with this, it has been something we've been discussing for some time but his PR gave us the motivation to actually work through it 😄 this might sound like a small change, but it's really a big deal and Jacek put a lot of hard work into this.

[-]

jacek2023@reddit (OP)

I have a big collection of models, and over the weekend I tried to run them all (from Mistral Nemo / Llama 3 to MiniMax/Step) to make sure nothing crashes. 😄

[-]

PaceZealousideal6091@reddit

Congratulations Jacrek! Thanks a lot for the amazing work!

[-]

YetAnotherAnonymoose@reddit

Does beellama have a fix like this already /u/Anbeeld ?

[-]

Anbeeld@reddit

I'm currently rebasing it to the latest llama.cpp, so no worries there.

[-]

YetAnotherAnonymoose@reddit

👍🏻

[-]

ImpossibleHot@reddit

a big hug for you 🤗

[-]

joost00719@reddit

Finally. Been struggling with this a lot. Thank you man.

[-]

Existing_Bet_350@reddit

Context optimization in agentic workflows is a real pain point, that checkpoint fix addresses the symptom but the root issue is state management across long conversations. Tools modifying history mid-session breaks determinism and makes debugging nearly impossible. This is actually a core problem Yellow SDK tackles for AI agent infrastructure. State channels maintain cryptographic proofs of conversation state, so you get verifiable checkpoints without the context mutation issues. Useful when agents need to settle transactions or coordinate with other agents mid-workflow. If you're building agentic tools that need reliable state handling, worth checking out the SDK docs at [yellow.org](http://yellow.org) Cheers

[-]

jacek2023@reddit (OP)

I understand, but can you configure OpenCode to not mess with the context?

[-]