How does the system prompt actually work? does it differ per provider and per model? Also how does it impact prompt caching?
Posted by haodocowsfly@reddit | LocalLLaMA | View on Reddit | 12 comments
So I’m reading: https://developers.openai.com/cookbook/examples/prompt_caching_201 and https://platform.claude.com/docs/en/build-with-claude/prompt-caching and it says that the cache should be stable wrt to tools > system prompt > message content.
I’m a bit confused about the system prompt part. From what I remember about genma when i briefly played around with it, from what I understand, the format should be:
“””
[message history] (stripped of system prompt)
and then in the next message:
system: [attached system prompt]
user: (new message)
“””
Doesn’t that mean the most important part of the cache is “message history content” and not the tools/system prompt? Or are there other strategies for the system prompt?
I’m trying to figure this out because I noticed this:
https://haowjy.github.io/blog/75-percent-redundant-reads (sorry for some of the AI slop, especially at the bottom, haven’t had time to clean up my theory/experiment especially).
The main technique I’m trying to figure out is if we can ditch most “tool results” and put them into the system prompt dynamically as sort of an exact “working memory” for the most recent tools (especially reads) which always have the most up to date contents of something, so that the message history doesn’t get polluted with constant re-reads.
Awwtifishal@reddit
Gemma didn't have system prompt until version 4.
Prompt cache only works as long as it remains exactly the same from the beginning until the part where cache is used. If you change or insert something near the beginning, then the whole context has to be processed from that point until the last message.
The system prompt and the tool definitions are both placed at the beginning of the context. At least for all models I know about. Their chat template converts the list of messages, tools, and system message to one continuous piece of text/tokens.
a_beautiful_rhind@reddit
It did if you used TC or altered the jinja.
Awwtifishal@reddit
Well, of course you can do anything with any model in text completion mode or altering the template. The point is about what the model was natively (post)trained with.
a_beautiful_rhind@reddit
We don't know with gemma. They probably removed the system prompt as part of the censoring. It did respond to it though.
I surmise that for part of it's train, it had a system or guidelines and then in the end it was bowled over before release.
Awwtifishal@reddit
Yeah the template had a system message where it was just prepended to the first user message, and the model was trained for following instructions, so the system message works but maybe not as well as it's designed for other models. For example if you're supposed to follow the instructions in the system message but not follow them in the user message, the model has no way to know. Unless you know how the system message is put so you can include instructions for what comes below.
haodocowsfly@reddit (OP)
i was playing around with gemma 3. That makes sense - i guess the system prompt is training to be understood at the top? Is that really the best place to put the system prompt?
I think my idea will end up putting it as a user message then.
yuicebox@reddit
It’s the best in the sense that it allows for effective caching. You can also insert guidance / messages further down into your chat to steer things, which can help if you’re having issues with it not following system prompt later in conversations
haodocowsfly@reddit (OP)
do you know how much control you have for prompt caching (maybe in llamacpp)? specifically, if u had messages u wanted to send as part of the message, but don’t want to it to be part of the “cache,” can u control that? it looks like you can for the claude api.
yuicebox@reddit
There are definitely ways you can control it, but I'm not that familiar with them because I've never tried to do anything fancy with caching.
My high-level understanding is:
Not sure how you'd do this optimally in practice, but if you go down to the bottom of the llama.cpp server documentation where they go over the different endpoints the server provides, it shows some info about it.
https://github.com/ggml-org/llama.cpp/tree/master/tools/server
Awwtifishal@reddit
You can do both. A system message and then further guidance near the bottom.
Ha_Deal_5079@reddit
stuffing tool results into the system prompt nukes your cache hit rate since claude needs exact token matches. keep tool results in messages with a breakpoint between system and messages
HiddenoO@reddit
It's not exactly the same between providers/models, but generally speaking, the system prompt and available tools (plus some additional context based on parameters in some cases) are stored at the very top as they're expected to stay constant for the most part, and the message history follows afterwards. That way, you can cache anything up to the latest message.
This also means that you want to avoid changing your system prompt regularly, especially dynamically (e.g., current timestamp), as you'll generally be unable to take anything after the changed part from the cache.