How does the system prompt actually work? does it differ per provider and per model? Also how does it impact prompt caching?

[-]

Awwtifishal@reddit

Gemma didn't have system prompt until version 4.

Prompt cache only works as long as it remains exactly the same from the beginning until the part where cache is used. If you change or insert something near the beginning, then the whole context has to be processed from that point until the last message.

The system prompt and the tool definitions are both placed at the beginning of the context. At least for all models I know about. Their chat template converts the list of messages, tools, and system message to one continuous piece of text/tokens.

[-]

a_beautiful_rhind@reddit

It did if you used TC or altered the jinja.

[-]

Awwtifishal@reddit

Well, of course you can do anything with any model in text completion mode or altering the template. The point is about what the model was natively (post)trained with.

[-]

a_beautiful_rhind@reddit

We don't know with gemma. They probably removed the system prompt as part of the censoring. It did respond to it though.

I surmise that for part of it's train, it had a system or guidelines and then in the end it was bowled over before release.

[-]

Awwtifishal@reddit

Yeah the template had a system message where it was just prepended to the first user message, and the model was trained for following instructions, so the system message works but maybe not as well as it's designed for other models. For example if you're supposed to follow the instructions in the system message but not follow them in the user message, the model has no way to know. Unless you know how the system message is put so you can include instructions for what comes below.

[-]

haodocowsfly@reddit (OP)

i was playing around with gemma 3. That makes sense - i guess the system prompt is training to be understood at the top? Is that really the best place to put the system prompt?
I think my idea will end up putting it as a user message then.

[-]

yuicebox@reddit

It’s the best in the sense that it allows for effective caching. You can also insert guidance / messages further down into your chat to steer things, which can help if you’re having issues with it not following system prompt later in conversations

[-]

haodocowsfly@reddit (OP)

do you know how much control you have for prompt caching (maybe in llamacpp)? specifically, if u had messages u wanted to send as part of the message, but don’t want to it to be part of the “cache,” can u control that? it looks like you can for the claude api.

[-]

yuicebox@reddit

There are definitely ways you can control it, but I'm not that familiar with them because I've never tried to do anything fancy with caching.

My high-level understanding is:

You can set up multiple cache slots when you spin up the server
Cache slots can be static (not updated by messages) or dynamic (updated as you interact with the LLM)
By default, the server will just cache your multi-turn conversation to a dynamic cache slot and match it to the most similar slot
you can use the API to save and restore from those slots

Not sure how you'd do this optimally in practice, but if you go down to the bottom of the llama.cpp server documentation where they go over the different endpoints the server provides, it shows some info about it.

https://github.com/ggml-org/llama.cpp/tree/master/tools/server

[-]

Awwtifishal@reddit

You can do both. A system message and then further guidance near the bottom.

[-]

Ha_Deal_5079@reddit

stuffing tool results into the system prompt nukes your cache hit rate since claude needs exact token matches. keep tool results in messages with a breakpoint between system and messages

[-]

HiddenoO@reddit

It's not exactly the same between providers/models, but generally speaking, the system prompt and available tools (plus some additional context based on parameters in some cases) are stored at the very top as they're expected to stay constant for the most part, and the message history follows afterwards. That way, you can cache anything up to the latest message.

This also means that you want to avoid changing your system prompt regularly, especially dynamically (e.g., current timestamp), as you'll generally be unable to take anything after the changed part from the cache.