Is there a difference between chat and repeated calling from scratch?

Posted by yelling-at-clouds-40@reddit | LocalLLaMA | View on Reddit | 12 comments

When I do chat with a bot, it is like:

- #0:

- #1

- #2:

- #3

- #4:

- #5

Is there any fundamental difference between that and calling the LLM with #0, then with the concatenation of #0 and #2 (or is it #0, #1, #2?), and then #0, #2, and #4 (or is it #0..#4?)

Is there any difference between the models, whether they respond significantly different ways?

[-]

Radiant_Dog1937@reddit

The context is just a string of text with the sum of all the messages. If you load the exact same context, however you load it, and have the exact same settings, you get the exact same response.

[-]

this-just_in@reddit

This isn’t really true. Your inference parameters will lead to sampling changes and so different responses

[-]

Well, there are other parts than the input/output messages, like there might be system prompt, some summary/memory, etc. But it is essentially true. LLM just get one big text as input and produces next token. Chat is just an illusion, frontend already does what OP proposes - eg concatenating all messages into single blob (of course with proper tag tokens etc.).

Sampling only affects output which I do not think was OP question. If he properly puts into #0 everything that happened in #1-#5 (including the instruct dags if it is instruct model etc) he will get exactly same prompt and thus exactly same response in both scenarios. The response itself will be affected by sampler of course - but in the same way as it is same input (eg deterministic sampler should produce the same output in both cases.).

[-]

Radiant_Dog1937@reddit

What parameters change in this case?

[-]

LumpyWelds@reddit

I'm pretty sure that the RNG will be different on second run. Also, unless you are using CPU mode, your output isn't guaranteed to be the same even with a fixed seed for the RNG because GPUs are usually non-deterministic for performance reasons and that can affect output.

[-]

No_Afternoon_4260@reddit

For the model it "sees" all tokens before the one it is generating. "User" and "assistant", what you perceive has a new bubble in the conversation (UI) is just a special token for the model.

Hope it helps you I m not sure I got the question right

[-]

segmond@reddit

It's 0..4, the LLMs response is part of the conversation, if you take those out, it might just think you're talking to yourself and respond as if you are talking to yourself instead of in a chat. The LLM doesn't have memory of it's generation/responses, you must feed them back.

[-]

ShengrenR@reddit

To add to this, keep in mind there's nothing actually 'unique' about the individual messages/responses - they all get parsed with tags/tokens and formatted as a single chunk before it gets tokenized and sent to the LLM:

e.g. for 'chatml format':

<|im_start|>system

Provide some context and/or instructions to the model.

<|im_end|>

<|im_start|>user

The user’s message goes here

<|im_end|>

<|im_start|>assistant

First LLM/assistant response

<|im_end|>

<|im_start|>user

The second user message

<|im_end|>

<|im_start|>assistant

Your front-end may have just sent {messages: [{message1, }, {message2}...]} but that gets picked up by the API/backend and converted into the format above (or another format, it's just a pattern trained in to the model when they do fine-tune/instruct-tuning) - the LLM is just handed that back and forth as a huge preamble to what it writes next.

[-]

mylittlethrowaway300@reddit

I don't know why, but Llama.cpp and Ollama have completion endpoints and chat completion endpoints. For the same loaded model. If you use the completion endpoint, you can do a basic chat, but it's way more likely to go off the rails, or spit out way more text than is necessary to satisfy the latest prompt. I'm not sure what the inference engine is doing, specifically, but using chat completion seems to make a better chat experience than using the completion endpoint.

Otherwise, yes, the chat history is fed back into the prompt every time. Some LLMs don't have the concept of a system prompt, so the "system prompt" is added silently before the first prompt.

[-]

Evening_Ad6637@reddit

Chat-completion means that a chat template has been applied automatically. Llama.cpp tries to read the corresponding chat format from the gguf file itself.

While the completion endpoint is just completion and should be used either with base models (e.g. to complete a story, to complete JSON output, etc.) or with a finetuned model but where you apply the chat template manually. The latter is useful if you are using a model that has been finetuned to different chat templates and you want to investigate which of the templates gives you the best results.

[-]

mylittlethrowaway300@reddit

That makes sense. Thanks!

[-]

tengo_harambe@reddit

There's two things you need to consider

The model
The UI layer

AFAIK all LLM models are completely agnostic about this; they just work off of what is passed to them by the UI.

Most chatbot UIs would by default include the entire chat history, otherwise it would be a terrible chatbot, unless its a service for roleplaying with someone with alzheimers. This would includes its own messages, so it can refer to what its output previously. The caveat to all this is that if the message history exceeds the model's context limit, the earlier messages would not be included.