llama: limit max outputs of `llama_context` by am17an · Pull Request #23861 · ggml-org/llama.cpp

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 27 comments

Overview

continue #23764, this PR only reserves logits space for n_seqs when possible. With -ub 2048 and MTP, this saves another 1.2GB of VRAM for me. I've tested llama-perplexity also and it seems to work fine. But maybe there is a better API, putting up as a draft for now According to me an API in llama-context is a good solution for this, by default it will reserve all tokens but specifically in server-context we can set it to 1 whenever possible.

- u/am17an