Ollama and LM Studio should support dynamically increasing the context size as it fills up, instead of requiring it be set at load-time

Posted by gigaflops_@reddit | LocalLLaMA | View on Reddit | 8 comments

When you load a model in these programs, you have to manually choose your context size or accept the default of 4096. In contrast, the newly released Unsloth Studio does not have this limitation, and VRAM/RAM is allocated as-needed so that conversations can be continued for arbitrarily long, until resource utilization or speed becomes unsatisfactory. In my humble opinion, LM Studio and Ollama, which are supposed to be the beginner-friendly "plug-and-play" replacements for cloud providers, should support this basic feature.

Problem #1: the unnecessary burden of choice. When the user loads a model before starting a new conversation, they're forced to guess ahead of time how long the discussion will be. Should I set the context window to 8192 because generation is faster and I'm probably not going to need more than that? Or do I set to 16384, using up more resources and running slow, in case the model calls several tools or I need to ask more follow up questions. Forget configuring a default context size that "just works" whenever you need it to. It's frustrating to me that local models are often plenty capable for the task, but major points of friction like this one still renders it faster and easier to ask ChatGPT.

Problem #2: performance. I hinted at this earlier, but in these model runners, if you set the context window to 100K, and only use 10K of that, the generation speed is usually considerably slower than if you had chosen a 10K context window at load time. My understanding is that this this occurs because additional VRAM is allocated for the kv cache, causing more layers to spill into system RAM. This is horribly inefficient because the amount of context needed for a conversation starts at zero and grows slowly with each additional message, but for the entire conversation, gigabytes of empty kv cache occupies VRAM, forcing more layers to sit in slower system RAM and run on the CPU. It astounds me the effort that the local LLM community goes to in order to squeeze a few more tokens/sec out of your hardware, yet every model runner besides Unsloth Studio (correct me if wrong) still requires that your GPU keep gigabytes of VRAM allocated for kv cache that isn't needed yet.

Problem #3: beginner-friendliness. I would love for local LLMs to eventually be a mainstream alternative to cloud models, but that will never happen until it's possible to somebody to use basic chatbot functionality without needing to know what a "token" or "context window" is. It's unnecessarily confusing for beginners when they see a model that "supports 256K context" and upload a document that's 10K tokens, only to get a gibberish response because they didn't know their model runner silently truncated it to 4096 tokens instead of allocating more VRAM. I would bet that a non-zero number of people have had this happen to them, gave up, and left having made the conclusion that local LLMs aren't very good. I find it to be a crying shame that all my attempts to show local LLMs to my non-technical friends result in them losing interest before I'm done with the 10 minute spiel about how to choose the optimal context window.