Ollama and LM Studio should support dynamically increasing the context size as it fills up, instead of requiring it be set at load-time
Posted by gigaflops_@reddit | LocalLLaMA | View on Reddit | 8 comments
When you load a model in these programs, you have to manually choose your context size or accept the default of 4096. In contrast, the newly released Unsloth Studio does not have this limitation, and VRAM/RAM is allocated as-needed so that conversations can be continued for arbitrarily long, until resource utilization or speed becomes unsatisfactory. In my humble opinion, LM Studio and Ollama, which are supposed to be the beginner-friendly "plug-and-play" replacements for cloud providers, should support this basic feature.
Problem #1: the unnecessary burden of choice. When the user loads a model before starting a new conversation, they're forced to guess ahead of time how long the discussion will be. Should I set the context window to 8192 because generation is faster and I'm probably not going to need more than that? Or do I set to 16384, using up more resources and running slow, in case the model calls several tools or I need to ask more follow up questions. Forget configuring a default context size that "just works" whenever you need it to. It's frustrating to me that local models are often plenty capable for the task, but major points of friction like this one still renders it faster and easier to ask ChatGPT.
Problem #2: performance. I hinted at this earlier, but in these model runners, if you set the context window to 100K, and only use 10K of that, the generation speed is usually considerably slower than if you had chosen a 10K context window at load time. My understanding is that this this occurs because additional VRAM is allocated for the kv cache, causing more layers to spill into system RAM. This is horribly inefficient because the amount of context needed for a conversation starts at zero and grows slowly with each additional message, but for the entire conversation, gigabytes of empty kv cache occupies VRAM, forcing more layers to sit in slower system RAM and run on the CPU. It astounds me the effort that the local LLM community goes to in order to squeeze a few more tokens/sec out of your hardware, yet every model runner besides Unsloth Studio (correct me if wrong) still requires that your GPU keep gigabytes of VRAM allocated for kv cache that isn't needed yet.
Problem #3: beginner-friendliness. I would love for local LLMs to eventually be a mainstream alternative to cloud models, but that will never happen until it's possible to somebody to use basic chatbot functionality without needing to know what a "token" or "context window" is. It's unnecessarily confusing for beginners when they see a model that "supports 256K context" and upload a document that's 10K tokens, only to get a gibberish response because they didn't know their model runner silently truncated it to 4096 tokens instead of allocating more VRAM. I would bet that a non-zero number of people have had this happen to them, gave up, and left having made the conclusion that local LLMs aren't very good. I find it to be a crying shame that all my attempts to show local LLMs to my non-technical friends result in them losing interest before I'm done with the 10 minute spiel about how to choose the optimal context window.
Anxious_Comparison77@reddit
there is a setting in lm studio, just crank it to unlimited and done. it's a hard setting that that is applid to all models you load.
local LLMs in it's current state will never replace frontier models. local llms are effectively shitty toys. Ask a 14B model even 30B models something remotely complex and half it's output will be one big hallucination. There is nothing that can be done, as they are simply to small. the latest qwen3.5 is still trash on anything that requires a iq over 85.
gigaflops_@reddit (OP)
On the other hand, I think that a lackluster tool ecosystem, rather than outright bad intelligence, explains a lot of local LLM deficits. I mean good luck finding a web search or visit website tool that doesn't require an API and can reliably scrape text from a website without including 3000 tokens of invisible text, navigation button text, and advertisements. Tool calls aren't pruned from the context like reasoning tokens are, so while ChatGPT can search 30+ websites in one answer and discard the tokens when finished, models in LM Studio have to retain those tokens. If a model has to search 10+ websites for one answer then the conversation's over because context is literally completely full. Rant over.
Anxious_Comparison77@reddit
you can force the models layers onto GPU RAM and dump the context window onto system ram with lm studio. That way you get long context mine is 250K tokens with 128 system ram, the work is in the weights and layers not feeding it a text file. So slower ram is fine for context window. Aside for keeping up to speed with development I don't use local models. They are all crap, half of them are broken on hugging face. Honestly I never used more than 10K tokens in a chat.
Uncensored models are dumb as fuck, Sure they don't say no, but their guardrails against stupid outputs get wiped out too a 9B Qwen3.5 uncensored I tried was 50% hallucinations
chibop1@reddit
I think this is what happens if you use Ollama MLX runner or mlx-lm directly.
"The context size not set in the server. So it can grow arbitrarily large with the prompt / tokens being generated."
https://github.com/ml-explore/mlx-examples/issues/1170
Ok_Technology_5962@reddit
And just have the thing crash once you are out of space instead of understanding thats how much context you have and thats all?
gigaflops_@reddit (OP)
That doesn't need to be what happens. Just expand the context size until a predefined resource limit is reached, then notify the user that future messages will result in truncation. In fact, LM Studio, and probably other model runners, already enforce a memory usage limit and refuses to load a larger context window rather than crashing. Even a conservative default configuration of "don't use more than 85% of the available VRAM" would allow the context to grow far longer than 4096 without reducing performance until the chat actually grew too long for your hardware.
Ok_Technology_5962@reddit
I think beginner friendly means they have to learn something to then advance. Hence beginner friendly. No mother or someone non tech savy will be able to use an llm and inffact would not want to do so on their own as they dont have a use case. I dont think there is a problem here... The entry point is lower but the user themselves that starts off usually eventually migrates to llam.cpp or custom backends eventually. The point of lmstudio is just to learn what context means, what tempreture does etc. notice how you dont have complex inference settings on the front end like mirostat, loggit bias etc even though lmstudio exposes it via api...
Ok_Technology_5962@reddit
I think beginner friendly means they have to learn something to then advance. Hence beginner friendly. No mother or someone non tech savy will be able to use an llm and inffact would not want to do so on their own as they dont have a use case. I dont think there is a problem here... The entry point is lower but the user themselves that starts off usually eventually migrates to llam.cpp or custom backends eventually. The point of lmstudio is just to learn what context means, what tempreture does etc. notice how you dont have complex inference settings on the front end like mirostat, loggit bias etc even though lmstudio exposes it via api...