Openclaw context limit exceeded

Posted by Certain_Pen_1982@reddit | LocalLLaMA | View on Reddit | 13 comments

I’m trying to run glm 4.7 flash with llama.cpp on openclaw but I can’t seem to get past and issue where whenever I try to ask it any questions, it responds by telling me my context limit was exceeded, I’ve tried changing the limit in the json and in my commas to run llama-server, but it’s always the same error and I cant seem to find any documentation, any help/advice is appreciated

[-]

CoolHeadeGamer@reddit

getting the same issue even after using /new. Using NVIDIA NIM with glm 5, glm 4.7, gpt oss, nemtron 3, etc. seems to be an openclaw specific issue since opencode works just fine

[-]

ai_guy_nerd@reddit

The context limit error in OpenClaw usually stems from a mismatch between the llama-server's -c parameter and the configuration in the JSON setup. If the server is launched with a smaller context window than what the orchestrator expects to send, the server will reject the request.

Check the startup logs for the llama-server to see the actual allocated context. If using a laptop with 32GB, try capping the context to 8k or 16k specifically in both the server command and the OpenClaw config to keep it stable. Sometimes the "automatic" settings over-estimate what the VRAM can actually hold, leading to those crashes or limit errors.

[-]

Practical-Collar3063@reddit

It is hard to help you without knowing the specs of your computer, please include it since it could be related to your computer's specs

[-]

Certain_Pen_1982@reddit (OP)

Sorry about that, I updated the post to include my specs

[-]

Practical-Collar3063@reddit

aditionnally, what model quant are you running (Q8, BF16, Q4...) ? You need to include all the details of your set up in order for people to help you. I f you don't know then paste the exact full name of the model that you use when lauching with Llama.cpp

Your PC specs might be too low to run the model that you are trying to run, a good way to check is to open up ressource monitor (I think it is called like that on Windows) and check how much VRAM is being utlised after you loaded the model with Llama.cpp, if both RAM and VRAM are at 80%+ utlisation then you don't have enough ram for that model.

[-]

Certain_Pen_1982@reddit (OP)

I don’t think it’s a problem with the model as I can comfortably run the Q4k_s version at 20tok/s, I’ll have a look at the usage tomorrow, but when I open the dashboard, the immediate response is that the context limit was exceeded without me entering any prompt, this was after I had reinstalled openclaw completely.

[-]

-dysangel-@reddit

GLM Coding Plan has been doing this for me ever since 5.1 was released. The first day it seemed that I was able to get full long context in Claude Code with no degradation like with 5.0 - but since then it's been hitting limits a lot.

[-]

Certain_Pen_1982@reddit (OP)

Have you managed to find a solution? Only thing I found to work was to use ollama instead, but I can’t use the same model because of my vram(8gb) because even qwen3.5 9b exceeds vram so I have to use a 7b model

[-]

lemondrops9@reddit

Have you tried LM Studio?

[-]

Certain_Pen_1982@reddit (OP)

Not yet, I’ve tried it with gpt-oss-20b but always found better results with llama.cpp, I might try it tomorrow though, how do you use it with openclaw? Is it the same as llama.cpp?

[-]