Windows freezing up as VRAM fills up - Does this happen for everyone?

Posted by llmenjoyer0954@reddit | LocalLLaMA | View on Reddit | 9 comments

Hey everyone,

I run llamacpp precompiled with CUDA 12.4 on Windows 11 with a RTX 4090. With small models like gemma-4-E4B everything runs fine, but as soon as I run a bigger model like Qwen3.6-27B (IQ4_NL) or a medium sized model with larger context I get this weird behaviour:

When the VRAM fills up, Windows 11 starts to freeze. Windows become unresponsive, the taskbar gets white. Youtube may stop playing and the whole OS becomes unuseable. Mouse movement comes to a halt. (--no-mmap --mlock don't change that)

This happens exclusivly on Windows. I have a CachyOS dual-boot, where I can run a model like Qwen3.6-27B with 60K context. (--fit is the best)

I'm trying to understand: Is everybody else struggeling with this? Is Windows and models that fill up the VRAM just not compatible? Is it a configuration thing?

I can safely say it's not a hardware thing, because the same software (llamacpp) with the same models on the same harddrives runs just fine under linux.

I'd love to get feedback on this. Thanks!

[-]

BitGreen1270@reddit

I have the same thing happening on Linux if I keep a conversation going for long on 26B (on my super modest 780M laptop). I'm experimenting with --ctx and quantizing kv cache to Q8_0 i.e. --ctk and --ctv. Too soon to say but will share if it makes it more stable. I also use -fitt at 2048 to give space.

[-]

llmenjoyer0954@reddit (OP)

Reducing the context works for me as well, but that can't be the end goal. Agents require a certain context length to work properly, so something like 50K is a given.
Would still be interested in your results!

[-]

BitGreen1270@reddit

I can't really tell much from my use cases. But yea the context is quite important, I tried ctx 2048 and that was woefully inadequate. Now when I run llama-cli I keep a watch on available memory and if it runs too low I kill it and restart it again.

I guess my goal is to have it orchestrated by a python script interacting with llama-server. Need to see if there is a way to have llama-server reset the context when memory runs low.

[-]

car_lower_x@reddit

Make sure your monitors are plugged into the iGPU not your GPU. Gives the PC some headroom. That being said Linux works far better.

[-]

Jack5500@reddit

This helped me as well, but you can't use the GPU's native features like GSYNC then.

[-]

car_lower_x@reddit

This is true but why use GSYNC or need it for a something that isn't high refresh rate.

[-]

Mart-McUH@reddit

Yes, can freeze for a short moment but then behaves normally. Also happens when switching between VRAM-RAM. Eg I use LLM and diffusion (imagegen) and when they switch (depending if I generate text or image) it can freeze for a short time (few seconds) too.

[-]

Kodix@reddit

That's just due to the fact that your OS requires some VRAM to, well, display graphics. Linux is far less greedy (and more customizable) when it comes to that, that's why it lets you get away with more.

[-]

llmenjoyer0954@reddit (OP)

If that were the case, then just offloading a few layers would resolve the issue, but that isn't the case for me.

When i run it with Qwen3.6-27B-IQ4_NL.gguf with ngl 40, so elaving 15 layers off the GPU and just filling the VRAM with the context (258560) it runs just fine at 23,2/24 GB VRAM

But if I run it with --fit it uses 65 layers with 89088 token context and Windows starts behaving as described above.