Windows freezing up as VRAM fills up - Does this happen for everyone?

Posted by llmenjoyer0954@reddit | LocalLLaMA | View on Reddit | 9 comments

Hey everyone,

I run llamacpp precompiled with CUDA 12.4 on Windows 11 with a RTX 4090. With small models like gemma-4-E4B everything runs fine, but as soon as I run a bigger model like Qwen3.6-27B (IQ4_NL) or a medium sized model with larger context I get this weird behaviour:

When the VRAM fills up, Windows 11 starts to freeze. Windows become unresponsive, the taskbar gets white. Youtube may stop playing and the whole OS becomes unuseable. Mouse movement comes to a halt. (--no-mmap --mlock don't change that)

This happens exclusivly on Windows. I have a CachyOS dual-boot, where I can run a model like Qwen3.6-27B with 60K context. (--fit is the best)

I'm trying to understand: Is everybody else struggeling with this? Is Windows and models that fill up the VRAM just not compatible? Is it a configuration thing?

I can safely say it's not a hardware thing, because the same software (llamacpp) with the same models on the same harddrives runs just fine under linux.

I'd love to get feedback on this. Thanks!