Weird vram behavior with qwen 3.5 80b q8 vs q6

Posted by Panthau@reddit | LocalLLaMA | View on Reddit | 6 comments

I use lmstudio on fedora. When i load the q6 model, nvtop shows 70gb vram usage (\~4gb system, 65gb model). This stays the same, wether i ask it do code or its idle.

When i load the q8 model, nvtop shows 85gb vram usage but the moment the model starts working (i use pearai), it shoots up to over 120gb and crashes.

Settings are the same for both (context length, kv, etc.).
Q6 suggests, its not using any kv chache? For q8, i tried kv and v cache quantisation (4bit), which made no difference at all.

My system is a Strix Halo 395+ with 128gb unified memory. Any ideas?

[-]

qubridInc@reddit

Q8 is likely pushing your KV/cache + activation overhead over the edge weights fit, but runtime working memory doesn’t.

Panthau@reddit (OP)

Still the behavior seems weird. I load and work with even bigger models and same context just fine... maybe the file is broken or smth.