Misunderstanding memory usage - 11.68gb quantized model takes up 22gb of RAM?

Posted by NotARedditUser3@reddit | LocalLLaMA | View on Reddit | 17 comments

I'm running unsloth/qwen3.6-35b-a3b IQ2\_XSS. It's 11.68 gb on disk, and when I load it in LM studio, it claims it will use / is using about 13GB of RAM. In Task manager, my memory usage goes from 7GB to 30GB or more. The individual process shows only \~15.5gb in task manager, but literally, that's the usage increase when I load the model, and it goes back down when I eject it in LM studio. What's up with this? I've been struggling to load this model for a bit now thinking that quantized versions should need less RAM, but I'm running out. I'm running on a CPU, running out of system ram. I can get \~20 tokens per second, but literally have no system memory to have anything else open, so I can't have any apps on this machine make use of it. (This happens to me on the MTP and non MTP versions of this model btw) Am I missing something? I had figured the RAM amount would always be roughly the disk size, but this is quite a bit off.

Reply to Post

17 Comments

[-]

nickless07@reddit

Turn off Keep model in memory. It still will take a good amount of RAM, as you figured by the filze size, but not a copy of the model for quick reload/swap.

[-]

NotARedditUser3@reddit (OP)

This sounded promising, but I just tried that and I still see the same behavior

[-]

nickless07@reddit

Your Operating System automatically maps the gguf model file into virtual memory. The OS treats this file cache as expendable RAM. When you combine mmap with **'**Keep Model in Memory**'** LM Studio commands the backend to lock those specific pages into the active runtime process so they cannot be paged out

[-]

NotARedditUser3@reddit (OP)

I just figured it out... if I turn off GPU offload this issue goes away. I think it maybe is putting the model into system ram + GPU memory. even if I tell it not to. I'm not 100% but that would make the most sense from what I'm seeing? I posted a screenshot just now with what I'm loading the model with

[-]

nickless07@reddit

An integrated GPU (iGPU) uses shared system memory instead of having its own dedicated VRAM. The amount can be dynamically allocated by the system.

[-]

Wrong_Mushroom_7350@reddit

Your real memory requirement is **Disk Size + KV Cache + Overhead + OS Idle**. * **The Model Weights :** The static file size is your absolute baseline before it even processes a word. * **The KV Cache:** The model needs a temporary workspace to remember your conversation history. Depending on the context size you set (e.g., 8k or 32k tokens), this can instantly demand another **4 GB to 10 GB+**. * **Compute:** The engine under the hood needs scratch buffers to do the actual math, eating another **1 to 2 GB**. * **The iGPU:** Because you are on a Ryzen with an integrated GPU, you don't have dedicated VRAM. Your iGPU *steals* from your normal system RAM. Turning on "GPU Offload" forces your CPU and fake GPU to fight over the exact same physical memory sticks. * **Windows mmap:** Modern engines memory-map (`mmap`) the file. Windows Task Manager is notoriously terrible at reporting this, making it look like a massive leak when the OS is actually just heavily reserving it in the standby cache.

[-]

NotARedditUser3@reddit (OP)

I think the mmap is the bit that was throwing me for a loop. Thanks, this was a helpful explanation.

[-]

Badger-Purple@reddit

Your context and kv cache are the other 15GB

[-]

NotARedditUser3@reddit (OP)

4096 context, even with k/v both quantized this is happening

[-]

Badger-Purple@reddit

yeah you solved it…you don’t have a dedicated GPU. List specs for better help :)

[-]

NotARedditUser3@reddit (OP)

Ohh... I didn't realize this was a thing with mmap. Thank you for this. I think that's the missing piece I wasn't aware of.

[-]

Snoo_81913@reddit

Just so I have this clear what hardware are you running this on? How much RAM do you have, no dedicated GPU?. Is there a reason youre using LMStudio vs Llama.cpp or if youre on a Mac a mlx specific server/model?

[-]

NotARedditUser3@reddit (OP)

Using LM Studio because I like GUI's 😄 I just updated the post with a potential root cause - this only happens when I have GPU offload enabled. I'm on a Ryzen 9 7940HS with integrated graphics. It's almost like, when I tell it to offload to the GPU, as if it's keeping a separate copy of the model in 'system ram' separate from a copy it's pushing to 'vram', though on my system / with this integrated GPU, those are shared / the same.

[-]

Massive-Question-550@reddit

with some software like lm studio it stores another copy of the model entirely in your system ram for fast loading and switching between models. its a bad feature when it's enabled by default as it uses up so much ram and if you aren't aware of it you won't be using the feature anyway.

[-]

NotARedditUser3@reddit (OP)

WHAT?! HOLY MOLY.... I will look for this... thank you.

[-]

Happy_Brilliant7827@reddit

Lm studio uses ram The context can also use as much ram as the model. What context are you setting at? A model that takes up 13gb storage can only load on 13GB ram with no context.

[-]

NotARedditUser3@reddit (OP)

This happens even with 4096 context length. No docker running. Let's say LM Studio is already running, right? If I look at the memory usage with LM Studio open, and then load this one model without changing anything else, my memory usage jumps 22-23GB.