I want to run qwen3.5 27B q4_k_m on CPU, and I need help.

Posted by Personal_Storage_876@reddit | LocalLLaMA | View on Reddit | 17 comments

I am an local LLM beginner and I found this Reddit while looking for help. (Please understand that I am unfamiliar with Reddit.)

(system- i5 4440 1.8GHz/b85m ds3h/DDR3 32GB/128GB SSD/Ubuntu 25.10 questing)

I loaded Qwen3.5 27B Q4_K_M onto a llama.cpp built for CPU with the options shown in the photo, and the remaining memory was less than 1GB.

However, when I loaded a llama.cpp built for Vulkan with -ngl 0 while using an RX570 8GB, the remaining memory was 8GB. (VRAM occupied about 1.8GB.)

When I loaded Qwen3.5 27B IQ4_XS onto the CPU, the remaining memory was 10GB. I am currently using IQ4_XS and have no complaints regarding the immediate quality, but I am curious why this phenomenon occurs with Q4_K_M.

[-]

vasimv@reddit

As i understand, in memory limited environment you want to: "-np 1" (only one parallel request, limits KV cache usage), "-b 512 -nb 256" (or even smaller, this is buffers for prompt processing), "-ctv q8_0 -ctk q8_0" (uses 8 bits for KV cache instead 16, two times less KV cache memory), "--cache-ram 0" (disables prompt caching in RAM)

[-]

EsotericTechnique@reddit

Go for the 35b one dense models on cpu are harsh

[-]

Healthy-Nebula-3603@reddit

If you want more free ram use option --no-mmap or something thtn what is in the vram not be mapped to ram

[-]

ag789@reddit

I used to worry about mmap, but noted that if a page is mapped read only, the cpu can 'discard' it if memory is tight. It then works like swap and is a useful thing, e.g. that it may let you run models that is bigger than your physical memory.

[-]

Personal_Storage_876@reddit (OP)

Thank you. To be honest, my budget is very limited, so I think I will ultimately end up building it with just the i5 4440 and DDR3 32GB.

[-]

ag789@reddit

I don't have a working gpu, hence can't tell you much about gpu, but that I'm running on a plain old haswell i7 4790 as well 32 gb memory.
I did see running big models e..g I actually ran Qwen 3.5 35B A3B Q4_K_M, it 'used up all the memory'.
https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF

But I noted that some (a big fraction) of that memory is disk cache, Linux tends to do that.
The model runs nevertheless, I get a few tokens per sec like 4-5 tokens / s, AVX2 and all 4 cores of the i7 are engaged.

To get a lighter weight model, e.g. one of those 'REAP' ones e.g.
https://huggingface.co/mradermacher/Qwen-3.5-28B-A3B-REAP-GGUF
I get about 7-8 tokens / sec running on this striped down Qwen 3.5 28B REAP.
It 'looks the same' as do the 35B model for 'easy' prompts / tasks, but you may run up against the limits for 'difficult' stuff e.g. refactoring codes. e.g. for one case, it looped burning some > 12,000 tokens taking more than 1 hour cpu processing full blast maximum temperature throughout. And it did not exit the 'thinking' look, and did not revert with a response !

What I did then is to switch back to the 'slower' 35B model, and QWen 3.5 broke through the 'thinking' on a 'difficult' code refactoring, and 'fixed everything' in a small script, testifying to its capabilities.
https://www.reddit.com/r/LocalLLaMA/comments/1sjprna/qwen_35_28b_a3b_reap_for_coding_initial/

[-]

Personal_Storage_876@reddit (OP)

Thank you for the valuable information. Currently, I am using a Qwen3.5 27B IQ4_XS with llama-server and open-webui connected, consuming 45W and 1.2 t/s. It took a few days since it was my first time, but it was a very rewarding achievement. The 35B-A3B and REAP versions are also very interesting, so I will look into them.

I came here because I was curious about the cause of the symptom where it uses more than 8GB of memory when loading a llama.cpp built for the CPU, unlike when I loaded it with a dGPU installed and -ngl 0. This symptom did not occur on the IQ4_XS, which is not the Q4KM.

[-]

ag789@reddit

accordingly, llama.cpp is able to use both gpu and cpu, you may want to review docs about it.
smaller models do have limits, my guess is the 35BA3B and 27B dense are 'similar' models in terms of performance. But that smaller models, e.g. 9B etc may possibly 'perform poorer' for certain tasks, especially the 'difficult' tasks.

Generating codes, my guess is more a 'memory' problem, hence, even small models e.g. 9B could 'generate codes' as long as they 'know it' (it is more a matter of recall). But for things like refactoring codes and that e.g. there is a bug, and say you want the model to attempt to fix it, I think it takes bigger models to have the capacity.

[-]

ag789@reddit

I think 27B are 'dense' models (i.e. fully connected) instead of those like 35B A4B (mixture of experts) , the 27B model got me just under 1 token / sec with AVX2 on all 4 cores max temperature.
for the 27B model, my guess is that it may help if you can run it on your gpu. 'dense' models are very compute intensive, I guessed they are either O(n\^2), O(n\^3) (matrix multiplication / blas) or higher.
bew 27B \^ 3 \~ 1.9683e31 operations, that is 19,683,000,000,000,000,000,000,000,000,000 operations.
to use the 27B model, try to work it in the GPU

[-]

Potential-Gold5298@reddit

llama.cpp likes to create a copy of the weights for Q4. View the model loading log. It looks like this:

load_tensors: CPU_Mapped model buffer size = 16529.63 MiB
load_tensors: CPU_REPACK model buffer size = 11694.38 MiB

CPU_Mapped - this is the model itself (gguf-file), and CPU_REPACK is something like an unpacked copy to speed up output. If you have the same, then this is it. I didn't notice any speed gain from CPU_REPACK, but I use mostly Q5 quants, so this issue doesn't bother me.

[-]

Personal_Storage_876@reddit (OP)

Thank you so much. I'm glad I mustered the courage to come here.

[-]

jacek2023@reddit

Run llama-server and observe logs, you have memory usage listed in details

[-]

Personal_Storage_876@reddit (OP)

Understood! I will give it a try.

[-]

Healthy-Nebula-3603@reddit

That's why

[-]

Personal_Storage_876@reddit (OP)

I also want to use Q4_K_M or Q4_K_XL. As I described, if I load it with -ngl 0 after installing a dGPU, I can use the 27B Q4_K_M without swapping, with 8GB of headroom available. However, using it with a dGPU installed just because "it works well without swapping" means sacrificing 20W of power consumption. I came here because I was curious about the difference between loading Q4_K_M with -ngl 0 and simply loading it from a CPU-only llama.cpp file to see why this discrepancy occurs.

[-]

tmvr@reddit

That must be brutally slow on CPU only and even with the GPU. Try and use Qwen3.5 35B A3B and see if it does what you need, will be much faster with the DDR3 RAM you have. if you are on CPU only than try to stick to the Q_K quants, IQ is slower on the CPU afaik. Also, with llamacpp use the --fit switch when you also want to use the GPU, it will help with the 35B A3B MoE model.

[-]

Personal_Storage_876@reddit (OP)

Thank you for the information. I built it to run 24h/365d at 1 t/s.