Llama.cpp's auto fit works much better than I expected
Posted by a9udn9u@reddit | LocalLLaMA | View on Reddit | 65 comments
I always thought with 32GB of VRAM, the biggest models I could run were around 20GB, like Qwen3.5 27B Q4 or Q6. I had an impression that everything had to fit in VRAM or I'd get 2 t/s.
Man was I wrong. I just tested Qwen3.6 Q8 with 256k context on llama.cpp, with `--fit` on, the weights alone are bigger than my VRAM, and my 5090 is hooked up via Oculink, but I’m still getting 57 t/s! This is literally magic. If you’ve been stuck in the same boat as me thinking it’s all VRAM or nothing, you should try this now!
pmttyji@reddit
You could squeeze even more with
fit targetby giving low value like 512 (Default is 1024 .... 1GB). Also KVCache Q8 is great now(No need for F16 anymore after recent change)klim2009@reddit
Setting fitt to lower than 2048 makes it error out when trying to analyze pictures
SosirisTseng@reddit
The image projector takes about 1GB of VRAM. I usually use
--no-mmproj-offload(loading the image projector to RAM instead of VRAM) to squeeze out one more GB.klim2009@reddit
Interesting! I'll try that. Thanks!
ea_man@reddit
That's for rich people, on Linux I use --fit-target 128
or 64
DrAlexander@reddit
I'm assuming for a headless run, right? Or does the desktop run on only 64 MB vram?
ea_man@reddit
That value is the headroom left, calculated after whatever is in VRAM by the time llama launches.
DrAlexander@reddit
Ok. So, assuming I'm not doing anything else, it could be dropped a bit. On windows for sure not so drastically, but I'll have to try things out.
Thanks!
ea_man@reddit
you should not start anything else that goes in vram, it's ok if you got your stuff already running.
And yet, the less shit you keep in vram...
the__storm@reddit
As ea_man says, the target is what it tries to leave free including anything else on your system already using VRAM. However, yes, if you set the margin that low you'll want to run headless - even something as simple as alt-tabbing can cause huge slowdowns otherwise.
ea_man@reddit
Well you could disable all kind of 3d stuff in the DE or use a integrated graphic for that.
Me I'm using 60-100 fit if I'm running LXqt-KDE (no composer).
ea_man@reddit
No that does not have to do with the desktop usage.
ParaboloidalCrest@reddit
Will that cannibalize the memory necessary for
compute buffer, which I assume is not accounted for byfit? I mean is there a fitt value so low that one might get an OOM?ea_man@reddit
If you are headless you should use all your VRAM for compute.
Also it's not like that if you go 10mb out of VRAM the machine explodes, it will just be slower.
YourNightmar31@reddit
Wait what? What is fit target and why is it different from "fit on"?
puncia@reddit
It's not different. Fit (
-fit) attempts to fit everything into VRAM, leaving some headroom (1024MiB by default). Fit target (-fitt) is just an override for that headroom.DunderSunder@reddit
so i'm on a laptop and the Nvidia GPU is not used normally (iGPU is used for apps and display) . Would that mean I can set fitt as small as possible or there is some stuff that has to be taken into account?
puncia@reddit
Yes, you could even theoretically set it to 0 if you are sure nothing else is using VRAM. Even if you were to open a game (which would want to go to your powerful gpu), let's say, nothing would happen besides the game going at like 10 fps, and this is because by default nvidia drivers let the excess VRAM spill into system RAM (sysmem fallback).
Also,
-fittsupports a comma-separated list of values for multi-gpu. For example-fitt 256, 512would let 256 and 512 MiB of headroom in your GPU 0 and GPU 1 respectively. You'd have to watch llama's output to see where it's actually allocating GPU in your case though, because I'm not sure myself how it behaves with a setup of that kind.a9udn9u@reddit (OP)
It sets a free memory target, so 2048 means leave 2GB free VRAM for other things.
GregoryfromtheHood@reddit
Wait is this true for Qwen 3.5 and 3.6? I thought even F16 sucks on those and bf16 is the best for them and Q8 is basically garbage.
ParaboloidalCrest@reddit
While it has improved, there are still some edge-cases. For example, GLM4.5-Air and GLM4.6V both perform a lot worse on longer context (> 64k) with q8 cache, certainly worse than f16.
a9udn9u@reddit (OP)
Good tips! I did fit target 2g and kv cache always set to Q8
ghostopera@reddit
If you use quantization for the KV (say, Q8_0) you might be able to fit everything into vram, including 256k context, and get double or more the token speed you currently getting.
For example, I'm fitting Qwen 3.6 35B Q3_K_M with 256k context on my 24gb 7900 xtx and am getting about 84 tok/s.
On your 32gb you should be able to do the same thing, but fitting a higher model quantization than I'm using :).
Mattthhdp@reddit
Can you share your config ? I have the same GPU and barely get 20tok/a
ghostopera@reddit
Are you sure that's not running off the CPU? If you are using say, rocm llama-server with a mismatched rocm library it will fall back to CPU. I've been using the vulkan version for this myself.
My models.ini:
Been meaning to test with
chat-template-kwargs = {"preserve_thinking": true}as well but haven't gotten around to it just yet.If you are using something like LM Studio you can still do pretty much all the same settings.
Mattthhdp@reddit
I will have to take a closer look, with your setup I'm at 25 tk/a .... Maybe it's because I'm on windows ?
ghostopera@reddit
So weird. It could certainly be a Windows vs Linux thing! Unfortunately I don't have an installation of Windows to test from. (Stopped using Windows entirely at the start of last year)
If it helps, this is my llama-server command:
(llama.cpp has two copies of llama. In this case I'm using the vulkan version version).
Two thoughts come to mind: 1. It's loading into your CPU 2. It's loading into your integrated graphics instead of your GPU
You can check out what Vulkan devices are available (or ROCm if you are using that) with:
You should see your GPU listed here. If you do see your GPU here, but it doesn't seem to be using your CPU, you can tell it which device to use. I don't have any integrated graphics, so it only outputs my dedicated GPU.
But you could use --dev to force it.
You should check the fit messaging from the command. For example:
In this case, it fit everything I asked for into vram including all layers and the full context the model supports.
If it's using your CPU for the full thing for some reason, you will notice it will use your system ram for the free device memory numbers)
Though with the params I am using it should shorten the context before dropping slices.
Should also check the output around load_tensors:
In this case you can see that it stuck all layers on to the GPU.
Also worth noting, you could also try LM Studio. You will have to fiddle around to get all the same settings, but I've found it to work just fine as well.
Mattthhdp@reddit
Here is my lastest config i got 74tok/s
ghostopera@reddit
Yeah maybe? I just upgraded everything but my video card and am now getting 130 tok/s.
When you start llama-server, it tells you how well it fits onto the vram. Does it say if it isn't fitting well? There should be a message about it. It's possible more vram is being used on your system by the OS and everything else, so the model is sitting more in cpu/ram than on the card.
It could also be the rest of your hardware holding things back. For example, do you have REBAR on?
Side note, I've recently moved back to 128k of context. Less important with the fit configuration since it will scale that down as needed, but it was useful for model stability.
Mattthhdp@reddit
I will take a look tonight, pretty su rebar is on, but doesn't hurt to confirm . I did a few tweak like try rdom and I'm stuck at 38tk/a XD
ghostopera@reddit
Are you using the vulkan build of llama-server? Can also make sure its using the gpu. Possible its not finding your video card?
Mattthhdp@reddit
Well the beam usage jump to 22gb and the compute to 100% so definitely using the GPU ^^
a9udn9u@reddit (OP)
Q4 used to be my go to quantization but I was never impressed by their response quality, I wanted to try Q8 this time. 57 t/s is on par with the speed I got from the 27B dense model, it's good enough for me
Digger412@reddit
Ghostopera meant KV cache quantization, which is separate from the model quantization itself. Your KV cache is in FP16 regardless of what the model quantization level is, unless you choose to quant the KV cache.
It costs some accuracy to do so, but Q8 KV cache would be a fairly small drop in accuracy but allow you to use twice the context length.
a9udn9u@reddit (OP)
Oh I misread his message, my kv is at Q8, but how could kv quantization help fitting 35GB weights in 32GB VRAM?
Digger412@reddit
If you're using `--fit`, and if you don't increase the context size, then your KV takes up half the size it did previously and `--fit` will load more weights from RAM to VRAM to fill it up. So you'd get more by proxy.
antwon_dev@reddit
True. If OP is getting 256k context on fp16 alr, though, then maybe consider not quantizing the KV cache. 256k is a lot of usable context, and quality can certainly diminish when messing with attention.
But if you try out 512k @ Q8, let us know how it affects output quality/speed and memory usage!
waruby@reddit
see the
-ctkand-ctvcommand line options. If you compile the Rotorquant fork of llama.cpp you can do-ctk planar3 -ctv turbo3which give 10.3x compression of the KV cache for negligible loss in quality.Old-Sherbert-4495@reddit
are u running on a 5090?? coz im getting 40tkps on 4060ti 16vram and 32 sys ram 20 core cpu. so i think u can squeeze more out of it. i tweaked it manually using, ngl and cpu moe count. with q8 kv. instead of fit.
one thing to note is that fit does offload to cpu. but it works very different with dense (27b) and moe (active only 3b). u have to fit dense model fully to get best performance, offloading hurts a lot. but for moe offloading helps.
relmny@reddit
Have a look at this thread, you might find better options:
https://www.reddit.com/r/LocalLLaMA/comments/1sor55y/rtx_5070_ti_9800x3d_running_qwen3635ba3b_at_79_ts/
btw, with a slower 32GB GPU + 128gb RAM + ssd, I can run qwen3.5-397b q4kl at 5 t/s or 3.6-35b ud-q6-kl at 114 t/s (can even run kimi.k2.6 smol-iq2-ks but at only 2.18 t/s)
It's all about offloading to CPU
Worried-Squirrel2023@reddit
the 35B-A3B MoE part is the lever. only 3B params active per token means even spilling weights to system ram doesn't kill throughput like it would on a dense 35B. would be a different story on a dense model of the same total size.
fallingdowndizzyvr@reddit
I didn't find it to work that well. When I run models that will barely fit spread across multiples devices, GPUs or machines, I find that many times it falls to find a split that will run. It'll just OOM a device. But if I split the model by hand, I can squeeze it in and have it run.
see_spot_ruminate@reddit
Really? I have a quad 5060ti setup and it works great. If I don't mess/tweak with --fit-target, then I never have an issue. What is your command and what driver are you on?
StardockEngineer@reddit
How many GPUs? Was working great with my 5090 and A6000 combo. Only early vision capable models were a problem with it first launched.
tomt610@reddit
It isn't magic, in more complex multi gpu/cpu scenarios it leaves a lot of performance/context on the table and unused. It may be good on simple systems but has long way to go
No_Mango7658@reddit
Q4km with 256k fits like a glove. There is 1 layer(idk how to actually check) that spills into ram. I get about 165tps at 128k context with every in vram, and I get 145tps with 256k context and very slight spillover into ram. This is on my gaming desktop in lmstudio
J
StardockEngineer@reddit
You should be getting 190-220 tok/s. BTW --fit on is on by default. Just don't specify context.
ikkiho@reddit
the reason this works so well is moe + oculink only has to shuttle the active experts per token (~3b active for qwen 3.6 35b), not the full weights. dense model same size would top out at maybe 5-10 t/s with that vram deficit. also worth stacking kv-cache q8 on top of --fit — that single change usually matters more than which experts land on cpu.
a9udn9u@reddit (OP)
True, I'd never use a dense model with CPU offloading
RoomyRoots@reddit
I wish I could say the same, I need to specify the fit flags manually or I get OOM. It can be ROCm fucking me over but in general it runs well.
_bones__@reddit
Wow, this got Qwen 3.6 35B UD Q3 K XL to run at 48t/s for me, where before I got 12. Pretty damn good!
danigoncalves@reddit
Did you offloaded any layer to the CPU? Currently I have the same VRAM but have to offload 35 layers if I want reasonable speed.
a9udn9u@reddit (OP)
Awesome! What's your setup before? Manual offloading or all CPU?
_bones__@reddit
GPU as far as I could, -ngl 99 -cpu-moe. Basically copy pasted settings. GPU memory utilization is now quite high, but was quite low, so I think it did way too much on CPU
The 45t/s is on a 12GB rtx3080, so pretty happy. Now it's worth trying to see what a Q3 can do as a coding agent...
draetheus@reddit
This works well because the 35B model is an MoE architecture with only 3B active params. You'll have a much worse time with a dense model like the 27B.
GregoryfromtheHood@reddit
Wow you're right! There's some magic here. I never used it because it used to do weird stuff and just cause OOMs doing weird splits when I could easily fit the model playing with tensor split numbers myself, so I've been spending hours on every model finding the exact right tensor split and context to perfectly fit the GPUs as best I can. I had Qwen 3.6 up and running with 650k context and it was juust barely fitting into my GPUs with a few hundred mb headroom after I got the tensor splits right.
I just tried fit and fit target and somehow now it fits with like 5gb free on one GPU, another few GB free on the others, running at the same speeds. The heck? Where'd it pull all this extra headroom from? The GPUs were all entirely full when I did my own tensor split.
ANTONBORODA@reddit
Does fit actually take checkpoints into account? Because I recently found out that the crazy slowdowns I encounter during prolonged usage are because of context checkpoints that are also being saved to memory and they can actually grow huge.
OddDesigner9784@reddit
Running qwen 3.6 35b 2 bit quant on my 16 gb AMD card we ball
No-Manufacturer-3315@reddit
Well, I added the visioning coder fit did not work. It would overflow my GPU.
Octopotree@reddit
So does it fit as much as it can on your GPU vram and put the rest (including context?) on your CPU ram?
a9udn9u@reddit (OP)
I think it offload model layers
Miriel_z@reddit
Ollama automanages model weights and kv cache. You can always offload to RAM/CPU, will be way slower though.
leonbollerup@reddit
ollama is "to simple" .. honestly..
a9udn9u@reddit (OP)
I heard that ollama is easy to use but slow, never tried it, started with vLLM but it requires more VRAM overhead, now switched to llama.cpp, love it!
Miriel_z@reddit
Slower if you offload to CPU, sure. It uses cpp itself with extra management on top. Have not compared it directly to cpp though.