Llama.cpp's auto fit works much better than I expected

[-]

pmttyji@reddit

You could squeeze even more with fit target by giving low value like 512 (Default is 1024 .... 1GB). Also KVCache Q8 is great now(No need for F16 anymore after recent change)

[-]

klim2009@reddit

Setting fitt to lower than 2048 makes it error out when trying to analyze pictures

[-]

SosirisTseng@reddit

The image projector takes about 1GB of VRAM. I usually use --no-mmproj-offload (loading the image projector to RAM instead of VRAM) to squeeze out one more GB.

[-]

ea_man@reddit

That's for rich people, on Linux I use --fit-target 128

or 64

[-]

DrAlexander@reddit

I'm assuming for a headless run, right? Or does the desktop run on only 64 MB vram?

[-]

ea_man@reddit

That value is the headroom left, calculated after whatever is in VRAM by the time llama launches.

[-]

DrAlexander@reddit

Ok. So, assuming I'm not doing anything else, it could be dropped a bit. On windows for sure not so drastically, but I'll have to try things out.

Thanks!

[-]

ea_man@reddit

you should not start anything else that goes in vram, it's ok if you got your stuff already running.

And yet, the less shit you keep in vram...

[-]

As ea_man says, the target is what it tries to leave free including anything else on your system already using VRAM. However, yes, if you set the margin that low you'll want to run headless - even something as simple as alt-tabbing can cause huge slowdowns otherwise.

[-]

ea_man@reddit

Well you could disable all kind of 3d stuff in the DE or use a integrated graphic for that.

Me I'm using 60-100 fit if I'm running LXqt-KDE (no composer).

[-]

ea_man@reddit

No that does not have to do with the desktop usage.

[-]

ParaboloidalCrest@reddit

Will that cannibalize the memory necessary for compute buffer, which I assume is not accounted for by fit? I mean is there a fitt value so low that one might get an OOM?

[-]

ea_man@reddit

If you are headless you should use all your VRAM for compute.

Also it's not like that if you go 10mb out of VRAM the machine explodes, it will just be slower.

[-]

YourNightmar31@reddit

Wait what? What is fit target and why is it different from "fit on"?

[-]

puncia@reddit

It's not different. Fit (-fit) attempts to fit everything into VRAM, leaving some headroom (1024MiB by default). Fit target (-fitt) is just an override for that headroom.

[-]

DunderSunder@reddit

so i'm on a laptop and the Nvidia GPU is not used normally (iGPU is used for apps and display) . Would that mean I can set fitt as small as possible or there is some stuff that has to be taken into account?

[-]

puncia@reddit

Yes, you could even theoretically set it to 0 if you are sure nothing else is using VRAM. Even if you were to open a game (which would want to go to your powerful gpu), let's say, nothing would happen besides the game going at like 10 fps, and this is because by default nvidia drivers let the excess VRAM spill into system RAM (sysmem fallback).

Also, -fitt supports a comma-separated list of values for multi-gpu. For example -fitt 256, 512 would let 256 and 512 MiB of headroom in your GPU 0 and GPU 1 respectively. You'd have to watch llama's output to see where it's actually allocating GPU in your case though, because I'm not sure myself how it behaves with a setup of that kind.

[-]

a9udn9u@reddit (OP)

It sets a free memory target, so 2048 means leave 2GB free VRAM for other things.

[-]

GregoryfromtheHood@reddit

Wait is this true for Qwen 3.5 and 3.6? I thought even F16 sucks on those and bf16 is the best for them and Q8 is basically garbage.

[-]

ParaboloidalCrest@reddit

KVCache Q8 is great now

While it has improved, there are still some edge-cases. For example, GLM4.5-Air and GLM4.6V both perform a lot worse on longer context (> 64k) with q8 cache, certainly worse than f16.

[-]

a9udn9u@reddit (OP)

Good tips! I did fit target 2g and kv cache always set to Q8

[-]

ghostopera@reddit

If you use quantization for the KV (say, Q8_0) you might be able to fit everything into vram, including 256k context, and get double or more the token speed you currently getting.

For example, I'm fitting Qwen 3.6 35B Q3_K_M with 256k context on my 24gb 7900 xtx and am getting about 84 tok/s.

On your 32gb you should be able to do the same thing, but fitting a higher model quantization than I'm using :).

[-]

Mattthhdp@reddit

Can you share your config ? I have the same GPU and barely get 20tok/a

[-]

ghostopera@reddit

Are you sure that's not running off the CPU? If you are using say, rocm llama-server with a mismatched rocm library it will fall back to CPU. I've been using the vulkan version for this myself.

My models.ini:

version = 1

[*]
jinja = true
parallel = 1
fit = on
fit-ctx = 8192
cache-type-k = q8_0
cache-type-v = q8_0

[Qwen3.6-35B-A3B-UD-Q3_K_M]
model = models/Qwen3.6-35B-A3B-UD-Q3_K_M.gguf
flash-attn = on
temperature = 0.6
presence-penalty = 0.0
repeat-penalty = 1.0
top-p = 0.95
top-k = 20
min-p = 0.00

Been meaning to test with chat-template-kwargs = {"preserve_thinking": true} as well but haven't gotten around to it just yet.

If you are using something like LM Studio you can still do pretty much all the same settings.

[-]

Mattthhdp@reddit

I will have to take a closer look, with your setup I'm at 25 tk/a .... Maybe it's because I'm on windows ?

[-]

ghostopera@reddit

So weird. It could certainly be a Windows vs Linux thing! Unfortunately I don't have an installation of Windows to test from. (Stopped using Windows entirely at the start of last year)

If it helps, this is my llama-server command:

llama.cpp/vulkan/llama-server --models-preset ./models.ini --models-max 1

(llama.cpp has two copies of llama. In this case I'm using the vulkan version version).

Two thoughts come to mind: 1. It's loading into your CPU 2. It's loading into your integrated graphics instead of your GPU

You can check out what Vulkan devices are available (or ROCm if you are using that) with:

$ llama.cpp/vulkan/llama-server --list-devices
load_backend: loaded RPC backend from /home/lholden/LLMs/llama.cpp/vulkan/libggml-rpc.so
load_backend: loaded Vulkan backend from /home/lholden/LLMs/llama.cpp/vulkan/libggml-vulkan.so
load_backend: loaded CPU backend from /home/lholden/LLMs/llama.cpp/vulkan/libggml-cpu-haswell.so
Available devices:
  Vulkan0: AMD Radeon RX 7900 XTX (RADV NAVI31) (24576 MiB, 20364 MiB free)

You should see your GPU listed here. If you do see your GPU here, but it doesn't seem to be using your CPU, you can tell it which device to use. I don't have any integrated graphics, so it only outputs my dedicated GPU.

But you could use --dev to force it.

You should check the fit messaging from the command. For example:

[52905] llama_params_fit_impl: projected to use 19019 MiB of device memory vs. 20416 MiB of free device memory
[52905] llama_params_fit_impl: will leave 1396 >= 1024 MiB of free device memory, no changes needed
[52905] llama_params_fit: successfully fit params to free device memory

In this case, it fit everything I asked for into vram including all layers and the full context the model supports.

If it's using your CPU for the full thing for some reason, you will notice it will use your system ram for the free device memory numbers)

Though with the params I am using it should shorten the context before dropping slices.

Should also check the output around load_tensors:

[52905] load_tensors: offloading output layer to GPU
[52905] load_tensors: offloading 39 repeating layers to GPU
[52905] load_tensors: offloaded 41/41 layers to GPU
[52905] load_tensors:   CPU_Mapped model buffer size =   397.85 MiB
[52905] load_tensors:      Vulkan0 model buffer size = 15423.34 MiB

In this case you can see that it stuck all layers on to the GPU.

Also worth noting, you could also try LM Studio. You will have to fiddle around to get all the same settings, but I've found it to work just fine as well.

[-]

Mattthhdp@reddit

%LLAMA_DIR%\llama-server.exe ^
  -m "%MODEL%" ^
  -ngl 99 ^
  -ncmoe 4 ^
  -fa on ^
  -ctk q8_0 ^
  -ctv q8_0 ^
  -c 262144 ^
  --fit on ^
  --fit-ctx 32768 ^
  --context-shift ^
  --cache-reuse 256 ^
  -t 6 ^
  -tb 12 ^
  --batch-size 2048 ^
  --ubatch-size 512 ^
  --parallel 1 ^
  --prio 2 ^
  --chat-template-kwargs "{\"preserve_thinking\":true}" ^
  --host 0.0.0.0 ^
  --port 8080


models--unsloth--Qwen3.6-35B-A3B-GGUF\snapshots\a483e9e6cbd595906af30beda3187c2663a1118c\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf

Here is my lastest config i got 74tok/s

[-]

ghostopera@reddit

Yeah maybe? I just upgraded everything but my video card and am now getting 130 tok/s.

When you start llama-server, it tells you how well it fits onto the vram. Does it say if it isn't fitting well? There should be a message about it. It's possible more vram is being used on your system by the OS and everything else, so the model is sitting more in cpu/ram than on the card.

It could also be the rest of your hardware holding things back. For example, do you have REBAR on?

Side note, I've recently moved back to 128k of context. Less important with the fit configuration since it will scale that down as needed, but it was useful for model stability.

[-]

Mattthhdp@reddit

I will take a look tonight, pretty su rebar is on, but doesn't hurt to confirm . I did a few tweak like try rdom and I'm stuck at 38tk/a XD

[-]

ghostopera@reddit

Are you using the vulkan build of llama-server? Can also make sure its using the gpu. Possible its not finding your video card?

[-]

Mattthhdp@reddit

Well the beam usage jump to 22gb and the compute to 100% so definitely using the GPU ^^

[-]

a9udn9u@reddit (OP)

Q4 used to be my go to quantization but I was never impressed by their response quality, I wanted to try Q8 this time. 57 t/s is on par with the speed I got from the 27B dense model, it's good enough for me

[-]

Digger412@reddit

Ghostopera meant KV cache quantization, which is separate from the model quantization itself. Your KV cache is in FP16 regardless of what the model quantization level is, unless you choose to quant the KV cache.

It costs some accuracy to do so, but Q8 KV cache would be a fairly small drop in accuracy but allow you to use twice the context length.

[-]

a9udn9u@reddit (OP)

Oh I misread his message, my kv is at Q8, but how could kv quantization help fitting 35GB weights in 32GB VRAM?

[-]

Digger412@reddit

If you're using `--fit`, and if you don't increase the context size, then your KV takes up half the size it did previously and `--fit` will load more weights from RAM to VRAM to fill it up. So you'd get more by proxy.

[-]

antwon_dev@reddit

True. If OP is getting 256k context on fp16 alr, though, then maybe consider not quantizing the KV cache. 256k is a lot of usable context, and quality can certainly diminish when messing with attention.

But if you try out 512k @ Q8, let us know how it affects output quality/speed and memory usage!

[-]

waruby@reddit

see the -ctk and -ctv command line options. If you compile the Rotorquant fork of llama.cpp you can do -ctk planar3 -ctv turbo3 which give 10.3x compression of the KV cache for negligible loss in quality.

[-]

Old-Sherbert-4495@reddit

are u running on a 5090?? coz im getting 40tkps on 4060ti 16vram and 32 sys ram 20 core cpu. so i think u can squeeze more out of it. i tweaked it manually using, ngl and cpu moe count. with q8 kv. instead of fit.

one thing to note is that fit does offload to cpu. but it works very different with dense (27b) and moe (active only 3b). u have to fit dense model fully to get best performance, offloading hurts a lot. but for moe offloading helps.

[-]

relmny@reddit

Have a look at this thread, you might find better options:

https://www.reddit.com/r/LocalLLaMA/comments/1sor55y/rtx_5070_ti_9800x3d_running_qwen3635ba3b_at_79_ts/

btw, with a slower 32GB GPU + 128gb RAM + ssd, I can run qwen3.5-397b q4kl at 5 t/s or 3.6-35b ud-q6-kl at 114 t/s (can even run kimi.k2.6 smol-iq2-ks but at only 2.18 t/s)

It's all about offloading to CPU

[-]

Worried-Squirrel2023@reddit

the 35B-A3B MoE part is the lever. only 3B params active per token means even spilling weights to system ram doesn't kill throughput like it would on a dense 35B. would be a different story on a dense model of the same total size.

[-]

fallingdowndizzyvr@reddit

I didn't find it to work that well. When I run models that will barely fit spread across multiples devices, GPUs or machines, I find that many times it falls to find a split that will run. It'll just OOM a device. But if I split the model by hand, I can squeeze it in and have it run.

[-]

see_spot_ruminate@reddit

Really? I have a quad 5060ti setup and it works great. If I don't mess/tweak with --fit-target, then I never have an issue. What is your command and what driver are you on?

[-]

StardockEngineer@reddit

How many GPUs? Was working great with my 5090 and A6000 combo. Only early vision capable models were a problem with it first launched.

[-]

tomt610@reddit

It isn't magic, in more complex multi gpu/cpu scenarios it leaves a lot of performance/context on the table and unused. It may be good on simple systems but has long way to go

[-]

No_Mango7658@reddit

Q4km with 256k fits like a glove. There is 1 layer(idk how to actually check) that spills into ram. I get about 165tps at 128k context with every in vram, and I get 145tps with 256k context and very slight spillover into ram. This is on my gaming desktop in lmstudio

J

[-]

StardockEngineer@reddit

You should be getting 190-220 tok/s. BTW --fit on is on by default. Just don't specify context.

[-]

ikkiho@reddit

the reason this works so well is moe + oculink only has to shuttle the active experts per token (~3b active for qwen 3.6 35b), not the full weights. dense model same size would top out at maybe 5-10 t/s with that vram deficit. also worth stacking kv-cache q8 on top of --fit — that single change usually matters more than which experts land on cpu.

[-]

a9udn9u@reddit (OP)

True, I'd never use a dense model with CPU offloading

[-]

RoomyRoots@reddit

I wish I could say the same, I need to specify the fit flags manually or I get OOM. It can be ROCm fucking me over but in general it runs well.

[-]

_bones__@reddit

Wow, this got Qwen 3.6 35B UD Q3 K XL to run at 48t/s for me, where before I got 12. Pretty damn good!

[-]

danigoncalves@reddit

Did you offloaded any layer to the CPU? Currently I have the same VRAM but have to offload 35 layers if I want reasonable speed.

[-]

a9udn9u@reddit (OP)

Awesome! What's your setup before? Manual offloading or all CPU?

[-]

_bones__@reddit

GPU as far as I could, -ngl 99 -cpu-moe. Basically copy pasted settings. GPU memory utilization is now quite high, but was quite low, so I think it did way too much on CPU

The 45t/s is on a 12GB rtx3080, so pretty happy. Now it's worth trying to see what a Q3 can do as a coding agent...

[-]

draetheus@reddit

This works well because the 35B model is an MoE architecture with only 3B active params. You'll have a much worse time with a dense model like the 27B.

[-]

GregoryfromtheHood@reddit

Wow you're right! There's some magic here. I never used it because it used to do weird stuff and just cause OOMs doing weird splits when I could easily fit the model playing with tensor split numbers myself, so I've been spending hours on every model finding the exact right tensor split and context to perfectly fit the GPUs as best I can. I had Qwen 3.6 up and running with 650k context and it was juust barely fitting into my GPUs with a few hundred mb headroom after I got the tensor splits right.

I just tried fit and fit target and somehow now it fits with like 5gb free on one GPU, another few GB free on the others, running at the same speeds. The heck? Where'd it pull all this extra headroom from? The GPUs were all entirely full when I did my own tensor split.

[-]

ANTONBORODA@reddit

Does fit actually take checkpoints into account? Because I recently found out that the crazy slowdowns I encounter during prolonged usage are because of context checkpoints that are also being saved to memory and they can actually grow huge.

[-]

OddDesigner9784@reddit

Running qwen 3.6 35b 2 bit quant on my 16 gb AMD card we ball

[-]

No-Manufacturer-3315@reddit

Well, I added the visioning coder fit did not work. It would overflow my GPU.

[-]

Octopotree@reddit

So does it fit as much as it can on your GPU vram and put the rest (including context?) on your CPU ram?

[-]

a9udn9u@reddit (OP)

I think it offload model layers

[-]

Miriel_z@reddit

Ollama automanages model weights and kv cache. You can always offload to RAM/CPU, will be way slower though.

[-]

leonbollerup@reddit

ollama is "to simple" .. honestly..

[-]

a9udn9u@reddit (OP)

I heard that ollama is easy to use but slow, never tried it, started with vLLM but it requires more VRAM overhead, now switched to llama.cpp, love it!

[-]

Miriel_z@reddit

Slower if you offload to CPU, sure. It uses cpp itself with extra management on top. Have not compared it directly to cpp though.