The LLM tunes its own llama.cpp flags (+54% tok/s on Qwen3.5-27B) | TheaterFire

The LLM tunes its own llama.cpp flags (+54% tok/s on Qwen3.5-27B)

Posted by raketenkater@reddit | LocalLLaMA | View on Reddit | 73 comments

This is V2 of my previous post.

What's new: --ai-tune — the model starts tuning its own flags in a loop and caches the fastest config it finds.

My weird rig: 3090 Ti + 4070 + 3060 + 128GB RAM.

Model |llama-server (no tuning) |llm-server v1 tuning |llm-server v2 (AI-Tune)

Qwen3.5-122B |4.1 tok/s |11.2 tok/s |17.47 tok/s

Qwen3.5-27B Q4_K_M |18.5 tok/s |25.94 tok/s |40.05 tok/s

gemma-4-31B UD-Q4_K_XL |14.2 tok/s |23.17 tok/s |24.77 tok/s What I think is best here: --ai-tune keeps up with updates on llama.cpp / ik_llama.cpp automatically, because it feeds llama-server --help into the LLM tuning loop as context. New flags land → the tuner can use them → you get the best performance.

i think those are some solid gains (max tokens yeaaahh), plus more stability and a nice TUI via llm-server-gui.

Check it out: https://github.com/raketenkater/llm-server

[-]

segmond@reddit

provide an example of the parameters it used vs the previous to go from 4.1tk/s to 17.47tk/s

[-]

raketenkater@reddit (OP)

so 4.1tk-s is just llama-server -m Qwen3.5-122B-A10B-Opus-Reasoning-Q4_K_M.gguf

and the tuned command after moe expert offloading layer 1 handles and ai-tune layer 2 this would be

llama-server -m Qwen3.5-122B-A10B-Opus-Reasoning-Q4_K_M.gguf \

-ngl 48 \

--tensor-split 0.54,0.23,0.23 \

-sm graph \

-fa on \

--cache-type-k q8_0 --cache-type-v q8_0 \

-ot "blk\.(1[4-9]|2[0-9])\.ffn_.*_exps=CUDA1" \

-ot "blk\.(3[0-9]|4[0-7])\.ffn_.*_exps=CUDA2" \

--run-time-repack -khad --defrag-thold 0.1 \

--threads 8 --threads-batch 16 \

--batch-size 2048 --ubatch-size 256

Gets 17.47 tok/s

[-]

Equivalent_Job_2257@reddit

Can you compare with the same command as first with only '--fit' flag added?

[-]

666666thats6sixes@reddit

--fit is on by default unless they used some ancient (pre 75xx) build

[-]

ecompanda@reddit

the cpu offload strategy being the default when ngl is not set explains a lot of the bad benchmarks people post. most "my llama.cpp is slow" threads are just missing that one flag

[-]

IrisColt@reddit

ngl... ngl is a must

[-]

Liquos@reddit

I thought in the latest versions it defaults to offloading to the GPU?

[-]

StardockEngineer@reddit

It does. llama.cpp defaults to maxing out the GPU

[-]

Glittering-Call8746@reddit

Will it work for ik_llama.cpp ?

[-]

draetheus@reddit

I have 96GB ram and a single 9070 XT and with vulkan I get about the same TG speed. What is your PP speed though? If you have any MoE layer spilling onto CPU, ubatch size of 256 is going to be horrible for PP speed. I'm not sure I trust this as the most optimized settings possible.

[-]

raketenkater@reddit (OP)

yes my 3060 is currently only 1x but i will upgrade to 4x using m.2 to pcie adapter soon hehe

[-]

ForsookComparison@reddit

i think they meant the before and after

[-]

segmond@reddit

I think they showed the before, they ran it without offloading to GPU
"llama-server -m Qwen3.5-122B-A10B-Opus-Reasoning-Q4_K_M.gguf"

To OP, at least offload to GPUs and use the fit parameters, that should be your minimal baseline.

[-]

StardockEngineer@reddit

That would totally offload to the GPU. That's the default.

[-]

raketenkater@reddit (OP)

Before 4.1tk/s:

llama-server -m Qwen3.5-122B-A10B-Opus-Reasoning-Q4_K_M.gguf

After 17.47tk/s:

llama-server -m Qwen3.5-122B-A10B-Opus-Reasoning-Q4_K_M.gguf \
-ngl 48 \
--tensor-split 0.54,0.23,0.23 \
-sm graph \
-fa on \
--cache-type-k q8_0 --cache-type-v q8_0 \
-ot "blk\.(1[4-9]|2[0-9])\.ffn_.*_exps=CUDA1" \
-ot "blk\.(3[0-9]|4[0-7])\.ffn_.*_exps=CUDA2" \
--run-time-repack -khad --defrag-thold 0.1 \
--threads 8 --threads-batch 16 \
--batch-size 2048 --ubatch-size 256

[-]

ForsookComparison@reddit

siiiiiiiggghhhhh

[-]

RelicDerelict@reddit

Will be Linux supported in the future?

[-]

unculturedperl@reddit

It works on linux now.

[-]

Leather_Flan5071@reddit

this is some nice concept I'm gonna watch for improvements

[-]

fragment_me@reddit

Anything serious needs to do benchmarks post changes of perplexity and KLD.

[-]

Wise-Hunt7815@reddit

I tested it on two 3090s with 64G RAM, and it did improve the speed, but the AI changed the KV cache to Q4_0... I can't accept that, lol

Qwen3.5-122B-A10-Q4_K_M

[-]

rearwebpidgeon@reddit

Seems like --ai-tune isn't implemented in llm-server-mac - that wasn't clear to me from docs (unless I just didn't RTFM enough).

[-]

mrtrly@reddit

The self-referential loop is the clever part here. Most people hand-tune tensor splits once and forget about it, but flag interactions are combinatorial enough that automated search beats human intuition past two GPUs. Quant level probably shifts the optimal split enough that each one needs its own tuning pass.

[-]

Wide_Veterinarian100@reddit

My noob ass just learned to do this manually, thank you for this!

[-]

Glittering-Call8746@reddit

So basically it keep trying till it get the right tensor split ?

[-]

raketenkater@reddit (OP)

there are 2 stages

1. it calculates vram pcie lane speed and model architecture and so on based on that it chooses a strategy: 1. dense single gpu 2. dense multi gpu 3. cpu offloading expert placment(with first conservativ placment then filling it until tight)

2. there is an ai-tune flag which prompts the llm running with the help of the selected backend and then trys to better its tk/s performance

[-]

RelicDerelict@reddit

Fantastic, finally PCI lanes getting into consideration. Building PC with PCIe 5.0 doesn't sounds so useless anymore.

[-]

Glittering-Call8746@reddit

No it doesn't until u figure out pcie 5.0 extension have to be shorter and are more expensive and not all consumer mobo have pcie 5.0 bifurcation..

[-]

fulgencio_batista@reddit

Hey this is pretty handy! I saw around a 50% boost in tg from my baseline command from the auto-detected command, though I didn't have any luck with my LLM tuning it (no change).

I was trying to qwen3.5-110b-a10b-reap-40 (\~46gb) with 32gb vram.

[-]

unculturedperl@reddit

For schnitts and giggles ran it on my dgx spork (n100/16gb).

  AI Tune complete: Maximize KV Cache Quality and Batch Size wins!
  Baseline: 7.64 tok/s # Best: 7.75 tok/s (+1.4%)
###################################################

[-]

Corosus@reddit

Cool stuff, on a whim I decided to try it, bothering to switch from windows to wsl2 alone has given me a nice lil boost from 26tps to 30tps for Qwopus3.5-27B-v3-Q4_K_M.gguf, lets see if it can beat my current kinda mostly optimized ik_llama on my dual 5070ti 5060ti with bad pcie speed communication setup

~/projects/git/ik_llama.cpp/build/bin/llama-server -m /home/corosus/projects/ai/jackrong/Qwopus3.5-27B-v3-Q4_K_M.gguf --host 0.0.0.0 --port 8080 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.4 --repeat-penalty 1.5 -ngl 99 -sm layer --merge-qkv -rtr -fa on -ctk q8_0 -ctv q8_0 -c 100000 -b 16384 -ub 4096 -ts 28,20 --no-warmup --jinja --numa isolate -tb 16

at Round 4/8 so far on the --ai-tune

[-]

CornerLimits@reddit

Maybe a simple script without llm could be faster/better no burned tokens? It will bench a lot of times, i can’ t see the real value of having an llm.

However cool idea!

[-]

fishhf@reddit

Using the optuna library or the old genetic algorithm would be less of an overkill.

[-]

IsopodInitial6766@reddit

The value isn't faster search it's zero-maintenance search. Optuna

needs you to define the parameter space up front: which flags exist,

valid values, conflicts. An llm reading `llama-server --help` each run

picks up new flags (like `--ubatch-size`) without updating your config

Hybrid is probably best: LLM constrains the search space, a

deterministic tuner does the sweep

[-]

raketenkater@reddit (OP)

There is no token burn it is using your local llm and you do not have to use the ai-tune it’s an optional flag

[-]

CornerLimits@reddit

I mean a token is a token even if its local :D Btw like this is certainly a good approach because you don’t have to update it if new flags comes out. It only sounds a bit overkill to me and i suspect low tire llms will make a mess with all the flags but maybe i’m wrong.

How can it detect when we are at absolute maximum performance (no stuff to try left) ?

[-]

raketenkater@reddit (OP)

even small current llms are pretty capable but i have not tested it and for the maximum performance ai-tune basically just uses the rounds you set and runs that often while crashes do not count and then checks which is the best tk/s and safes that but i think maximum performance is relativ with ai space llama.cpp moving so fast that its just going up

[-]

ai_without_borders@reddit

tensor split is doing a lot of heavy lifting here. with mixed vram capacities (like 3090+4090), the default 50/50 split hammers the slower card and you get bottlenecked at the compute boundary. finding the right ratio is sometimes worth 2x on its own, separate from any flag tuning. curious what the split ended up being in the optimized config.

[-]

RelicDerelict@reddit

Does this calculates ratio between CPU and GPU too?

[-]

sonicnerd14@reddit

Interesting, ironically I've been working on a skill that does something similar called local inference optimizer. Except that it relies on an agent outside of the LLM working on the host machine itself to find the most optimal settings. I think both ideas are pretty solid and useful so that we dont have to spend so much time tuning these models ourselves.

[-]

Danmoreng@reddit

Have you tried optimal default settings with fit and fit-ctx? See here: https://github.com/Danmoreng/local-qwen3-coder-env

[-]

JLeonsarmiento@reddit

What kind of witchcraft is this?

[-]

Designer_Reaction551@reddit

the self-tuning loop idea is actually brilliant for multi-GPU setups where the optimal layer split is basically impossible to guess manually. we spent hours tweaking ngl and tensor split values for a 3090 + 3060 combo before just writing a similar brute-force search. 4.1 -> 17.47 on the 122B is wild tho, most of that is probably just proper GPU offloading vs CPU default.

[-]

HopePupal@reddit

brute force seems fine for multi-GPU setups you can count on one hand. you don't need an LLM for that. you don't even really need Optuna or another hyperparameter search tool.

[-]

Professional_Let8686@reddit

I am using RTX 5070 (12G VRAM) with 128G RAM. What is the best inference tok/s I can expect with these large models?. I am currently running Qwen 3.5 9B unsloth Q4 quant model with q4_1 kv cache and getting around 90 tok/s.

[-]

Queasy_Asparagus69@reddit

For a v3 add speculative decoding

[-]

Queasy_Asparagus69@reddit

And jinja chat templates too ;)

[-]

raketenkater@reddit (OP)

yeah wrote that up already would be another big performance gaining path

[-]

fuchelio@reddit

Does --ai-tune support hard constraints? For example, a 256K context, mmproj, or thinking as a non-negotiable requirement.

[-]

raketenkater@reddit (OP)

yes ctx-size is set can not be changed by ai-tune as well as vision

[-]

Queasy_Asparagus69@reddit

Wow I love this

[-]

Pixer---@reddit

Will there be a rocm / Vulkan version ?

[-]

raketenkater@reddit (OP)

working on vulkan right now for v2.2

[-]

Clean_Initial_9618@reddit

Say the same thing can I tell claude code to do like give it access to a shell and ask it to run llama-server query it and see the stats and find the best settings and give it access to llamacpp docs. Sorry just asking as I have been trying to find the right flags for my setup as well. Rtx3090 and 64GB system RAM. Trying to run Hermes agent with either gemma4-26B-A4B-it or qwen3.5-27b. any Any help or suggestions would be great. Thank you

[-]

raketenkater@reddit (OP)

yes you could do that using claude code as well i think but you would burn your tokens

[-]

Clean_Initial_9618@reddit

Makes sense how does the ai-tune work in the background is it safe ? Can I just add that feature to my existing llama-server or do I need to like clone and make llama-server again ??

[-]

raketenkater@reddit (OP)

so llm-server just build on top of any llama-server binary and for ai-tune being safe it is just your hosted ai reading the -help pages of your binary and based on that tuning the flags of the model currently ran

[-]

Craftkorb@reddit

If it fits into VRAM go with vllm, much faster

[-]

Theboyscampus@reddit

It sounds like auto OCing on graphic cards lol

[-]

b1231227@reddit

Can it export the parameters after ai-tune as a reference? Because I am using another llama.cpp branch, there are some functions that I need so I cannot directly jump to the llm-server you developed.

[-]

raketenkater@reddit (OP)

which binary to run is pluggable using the --server-bin flag too

[-]

raketenkater@reddit (OP)

It saves them as configs so yes

[-]

andy2na@reddit

any easy way to run this in a docker container? I've tried to run it in unraid and its not working at all

[-]

mister2d@reddit

It's always nice to see optimization on consumer hardware. I've had to do this by hand while keeping up with all the new flags like n-cpu-moe and tensor parallelism.

And since buying a new rig is out of the question I have to squeeze out everything from my DDR3 box.

[-]

raketenkater@reddit (OP)

Exactly same for me

[-]

TomHale@reddit

Very cool! With your AIs knowledge and context, could you ask if for a plan on how to do the same but with Lemonade for AMD?

A markdown file on that in your repo on that would be amazing! 😉

[-]

denoflore_ai_guy@reddit

OmG iTs SeLf ImPrOvInG ai!?!?!?! 🤪 but srsly nice stuff.

[-]

ketosoy@reddit

Do you have a genetic algo in there or is it pure random testing?

[-]

ecompanda@reddit

the multi GPU split is probably doing as much work as the flag tuning honestly. tensor split across a 3090 Ti and two smaller cards is notoriously fussy and most people never get past default even distribution.

curious whether the ai tuning is finding a non obvious tensor split ratio or mostly optimizing batch size and context window flags. because those are two pretty different wins.

27B at 40 tok/s is legitimately fast for a rig like that though.

[-]

raketenkater@reddit (OP)

and tensor splits are handled by the deterministic first layer: conservative fill based on measured VRAM, then squeeze until tight.

[-]

raketenkater@reddit (OP)

The context-windows flags and so on can be set by user and are seen as set in the ai-tune

[-]

qwen_next_gguf_when@reddit

What is your ik llamacpp cmake command?

[-]

raketenkater@reddit (OP)

just

cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON -DCMAKE_BUILD_TYPE=Release
  cmake --build build --config Release -j