The LLM tunes its own llama.cpp flags (+54% tok/s on Qwen3.5-27B)
Posted by raketenkater@reddit | LocalLLaMA | View on Reddit | 73 comments
This is V2 of my previous post.
What's new: --ai-tune — the model starts tuning its own flags in a loop and caches the fastest config it finds.
My weird rig: 3090 Ti + 4070 + 3060 + 128GB RAM.
Model |llama-server (no tuning) |llm-server v1 tuning |llm-server v2 (AI-Tune)
Qwen3.5-122B |4.1 tok/s |11.2 tok/s |17.47 tok/s
Qwen3.5-27B Q4_K_M |18.5 tok/s |25.94 tok/s |40.05 tok/s
gemma-4-31B UD-Q4_K_XL |14.2 tok/s |23.17 tok/s |24.77 tok/s What I think is best here: --ai-tune keeps up with updates on llama.cpp / ik_llama.cpp automatically, because it feeds llama-server --help into the LLM tuning loop as context. New flags land → the tuner can use them → you get the best performance.
i think those are some solid gains (max tokens yeaaahh), plus more stability and a nice TUI via llm-server-gui.
Check it out: https://github.com/raketenkater/llm-server
segmond@reddit
provide an example of the parameters it used vs the previous to go from 4.1tk/s to 17.47tk/s
raketenkater@reddit (OP)
so 4.1tk-s is just llama-server -m Qwen3.5-122B-A10B-Opus-Reasoning-Q4_K_M.gguf
and the tuned command after moe expert offloading layer 1 handles and ai-tune layer 2 this would be
llama-server -m Qwen3.5-122B-A10B-Opus-Reasoning-Q4_K_M.gguf \
-ngl 48 \
--tensor-split 0.54,0.23,0.23 \
-sm graph \
-fa on \
--cache-type-k q8_0 --cache-type-v q8_0 \
-ot "blk\.(1[4-9]|2[0-9])\.ffn_.*_exps=CUDA1" \
-ot "blk\.(3[0-9]|4[0-7])\.ffn_.*_exps=CUDA2" \
--run-time-repack -khad --defrag-thold 0.1 \
--threads 8 --threads-batch 16 \
--batch-size 2048 --ubatch-size 256
Gets 17.47 tok/s
Equivalent_Job_2257@reddit
Can you compare with the same command as first with only '--fit' flag added?
666666thats6sixes@reddit
--fitis on by default unless they used some ancient (pre 75xx) buildecompanda@reddit
the cpu offload strategy being the default when ngl is not set explains a lot of the bad benchmarks people post. most "my llama.cpp is slow" threads are just missing that one flag
IrisColt@reddit
ngl... ngl is a must
Liquos@reddit
I thought in the latest versions it defaults to offloading to the GPU?
StardockEngineer@reddit
It does. llama.cpp defaults to maxing out the GPU
Glittering-Call8746@reddit
Will it work for ik_llama.cpp ?
draetheus@reddit
I have 96GB ram and a single 9070 XT and with vulkan I get about the same TG speed. What is your PP speed though? If you have any MoE layer spilling onto CPU, ubatch size of 256 is going to be horrible for PP speed. I'm not sure I trust this as the most optimized settings possible.
raketenkater@reddit (OP)
yes my 3060 is currently only 1x but i will upgrade to 4x using m.2 to pcie adapter soon hehe
ForsookComparison@reddit
i think they meant the before and after
segmond@reddit
I think they showed the before, they ran it without offloading to GPU
"llama-server -m Qwen3.5-122B-A10B-Opus-Reasoning-Q4_K_M.gguf"
To OP, at least offload to GPUs and use the fit parameters, that should be your minimal baseline.
StardockEngineer@reddit
That would totally offload to the GPU. That's the default.
raketenkater@reddit (OP)
Before 4.1tk/s:
llama-server -m Qwen3.5-122B-A10B-Opus-Reasoning-Q4_K_M.ggufAfter 17.47tk/s:
ForsookComparison@reddit
siiiiiiiggghhhhh
RelicDerelict@reddit
Will be Linux supported in the future?
unculturedperl@reddit
It works on linux now.
Leather_Flan5071@reddit
this is some nice concept I'm gonna watch for improvements
fragment_me@reddit
Anything serious needs to do benchmarks post changes of perplexity and KLD.
Wise-Hunt7815@reddit
I tested it on two 3090s with 64G RAM, and it did improve the speed, but the AI changed the KV cache to Q4_0... I can't accept that, lol
Qwen3.5-122B-A10-Q4_K_M
rearwebpidgeon@reddit
Seems like --ai-tune isn't implemented in llm-server-mac - that wasn't clear to me from docs (unless I just didn't RTFM enough).
mrtrly@reddit
The self-referential loop is the clever part here. Most people hand-tune tensor splits once and forget about it, but flag interactions are combinatorial enough that automated search beats human intuition past two GPUs. Quant level probably shifts the optimal split enough that each one needs its own tuning pass.
Wide_Veterinarian100@reddit
My noob ass just learned to do this manually, thank you for this!
Glittering-Call8746@reddit
So basically it keep trying till it get the right tensor split ?
raketenkater@reddit (OP)
there are 2 stages
1. it calculates vram pcie lane speed and model architecture and so on based on that it chooses a strategy: 1. dense single gpu 2. dense multi gpu 3. cpu offloading expert placment(with first conservativ placment then filling it until tight)
2. there is an ai-tune flag which prompts the llm running with the help of the selected backend and then trys to better its tk/s performance
RelicDerelict@reddit
Fantastic, finally PCI lanes getting into consideration. Building PC with PCIe 5.0 doesn't sounds so useless anymore.
Glittering-Call8746@reddit
No it doesn't until u figure out pcie 5.0 extension have to be shorter and are more expensive and not all consumer mobo have pcie 5.0 bifurcation..
fulgencio_batista@reddit
Hey this is pretty handy! I saw around a 50% boost in tg from my baseline command from the auto-detected command, though I didn't have any luck with my LLM tuning it (no change).
I was trying to qwen3.5-110b-a10b-reap-40 (\~46gb) with 32gb vram.
unculturedperl@reddit
For schnitts and giggles ran it on my dgx spork (n100/16gb).
Corosus@reddit
Cool stuff, on a whim I decided to try it, bothering to switch from windows to wsl2 alone has given me a nice lil boost from 26tps to 30tps for Qwopus3.5-27B-v3-Q4_K_M.gguf, lets see if it can beat my current kinda mostly optimized ik_llama on my dual 5070ti 5060ti with bad pcie speed communication setup
~/projects/git/ik_llama.cpp/build/bin/llama-server -m /home/corosus/projects/ai/jackrong/Qwopus3.5-27B-v3-Q4_K_M.gguf --host 0.0.0.0 --port 8080 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.4 --repeat-penalty 1.5 -ngl 99 -sm layer --merge-qkv -rtr -fa on -ctk q8_0 -ctv q8_0 -c 100000 -b 16384 -ub 4096 -ts 28,20 --no-warmup --jinja --numa isolate -tb 16
at Round 4/8 so far on the --ai-tune
CornerLimits@reddit
Maybe a simple script without llm could be faster/better no burned tokens? It will bench a lot of times, i can’ t see the real value of having an llm.
However cool idea!
fishhf@reddit
Using the optuna library or the old genetic algorithm would be less of an overkill.
IsopodInitial6766@reddit
The value isn't faster search it's zero-maintenance search. Optuna
needs you to define the parameter space up front: which flags exist,
valid values, conflicts. An llm reading `llama-server --help` each run
picks up new flags (like `--ubatch-size`) without updating your config
Hybrid is probably best: LLM constrains the search space, a
deterministic tuner does the sweep
raketenkater@reddit (OP)
There is no token burn it is using your local llm and you do not have to use the ai-tune it’s an optional flag
CornerLimits@reddit
I mean a token is a token even if its local :D Btw like this is certainly a good approach because you don’t have to update it if new flags comes out. It only sounds a bit overkill to me and i suspect low tire llms will make a mess with all the flags but maybe i’m wrong.
How can it detect when we are at absolute maximum performance (no stuff to try left) ?
raketenkater@reddit (OP)
even small current llms are pretty capable but i have not tested it and for the maximum performance ai-tune basically just uses the rounds you set and runs that often while crashes do not count and then checks which is the best tk/s and safes that but i think maximum performance is relativ with ai space llama.cpp moving so fast that its just going up
ai_without_borders@reddit
tensor split is doing a lot of heavy lifting here. with mixed vram capacities (like 3090+4090), the default 50/50 split hammers the slower card and you get bottlenecked at the compute boundary. finding the right ratio is sometimes worth 2x on its own, separate from any flag tuning. curious what the split ended up being in the optimized config.
RelicDerelict@reddit
Does this calculates ratio between CPU and GPU too?
sonicnerd14@reddit
Interesting, ironically I've been working on a skill that does something similar called local inference optimizer. Except that it relies on an agent outside of the LLM working on the host machine itself to find the most optimal settings. I think both ideas are pretty solid and useful so that we dont have to spend so much time tuning these models ourselves.
Danmoreng@reddit
Have you tried optimal default settings with fit and fit-ctx? See here: https://github.com/Danmoreng/local-qwen3-coder-env
JLeonsarmiento@reddit
What kind of witchcraft is this?
Designer_Reaction551@reddit
the self-tuning loop idea is actually brilliant for multi-GPU setups where the optimal layer split is basically impossible to guess manually. we spent hours tweaking ngl and tensor split values for a 3090 + 3060 combo before just writing a similar brute-force search. 4.1 -> 17.47 on the 122B is wild tho, most of that is probably just proper GPU offloading vs CPU default.
HopePupal@reddit
brute force seems fine for multi-GPU setups you can count on one hand. you don't need an LLM for that. you don't even really need Optuna or another hyperparameter search tool.
Professional_Let8686@reddit
I am using RTX 5070 (12G VRAM) with 128G RAM. What is the best inference tok/s I can expect with these large models?. I am currently running Qwen 3.5 9B unsloth Q4 quant model with q4_1 kv cache and getting around 90 tok/s.
Queasy_Asparagus69@reddit
For a v3 add speculative decoding
Queasy_Asparagus69@reddit
And jinja chat templates too ;)
raketenkater@reddit (OP)
yeah wrote that up already would be another big performance gaining path
fuchelio@reddit
Does
--ai-tunesupport hard constraints? For example, a 256K context, mmproj, or thinking as a non-negotiable requirement.raketenkater@reddit (OP)
yes ctx-size is set can not be changed by ai-tune as well as vision
Queasy_Asparagus69@reddit
Wow I love this
Pixer---@reddit
Will there be a rocm / Vulkan version ?
raketenkater@reddit (OP)
working on vulkan right now for v2.2
Clean_Initial_9618@reddit
Say the same thing can I tell claude code to do like give it access to a shell and ask it to run llama-server query it and see the stats and find the best settings and give it access to llamacpp docs. Sorry just asking as I have been trying to find the right flags for my setup as well. Rtx3090 and 64GB system RAM. Trying to run Hermes agent with either gemma4-26B-A4B-it or qwen3.5-27b. any Any help or suggestions would be great. Thank you
raketenkater@reddit (OP)
yes you could do that using claude code as well i think but you would burn your tokens
Clean_Initial_9618@reddit
Makes sense how does the ai-tune work in the background is it safe ? Can I just add that feature to my existing llama-server or do I need to like clone and make llama-server again ??
raketenkater@reddit (OP)
so llm-server just build on top of any llama-server binary and for ai-tune being safe it is just your hosted ai reading the -help pages of your binary and based on that tuning the flags of the model currently ran
Craftkorb@reddit
If it fits into VRAM go with vllm, much faster
Theboyscampus@reddit
It sounds like auto OCing on graphic cards lol
b1231227@reddit
Can it export the parameters after ai-tune as a reference? Because I am using another llama.cpp branch, there are some functions that I need so I cannot directly jump to the llm-server you developed.
raketenkater@reddit (OP)
which binary to run is pluggable using the --server-bin flag too
raketenkater@reddit (OP)
It saves them as configs so yes
andy2na@reddit
any easy way to run this in a docker container? I've tried to run it in unraid and its not working at all
mister2d@reddit
It's always nice to see optimization on consumer hardware. I've had to do this by hand while keeping up with all the new flags like
n-cpu-moeand tensor parallelism.And since buying a new rig is out of the question I have to squeeze out everything from my DDR3 box.
raketenkater@reddit (OP)
Exactly same for me
TomHale@reddit
Very cool! With your AIs knowledge and context, could you ask if for a plan on how to do the same but with Lemonade for AMD?
A markdown file on that in your repo on that would be amazing! 😉
denoflore_ai_guy@reddit
OmG iTs SeLf ImPrOvInG ai!?!?!?! 🤪 but srsly nice stuff.
ketosoy@reddit
Do you have a genetic algo in there or is it pure random testing?
ecompanda@reddit
the multi GPU split is probably doing as much work as the flag tuning honestly. tensor split across a 3090 Ti and two smaller cards is notoriously fussy and most people never get past default even distribution.
curious whether the ai tuning is finding a non obvious tensor split ratio or mostly optimizing batch size and context window flags. because those are two pretty different wins.
27B at 40 tok/s is legitimately fast for a rig like that though.
raketenkater@reddit (OP)
and tensor splits are handled by the deterministic first layer: conservative fill based on measured VRAM, then squeeze until tight.
raketenkater@reddit (OP)
The context-windows flags and so on can be set by user and are seen as set in the ai-tune
qwen_next_gguf_when@reddit
What is your ik llamacpp cmake command?
raketenkater@reddit (OP)
just