The LLM tunes its own llama.cpp flags (+54% tok/s on Qwen3.5-27B)

Posted by raketenkater@reddit | LocalLLaMA | View on Reddit | 73 comments

This is V2 of my previous post.

What's new: --ai-tune — the model starts tuning its own flags in a loop and caches the fastest config it finds.

My weird rig: 3090 Ti + 4070 + 3060 + 128GB RAM.

Model |llama-server (no tuning) |llm-server v1 tuning |llm-server v2 (AI-Tune)

Qwen3.5-122B |4.1 tok/s |11.2 tok/s |17.47 tok/s

Qwen3.5-27B Q4_K_M |18.5 tok/s |25.94 tok/s |40.05 tok/s

gemma-4-31B UD-Q4_K_XL |14.2 tok/s |23.17 tok/s |24.77 tok/s What I think is best here: --ai-tune keeps up with updates on llama.cpp / ik_llama.cpp automatically, because it feeds llama-server --help into the LLM tuning loop as context. New flags land → the tuner can use them → you get the best performance.

i think those are some solid gains (max tokens yeaaahh), plus more stability and a nice TUI via llm-server-gui.

Check it out: https://github.com/raketenkater/llm-server