Built a config sweep CLI for llama.cpp and vLLM and found out Q4_K_M beat Q8_0 by 230ms TTFT on Qwen2.5-7B
Posted by diptanshu1991@reddit | LocalLLaMA | View on Reddit | 19 comments
I have been coming to this subreddit to understand what the optimal config is to run a model on a given hardware setup. I referred to specific benchmarks, but they are too generic and do not consider the underlying hardware. So, I decided to build the tool myself.
Sigilant-sweep is an OSS CLI that runs 16 configs (combinations of quants, KV cache, and context size) for a specified no. of trials. TPS and TTFT are measured every trial, along with PPL on a fixed 3,300 token mixed-domain corpus. After all the trials, each config gets p50 and p95 values for TPS and TTFT. These are normalised and combined into a final score, which is a weighted average based on the profile you select (balanced, latency, and quality).
The biggest challenge I faced was getting deterministic results. Initially, every run was showing a different winner. I tried multiple approaches and finally settled on deterministic shuffling through cyclic offset. This fixed the problem, and the results are now stable 9/10 times for a given hardware and backend.
Results: Qwen2.5-7B (bartowski) · Modal L4 · 16 configs · 15 trials
Config TPS p95 TTFT p95 PPL Score
Q4_K_M · ctx:8192 · kv:k16v16 · best 74.5 1856ms 6.02 99
Q4_K_M · ctx:16384 · kv:k16v16 74.3 1869ms 6.02 98
Q5_K_M · ctx:8192 · kv:k16v16 71.5 2010ms 5.86 97
Q5_K_M · ctx:16384 · kv:k16v16 71.0 1950ms 5.86 97
Q8_0 · ctx:8192 · kv:k16v16 63.8 2130ms 5.82 92
Best vs Q8_0: TPS +10.7 · TTFT -274ms · PPL +0.20 · Score +7
Worth noting: Q4_K_M ctx:8192 and ctx:16384 are within 1% score. The CLI surfaces this explicitly and flags low confidence when the top-2 gap is within noise, so you know when to run more trials rather than blindly trusting a single winner.
There is also a depth profile mode that tests TPS and TTFT at 8k, 14k, and 28k prompt lengths to show which config is optimal as context grows. Perplexity stays on the same fixed corpus across all passes.
What it measures: TPS, TTFT, ITL, PPL
What it does not measure: Full quality (tool calling, str JSON validity etc.). There is a 5-sample smoke test, but it's not used in scoring yet.
Backends: llama.cpp and vLLM
Github: https://github.com/sigilantlabs/sigilant-sweep/
Feedback welcome
bigattichouse@reddit
I've been using Taguchi arrays to run experiments like this - is this a similar idea to your sweeps? github.com/bigattichouse/taguchi
diptanshu1991@reddit (OP)
I have not used this repo, went though it. prima facie, Taguchi arrays are about reducing experiment count while maintaining statistical coverage. sigilant-sweep runs all 16 configs but handles the full execution loop: dispatches to llama.cpp or vLLM, collects TPS/TTFT/PPL, scores and ranks. The interesting angle is using Taguchi to cut the config grid down. Worth exploring as a future mode
bigattichouse@reddit
I'm a home experimenter (both chemistry/battery stuff and LLMs), and created this repo to help organize experiments. I didn't know if you were using an orthagonal array technique in yours - it's been invaluable for my own work. And having the tool really seems to help coding LLMs plan for experimental runs.
diptanshu1991@reddit (OP)
Haven't used Taguchi arrays for sigilant-sweep yet but will definitely give it a try, especially for reducing the config grid while maintaining coverage
bigattichouse@reddit
Depending on param depth, you can even use it to expand your search space - find params (or combos) with strong signals and then expand tests on those.
Septerium@reddit
Wow, that is so relevant (sl)op. Have you tried other SOTA open models, like Gemma 3, Mixtral or even Llama 3?
diptanshu1991@reddit (OP)
Llama 3.1 8B results coming up in the comments. Gemma 3 and Mixtral on the list.
diptanshu1991@reddit (OP)
llama 3.1 8b results:
Config TPS p95 TTFT p95 PPL Score
IQ3_M · ctx:8192 · kv:k16v16 · default <- best 35.9 4312.7 5.39 98
IQ3_M · ctx:16384 · kv:k8v8 34.2 4512.6 5.39 94
Q4_K_M · ctx:16384 · kv:k16v16 31.1 4727.3 5.19 90
Q5_K_M · ctx:8192 · kv:k16v16 28.7 5192.6 5.11 88
Q8_0 · ctx:16384 · kv:k16v16 24 6119.9 5.07 80
Best vs Q8_0: TPS +11.3 · TTFT -1690ms · PPL +0.32 · Score +17
sahanpk@reddit
the low-confidence flag is the best part here. people over-trust tiny benchmark gaps way too much.
diptanshu1991@reddit (OP)
exactly, and if the gap is too close, you can either increase the no. of trials or re-run the top 2 configs to see if one pulls ahead
woolcoxm@reddit
more ai bullshit.
hidden2u@reddit
op do llama3 8b next
diptanshu1991@reddit (OP)
yeah, on the list
gnaarw@reddit
Bad bot 😭you were not supposed to answer this :D
diptanshu1991@reddit (OP)
Llama 3.1 8B results on L4, as requested
ChampionshipLimp1749@reddit
yea, would like to see qwen 3.6 results
diptanshu1991@reddit (OP)
Qwen 3.6 - will run it and post results here
gh0stwriter1234@reddit
I mean are you disabling warmup for llama.cpp because that will influence TTFT measurements. Also a fresh unwarmed up server is not the same as a server that has already served 3 sessions etc...
diptanshu1991@reddit (OP)
Currently, the trials in llama.cpp path is a fresh llama-cli process. so yeah, we are measuring cold start TTFT. We run multiple trials and report medians and p95 to handle variance, but the metric is still cold start TTFT. We can publish these explicitly: TTFT (cold) and warmed server TTFT (persistent server). added on the roadmap