Added myself as a baseline to my LLM benchmark

Posted by Interesting_Fly_6576@reddit | LocalLLaMA | View on Reddit | 4 comments

Running a pipeline to classify WST problems in \~590K Uzbek farmer messages. 19 categories, Telegram/gov news/focus groups, mix of Uzbek and Russian.

Built a 100-text benchmark with 6 models, then decided to annotate it myself blind. 58 minutes, 100 texts done.

Result: F1 = 76.9% vs Sonnet ground truth. Basically same as Kimi K2.5.

Then flipped it — used my labels as ground truth instead of Sonnet's. Turns out Sonnet was too conservative, missed \~22% of real problems. Against my annotations:

Qwen 3.5-27B AWQ 4-bit (local): F1 = 86.1%
Kimi K2.5: F1 = 87.9%
Gemma 4 26B AWQ 4-bit (local): F1 = 70.2%

Setup: RTX 5090, 32GB VRAM. Qwen runs at \~50 tok/s per request, median text is 87 tokens so \~1.8s/text. Aggregate throughput \~200-330 tok/s at c=16-32.

Gemma 4 26B on vLLM was too slow for production, Triton problem most probably — ended up using OpenRouter for it and cloud APIs for Kimi/Gemini/GPT.

The ensemble (Qwen screens → Gemma verifies → Kimi tiebreaks) runs 63% locally and hits F1 = 88.2%. 2 points behind Kimi K2.5, zero API cost for most of it.

Good enough. New local models are impressive!

[-]

Miserable-Dare5090@reddit

I would love for someone to have a way to do this with email. I annotate 100 messages as important, not important, work, shopping, spam, etc, and then use that as ground truth.

[-]

Interesting_Fly_6576@reddit (OP)

I also tried to fine-tune models by using Opus/Sonet as ground truth, but I did not get significantly better results. Just to use good model + good prompt + good context is usually more than enough to get solid results.

[-]

draconisx4@reddit

Solid move annotating blind to check your own biases that's crucial for reliable AI control. How do you plan to scale this without letting dataset noise creep in and mess with governance?

[-]

Interesting_Fly_6576@reddit (OP)

Right — if model agreement is above some confidence threshold, the predicted distribution converges to the true one regardless of individual label noise. That's the actual validation the benchmark provides, not per-text accuracy. We are interesting in distribution, not precise one message classification.