Benchmarking Local LLM/Harness Combinations
Posted by pminervini@reddit | LocalLLaMA | View on Reddit | 8 comments
Hi, I'm trying to find the best local model/harness combinations for agentic coding tasks involving PyTorch, JAX, Transformers, etc., and I ended up doing a small private (to avoid contaminations) benchmark. Let me know if there's anything you'd like to see!
MuDotGen@reddit
I really like Pi.dev . It's so lightweight it actually works with smaller LLMs and hardware, and it's highly customizable.
pminervini@reddit (OP)
It does great on this benchmark!
Eyelbee@reddit
What about cline/roo code?
pminervini@reddit (OP)
Right, I completely forgot they existed; adding them and starting the sweeps
StorageHungry8380@reddit
Perhaps you mentioned it, but did you check for randomness? That is, run a couple of the combinations multiple times to see of often they pass? I find the Q8 results in a net regression quite surprising.
pminervini@reddit (OP)
not yet, it's in the pipeline! right now I'm looking into crystallising the benchmark and set of models/harnesses
StorageHungry8380@reddit
I can understand you not wanting to do that for all combos, but I think it's important to do for a few, just to get a handle on the spread. Perhaps pick one harness, a couple of models and one hard and one easy task, then do at least 5 runs each. At least when using them casually, I sometimes get very different outputs from same prompt.
Anyway, interesting to see, I was considering doing something similar, but more open-ended, ie make them plan a task and then implement, selecting recommended choices for questions. Then use a couple of frontier models to grade the work.
pminervini@reddit (OP)
> I can understand you not wanting to do that for all combos, but I think it's important to do for a few, just to get a handle on the spread.
Totally agree -- it's mainly that I'm trying to run everything on apple silicon to stress the "local LLM" component, and it may take a bit to do \~3 seeds, but will 100% do it.
> Anyway, interesting to see, I was considering doing something similar, but more open-ended, ie make them plan a task and then implement, selecting recommended choices for questions.
I hope to share something soon-ish on that front! 🙂