Benchmarking Local LLM/Harness Combinations

Posted by pminervini@reddit | LocalLLaMA | View on Reddit | 8 comments

Hi, I'm trying to find the best local model/harness combinations for agentic coding tasks involving PyTorch, JAX, Transformers, etc., and I ended up doing a small private (to avoid contaminations) benchmark. Let me know if there's anything you'd like to see!

[-]

MuDotGen@reddit

I really like Pi.dev . It's so lightweight it actually works with smaller LLMs and hardware, and it's highly customizable.

[-]

pminervini@reddit (OP)

It does great on this benchmark!

[-]

Eyelbee@reddit

What about cline/roo code?

[-]

pminervini@reddit (OP)

Right, I completely forgot they existed; adding them and starting the sweeps

[-]

StorageHungry8380@reddit

Perhaps you mentioned it, but did you check for randomness? That is, run a couple of the combinations multiple times to see of often they pass? I find the Q8 results in a net regression quite surprising.

[-]

pminervini@reddit (OP)

not yet, it's in the pipeline! right now I'm looking into crystallising the benchmark and set of models/harnesses

[-]

StorageHungry8380@reddit

I can understand you not wanting to do that for all combos, but I think it's important to do for a few, just to get a handle on the spread. Perhaps pick one harness, a couple of models and one hard and one easy task, then do at least 5 runs each. At least when using them casually, I sometimes get very different outputs from same prompt.

Anyway, interesting to see, I was considering doing something similar, but more open-ended, ie make them plan a task and then implement, selecting recommended choices for questions. Then use a couple of frontier models to grade the work.

[-]

pminervini@reddit (OP)

> I can understand you not wanting to do that for all combos, but I think it's important to do for a few, just to get a handle on the spread.

Totally agree -- it's mainly that I'm trying to run everything on apple silicon to stress the "local LLM" component, and it may take a bit to do \~3 seeds, but will 100% do it.

> Anyway, interesting to see, I was considering doing something similar, but more open-ended, ie make them plan a task and then implement, selecting recommended choices for questions.

I hope to share something soon-ish on that front! 🙂