Do you benchmark local models as agents, or only on single prompts?

Posted by sahanpk@reddit | LocalLLaMA | View on Reddit | 13 comments

Curious how people test tool use locally. A model can look fine in chat and still fall apart once state, retries, and bad tool results show up.