We’ve been running nightly benchmarks: WozCode vs Claude Code (same model, same tasks)

Posted by ChampionshipNo2815@reddit | LocalLLaMA | View on Reddit | 4 comments

We’ve been running nightly CI benchmarks comparing our coding agent (WozCode) against Claude Code.

All runs use the same model (Opus), identical prompts, and the same repositories. The only variable is how the agent executes.

Across multiple tasks (portfolio updates, todo app features, multi-file styling changes), the output quality is largely equivalent. Both agents produce functionally correct results with similar code changes.

However, the execution patterns differ significantly.

Claude follows a structured Read → Edit workflow, typically reading multiple files before making incremental changes. This often results in a high number of tool calls, especially for multi-file or repetitive updates.

WozCode, by contrast, batches edits aggressively. It frequently skips pre-reads when context is already sufficient and consolidates multi-file or multi-hunk changes into a single operation. It also handles obvious follow-up steps within the same pass instead of waiting for additional prompts.

A representative example (color scheme update across files):

Claude Code: 44 tool calls (\~$2.42)
WozCode: 3 turns (\~$0.22)

Both produced comparable results in the repository.

This pattern is consistent across all evaluated prompts:

WozCode uses fewer tool calls
Lower total cost
Faster completion time

One area where WozCode currently needs improvement is initial file discovery. Early-stage searches sometimes use incorrect or overly broad glob patterns, leading to a failed search before recovery. There are also occasional redundant searches for files already modified in prior steps.

These appear to be heuristic issues rather than model limitations.

Overall, the results suggest that, with comparable model capability, execution strategy plays a significant role in performance. Agents optimized for batching and forward execution can achieve similar outcomes with substantially lower overhead.

We plan to continue running these benchmarks and refining the system.

Curious if others working with coding agents are observing similar differences in execution patterns.

We’ve been running nightly benchmarks: WozCode vs Claude Code (same model, same tasks)

ttkciar@reddit

ChampionshipNo2815@reddit (OP)

ttkciar@reddit

ChampionshipNo2815@reddit (OP)