Claude Code vs Codex: 36 files vs 28, an infinite loop, and a $0.46 difference. Guess which one needed a patch.

Posted by Straight_Stomach812@reddit | LocalLLaMA | View on Reddit | 4 comments

Been meaning to do this for a while. Sick of seeing benchmark screenshots so I just… built stuff.

Two tasks. Same prompt to both agents. Same MCP setup (GitHub + Slack via a dashboard). No hints. No extra help.

Task 1: PR triage bot->read open PRs, score them by complexity, write a report, ping Slack for high priority. Retry logic, error logging, strict TS, no "any". The kind of thing you'd actually run on a cron job.

Task 2: Real-time code review UI-> React, WebSockets, inline comment threads, optimistic updates that roll back on failure, virtualized diff viewer, WS reconnect with backoff. No UI libraries. Build everything from scratch.

Here's what happened:

Claude Code verified its MCP tools were live before writing a line. Then built 36 files in 12 minutes and wrote a two-client WebSocket smoke test I didn't ask for. Broadcast latency: 3ms. Zero "any". Passed typecheck first try.

Codex via Cursor couldn't access the GitHub MCP on Task 1, the execution path in Cursor didn't expose the tool descriptors. Got a "tool not found" error after 3 retries. But it handled it cleanly: logged everything, didn't crash, didn't invent output. Environment issue, not model quality. On Task 2, it shipped a working UI in \~15 min, smoke test passed at 5ms, but hit some TypeScript errors on first compile and an infinite React loop (`useEffect` calling `hydrate` repeatedly) that needed a ref guard patch.

Estimated API cost across both tasks: Claude \~$2.50, Codex \~$2.04. Not a huge gap, but Claude was about 23% more expensive for more granular architecture and a first-run clean UI.

Actual learning: they're not really competing for the same use case. Claude Code feels like pairing with someone who reads the docs before touching a keyboard. Codex feels like a senior dev who just wants to ship the thing and move on. Both valid depending on what you're building.

The thing that actually surprised me, neither leaked "any", neither hallucinated a tool name, both got WebSocket broadcast under 10ms. Six months ago that wasn't a given. The baseline has genuinely moved.

I have all the code, run logs, and cost breakdowns if anyone wants to dig in or poke holes in the methodology. Will drop the link in comments.

[-]

ttkciar@reddit

This is off-topic for LocalLLaMA.

fgp121@reddit

Interesting that Claude Code verified MCP tools before writing. I wonder how much of that upfront validation is baked in versus dynamic. For pipeline debugging, I've been using Neo to plan and run experiments on agent workflows - it's helped catch issues like the WebSocket infinite loop you mentioned before they hit production.

Ha_Deal_5079@reddit

the mcp tool stuff told me more than the benchmarks ngl. claude verifying tools before starting vs cursor not even exposing them says a lot about their design philosophy. theres a project called skillsgate on github that tracks agent config diffs if you're juggling both

EveningIncrease7579@reddit

Guess wich one is local and opensource