Claude Code vs Codex: 36 files vs 28, an infinite loop, and a $0.46 difference. Guess which one needed a patch.
Posted by Straight_Stomach812@reddit | LocalLLaMA | View on Reddit | 4 comments

Been meaning to do this for a while. Sick of seeing benchmark screenshots so I just… built stuff.
Two tasks. Same prompt to both agents. Same MCP setup (GitHub + Slack via a dashboard). No hints. No extra help.
Task 1: PR triage bot->read open PRs, score them by complexity, write a report, ping Slack for high priority. Retry logic, error logging, strict TS, no "any". The kind of thing you'd actually run on a cron job.
Task 2: Real-time code review UI-> React, WebSockets, inline comment threads, optimistic updates that roll back on failure, virtualized diff viewer, WS reconnect with backoff. No UI libraries. Build everything from scratch.
Here's what happened:
Claude Code verified its MCP tools were live before writing a line. Then built 36 files in 12 minutes and wrote a two-client WebSocket smoke test I didn't ask for. Broadcast latency: 3ms. Zero "any". Passed typecheck first try.
Codex via Cursor couldn't access the GitHub MCP on Task 1, the execution path in Cursor didn't expose the tool descriptors. Got a "tool not found" error after 3 retries. But it handled it cleanly: logged everything, didn't crash, didn't invent output. Environment issue, not model quality. On Task 2, it shipped a working UI in \~15 min, smoke test passed at 5ms, but hit some TypeScript errors on first compile and an infinite React loop (`useEffect` calling `hydrate` repeatedly) that needed a ref guard patch.
Estimated API cost across both tasks: Claude \~$2.50, Codex \~$2.04. Not a huge gap, but Claude was about 23% more expensive for more granular architecture and a first-run clean UI.
Actual learning: they're not really competing for the same use case. Claude Code feels like pairing with someone who reads the docs before touching a keyboard. Codex feels like a senior dev who just wants to ship the thing and move on. Both valid depending on what you're building.
The thing that actually surprised me, neither leaked "any", neither hallucinated a tool name, both got WebSocket broadcast under 10ms. Six months ago that wasn't a given. The baseline has genuinely moved.
I have all the code, run logs, and cost breakdowns if anyone wants to dig in or poke holes in the methodology. Will drop the link in comments.
ttkciar@reddit
This is off-topic for LocalLLaMA.
fgp121@reddit
Interesting that Claude Code verified MCP tools before writing. I wonder how much of that upfront validation is baked in versus dynamic. For pipeline debugging, I've been using Neo to plan and run experiments on agent workflows - it's helped catch issues like the WebSocket infinite loop you mentioned before they hit production.
Ha_Deal_5079@reddit
the mcp tool stuff told me more than the benchmarks ngl. claude verifying tools before starting vs cursor not even exposing them says a lot about their design philosophy. theres a project called skillsgate on github that tracks agent config diffs if you're juggling both
EveningIncrease7579@reddit
Guess wich one is local and opensource