Benchmark · May 20, 2026 · Last updated 2026-05-21 · 17 min read
Codex vs Claude Code: Real-Repo Benchmark

Questions this page answers
- How should I benchmark Codex vs Claude Code on a real repository?
- Which metrics separate coding agents beyond subjective vibe checks?
- What tasks should a Codex vs Claude Code benchmark include?
- Why does the host machine matter for a coding-agent benchmark?
Benchmark answer
Quick Answer: Benchmark The Workflow, Not The Vibes
Codex vs Claude Code comparisons go wrong when the test is one prompt, one toy repo, and one subjective verdict. Strong coding agents fail in different places: one may plan better, one may edit faster, one may recover from test failures better, and one may produce a cleaner review narrative. Your benchmark should make those differences visible.
- Use the same real repository for both agents.
- Pin the task prompt, branch name, dependency state, and timebox.
- Force both agents to run the same install, lint, test, and browser checks.
- Score the diff, not the chat transcript.
- Track wall-clock time, tokens or subscription usage, retries, and human intervention.
- Run the benchmark on a persistent host so caches, browser state, logs, and task artifacts survive between runs.
The Reddit lesson
The Real-Repo Benchmark Design
Start with a repo that already has some pain: real dependencies, flaky setup, old files, incomplete tests, browser UI, and at least one architectural decision that cannot be solved by search and replace. Then create a task set that exercises different muscles.
| Task | What it tests | Good output |
|---|---|---|
| Bug fix with failing test | Can the agent localize a defect, write the smallest patch, and prove the regression stays fixed? | One focused diff, a failing-then-passing test, and a plain explanation of the bug. |
| Feature with UI state | Can the agent coordinate code, browser checks, screenshots, and product judgment? | Working UI, responsive layout, clean copy, and evidence from a local browser run. |
| Refactor under tests | Can the agent reduce complexity without changing behavior? | Smaller or clearer code, unchanged public behavior, and a test run that covers the touched paths. |
| Dependency or SDK migration | Can the agent read docs, update call sites, and handle incompatible types? | Migration notes, compatible imports, updated tests, and no broad unrelated churn. |
| Dirty repo handoff | Can the agent work around existing changes without overwriting them? | A branch that preserves unrelated work, states assumptions, and isolates its own patch. |
Keep The Environment Fixed
A benchmark is useless if one agent gets a warm repo and the other gets a broken checkout. Use the same host class, same branch reset point, same package manager cache policy, same browser profile, same terminal permissions, and same review rules.
- Create one clean baseline branch and tag it before the first run.
- Record CPU, memory, disk, OS version, Node/Bun/Python versions, and package-manager versions.
- Install dependencies once, then decide whether both agents get warm caches or both get cold caches.
- Create one browser profile for UI tests and reset it between runs when auth state matters.
- Use the same secrets policy: no production credentials, no payment flows, no account-admin actions.
- Save every prompt, terminal command, test result, screenshot, and final diff in a run folder.
benchmark-runs/
2026-05-20-codex-bugfix/
prompt.txt
environment.txt
transcript-notes.md
commands.log
test-results.txt
screenshots/
final.patch
scorecard.md
2026-05-20-claude-code-bugfix/
...The Scorecard That Actually Separates Agents
Do not score only whether the agent finished. Score the review burden it left behind. A coding agent that lands an 80 percent patch with a chaotic diff may be slower than the one that asks two good questions and ships a narrower change.
| Metric | How to score it | Why it matters |
|---|---|---|
| Correctness | 0 to 5: tests pass, acceptance criteria met, no obvious regression. | The agent has to change reality, not just sound confident. |
| Diff discipline | 0 to 5: minimal scope, local style, no drive-by formatting. | Large unrelated diffs create review debt and merge risk. |
| Verification | 0 to 5: install, lint, unit tests, browser checks, screenshots, or explicit blocker notes. | A trustworthy agent proves the work from the same environment that made the change. |
| Recovery | 0 to 5: handles failures, backs out bad attempts, explains unresolved blockers. | Real repos fail. Recovery quality is often more important than first-shot quality. |
| Human load | 0 to 5: number of clarifications, approvals, manual fixes, and review comments. | The best agent lowers operator effort, not only model time. |
| Cost | Measured: subscription usage, API spend, wall time, and reruns. | A cheap first run can become expensive if it takes three corrective passes. |
A Repeatable Run Protocol
The protocol should be boring enough that another engineer can repeat it. This also makes the post stronger: readers trust benchmarks when they can see how you controlled for the environment.
For each task:
1. Reset to the baseline branch.
2. Start a fresh run folder.
3. Paste the exact task prompt.
4. Let the agent inspect the repo before editing.
5. Allow edits, commands, and browser checks within the stated policy.
6. Stop at the timebox or when the agent says the work is complete.
7. Save git diff, command log, screenshots, and the agent's final note.
8. Run the human scorecard before reading the other agent's result.Use one persistent host
How To Interpret The Result
| If the scorecard shows | Default choice | Reason |
|---|---|---|
| Lower review burden on complex repo edits | Use that agent for refactors and migrations. | Those tasks are expensive when the diff is noisy. |
| Better UI verification and screenshot discipline | Use that agent for product surfaces and browser-heavy fixes. | The agent is acting more like a frontend engineer, not only a code generator. |
| Better recovery after failed tests | Use that agent for legacy code and dependency work. | Broken setups reward persistence and debugging loops. |
| Lower cost but more human supervision | Use it for queued low-risk tasks. | Batch it behind a strong review process instead of handing it sensitive work. |
| Higher cost but cleaner final PRs | Use it for tasks where senior review time is the bottleneck. | The expensive model can still be cheaper than an expensive review spiral. |
You may end up using both. One agent can draft the migration plan, another can execute the edit, and a third pass can review for regressions. The benchmark tells you where each one deserves the keyboard.
Where Hyperbox Fits
Hyperbox is not the agent in this comparison. It is the machine the agents run on. That matters when the benchmark needs the same repo, tools, browser sessions, GUI permissions, and logs to stay available while you switch agents or review from another device.
- Run Codex and Claude Code against the same persistent checkout.
- Keep browser and desktop state alive after your laptop closes.
- Capture screenshots and logs from the machine that actually executed the work.
- Use SSH and VNC for terminal-first and desktop-first verification.
- Repeat the benchmark later without rebuilding the entire environment.
Frequently asked questions
Which is better, Codex or Claude Code?
The honest answer depends on the task. Benchmark bug fixes, UI work, refactors, migrations, and dirty-repo handoffs separately, then score correctness, diff quality, verification, recovery, human load, and cost.
What makes a coding-agent benchmark fair?
Use the same repo, baseline branch, prompt, task scope, test commands, browser checks, timebox, host environment, and scoring rubric. Otherwise you are measuring setup differences as much as agent quality.
Why run the benchmark on a persistent Mac?
A persistent Mac keeps dependencies, browser state, GUI permissions, logs, screenshots, and branches available across runs, which makes the benchmark closer to real agent work.
Related reading
Always-on Mac runtime
Give your agent a Mac that stays online after your laptop closes.
Hyperbox gives Codex, Claude Code, OpenClaw, and remote dev workflows a persistent macOS machine with SSH, VNC, and full desktop access.