Benchmark · May 20, 2026 · Last updated 2026-05-21 · 17 min read

Codex vs Claude Code: Real-Repo Benchmark

The useful question is not whether Codex or Claude Code feels smarter in a demo. It is which agent can take a real repo, make a risky change, run the boring checks, explain the tradeoffs, and leave you with a PR you can actually merge.
Two coding-agent workstreams moving through the same real-repo benchmark
A real benchmark keeps the repo, tests, task prompts, runtime, and review rubric fixed. Only the agent changes.

Questions this page answers

  • How should I benchmark Codex vs Claude Code on a real repository?
  • Which metrics separate coding agents beyond subjective vibe checks?
  • What tasks should a Codex vs Claude Code benchmark include?
  • Why does the host machine matter for a coding-agent benchmark?

Benchmark answer

Quick Answer: Benchmark The Workflow, Not The Vibes

Codex vs Claude Code comparisons go wrong when the test is one prompt, one toy repo, and one subjective verdict. Strong coding agents fail in different places: one may plan better, one may edit faster, one may recover from test failures better, and one may produce a cleaner review narrative. Your benchmark should make those differences visible.

  • Use the same real repository for both agents.
  • Pin the task prompt, branch name, dependency state, and timebox.
  • Force both agents to run the same install, lint, test, and browser checks.
  • Score the diff, not the chat transcript.
  • Track wall-clock time, tokens or subscription usage, retries, and human intervention.
  • Run the benchmark on a persistent host so caches, browser state, logs, and task artifacts survive between runs.

The Reddit lesson

The top comparison threads are full of careful anecdotes, but the durable insight is that people are benchmarking different jobs. A migration, a bug hunt, a greenfield feature, and a repo cleanup are separate events. Treat them that way.

The Real-Repo Benchmark Design

Start with a repo that already has some pain: real dependencies, flaky setup, old files, incomplete tests, browser UI, and at least one architectural decision that cannot be solved by search and replace. Then create a task set that exercises different muscles.

TaskWhat it testsGood output
Bug fix with failing testCan the agent localize a defect, write the smallest patch, and prove the regression stays fixed?One focused diff, a failing-then-passing test, and a plain explanation of the bug.
Feature with UI stateCan the agent coordinate code, browser checks, screenshots, and product judgment?Working UI, responsive layout, clean copy, and evidence from a local browser run.
Refactor under testsCan the agent reduce complexity without changing behavior?Smaller or clearer code, unchanged public behavior, and a test run that covers the touched paths.
Dependency or SDK migrationCan the agent read docs, update call sites, and handle incompatible types?Migration notes, compatible imports, updated tests, and no broad unrelated churn.
Dirty repo handoffCan the agent work around existing changes without overwriting them?A branch that preserves unrelated work, states assumptions, and isolates its own patch.

Keep The Environment Fixed

A benchmark is useless if one agent gets a warm repo and the other gets a broken checkout. Use the same host class, same branch reset point, same package manager cache policy, same browser profile, same terminal permissions, and same review rules.

  1. Create one clean baseline branch and tag it before the first run.
  2. Record CPU, memory, disk, OS version, Node/Bun/Python versions, and package-manager versions.
  3. Install dependencies once, then decide whether both agents get warm caches or both get cold caches.
  4. Create one browser profile for UI tests and reset it between runs when auth state matters.
  5. Use the same secrets policy: no production credentials, no payment flows, no account-admin actions.
  6. Save every prompt, terminal command, test result, screenshot, and final diff in a run folder.
benchmark-runs/
  2026-05-20-codex-bugfix/
    prompt.txt
    environment.txt
    transcript-notes.md
    commands.log
    test-results.txt
    screenshots/
    final.patch
    scorecard.md
  2026-05-20-claude-code-bugfix/
    ...

The Scorecard That Actually Separates Agents

Do not score only whether the agent finished. Score the review burden it left behind. A coding agent that lands an 80 percent patch with a chaotic diff may be slower than the one that asks two good questions and ships a narrower change.

MetricHow to score itWhy it matters
Correctness0 to 5: tests pass, acceptance criteria met, no obvious regression.The agent has to change reality, not just sound confident.
Diff discipline0 to 5: minimal scope, local style, no drive-by formatting.Large unrelated diffs create review debt and merge risk.
Verification0 to 5: install, lint, unit tests, browser checks, screenshots, or explicit blocker notes.A trustworthy agent proves the work from the same environment that made the change.
Recovery0 to 5: handles failures, backs out bad attempts, explains unresolved blockers.Real repos fail. Recovery quality is often more important than first-shot quality.
Human load0 to 5: number of clarifications, approvals, manual fixes, and review comments.The best agent lowers operator effort, not only model time.
CostMeasured: subscription usage, API spend, wall time, and reruns.A cheap first run can become expensive if it takes three corrective passes.

A Repeatable Run Protocol

The protocol should be boring enough that another engineer can repeat it. This also makes the post stronger: readers trust benchmarks when they can see how you controlled for the environment.

For each task:
1. Reset to the baseline branch.
2. Start a fresh run folder.
3. Paste the exact task prompt.
4. Let the agent inspect the repo before editing.
5. Allow edits, commands, and browser checks within the stated policy.
6. Stop at the timebox or when the agent says the work is complete.
7. Save git diff, command log, screenshots, and the agent's final note.
8. Run the human scorecard before reading the other agent's result.

Use one persistent host

A persistent Mac makes the benchmark less fake. Browser auth, Xcode state, simulator state, package caches, logs, and recovered sessions are part of real agent work. If your benchmark loses them every run, you are measuring sandbox startup as much as agent quality.

How To Interpret The Result

If the scorecard showsDefault choiceReason
Lower review burden on complex repo editsUse that agent for refactors and migrations.Those tasks are expensive when the diff is noisy.
Better UI verification and screenshot disciplineUse that agent for product surfaces and browser-heavy fixes.The agent is acting more like a frontend engineer, not only a code generator.
Better recovery after failed testsUse that agent for legacy code and dependency work.Broken setups reward persistence and debugging loops.
Lower cost but more human supervisionUse it for queued low-risk tasks.Batch it behind a strong review process instead of handing it sensitive work.
Higher cost but cleaner final PRsUse it for tasks where senior review time is the bottleneck.The expensive model can still be cheaper than an expensive review spiral.

You may end up using both. One agent can draft the migration plan, another can execute the edit, and a third pass can review for regressions. The benchmark tells you where each one deserves the keyboard.

Where Hyperbox Fits

Hyperbox is not the agent in this comparison. It is the machine the agents run on. That matters when the benchmark needs the same repo, tools, browser sessions, GUI permissions, and logs to stay available while you switch agents or review from another device.

  • Run Codex and Claude Code against the same persistent checkout.
  • Keep browser and desktop state alive after your laptop closes.
  • Capture screenshots and logs from the machine that actually executed the work.
  • Use SSH and VNC for terminal-first and desktop-first verification.
  • Repeat the benchmark later without rebuilding the entire environment.

Frequently asked questions

Which is better, Codex or Claude Code?

The honest answer depends on the task. Benchmark bug fixes, UI work, refactors, migrations, and dirty-repo handoffs separately, then score correctness, diff quality, verification, recovery, human load, and cost.

What makes a coding-agent benchmark fair?

Use the same repo, baseline branch, prompt, task scope, test commands, browser checks, timebox, host environment, and scoring rubric. Otherwise you are measuring setup differences as much as agent quality.

Why run the benchmark on a persistent Mac?

A persistent Mac keeps dependencies, browser state, GUI permissions, logs, screenshots, and branches available across runs, which makes the benchmark closer to real agent work.

Always-on Mac runtime

Give your agent a Mac that stays online after your laptop closes.

Hyperbox gives Codex, Claude Code, OpenClaw, and remote dev workflows a persistent macOS machine with SSH, VNC, and full desktop access.