The automated peer review experiment
Software development is entering a new phase where AI agents no longer just write code—they audit it. A recent head-to-head evaluation pitted Claude Code
against Codex
in a high-stakes Laravel
project. The task involved implementing a brand-new "teams" functionality, a feature so fresh that neither model had it in its training data. By forcing these agents to rely on provided git commits rather than memory, the test revealed the raw reasoning capabilities of modern LLMs.
Codex wins on aesthetics and UI
When it came to the initial build, Codex
demonstrated a superior grasp of user experience. While Claude Code
delivered a functional but bare-bones interface, Codex
automatically grouped menu items and utilized cards and borders to create a professional-looking dashboard. However, visual polish often hides structural rot. The real value of the experiment emerged when the agents were ordered to swap files and perform a "second opinion" audit.
Claude Code uncovers dangerous deletion bugs
In the audit phase, Claude Code
proved to be the more meticulous reviewer, identifying 12 distinct issues within the Codex
codebase. The most alarming find was a "silent cascade" bug where deleting a category would instantly wipe out all associated posts without a confirmation prompt. This lack of a safety net is a critical failure in any production environment. Claude Code
also flagged excessive database queries and potential security vulnerabilities regarding fillable team IDs.
Cross-model auditing as the new standard
While Codex
found fewer errors in Claude Code
's work, it did catch a significant validation oversight: the ability to fake post requests to access categories from other teams. These results suggest that relying on a single AI model is a gamble. The takeaway is clear: the "second opinion" workflow—using one model to build and another to break—mimics human pair programming and drastically reduces the likelihood of shipping catastrophic bugs. For serious developers, the cost of running two agents is a small price for such rigorous quality control.