Claude Code and Codex expose critical security gaps in each other's work

The automated peer review experiment

Software development is entering a new phase where AI agents no longer just write code—they audit it. A recent head-to-head evaluation pitted

against
Codex
in a high-stakes
Laravel
project. The task involved implementing a brand-new "teams" functionality, a feature so fresh that neither model had it in its training data. By forcing these agents to rely on provided git commits rather than memory, the test revealed the raw reasoning capabilities of modern LLMs.

Codex wins on aesthetics and UI

When it came to the initial build,

demonstrated a superior grasp of user experience. While
Claude Code
delivered a functional but bare-bones interface,
Codex
automatically grouped menu items and utilized cards and borders to create a professional-looking dashboard. However, visual polish often hides structural rot. The real value of the experiment emerged when the agents were ordered to swap files and perform a "second opinion" audit.

Claude Code uncovers dangerous deletion bugs

In the audit phase,

proved to be the more meticulous reviewer, identifying 12 distinct issues within the
Codex
codebase. The most alarming find was a "silent cascade" bug where deleting a category would instantly wipe out all associated posts without a confirmation prompt. This lack of a safety net is a critical failure in any production environment.
Claude Code
also flagged excessive database queries and potential security vulnerabilities regarding fillable team IDs.

Cross-model auditing as the new standard

While

found fewer errors in
Claude Code
's work, it did catch a significant validation oversight: the ability to fake post requests to access categories from other teams. These results suggest that relying on a single AI model is a gamble. The takeaway is clear: the "second opinion" workflow—using one model to build and another to break—mimics human pair programming and drastically reduces the likelihood of shipping catastrophic bugs. For serious developers, the cost of running two agents is a small price for such rigorous quality control.

2 min read