Grok 4.3 fails coding stress tests while charging four times more than rivals
The high cost of synthetic speed
xAI recently released Grok 4.3, and the developer community immediately looked for performance gains. This iteration follows a lineage of so-called fast models, such as Grok Code Fast 1, which initially impressed the market with low latency. However, speed is a dangerous metric when detached from reliability. In a series of standardized benchmarks involving Laravel and Filament admin panels, Grok 4.3 demonstrated an alarming disconnect between its rapid execution and the actual quality of its output.
Fundamental errors in Laravel and PHP
When tasked with building a Laravel API, the model stumbled on basic architectural requirements. It failed to apply required route name prefixes and, more critically, lost crucial request type hints during code refactoring. For example, moving a route into a group resulted in the loss of the $request parameter type hint, an error that breaks functionality immediately upon execution. These are not nuanced architectural disagreements; they are fundamental syntax and logic failures that an experienced developer would expect a modern LLM to handle with ease.
Broken interfaces and inconsistent enums
The struggles continued with Filament. The prompt required the implementation of specific PHP enums using HasLabel and HasColor interfaces. Grok 4.3 failed three consecutive attempts, often ignoring the interface requirements entirely or hallucinating string values that deviated from the prompt. While one attempt was almost successful, it was marred by unnecessary "creativity" that broke automated tests. This inconsistency makes it impossible to trust the model for automated workflows.

The verdict on price and performance
The most staggering data point is the cost. Accessed via OpenRouter, the model billed at roughly $0.50 per prompt. This makes it nearly four times as expensive as Kimi, a model that consistently delivered bug-free code in the same tests. While Grok 4.3 is fast—averaging two minutes per task—it is an expensive luxury that currently yields broken results. For serious development, Claude 3.5 Sonnet and GPT-4o remain the standard-bearers for accuracy and value.
- Grok 4.3
- 24%· products
- Filament
- 12%· products
- Laravel
- 12%· products
- Claude 3.5 Sonnet
- 6%· products
- GPT-4o
- 6%· products
- Other topics
- 41%

I Tested Grok 4.3 vs Other LLMs for Coding: Clear Answer
WatchAI Coding Daily // 7:25
This channel is not for vibe-coders. It's for professional devs who want to use AI as powerful assistant, while still keeping the control of their codebase. My name is Povilas Korop, and I'm passionate about coding with AI. So I started this THIRD YouTube channel, in addition to my other ones Laravel Daily and Filament Daily. You will see a lot of my experiments with AI: I will try new things and share my discoveries along the way.