Ngo warns code cannot audit agents without telemetry first
The shift from code to telemetry
In the traditional software world, predictable logic paths allow developers to audit systems by simply reading the code. AI agents break this paradigm. Dat Ngo, AI Architect at Arize AI, argues that because these systems are non-deterministic, code alone no longer serves as a reliable audit record. Instead, telemetry becomes the primary source of truth. By utilizing OpenTelemetry (OTEL), engineers can generate traces and spans that act as a forensic account of an agent's behavior, revealing when a model makes a tool call out of order or experiences a dependency failure that static code would never catch.

Five flavors of evaluation signal
Building reliable AI products requires moving beyond simple qualitative "vibes" toward structured signal derivation. Dat Ngo categorizes these signals into five distinct flavors. While LLM as a judge is the most discussed, it remains just one piece of the puzzle. Human feedback provides the grounded reality of end-user satisfaction, while golden datasets offer a trusted baseline for tuning automated judges. For cost-conscious teams, deterministic checks—such as validating JSON schemas or non-null fields—offer high-speed, low-cost verification. Finally, business metrics serve as the ultimate north star, measuring if an agent actually saves time or generates revenue.
Scaling evaluation from spans to sessions
Granularity is the defining challenge of modern AI observability. Evaluation must occur at multiple scopes to be effective. A single span eval looks at one specific input and output, which is the baseline for most developers. However, multi-span evals track how data passes between different components, ensuring that Agent A's output is actually compatible with Agent B's requirements. At a higher altitude, trajectory evals analyze the entire path taken to complete a business process, while session evals examine the full state machine of a conversation to detect user frustration or unresolved queries.
Automating the observability flywheel
The future of AI engineering points toward the total automation of the debugging process. Through products like Arize Phoenix and the enterprise-grade Arize AX, the goal is to create a self-correcting loop. Arize AI recently introduced Alex AI, an AI system designed to scan traces and surface errors or latency issues autonomously. This shift suggests a world where engineers no longer manually pick their evaluations; instead, an AI with context of the system's traces creates and runs them on the fly.

LLM Observability, Evaluation, Experimentation Platform — Dat Ngo, Arize
WatchAI Engineer // 16:32