Ngo warns code cannot audit agents without telemetry first

AI Engineer//Jun 7, 2026//3 min read

The shift from code to telemetry

In the traditional software world, predictable logic paths allow developers to audit systems by simply reading the code. AI agents break this paradigm. Dat Ngo, AI Architect at Arize AI, argues that because these systems are non-deterministic, code alone no longer serves as a reliable audit record. Instead, telemetry becomes the primary source of truth. By utilizing OpenTelemetry (OTEL), engineers can generate traces and spans that act as a forensic account of an agent's behavior, revealing when a model makes a tool call out of order or experiences a dependency failure that static code would never catch.

Ngo warns code cannot audit agents without telemetry first — LLM Observability, Evaluation, Experimentation Platform — Dat Ngo, Arize

Five flavors of evaluation signal

Building reliable AI products requires moving beyond simple qualitative "vibes" toward structured signal derivation. Dat Ngo categorizes these signals into five distinct flavors. While LLM as a judge is the most discussed, it remains just one piece of the puzzle. Human feedback provides the grounded reality of end-user satisfaction, while golden datasets offer a trusted baseline for tuning automated judges. For cost-conscious teams, deterministic checks—such as validating JSON schemas or non-null fields—offer high-speed, low-cost verification. Finally, business metrics serve as the ultimate north star, measuring if an agent actually saves time or generates revenue.

Scaling evaluation from spans to sessions

Granularity is the defining challenge of modern AI observability. Evaluation must occur at multiple scopes to be effective. A single span eval looks at one specific input and output, which is the baseline for most developers. However, multi-span evals track how data passes between different components, ensuring that Agent A's output is actually compatible with Agent B's requirements. At a higher altitude, trajectory evals analyze the entire path taken to complete a business process, while session evals examine the full state machine of a conversation to detect user frustration or unresolved queries.

Automating the observability flywheel

The future of AI engineering points toward the total automation of the debugging process. Through products like Arize Phoenix and the enterprise-grade Arize AX, the goal is to create a self-correcting loop. Arize AI recently introduced Alex AI, an AI system designed to scan traces and surface errors or latency issues autonomously. This shift suggests a world where engineers no longer manually pick their evaluations; instead, an AI with context of the system's traces creates and runs them on the fly.

Topic DensityMention share of the most discussed topics · 10 mentions across 8 distinct topics

Arize AI: 20%· companies
Dat Ngo: 20%· people
Alex AI: 10%· products
Anthropic: 10%· companies
Arize AX: 10%· products
Other topics: 30%

End of Article

Source video

Ngo warns code cannot audit agents without telemetry first

LLM Observability, Evaluation, Experimentation Platform — Dat Ngo, Arize

AI Engineer // 16:32

AI Engineer

AI Engineer

Talks, workshops, events, and training for AI Engineers.

Who and what they mention most

28.3%13

26.1%12

Model Context Protocol

17.4%8

17.4%8

10.9%5

3 min read0%

3 min read