Arize AI – Research, Videos, Insights & Reviews

// AI Engineer

The shift from code to telemetry In the traditional software world, predictable logic paths allow developers to audit systems by simply reading the code. AI agents break this paradigm. Dat Ngo, AI Architect at Arize AI, argues that because these systems are non-deterministic, code alone no longer serves as a reliable audit record. Instead, telemetry becomes the primary source of truth. By utilizing OpenTelemetry (OTEL), engineers can generate traces and spans that act as a forensic account of an agent's behavior, revealing when a model makes a tool call out of order or experiences a dependency failure that static code would never catch. Five flavors of evaluation signal Building reliable AI products requires moving beyond simple qualitative "vibes" toward structured signal derivation. Ngo categorizes these signals into five distinct flavors. While **LLM as a judge** is the most discussed, it remains just one piece of the puzzle. **Human feedback** provides the grounded reality of end-user satisfaction, while **golden datasets** offer a trusted baseline for tuning automated judges. For cost-conscious teams, **deterministic checks**—such as validating JSON schemas or non-null fields—offer high-speed, low-cost verification. Finally, **business metrics** serve as the ultimate north star, measuring if an agent actually saves time or generates revenue. Scaling evaluation from spans to sessions Granularity is the defining challenge of modern AI observability. Evaluation must occur at multiple scopes to be effective. A **single span eval** looks at one specific input and output, which is the baseline for most developers. However, **multi-span evals** track how data passes between different components, ensuring that Agent A's output is actually compatible with Agent B's requirements. At a higher altitude, **trajectory evals** analyze the entire path taken to complete a business process, while **session evals** examine the full state machine of a conversation to detect user frustration or unresolved queries. Automating the observability flywheel The future of AI engineering points toward the total automation of the debugging process. Through products like Arize Phoenix and the enterprise-grade Arize AX, the goal is to create a self-correcting loop. Arize recently introduced Alex, an AI system designed to scan traces and surface errors or latency issues autonomously. This shift suggests a world where engineers no longer manually pick their evaluations; instead, an AI with context of the system's traces creates and runs them on the fly.

5 days ago

Ngo warns code cannot audit agents without telemetry first

Madura says programming with DSPy ends the era of prompt engineering

SallyAnn DeLucia builds prompt learning loops to kill static instructions