Session 38: Evals for Product Managers

Welcome to session 38! We have Gautham Muthukumar joining us this week.

Gautham is a Product leader specializing in Gen AI and Evals at Intuit, with previous experience at Microsoft. He brings deep expertise in building and evaluating AI systems at scale.

Here's the notes from the meeting:

Meeting Purpose

To introduce the complexities of evaluating AI agents and the role of observability.

Key Takeaways

Agent evals are harder than traditional software evals due to non-deterministic outputs, hidden multi-step trajectories, and subjective quality metrics.
Observability is the required precursor to evaluation. Tools like Langfuse trace the agent's internal "thought process" (trajectory), which is essential for debugging and creating meaningful metrics.
A hybrid eval strategy is best, combining scalable Automated Evals (code-based similarity, LLM judges), high-quality Human-in-the-Loop Evals, and real-world End-User Feedback.
A continuous feedback loop is critical: Promote problematic production traces into the "golden dataset" to improve offline batch evals and prevent performance drift.

Topics Covered

The Challenge of Evaluating AI Agents

Multi-step Failure Points: Failures can occur at any stage of an agent's internal process, from intent understanding to tool calls and final response generation.
Hidden Trajectories: The agent's internal reasoning and tool-use sequence is a black box without proper tracing.
Subjective Quality: "Good" answers lack a single, objective ground truth, unlike traditional software.
Non-Deterministic Behavior: Generative models produce varied outputs for identical inputs, making simple pass/fail tests ineffective.
Subtle Performance Drift: Unlike sudden crashes, agent quality often degrades slowly over time, requiring continuous monitoring.

Observability: The Foundation for Evals

Observability is the prerequisite for evaluation. It provides the data needed to understand why an agent produced a specific output.
Three Pillars of Observability:
- Logs: Timestamped event records.
- Traces: A map of the agent's full trajectory, connecting logs into a sequence of events.
- Evaluations: Metrics derived from the trace data.
Langfuse Demo: Traces agent trajectories, showing each step (span) with inputs, outputs, latency, and cost.

Evaluation Strategies

Three Pillars of Eval Metrics:
- Efficiency: Operational metrics (latency, cost, token usage).
- Effectiveness: Quality metrics (accuracy, helpfulness, intent understanding).
- Trustworthiness: Safety and alignment (guardrails, PII masking).
Three Approaches to Performing Evals:
- Automated Evals: Code-based statistical methods and LLM as a Judge
- Human-in-the-Loop Evals: Domain experts evaluate agent output
- End-User Feedback: Explicit (thumbs up/down) and implicit signals

The Dev-to-Prod Feedback Loop

Offline Batch Evals (Dev): Uses a "golden dataset" to test agent changes.
Online Live Evals (Prod): Evaluates agent performance with real-time user data.
The Loop: Monitor production -> Promote problematic traces -> Run offline evals -> Deploy improvements.

Here's the entire recording of the session.