Session 40: Evals for Product Managers - Part 2 | AppliedAI Club

Welcome to session 40! We have Gautham Muthukumar continuing his deep dive on evaluations for product managers.

Gautham works on Applied AI Products, GenAI, ML & Evals Platforms at Microsoft and previously at Intuit.

Resources

Here's the notes from the meeting:

Data is the Foundation: Secure high-quality, statistically significant datasets from production logs, domain experts, or synthetic generation. Ensure data diversity (e.g., via pairwise dissimilarity) to prevent evaluation bias.
Human-in-the-Loop (HITL) is the Blueprint: Use HITL to discover an agent's unique failure modes. "Open coding" identifies specific errors, while "axial coding" clusters them into patterns that can be automated.
Automate with Judge LLMs: Convert identified error patterns into Judge LLMs for scalable, automated evaluation. Crucially, calibrate these judges against human experts (targeting 80-90% correlation) to ensure they accurately reflect human judgment.
Monitor for Drift: Continuously monitor production for performance drift. Investigate only statistically significant changes, and analyze both output (e.g., accuracy) and input (e.g., user intent) to diagnose root causes.

High-quality data is the prerequisite for effective evaluation.
Data Sources:
1. Production Logs: The best source for real-world user requests and responses.
2. Domain Experts: Essential for generating "golden" datasets with deep, nuanced knowledge not found in public data.
3. Synthetic Data: Generate data from a few examples using an LLM (e.g., via Langfuse's UI or SDK).
Data Quality Requirements:
- Privacy: Anonymize PII (names, credit cards) to prevent hallucinations and privacy risks.
- Expert Review: Have a domain expert validate all datasets, especially synthetic ones.
- Diversity: Ensure the dataset covers a wide range of scenarios using pairwise dissimilarity metrics.
- Statistical Significance: Use a calculator to determine the required sample size (e.g., 100-250 examples).

HITL is the foundational process for discovering an agent's specific failure modes.
Process:
1. Run the agent on a golden dataset.
2. Measure performance (e.g., correctness rate).
3. Analyze errors to find root causes.
4. Iterate on the agent (prompts, models, tools).
Error Analysis Framework:
- Open Coding: Manually review individual errors and label the exact cause without bias.
- Axial Coding: Cluster the open-coded errors into broad categories (e.g., "tool call errors," "lengthy answers").

Judge LLMs automate the evaluation of specific error patterns identified via HITL.
Critical Step - Calibration:
- Ensure the Judge LLM's scores align with human expert judgment.
- Target: 80-90% correlation. If lower, refine the Judge LLM's prompt or model.
After calibration, Judge LLMs can run on live traces to flag issues like conciseness, even without a reference answer.

Monitor evaluation metrics for statistically significant changes.
Investigate both output drift and input drift (e.g., changes in user intent) to diagnose root causes.
Fixing Techniques:
- Model/Prompt: Test different models or tune prompt parameters.
- Tools: Verify tool accuracy and ensure tool schemas are unambiguous.
- RAG: Refine the retrieval process.

Here's the entire recording of the session.