Welcome to session 40! We have Gautham Muthukumar continuing his deep dive on evaluations for product managers.
Gautham works on Applied AI Products, GenAI, ML & Evals Platforms at Microsoft and previously at Intuit.
Resources
Here's the notes from the meeting:
Key Takeaways
- Data is the Foundation: Secure high-quality, statistically significant datasets from production logs, domain experts, or synthetic generation. Ensure data diversity (e.g., via pairwise dissimilarity) to prevent evaluation bias.
- Human-in-the-Loop (HITL) is the Blueprint: Use HITL to discover an agent's unique failure modes. "Open coding" identifies specific errors, while "axial coding" clusters them into patterns that can be automated.
- Automate with Judge LLMs: Convert identified error patterns into Judge LLMs for scalable, automated evaluation. Crucially, calibrate these judges against human experts (targeting 80-90% correlation) to ensure they accurately reflect human judgment.
- Monitor for Drift: Continuously monitor production for performance drift. Investigate only statistically significant changes, and analyze both output (e.g., accuracy) and input (e.g., user intent) to diagnose root causes.
Topics Covered
The Data Challenge
- High-quality data is the prerequisite for effective evaluation.
- Data Sources:
- Production Logs: The best source for real-world user requests and responses.
- Domain Experts: Essential for generating "golden" datasets with deep, nuanced knowledge not found in public data.
- Synthetic Data: Generate data from a few examples using an LLM (e.g., via Langfuse's UI or SDK).
- Data Quality Requirements:
- Privacy: Anonymize PII (names, credit cards) to prevent hallucinations and privacy risks.
- Expert Review: Have a domain expert validate all datasets, especially synthetic ones.
- Diversity: Ensure the dataset covers a wide range of scenarios using pairwise dissimilarity metrics.
- Statistical Significance: Use a calculator to determine the required sample size (e.g., 100-250 examples).
Human-in-the-Loop (HITL) Evaluation
- HITL is the foundational process for discovering an agent's specific failure modes.
- Process:
- Run the agent on a golden dataset.
- Measure performance (e.g., correctness rate).
- Analyze errors to find root causes.
- Iterate on the agent (prompts, models, tools).
- Error Analysis Framework:
- Open Coding: Manually review individual errors and label the exact cause without bias.
- Axial Coding: Cluster the open-coded errors into broad categories (e.g., "tool call errors," "lengthy answers").
Automating with Judge LLMs
- Judge LLMs automate the evaluation of specific error patterns identified via HITL.
- Critical Step - Calibration:
- Ensure the Judge LLM's scores align with human expert judgment.
- Target: 80-90% correlation. If lower, refine the Judge LLM's prompt or model.
- After calibration, Judge LLMs can run on live traces to flag issues like conciseness, even without a reference answer.
Drift Detection & Fixing
- Monitor evaluation metrics for statistically significant changes.
- Investigate both output drift and input drift (e.g., changes in user intent) to diagnose root causes.
- Fixing Techniques:
- Model/Prompt: Test different models or tune prompt parameters.
- Tools: Verify tool accuracy and ensure tool schemas are unambiguous.
- RAG: Refine the retrieval process.
Here's the entire recording of the session.