Session 12: Introduction to AI Evals

Introduction to AI Evals Hey, Bala again. In this session, Gautham Muthukumar from Microsoft takes us through AI evals.

Key Takeaways

Evals are critical for assessing AI model/agent performance and continuously improving them
Key eval types: benchmarks, programmatic evaluations, system performance metrics, human-in-the-loop, and LLM as judge
Tools like Langfuse provide observability and eval capabilities out-of-the-box
Evals should be precise yet continuously updated as models and use cases evolve

Topics

Importance and Complexity of Evals
Evals are crucial for determining AI product success, considered "the real moat" by industry leaders
Evals cover general evaluations, observability metrics, and responsible AI aspects
Agent evals are more complex than traditional software/ML testing due to non-deterministic, multi-path nature

Types of Evals

Benchmarks: Standard datasets (e.g. MMLU, GPQA) or custom synthetic data with golden answers
Programmatic evaluations: Simple automated checks (e.g. SQL syntax, summary length)
System performance: Latency, cost, error rates
Human-in-the-loop: Expert review of agent outputs
LLM as judge: Using another LLM to evaluate outputs based on rubrics

Eval Tools Demo (Langfuse)

Best Practices

Combine offline and online evals
Be precise in metrics, but expect to evolve evals over time
Consider implicit user feedback (e.g. follow-up questions) in addition to explicit ratings
For subjective answers, use majority customer opinions and personalization

Next Steps

Here's the entire recording of the session.

How to apply Evals to test your AI Agents