Session 37: Evals - A Primer | AppliedAI Club | AppliedAI

Welcome to session 37! We have Pavan Mantha back this week to take us through a comprehensive session on Evals. Pavan is the Distinguished AI Architect - Data, AI and GenAI @Equal, researcher from IIIT Hyderabad and an open source contributor. Here are the resources discussed and shared during the session.

Resources discussion in the video: Repo: https://github.com/pavanjava/genai_evals (Conversational agents eval will be shared later)

Here's the notes from the meeting:

Notes: A primer on evaluating AI systems (LLMs, RAG, agents).

Key Takeaways

Eval Frameworks: Use open-source frameworks like Ragas and DeepEval for standard metrics (relevancy, faithfulness) to get a quick start and build confidence.
Retrieval is Paramount: Prioritize evaluating the retrieval phase in RAG. A poor context (e.g., low context utilization) guarantees a poor generation, making it the primary bottleneck.
Synthetic Data for Retrieval: Synthetic datasets are effective for tuning RAG retrieval parameters but are unreliable for evaluating generation, which requires human judgment.
Live Monitoring is Essential: Integrate evals directly into your application's runtime. This provides real-time performance data, enabling rapid detection and diagnosis of model drift or degradation.

Topics The Challenge of AI Evaluation

Core Problem: Evaluating AI systems is complex because performance is highly use-case dependent. There is no single "best way."
Evaluation Categories:
- Prompt Evals: The foundation for all AI systems; ensures prompts are robust and reliable.
- RAG & Bot Evals: Focus on the two main components: retrieval and generation.
- Agent Evals: Varies based on agent type (conversational vs. non-conversational).

Data for Evaluation

Ground Truth: Essential for reliable evaluation.
- SME-Generated: High-quality, domain-specific data from experts.
- Synthetic: Auto-generated data for initial testing.
Conversation Types:
- Single-Turn: Q&A pairs (e.g., customer support tickets).
- Multi-Turn: Continuous conversations with context and interruptions (e.g., voice agents).

RAG Evaluation Metrics

Retrieval Phase (Context):
- Context Relevancy: How relevant is the retrieved context to the query?
- Context Recall: How much of the relevant information was retrieved?
- Context Utilization: What percentage of the retrieved context was actually used in the answer?
  - Significance: A low score (e.g., less than 80%) indicates the LLM is ignoring context, which is a strong predictor of hallucination.
Generation Phase (Answer):
- Answer Relevancy: How relevant is the generated answer to the query?
- Faithfulness: Is the answer grounded in the provided context?

Agent Evaluation Metrics

Non-Conversational Agents (Workflows):
- Initialization Performance: Resource usage (RAM, CPU) during startup.
- Tool Reliability: How consistently the agent calls the correct tool.
  - Insight: Reliability drops significantly with greater than 5 tools. A single-tool-per-agent design is more robust.
Conversational Agents (Chat/Voice):
- Turn Relevancy: How relevant is each turn in the conversation?
- Turn Faithfulness: Is each turn grounded in the conversation history?
- VAD Accuracy: For voice agents, the accuracy of Voice Activity Detection (VAD) is critical for managing interruptions.

Demo: Live RAG & Conversational Evals

RAG Pipeline (VLAC Stack):
- V: VLLMs (local inference for privacy)
- L: LlamaIndex (data framework)
- A: Agno (agent framework)
- C: Chunky (chunking library)
- Q: Qdrant (vector store)
RAG Agent Demo:
- Ingestion: PDF → Images → Markdown → Chunks (74) → Qdrant.
- Agent: An Agno agent used a Claude Sonnet model and a custom Qdrant tool.
- Live Evals: DeepEval measured context relevancy (0.87), answer relevancy (0.85), and faithfulness (0.66) on a local Granite 3B model.
Conversational Agent Demo:
- Framework: PipeCat (chosen for its simpler learning curve).
- Pipeline: Speech (user) → STT → LLM → TTS → Speech (agent).
- Live Evals: DeepEval triggered on each full turn, logging turn relevancy and faithfulness scores in real-time.

Best Practices & Advanced Concepts

LLM as a Judge: Use a different LLM family for evaluation than for generation (e.g., Claude for generation, Granite for evals). This reduces bias.
Mixture of Experts (MES): Use multiple eval agents to score an answer and ensemble the results for a more robust evaluation.
Custom Evals: Standard metrics are a start, but custom evals are often required for specific business needs (e.g., PII masking, sentiment analysis).
- Example: A "volatility metric" can measure the consistency of user-provided data (e.g., name) to determine its trustworthiness.
Evaluating Multi-Turn Agents:
- Threshold: Start evals only after a certain number of turns (e.g., 10) to allow the conversation to build context.
- Growth Curve: Expect eval scores to improve as the conversation length increases, demonstrating effective context building.

Here's the entire recording of the session.

Session 37: Evals - A Primer

A comprehensive demo of creating evals for production