Blogs & Webinars

Session 37: Evals - A Primer

A comprehensive demo of creating evals for production

Welcome to session 37! We have Pavan Mantha back this week to take us through a comprehensive session on Evals. Pavan is the Distinguished AI Architect - Data, AI and GenAI @Equal, researcher from IIIT Hyderabad and an open source contributor. Here are the resources discussed and shared during the session.

Resources discussion in the video: Repo: https://github.com/pavanjava/genai_evals (Conversational agents eval will be shared later)

Here's the notes from the meeting:

Notes: A primer on evaluating AI systems (LLMs, RAG, agents).

Key Takeaways

  • Eval Frameworks: Use open-source frameworks like Ragas and DeepEval for standard metrics (relevancy, faithfulness) to get a quick start and build confidence.
  • Retrieval is Paramount: Prioritize evaluating the retrieval phase in RAG. A poor context (e.g., low context utilization) guarantees a poor generation, making it the primary bottleneck.
  • Synthetic Data for Retrieval: Synthetic datasets are effective for tuning RAG retrieval parameters but are unreliable for evaluating generation, which requires human judgment.
  • Live Monitoring is Essential: Integrate evals directly into your application's runtime. This provides real-time performance data, enabling rapid detection and diagnosis of model drift or degradation.

Topics The Challenge of AI Evaluation

  • Core Problem: Evaluating AI systems is complex because performance is highly use-case dependent. There is no single "best way."
  • Evaluation Categories:
    • Prompt Evals: The foundation for all AI systems; ensures prompts are robust and reliable.
    • RAG & Bot Evals: Focus on the two main components: retrieval and generation.
    • Agent Evals: Varies based on agent type (conversational vs. non-conversational).

Data for Evaluation

  • Ground Truth: Essential for reliable evaluation.
    • SME-Generated: High-quality, domain-specific data from experts.
    • Synthetic: Auto-generated data for initial testing.
  • Conversation Types:
    • Single-Turn: Q&A pairs (e.g., customer support tickets).
    • Multi-Turn: Continuous conversations with context and interruptions (e.g., voice agents).

RAG Evaluation Metrics

  • Retrieval Phase (Context):
    • Context Relevancy: How relevant is the retrieved context to the query?
    • Context Recall: How much of the relevant information was retrieved?
    • Context Utilization: What percentage of the retrieved context was actually used in the answer?
      • Significance: A low score (e.g., less than 80%) indicates the LLM is ignoring context, which is a strong predictor of hallucination.
  • Generation Phase (Answer):
    • Answer Relevancy: How relevant is the generated answer to the query?
    • Faithfulness: Is the answer grounded in the provided context?

Agent Evaluation Metrics

  • Non-Conversational Agents (Workflows):
    • Initialization Performance: Resource usage (RAM, CPU) during startup.
    • Tool Reliability: How consistently the agent calls the correct tool.
      • Insight: Reliability drops significantly with greater than 5 tools. A single-tool-per-agent design is more robust.
  • Conversational Agents (Chat/Voice):
    • Turn Relevancy: How relevant is each turn in the conversation?
    • Turn Faithfulness: Is each turn grounded in the conversation history?
    • VAD Accuracy: For voice agents, the accuracy of Voice Activity Detection (VAD) is critical for managing interruptions.

Demo: Live RAG & Conversational Evals

  • RAG Pipeline (VLAC Stack):
    • V: VLLMs (local inference for privacy)
    • L: LlamaIndex (data framework)
    • A: Agno (agent framework)
    • C: Chunky (chunking library)
    • Q: Qdrant (vector store)
  • RAG Agent Demo:
    • Ingestion: PDF → Images → Markdown → Chunks (74) → Qdrant.
    • Agent: An Agno agent used a Claude Sonnet model and a custom Qdrant tool.
    • Live Evals: DeepEval measured context relevancy (0.87), answer relevancy (0.85), and faithfulness (0.66) on a local Granite 3B model.
  • Conversational Agent Demo:
    • Framework: PipeCat (chosen for its simpler learning curve).
    • Pipeline: Speech (user) → STT → LLM → TTS → Speech (agent).
    • Live Evals: DeepEval triggered on each full turn, logging turn relevancy and faithfulness scores in real-time.

Best Practices & Advanced Concepts

  • LLM as a Judge: Use a different LLM family for evaluation than for generation (e.g., Claude for generation, Granite for evals). This reduces bias.
  • Mixture of Experts (MES): Use multiple eval agents to score an answer and ensemble the results for a more robust evaluation.
  • Custom Evals: Standard metrics are a start, but custom evals are often required for specific business needs (e.g., PII masking, sentiment analysis).
    • Example: A "volatility metric" can measure the consistency of user-provided data (e.g., name) to determine its trustworthiness.
  • Evaluating Multi-Turn Agents:
    • Threshold: Start evals only after a certain number of turns (e.g., 10) to allow the conversation to build context.
    • Growth Curve: Expect eval scores to improve as the conversation length increases, demonstrating effective context building.

Here's the entire recording of the session.