Blogs & Webinars

Session 11: Introduction to AI voice agents

An introductory session on how voice agents work.

Introduction to AI voice agents

Hey, Bala again. In this session, Manikantha S from Sarvam AI takes us through the evolution of voice agents, current status and the future.

Key Takeaways

  • Voice agents have significantly improved due to advancements in speech recognition and language models (LLMs)
  • Building effective voice agents requires careful scoping, prompt engineering, and consideration of latency
  • The ecosystem around voice agents (analytics, testing, evals) is still evolving and presents opportunities
  • Voice agents are seeing successful production deployments, with comparable or better conversion rates than humans in some cases

Voice Agent Architecture

  • Components: Automatic Speech Recognition (ASR), Translation, Language Model (LLM), Text-to-Speech (TTS)
  • Typical latency breakdown: 50% LLM, 40% speech models, 10% translation
  • Challenges include handling background noise, interruptions, and voice activity detection
  • Current focus on reducing latency through model optimization, caching, and streaming techniques

Building Voice Agents

  • OpenAI's platform (platform.openai.com) offers a simple way to create assistants
  • Playground feature allows testing of voice agents with different voices and models
  • WebRTC integration possible for deploying agents on websites or telephony providers

Voice Agent Ecosystem

  • Analytics: Still evolving, need for better tools to understand conversation metrics and business insights
  • Testing: Manual testing common, but automated testing becoming more important for scalability
  • Evals: Underlooked area, crucial for assessing model performance and conversation quality

Current Adoption and Performance

  • Successfully deployed for large-scale use cases (e.g., Aadhaar customer service)
  • Companies have handled tens of millions of calls using voice agents
  • Conversion rates comparable or better than human agents (2.7-4.5% vs 3-4% for manual)
  • Cost savings of around 50% reported

Future Developments

  • Voice-to-voice systems eliminating need for separate ASR, TTS, and LLM components
  • Cross-platform integration for seamless user experience across different channels
  • Increased memory and context length for more personalized and complex interactions
  • Broader context understanding, moving beyond narrow use cases
  • Automatic syncing with developer documentation for easier maintenance

Emerging Use Cases

  • Companion/friend voice agents
  • Educational applications (teachers, mentors, guides)
  • Faith tech and spirituality-related voice agents
  • Astrology and other niche applications

Next Steps

  • Attendees can explore building voice agents using OpenAI's platform
  • Connect with Manikantha on LinkedIn for further discussions
  • Watch for public release of Sarvam AI's platform for voice agent development
  • Review additional reading materials on VC insights into voice agent market trends

Here's the entire recording of the session.