Session 11: Introduction to AI voice agents

Introduction to AI voice agents

Hey, Bala again. In this session, Manikantha S from Sarvam AI takes us through the evolution of voice agents, current status and the future.

Key Takeaways

Voice agents have significantly improved due to advancements in speech recognition and language models (LLMs)
Building effective voice agents requires careful scoping, prompt engineering, and consideration of latency
The ecosystem around voice agents (analytics, testing, evals) is still evolving and presents opportunities
Voice agents are seeing successful production deployments, with comparable or better conversion rates than humans in some cases

Voice Agent Architecture

Components: Automatic Speech Recognition (ASR), Translation, Language Model (LLM), Text-to-Speech (TTS)
Typical latency breakdown: 50% LLM, 40% speech models, 10% translation
Challenges include handling background noise, interruptions, and voice activity detection
Current focus on reducing latency through model optimization, caching, and streaming techniques

Building Voice Agents

OpenAI's platform (platform.openai.com) offers a simple way to create assistants
Playground feature allows testing of voice agents with different voices and models
WebRTC integration possible for deploying agents on websites or telephony providers

Voice Agent Ecosystem

Analytics: Still evolving, need for better tools to understand conversation metrics and business insights
Testing: Manual testing common, but automated testing becoming more important for scalability
Evals: Underlooked area, crucial for assessing model performance and conversation quality

Current Adoption and Performance

Successfully deployed for large-scale use cases (e.g., Aadhaar customer service)
Companies have handled tens of millions of calls using voice agents
Conversion rates comparable or better than human agents (2.7-4.5% vs 3-4% for manual)
Cost savings of around 50% reported

Future Developments

Voice-to-voice systems eliminating need for separate ASR, TTS, and LLM components
Cross-platform integration for seamless user experience across different channels
Increased memory and context length for more personalized and complex interactions
Broader context understanding, moving beyond narrow use cases
Automatic syncing with developer documentation for easier maintenance

Emerging Use Cases

Next Steps

Attendees can explore building voice agents using OpenAI's platform
Connect with Manikantha on LinkedIn for further discussions
Watch for public release of Sarvam AI's platform for voice agent development
Review additional reading materials on VC insights into voice agent market trends

Here's the entire recording of the session.

An introductory session on how voice agents work.