
The ‘Building Real-Time Voice AI with NVIDIA Riva on E2E AI Cloud’ webinar tackled exactly that gap. Three companies brought their perspectives: E2E Networks on cloud infrastructure, Nvidia on speech models, and Gnani.ai on running voice AI at a population scale.
When GPU pricing becomes the actual decision point
Vishnu Kumar, AI Solutions Architect at E2E Networks, pulled up the platform and took the attendees through the instance creation process. The H200, the most powerful GPU currently available to most teams, runs at Rs 300 per hour. That’s the standard pricing: you spin up an instance, it’s yours until you shut it down.
But there’s another option that matters more when you’re trying to keep a voice AI product alive without venture funding: spot instances. If you’re okay with your training job pausing when demand spikes elsewhere, you’ll receive the same H200 for Rs 88 instead of Rs 300. For development work, that’s the difference between affordable and unsustainable.
Then came the inference endpoints. Kumar showed the auto-scaling setup that adjusts GPU allocation based on actual request volume. Kumar said that our voice AI doesn’t need eight GPUs sitting idle at 3 am, which is how you avoid paying for capacity you’re not using.
The models that make sub-300ms latency possible
Akash Paul, Senior Solutions Architect at NVIDIA, came in with context about why any of this infrastructure matters. Voice AI has been climbing out of the uncanny valley for years, moving from DSP-based systems that listened for keywords, through Google’s WaveNet in 2016, to current models that can handle actual conversations.
Paul focused on Nvidia’s Nemo family: ASR (automatic speech recognition), TTS (text-to-speech), and NMT (neural machine translation). All open, all on ‘Hugging Face’. These models now rank at the top of ASR leaderboards, even for multilingual scenarios, without the hallucinations that plague some alternatives.
He highlighted Corover, an Indian company running voice bots for IRCTC and others using Riva, NVIDIA’s serving framework. Paul then played a demo of Persona Plex, a duplex speech-to-speech mode that handled back-channeling, those small acknowledgments people make while listening, with a natural pacing that wasn’t robotic.
What 30,000 concurrent calls actually look like
Avinash Benke, Technical Lead for Agentic AI at Gnani.ai, came with receipts. His team handles 10 million interactions daily across 40-plus languages. The peak load they’ve hit: 30,000 concurrent calls per second. The company has been working on voice and speech technology since 2017 and has built its entire stack in-house.
Gnani.ai’s platform, India.ai, is a no-code builder for voice and chat agents. Benke walked through the configuration panel, revealing how to set system prompts, choose from multiple languages, and toggle between streaming and batch ASR. The interruption settings reveal the level of detail that matters in production. It’s possible to allow users to interrupt after one word, two words, or three words. Interuptions can also be disabled during the initial greeting.
Then came the actual demo call. The agent started in English, switched to Kannada mid-conversation, then Hindi, following the user’s language changes without being explicitly told to switch. The transcript showed code-mixed phrases, the kind of real-world linguistic chaos that breaks most systems.
Even though the conversation was code-mixed, Benke noted, the system could get the context and maintain the same language. This wasn’t a controlled demo environment. This was the debt collection agent, where if the AI screws up the language detection, you lose money and customers.
The real constraint is production
What tied the sessions together was a shared acknowledgment: production is where assumptions are tested.
Low latency means thinking in milliseconds. Multilingual support means handling code-mixed, noisy audio in real environments, not curated datasets. Cost optimization means choosing the right GPU, scaling policy, and model size.
The webinar ended without grand claims about AI replacing humans. Instead, it left developers with specific next steps: try the models on Hugging Face, experiment with E2E’s spot instances for training, and test whether streaming or batch ASR makes more sense for their use case.
For teams trying to move voice AI from proof of concept to production, that’s probably more valuable than another slide deck about the future.
Discover more from News Link360
Subscribe to get the latest posts sent to your email.
