
Voice AI’s Big Moment: Why Everything Is Changing Now (ft. Neil Zeghidour, Gradium AI)
Summary
The episode explains why voice AI is hitting an inflection point: improvements in models, data, and engineering are finally making talking to machines feel natural and convenient. Neil Zeghidour contrasts the dominant cascaded stack (ASR → text model → TTS) with emerging speech-native and full‑duplex approaches that preserve paralinguistic signals and remove turn-taking latency. Practical concerns dominate the conversation: high-quality curated data, efficient on-device models, selective use of large models, and careful script generation for recordings. The discussion also covers product strategy (building blocks vs verticals), privacy risks around voice cloning, and skepticism about audio watermarking as a provenance solution.
Key Takeaways
- 1Voice AI is now at a genuine inflection point where talking to AI can be more natural and convenient than talking to a human in many scenarios.
- 2Full‑duplex and speech‑native (speech-to-speech) models remove turn-taking and preserve paralinguistic information, improving conversational fluidity.
- 3The cascaded stack remains popular for modularity and customization but loses paralinguistic signals and adds latency.
- 4High-quality, curated datasets and intelligent script generation are far more important than raw data scale for TTS and expressive voice models.
- 5Practical voice products require small, efficient on-device models with selective access to large models — adaptive compute makes scale economically viable.
Notable Quotes
"For the first time, it actually can be enjoyable and even more convenient to talk to an AI on the phone than talking to a human."
"One of the things we contributed doing is getting rid of speaker turns completely with what we call full‑duplex conversations."
"When we speak there are a lot of information that come about us and this is lost through [text] — emotional state, irony, lying… a lot of information is conveyed that is not in what we say."
"We trained on seven million hours of speech for Moshi — it's ridiculous, I mean we could probably do that with 10,000 hours if we had the right method."
"For example for TTS you want to have expressive data of high quality; you don't want to have something that is recorded with arbitrary conditions — you want to use studio recording, very low level of noise, professional or semi-professional actors."
"We spent a lot of time on making very complex machines for script generation with like a taxonomy of all possible topics and sub topics...every time we generate a phone number it's generated with an actual random number generator so that we actually cover the whole scope."
"I think these models need to be small but sometimes they require access to large models because they're solving a complex task — selective and adaptive compute usage based on the context and the difficulty of the task."
"Watermarking is a scam — I'm sorry I have to say it. It just doesn't work; I worked on it — we have an appendix...around how we could break so easily any watermarking that was supposed to be state of the art."