What happens when AI-voice becomes good enough?

Question

I fell into the rabbit hole of TTS models lately. Tried all major paid tools (ElevenLabs/InWorld/etc.), and all the newest open-source models.I started asking myself: what happens when the voice is "solved"? E.g. it gets impossible to distinguish it from a human. Wanted to hear your opinions!Sketched some of my own thoughts, and I see two futures:Future 1: the nuanced versionAudiobooks: I think established authors will still prefer human narrators. If you can afford a $3k&ndash;$4k fixed cost for narration, a good human voice is usually worth it. TTS may even push human narration prices down, making that choice easier.But for new/self-published authors, especially in non-fiction, AI narration may become the default. The choice is often not &ldquo;AI vs. human narrator,&rdquo; but &ldquo;AI audiobook vs. no audiobook.&rdquo; There will be backlash, but I think people will partly get used to it.The more interesting threat may be AI readers. If I can buy an ebook for $8&ndash;$10 and have it narrated in a voice/style I like for $1&ndash;$2, why pay for an AI-narrated audiobook as a separate product? This could partly unbundle audiobooks from platforms like Audible. I&rsquo;m torn here: AI-narrated self-published audiobooks and AI readers may co-exist, but AI readers could eventually replace most non-human audiobook editions.Business content: training videos, museum guides, phone systems, short ads, internal explainers, etc. will be mostly AI. Anywhere &ldquo;good enough is good enough&rdquo; meets budget pressure, TTS wins. It already does.Content creation: YouTube, podcasts, TikTok, etc. are different. Among top creators, I think human narration still dominates because personality and authenticity matter. If the voice is part of the brand, TTS is counterproductive.That said, AI narration will explode in low-effort content. As generative text/video tools create more slop, most of that slop will probably have AI narration. So maybe the ratio of human vs. TTS voices on social media becomes 1:10 by volume, but 10:1 by total viewership in favor of human voices.Dubbing/translations: heavily AI-dominated, except for high-end creative work like major films or books.Films: only humans for now, but it could change. I can easily see generative AI technology going far enough that films of Hollywood quality are fully produced with AI. It would involve a new type of &ldquo;producer,&rdquo; someone who could manipulate generative AI and mold it into something beautiful, and it would require a new set of tools. Essentially, there would be many, many Pixar-style studios focused on ultra-realistic video with relatively small budgets. For such cases, AI narration would be used, and eventually it could eat almost the whole industry.Games: TTS seems especially strong here: many distinct voices, short lines, lots of minor characters, and poor economics for hiring actors for everything. I think studios will still use humans for main characters, but many NPCs and indie-game voices will become AI.Future 2: the hardline versionAnything outside of personal-brand stuff would be AI-generated. If it gets cheap and good enough, and society accepts it, everything from books to films and ads would be AI-narrated.Human narrator would evolve as a profession &mdash; you would &ldquo;sell&rdquo; the rights to your voice being AI-generated.A new profession of AI sound engineers will emerge, who will use AI to get creative with voice design and voice orchestration to get the best results.I also feel like voice is quite different from text or image generation, in the sense that there is a weaker uncanny valley. In 95% of cases, voice is just a tool to convey creatively written text, hopefully written by a human, correctly. And for tools, it is mostly a question of getting good enough.It is also possible that it is not either/or between the two futures: the first future is the next 10 years, and the second future is a bit ahead of that.

kvasserman · Accepted Answer

I think of it this way. LLMs suppose to be good at generating text/writing, right? Well, they are not very good at it. They generate plausible content that superficially makes sense. Most people can easily tell AI generated slop from human writing. I suspect that mimicking human voice is multiple levels more difficult for LLMs than mimicking human content. The level of nuance that humans produce in their speech is probably staggering. So I maybe completely wrong, but I see no evidence so far to support the idea that either LLM's writing or speaking is going to get much better any time soon.

damnesian · Answer

I wonder if when it truly becomes indistinguishable from reality if people won't increasingly seek direct experiences with fellow humans. We're already experiencing this as a family. AI is such a strange mental rabbit hole, we're suffering from "tailored for you" fatigue. When you just want some objective answers, what pleases you the best is NOT useful, and at this point in the curve, you have to work harder to get LLMs to give you what you need rather that what it thinks will engage you more. My adult kids have started gathering to play board games and hang out in person whereas three years ago they'd be content to play online games together. We're hitting that threshold, right now, where our biology is pushing back.I don't think the future as painted for us presently is as guaranteed as those would profit from it would like you to think.

Jblx2 · Answer

Dystopian Future 3: Elderly people getting scammed out of their life savings by scammers on the phone who sound indistinguishable from their grandchildren. (The ones who's grandchildren had their voices scraped from tiktoks.)