Can AI break a speech audio file into individual words?

Question

I&rsquo;m wondering, is there a way for me to feed an audio file of someone talking, and the output to be multiple little audio files of each word.To be specific I mean cutting up the original input file on word boundaries. One audio file per word.I&rsquo;m curious to know if given a large enough input set if you could create a sort of dictionary of words each with one or more tiny audio files representing words spoken by a given person.Bonus points if it could do the same fir sentences.Does such a thing exist?The next step being using the database of words and sentences to reproduce someone&rsquo;s speech using audio of their actual words.I&rsquo;m aware that there is AI voice cloning but that&rsquo;s not what I&rsquo;m asking about.

smoldesu · Accepted Answer

People used to do this without AI by downloading YouTube captions and using a script to cut out each timestamped word. Then you could rearrange each sample and sentence-mix it to your heart's desire. There was a semi-popular app that did this circa 2013 but I can't seem t find it today.

andrewstuart · Answer

I found a pretty good discussion in the topic here:https://github.com/openai/whisper/discussions/1243

adastra22 · Answer

You don&rsquo;t need AI for this. There are plenty of audio tools that will chunk a file based on speech pauses. I think even sox can do this on the command line.

Can AI break a speech audio file into individual words?

People used to do this without AI by downloading YouTube captions and using a script to cut out each timestamped word. Then you could rearrange each sample and sentence-mix it to your heart's desire. There was a semi-popular app that did this circa 2013 but I can't seem t find it today.

I found a pretty good discussion in the topic here:
https://github.com/openai/whisper/discussions/1243

You don’t need AI for this. There are plenty of audio tools that will chunk a file based on speech pauses. I think even sox can do this on the command line.