HACKER Q&A
📣 andrewstuart

Can AI break a speech audio file into individual words?


I’m wondering, is there a way for me to feed an audio file of someone talking, and the output to be multiple little audio files of each word.

To be specific I mean cutting up the original input file on word boundaries. One audio file per word.

I’m curious to know if given a large enough input set if you could create a sort of dictionary of words each with one or more tiny audio files representing words spoken by a given person.

Bonus points if it could do the same fir sentences.

Does such a thing exist?

The next step being using the database of words and sentences to reproduce someone’s speech using audio of their actual words.

I’m aware that there is AI voice cloning but that’s not what I’m asking about.


  👤 smoldesu Accepted Answer ✓
People used to do this without AI by downloading YouTube captions and using a script to cut out each timestamped word. Then you could rearrange each sample and sentence-mix it to your heart's desire. There was a semi-popular app that did this circa 2013 but I can't seem t find it today.

👤 andrewstuart
I found a pretty good discussion in the topic here:

https://github.com/openai/whisper/discussions/1243


👤 adastra22
You don’t need AI for this. There are plenty of audio tools that will chunk a file based on speech pauses. I think even sox can do this on the command line.