HACKER Q&A
📣 andrewoodleyjr

Could we skip speech to text using vector databases?


Nearly all voice applications translate speech to text. However, while voice applications need to do multiple things analyzing text we still return to the audio itself for deeper processing.

For example we can analyze text to get intent analysis but we need to analyze the wave patterns of the audio file itself to get "real" sentiment analysis. But I wonder if there is a better one approach.

I am learning Spanish and right now I translate the words I hear into English for further processing. This is a multi-step process that is time consuming and delays my response. A friend said he didn't really learn English until he challenged himself to stop translating in his head and instead associated the words with the objects itself. At that point he could skip translating and respond extremely fast because he essentially learned the language - we do this with our native tongue.

What if we do the same approach using audio in voice ai applications. We could remove translating speech to text - understanding what is being said by analyzing the audio itself and comparing it to pass records using a vector database and pass records of audio, translations, intent, speech to text, etc. If we don't have a similar record translate speech to text (aka inquire).

If this were to work over time it would reduce cost and time required for voice ai applications to understand and respond.

What is also interesting is it follows the human way of learning. We need to be exposed to things, directed and corrected for a certain amount of time to understand languages.


  👤 Imanari Accepted Answer ✓
Interesting idea. This made me think of these audio illusions[0] where what you hear depends on what you expect to hear. I wonder if this would present challenges for the proposed approach.

[0] https://www.youtube.com/watch?v=8FXQ38-ZQK0 sorry for the fast-food-tier video, best I could find that was not a short.


👤 minimaxir
It is possible to create audio/speech embeddings using a model like CLAP: https://huggingface.co/laion/larger_clap_music_and_speech

The results aren't good for nearest neighbor vector lookup, however.