For example we can analyze text to get intent analysis but we need to analyze the wave patterns of the audio file itself to get "real" sentiment analysis. But I wonder if there is a better one approach.
I am learning Spanish and right now I translate the words I hear into English for further processing. This is a multi-step process that is time consuming and delays my response. A friend said he didn't really learn English until he challenged himself to stop translating in his head and instead associated the words with the objects itself. At that point he could skip translating and respond extremely fast because he essentially learned the language - we do this with our native tongue.
What if we do the same approach using audio in voice ai applications. We could remove translating speech to text - understanding what is being said by analyzing the audio itself and comparing it to pass records using a vector database and pass records of audio, translations, intent, speech to text, etc. If we don't have a similar record translate speech to text (aka inquire).
If this were to work over time it would reduce cost and time required for voice ai applications to understand and respond.
What is also interesting is it follows the human way of learning. We need to be exposed to things, directed and corrected for a certain amount of time to understand languages.
[0] https://www.youtube.com/watch?v=8FXQ38-ZQK0 sorry for the fast-food-tier video, best I could find that was not a short.
The results aren't good for nearest neighbor vector lookup, however.