I find the streaming text based response distracting and honestly I'd just like to talk to it as if it were a partner/collaborator.
My thinking is to use OpenAI Whisper as the Speech to Text (STT) input:
https://openai.com/research/whisper
Then stream the results to a Text To Speech (TTS).
I'm going to use the raw OpenAI chat API to access GPT directly.
I'm going to OSS it and put it up on Github.
The problem I'm having now is that I need a way to stream the output of ChatGPT to the the TTS system.
I think I need something BETTER than the Google Cloud Speech API so I'm looking for a recommendation.
My idea is to buffer the output from GPT until I have 1-2 sentences, THEN start streaming that, on sentence boundaries, to the TTS system.
That should enable both low latency responses but also give the TTS system a decent chance of making the results sound like an actual human without any jarring breaks (at least I hope)
The problem is that the Google Cloud Speech API doesn't have a streaming API.
Though I guess if I only send 2-3 sentences at a time that I can aggregate these in the client and keep the latency small.
Thoughts?