📣 hardwaresofton

Did OpenAI "just" convert all input to voice for direct voice output?

One of the (many) things that came out of OpenAI’s announcements last year was the massive improvement in their voice API. This improvement evidently comes from going straight to voice output (i.e. not converting between voice input and text on either end).
Am I massively over simplifying or is it possible that they just took all the text input, ran it through whisper at training time, and did an otherwise “normal” training run?
It can’t be this simple, right? I’m assuming I just don’t know enough to know how wrong I am.
[EDIT] Before someone asks, I did ask ChatGPT this question and what it spit out is roughly what a I would characterize as the title, but obviously I can’t falsify it —- ChatGPT currently knows more about AI than I do

Web Analytics Made Easy - Statcounter