How do GPT models generate variable length outputs??
Traditional ML models generally make predictions with a fixed length output , e.g. text of 100 characters. But gpt models can generate a yes/or no and sometimes they generate longer answers. How is it generated?
They generate one token at a time. The model outputs a probability for each possible output token, and the tooling around the model can sample from that distribution in several ways depending how much "creativity" you want to see in the output. Then that new token is added as part of the input, and the model is run again. That's why it can take tenths of a second to generate each token even on hardware with teraflops of processing power.