Why GPT-4 latency increases for uncommon questions?

Question

When asked about things that are not common, it seems like GPT-4 spends a few seconds "thinking". On the other hand, when asked about things that are easier to answer, its latency is much shorter.Is there some sort of caching going on? From a transformer perspective, why would some answers take longer to generate?

xanderlewis · Accepted Answer

I&rsquo;ve wondered exactly the same thing. At first I thought it was just a clever client-side trick: insert some random pauses to enhance the sense that the model really is &lsquo;thinking&rsquo; before it writes, but now I&rsquo;m not so sure. As you describe, if you come up with something really particularly weird to ask, it really does seem to pause and ponder. Sometimes midway through a sentence, and at a perfectly natural point too.

NhanH · Answer

I am guessing it might not be the main transformer that is occasionally slower, but rather a secondary LLM that is used to censor stuffs which occasionally runs ?So basically the unusual answer needs more additional processing to make sure it is "safe" to be streamed to the client.

theGeatZhopa · Answer

For sure they use some caching. The other thing is the computational capacity available - load might change within seconds. In the midst of "thinking" too.The currency here is the amount of tokens to be processed and delivered back. The outcome depends on that :)

Jensson · Answer

Maybe they do run it as an iterative loop that then composes an answer based on many answers, and if there is disagreements it does more computations.

Why GPT-4 latency increases for uncommon questions?

I am guessing it might not be the main transformer that is occasionally slower, but rather a secondary LLM that is used to censor stuffs which occasionally runs ?
So basically the unusual answer needs more additional processing to make sure it is "safe" to be streamed to the client.

For sure they use some caching. The other thing is the computational capacity available - load might change within seconds. In the midst of "thinking" too.
The currency here is the amount of tokens to be processed and delivered back. The outcome depends on that :)

Maybe they do run it as an iterative loop that then composes an answer based on many answers, and if there is disagreements it does more computations.