How are thinking efforts implemented?

Question

Claude and ChatGPT have thinking efforts where you can tune the amount of thinking allowed.Like low, medium, high, xhigh and so on.But are they different models underneath? Or same model with different parameter?The reason I ask is because, if I change the effort param mid conversation in Claude code, I get a warning suggesting I&rsquo;m breaking the cache.I don&rsquo;t think this happens in Codex because when I change the effort, the responses are still quick.

__patchbit__ · Accepted Answer

At a guess. May be associated with token length context window. Down selecting is consistent with warning message, forcing cutoff to context window. The technical term cache being a synonym. Increasing the headroom for more "thinking" should allow the implementation to access more resources without warning about the cache breaking.

aabdi · Answer

Different models do slight variants.Usually it&rsquo;s done in post training to enforce behavior based on prompt. Ie. System prompt with thinking:max or low or wtv.Enforcement then goes via constrained decoding, checking for think token start and end with max lengths, or other variations