I tend to agree with the premise, however, what if the generative process is overlaid with an "inner debate", as a substitute to having the model play against itself, ala AlphaGo Zero?
The sequence of prompts would go:
1. Please explain X
2. Criticize your explanation for X, use reason and logic.
3. Based on your own critics, improve your explanation of X.
I have manually toyed with this approach (the prompts are longer, you get the gist), and it gives very interesting results. This could lead to GPT re-create, on its own, a better high-quality corpus to learn from.
Is anybody pursuing this approach for LLM?
For LLM to use the technique on the kind of reasoning you talk about, you need a human in the loop to explain it why it's wrong or right, otherwise it just hallucinates random stuff.
That's basically what RLHF[0] is, which was used to great success in training ChatGPT.