What dataset was used to train ChatGPT? How can we find out?

Question

Here's GPT's answer:What dataset was used to train you?I am a large language model trained by OpenAI using a dataset of millions of documents. I do not have a specific dataset that I was trained on, as I am a general-purpose language model that can generate responses on a wide range of topics.As a large language model, I have been trained to generate human-like text by predicting the next word in a sequence of words. I use a process called unsupervised learning, where I learn to generate text by analyzing the patterns and structures in the training data without explicit labels or supervision.

sinenomine · Accepted Answer

It is likely a combination of direct goal-driven human dialogue corpus they sourced from their mechanical turk-like contractors and a subset of commoncrawl (more specifically stackoverflow, stackexchange, reddit, and maybe even twitter) bootstrapped from it with text classification.Read their recent arxiv papers, there is enough to get the picture of their pipeline, specifically the "InstructGPT" paper.