HACKER Q&A
📣 permo-w

GPT3 training data as a percentage of all discrete text?


GPT3 was trained on 45TB of text. what would you estimate that is as a percentage of all extant discrete written text? or as a percentage of text written in the Roman script perhaps


  👤 kuhewa Accepted Answer ✓
If 2-3 mb of text on average, the Library of Congress's 24 million books — excluding the large amount of of manuscripts and other items would be about 60 TB of text for a reference point. Depending how you count what is available on the internet I imagine it may dwarf that number but I would expect a lot more redundancy and less information-per-word