GPT3 training data as a percentage of all discrete text?

Question

GPT3 was trained on 45TB of text. what would you estimate that is as a percentage of all extant discrete written text? or as a percentage of text written in the Roman script perhaps

kuhewa · Accepted Answer

If 2-3 mb of text on average, the Library of Congress's 24 million books &mdash; excluding the large amount of of manuscripts and other items would be about 60 TB of text for a reference point. Depending how you count what is available on the internet I imagine it may dwarf that number but I would expect a lot more redundancy and less information-per-word