What does the data engineering behind LLMs look like?

Question

I've seen a lot of discussion about key aspects of LLMs like ML (research, architecture), Infrastructure (GPUs, Cloud), and Product (ChatGPT et al) but not much on the data engineering side. A lot of hand waving like you "just" train on the entire public Internet. There must be a ton of complexity here, as well.What is the difference between web scraping and crawling? They are not simply indexing websites, these systems must be extracting and storing vasts amount of data from those crawled sites (hence Reddit, Twitter, etc calling foul). Do these systems rely on tons of proxy IPs?There's probably not too much going on after ingestion beyond storing all this data as text or image in an optimal format for the training system(s) to use.

adt · Accepted Answer

If they don't already own the data (Alphabet/YouTube, Meta/FB, etc.) they do scrape it, or use ready-made datasets.
Comprehensive analysis paper (the 'what' not the 'how'):
https://lifearchitect.ai/whats-in-my-ai/
Holly wrote very eloquently on how/why they tokenize words:
A token is a way of dealing with rare words by breaking a word up into 50k unique subword units using byte pair encoding (BPE) Neural Machine Translation of Rare Words with Subword Units [arxiv.org] (Sennrich et al, 2015). This is particularly helpful with agglutinative or polysynthetic words where an infinite number of words can be created by combining morphemes. For example, the Yup’ik word tuntussuqatarniksaitengqiggtuq is composed of many morphemes that translate to “He had not yet said again that he was going to hunt reindeer” Describing Morphosyntax: A Guide for Field Linguists [cambridge.org] (Payne, 1997). Rather than training GPT-3 on tuntussuqatarniksaitengqiggtuq, it is more efficient to train on the BPEs: "t", "unt", "uss", "u", "q", "at", "arn", "i", "ks", "ait", "eng", "q", "igg", "tu", "q". Breaking up words like this has some strange side effects. For instance, GPT-3 performs better at addition when you use a comma as a separator GPT-3 Prompts: Math [wikidot.com] (Brockman, 2020). BPE encoding may also confuse GPT-3, by obscuring what it needs to understand in the text.
https://hollygrimm.com/gpt3musings

cjbprime · Answer

https://commoncrawl.org/

Closi · Answer

The other element here is quality content - you probably aren&rsquo;t just training on public internet data for commercial LLMs, hopefully you can train on scanned books too, and closed academic journals, radio transcripts, photograph stores, map data, codebases, technical documentation...

jeeeb · Answer

I don&rsquo;t have any insider insight on this but the GPT3 paper discusses some of their data sets and curation techniques (https://arxiv.org/pdf/2005.14165.pdf).The recent DinoV2 paper is also interesting reading (https://arxiv.org/pdf/2304.07193.pdf), as they particularly focus on techniques for improving the training set.OpenAI also have been open about making heavy use of RL (via PPO) to fine tune the models.For RL it seems they&rsquo;ve basically developed a second model that can be used to score the quality of responses based on the encoded preferences of human evaluators. I.e you build a ranking of different responses based on desired characteristics (e.g. polite, helpful etc) and use those to train a second model which models the RL reward function. This can then be used to fine tune the main model.