HACKER Q&A
📣 teekert

Do LLMs translate text or need knowledge input in other languages?


In my sessions with ChatGPT, it seems like it is less smart in my native language (Dutch). I thought this was because it has seen less input materials in Dutch. I asked ChatGPT itself and it says it effectively translates its English knowledge. I don't know how to interpret this. Does it really "understand" that is has knowledge and then translates it to your requested language (or the detected language)?


  👤 pk-protect-ai Accepted Answer ✓
LLM works on token sequences and does not recognize any letters. It operates on token indices. The tokenizer establishes a vocabulary from the datasets, placing segments of text into a hash table for all frequently appearing parts of the text, which could be subwords or sentence fragments. LLMs learn the dependencies of token indices from each other. They learn the probabilities of one index following another. No actual words or letters are involved.

Most training data originate from English datasets. Some datasets contain aligned sequences of Dutch and English. The LLM learns the relationship between indices of Dutch and English phrases. The quality of translation depends on the size of the aligned dataset and the extent of the topics covered in this dataset. Therefore, most of the knowledge is derived from English data, with Dutch merely being another representation of the same data learned during training.