How training of LLM dedicated to code is different from LLM of “text”

Question

Does it go beyond exposing to the LLM only "code"?, or there are extra steps in the training? (like giving the compiler/interpreter rules)? Since programming are more structured , I think that using grammar that are dedicated to those language might be useful.

sp332 · Accepted Answer

The tokenizer might need tweaking. Base Llama models, for example, are trained on text that has had consecutive spaces reduced to a single space. This is unhelpful for coding where specific amounts of whitespace is at least very nice to have and can even be meaningful.When you talk about a grammar, that's in the decoder, right? You don't need to retrain a model to use one of those.