HACKER Q&A
📣 dougSF70

Would GPT-n produce better model if trained on a structured language?


Chatting with my son yesterday - who has learned some disparate languages (Spanish, Japanese, Arabic) - he spoke about the fact that in Japanese there are (at least) two verbs "to touch": there is one you use to talk about physically touching something and another you use if touched you 'emotionally' e.g. they really touched my heart with that poem. In English, we obviously can use many different verbs but often we use plain 'ol "touch" and rely on the context of the sentence to provide more precise meaning. What LLMs are doing is trying to infer context from the other words using in the same sentence and while it is clearly doing a good job at that given its responses. My questions are would it perform better if the model had to do less inference from the sentence because the language analyzed was more prescriptive in its vocabulary, syntax and grammar. Conversely, does the model tend to work precisely because the English language relies more on contextual meaning that prescriptive grammar.


  👤 PaulHoule Accepted Answer ✓
I’m a little skeptical about claims that this language or that language has a different vocabulary, I mean you could watch anime and think they’ve got few phrases other than 任せる (“leave it to…”) or 絶対負けない!(I absolutely won’t lose!) but really languages have tons and tons of alternate vocabulary you could use, say

https://www.merriam-webster.com/thesaurus/touch

20 years ago it seemed to be there was very little NLP literature on languages other than English, I’d say today I see papers in arXiv every day where people trained an LLM for some “minor” language or do experiments with multi-lingual models, so your question is very much an active research area.

https://arxiv.org/search/?query=multilingual&searchtype=all&...


👤 verdverm
It might be more interesting to think about this with respect to programming languages, which are much more structured than human languages. Even within these, there is variation that we can explore, like comparing Go (with fewer keywords and building blocks) and JS|PY (where there are many more ways to do the same thing).

Another interesting thought related to this is that programming languages often have a spec or grammar made available. Can we help the LLMs learn faster or better by supplying these? How can it draw out the common patterns across languages while being effective with any language on a specific problem? Can they few-shot learn a library in the ecosystem and map that onto the problem / solution? JSON vs JSON5 is an interesting example. I was trying to get ChatGPT to work with CUE, but it kept wanting to produce JSON or Yaml instead.