I want to train a LM on my home country's dialect, how can I do it?

Question

I'm from Algeria. The language spoken on a daily basis by almost everybody is a weird mix of different languages : french, arabic, english..etc.I was thinking of grabbing data from tweets to fine-tune the model. I may be able to figure out other sources, but it's not gonna be much better than that. Just short-form text for the most part.I was thinking of potentially leveraging the smaller models I came across recently (nanoGPT for example) or something similar.I'm tech-savvy enough to make this work but I'd like some feedback from people more knowledgeable than me before I spend time and effort into this.Thanks!

ktrnka · Accepted Answer

I'd suggest starting with just building a high quality data set with text from a variety of domains, and starting off by publishing that. Maybe even developing some related tech like adding the dialect to language id packages. Another key thing might be to build a nicely curated word list for the dialect, and make sure there's good documentation for researchers wanting to work in the language.
Partly I'm feeling inspired by Google's machine translation paper about scaling to the next hundred or thousand languages. Some links in here https://ai.googleblog.com/2023/01/google-research-2022-beyon...
But also when it's been successful, it's an effort of many different researchers. And it usually starts with data.
Training a language model on top of it is definitely doable even for individuals, you just might not be able to train on a huge data set or you might hit a wall in terms of the perplexity you can reasonably train.

LunarAurora · Answer

Sadly, the very best datasets that seem publicly available are for Gulf Arabic dialect (where the money is) [1]
I suggest you contact https://www.icompass.tn/, a (Tunisian) startup specialized in Natural Language Processing...that process Arabic dialects and African languages
On a general note, I believe this kind of work should be a (urgently) nationally funded, because these countries will be forced to use second languages like French, or literary Arabic when AI/NLP becomes the dominant computing paradigm (bots, prompts...). A model in this respect is what Sweden is doing [1]. For mostly "oral" dialects (like Algerian I guess), collaborating with big names into adapting the best transcription models (like whisper) to them first is the key IMO.
[1] https://nyuad.nyu.edu/en/research/faculty-labs-and-projects/...
[2] https://news.ycombinator.com/item?id=34492572

yorwba · Answer

If all you want is a LM and it doesn't need to be trained by you or run on infrastructure you control, you could try to see whether ChatGPT already understands well enough. A Tunisian friend of mine told me that he asked it to tell a joke in Tunisian Arabic and it worked, only the joke wasn't funny.If you want or need to train on your own data, social media is a good bet for colloquial language. You could try exporting your own data to get something to play with without having to write a crawler. Or try building a language classifier first and use it to filter https://commoncrawl.org/

enoreyes · Answer

https://huggingface.co/alger-ia/dziribertThere is this model which also has a paper describing their methods for a BERT-family model designed for the Algerian dialect.

tooltitude · Answer

You could do data augmentation. You could automatically translate (there're open source models to do so) to your language from close enough languages, and train your model on this data.