I was thinking of grabbing data from tweets to fine-tune the model. I may be able to figure out other sources, but it's not gonna be much better than that. Just short-form text for the most part.
I was thinking of potentially leveraging the smaller models I came across recently (nanoGPT for example) or something similar.
I'm tech-savvy enough to make this work but I'd like some feedback from people more knowledgeable than me before I spend time and effort into this.
Thanks!
Partly I'm feeling inspired by Google's machine translation paper about scaling to the next hundred or thousand languages. Some links in here https://ai.googleblog.com/2023/01/google-research-2022-beyon...
But also when it's been successful, it's an effort of many different researchers. And it usually starts with data.
Training a language model on top of it is definitely doable even for individuals, you just might not be able to train on a huge data set or you might hit a wall in terms of the perplexity you can reasonably train.
I suggest you contact https://www.icompass.tn/, a (Tunisian) startup specialized in Natural Language Processing...that process Arabic dialects and African languages
On a general note, I believe this kind of work should be a (urgently) nationally funded, because these countries will be forced to use second languages like French, or literary Arabic when AI/NLP becomes the dominant computing paradigm (bots, prompts...). A model in this respect is what Sweden is doing [1]. For mostly "oral" dialects (like Algerian I guess), collaborating with big names into adapting the best transcription models (like whisper) to them first is the key IMO.
[1] https://nyuad.nyu.edu/en/research/faculty-labs-and-projects/...
If you want or need to train on your own data, social media is a good bet for colloquial language. You could try exporting your own data to get something to play with without having to write a crawler. Or try building a language classifier first and use it to filter https://commoncrawl.org/
There is this model which also has a paper describing their methods for a BERT-family model designed for the Algerian dialect.