How complex is it to train your own (tiny) LLM?

Question

I'll preface this by saying I am just getting up to speed on AI, so forgive me if I use any of the terms wrong.My company is celebrating 10 years next year, and I would love to create a fun little LLM that's trained on our data. I would then like to create a front-end for it, much like ChatGPT. I've got around 5000 articles (not sure how many words) and around 30k images to work with. How would I go on about training a small AI on this dataset? Where do I even get started? It's so overwhelming!Many thanks.

maxbaines · Accepted Answer

To train a bespoke LLM takes a lot of effort and compute, you are perhaps better off using Retrieval Augmented Generation (RAG). Here's some information from Langchain
https://js.langchain.com/docs/modules/data_connection/ https://python.langchain.com/docs/modules/data_connection/
Also OpenAi last week released Assistants which is an easy way to achieve RAG without needing new tools such as Vector Db's. Although 5000 docs is perhaps to large for assistants.
The first decision is whether you would use an Open Model such as Llama2 and host that yourself or a Model such as GPT 4 from openAi or Claude2 from Anthropic etc.

ChrisMartin33 · Answer

Sam Altman has just shown this weekend how this can super easily be done by "creating your own GPT" from the OpenAI page: https://twitter.com/AlphaSignalAI/status/1722321017446731927

brudgers · Answer

Thanks a lot.Now the tiny George Carlin in my head won't stop saying "jumbo shrimp, tiny LLM."Good luck.