My company is celebrating 10 years next year, and I would love to create a fun little LLM that's trained on our data. I would then like to create a front-end for it, much like ChatGPT. I've got around 5000 articles (not sure how many words) and around 30k images to work with. How would I go on about training a small AI on this dataset? Where do I even get started? It's so overwhelming!
Many thanks.
https://js.langchain.com/docs/modules/data_connection/ https://python.langchain.com/docs/modules/data_connection/
Also OpenAi last week released Assistants which is an easy way to achieve RAG without needing new tools such as Vector Db's. Although 5000 docs is perhaps to large for assistants.
The first decision is whether you would use an Open Model such as Llama2 and host that yourself or a Model such as GPT 4 from openAi or Claude2 from Anthropic etc.
Now the tiny George Carlin in my head won't stop saying "jumbo shrimp, tiny LLM."
Good luck.