Where can I find practical comparative data regarding different LLMs?

Question

I've been trying to keep up with the advances in the world of AI and LLMs. NLP was a world that I knew pretty well 7 years ago, when I knew most of the major NLP libraries, and their various strengths and weaknesses. However, nowadays, I'm having trouble finding good discussions about the real uses of the LLMs.I have gone to Hugging Face, and the amount of data there is overwhelming, but it seems poorly organized:https://huggingface.coDoes anyone know a secret that makes that site tractable? I've experimented with a few of the libraries posted there, but I can only sample a tiny fraction of what is there, and what I'm missing is some method for finding the useful stuff while disposing of the junk.7 years ago I could tell you the strengths of weaknesses of the Google's Tensorflow or the Stanford NLP library. But where do I go to get good comparative information now, about the strengths and weaknesses of the various libraries that interact with the new LLM tools?I'm looking to answer practical questions, that I can use in my own work with AI startups.For an example of a question, for which I cannot find an answer, I am aware of a startup that has developed a chat client that, the startup says, can entirely replace a company's customer support team. Among the claims made by the startup is that when their chat client makes a mistake, it can be easily adjusted so it won't make that mistake any more. I am curious, what approaches are the engineers at that startup probably using to fix mistakes? If I search Hugging Face for ways to fix factual errors in LLMs then I see some libraries, but I've no idea what is considered good or bad.So I asked the Hacker News community, how are you keeping up with advances around LLMs and associated tools?Also, every LLM seems to have an embedded finite state machine that remembers the state of the current conversation, so where can I go to learn about the strengths and weaknesses of those finite state machines? How would I go about adjusting them?Or, let me offer another example of the kind of information I want:I've been testing different AI chats by trying to play text adventures with them. For instance:https://huggingface.co/spaces/HuggingFaceH4/zephyr-7b-gemma-chathttps://chat.openai.comIf I use the same prompt with each of them, I can see how different they are, but how do I know if my observations are general (would other people get similar results) and how do I learn about other AI chats (since I cannot test them all).

BMSR · Accepted Answer

I'm also learning. The models get more accurate when they have more parameters, say 7b (7 billion parameters) vs 8x7b (56 billion parameters). They also take more time and resources at higher parameters. TheBloke at Huggingface uploads quantized models, which means they can run on lower spec computers but with a possible hit on quality, he offers multiple configurations per model depending on what you prefer. Big models can be too heavy and slow, the sweet spot is probably something like 13b. You can try different gguf models with this program: https://github.com/madprops/meltdown

ActorNightly · Answer

>Does anyone know a secret that makes that site tractable?
Its basically just a repo for models. Most original models are uploaded in fp16 format, with different parameter counts - higher parameter count = better performance. If you were to fine tune the model on your own data set, you have to keep the model in fp16, because gradients need higher resolution
On the flip side, inference is pretty much statistically most likely token which can be obtained without such resolution. As such, these models are usually quantized with GPTQ (GPU first), GGUF (CPU first, born from the llama.cpp project, but supports ), and AWQ (new method, supposedly faster than GPTQ).
Primer for quantization https://archive.ph/2023.11.21-144133/https://towardsdatascie...
When using the model, you generally want to use the model with the largest parameters, with the highest bit quantization that fits in your system if you are running this on personal hardware. Easiest way to do this is with ollama, because its basically just does what pytorch or llama.cpp do in terms of loading models onto gpu (or ram for apple silicon), and executing them with whatever hardware you have. It can auto download models (usually 4 bit quantized) as well, and integrates into vscode with Continue extension.
>For an example of a question, for which I cannot find an answer, I am aware of a startup that has developed a chat client that, the startup says, can entirely replace a company's customer support team. Among the claims made by the startup is that when their chat client makes a mistake, it can be easily adjusted so it won't make that mistake any more. I am curious, what approaches are the engineers at that startup probably using to fix mistakes?
Highly likely some form of prompt engineering. Langchain is a popular tool to utilize for this. Most companies pay for api access rather than set up their own hardware.
>how are you keeping up with advances around LLMs and associated tools?
Wait for a model to drop on ollama, try it out.