What is the best Llama model that you can deploy on a single A10G?

Question

It's hard to choose between a 13B model with 8 bit quantization, or the 33B model with 4 bit quantization.For some context, the idea is to build a text to SQL interface. The interface allows you to select certain tables from the data warehouse and injects their definitions in the prompt, so the 4096 context limit is useful here.

pocketarc · Accepted Answer

The 33B model with 4 bit quantisation would be better.Check this PR out, you can see the chart showing that even the best 13B quantisation would be a far cry from the 30B with 2 bit quantisation: https://github.com/ggerganov/llama.cpp/pull/1684