Edit: the above is about PC. Macs are much faster at CPU generation, but not nearly as fast as big GPUs, and their ingestion is still slow.
Recommended reading: Tim Dettmer's guide https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...
Related, could someone please point me in the right direction on how to run Wizard Vicuna Uncensored or Llama2 13B locally in Linux? I've been searching for a guide and have not found what I need for a beginner like myself. In the Github I referenced the download is only for Mac at the time. I have a Macbook Pro M1 I can use though it's running Debian.
Thank you.
You need about a gig of RAM/nvram per billion parameters (plus some headroom for a context window). Lower precision doesn’t really affect quality.
When Ethereum flipped from proof of work to proof of stake, a lot of used high-end cards hit the market.
4 of them in a cheap server would do the trick. Would be a great business model for some cheap colo to stand up a crap-ton of those and rent while servers to everyone here.
In the meantime if you’re interested in a cheap server as described above, post in this thread.
It might or might not be reasonable speeds, but I would reason that it could avoid "sunk cost irony"; if you decide, that any point, Chat-GPT would have sufficed in your task. It's rare, but it can happen.
If you want to take this silly logic further, you can theoretically run any sized model on any computer. You could even attempt this dumb idea on a computer running Windows 95. I don't care how long it would take; if it takes seven and a half million years for 42 tokens, I would still call it a success!