I want to build an air cooled home server to run the bigger parameter models (like llama3 70b which is 40gb with quantization to 4bits). It seems like running 2 3090s or 4090s is the way to go for this.
1) Does Ollama support loading the model across multiple gpus?
2) Anyone have a general parts list that I can copy that works well? Id prefer to go with 3 gpus but I feel like cooling may be an issue.
* Unified memory effectively will let you run much larger models without requiring you to add in more GPUs * GPUs will require loading the model from system memory which is always going to be slower than macOS w/ metal * Fitting multiple GPUs (especially 4090s) into a case is difficult, and motherboards that can support it are expensive (such as https://www.amazon.com/ASUS-Pro-WS-Motherboard-Server-Grade/...)
The one benefit of the 4090s is that it _should_ in theory be faster, but YMMV.