I think that with transformers being highly parallelizable and with a large number of parameters it would make sense to use CPU RAM which is significantly cheaper and using multi-core training could offset the performance difference between GPUs? (I assume 10 cores would perform as well as 1 GPU?).
Tell me why this is a stupid idea and won't work please before I go and blow my cash on a 70 core 512GB RAM machine.
Transformers are very easily parallelizable.
Take a look at H100 https://www.nvidia.com/en-us/data-center/technologies/hopper... It has 18482 CUDA cores, which for the purposes of the discussion means 18482 multiplications in parallel. They also have tensor cores, which make everything faster several times, i.e. these are essentially special hardware for matrix multiplications.
We could use AVX512 to make many multiplications in parallel (i.e. 512/32 = 16 or 512/64 = 8) on each CPU core, but it wouldn't compare. GPU still wins.
A 32 core CPU may have more capability per core, but it doesn’t scale to meet the needs of training LLMs