Why not use CPUs for LLM training?

Question

What is the current performance gap between CPU and GPU training/finetuning/inference for Large Language Models, or transformers in general? Are there papers/benchmarks showing this from the last 2 years or so?I think that with transformers being highly parallelizable and with a large number of parameters it would make sense to use CPU RAM which is significantly cheaper and using multi-core training could offset the performance difference between GPUs? (I assume 10 cores would perform as well as 1 GPU?).Tell me why this is a stupid idea and won't work please before I go and blow my cash on a 70 core 512GB RAM machine.

solomatov · Accepted Answer

>I assume 10 cores would perform as well as 1 GPU
Transformers are very easily parallelizable.
Take a look at H100 https://www.nvidia.com/en-us/data-center/technologies/hopper... It has 18482 CUDA cores, which for the purposes of the discussion means 18482 multiplications in parallel. They also have tensor cores, which make everything faster several times, i.e. these are essentially special hardware for matrix multiplications.
We could use AVX512 to make many multiplications in parallel (i.e. 512/32 = 16 or 512/64 = 8) on each CPU core, but it wouldn't compare. GPU still wins.

binarymax · Answer

The reason is the number of parallel instructions you need to run during training. Let&rsquo;s say you have a 32 core CPU. Well great. But an A100 GPU has 6912 CUDA cores and 432 Tensor cores.A 32 core CPU may have more capability per core, but it doesn&rsquo;t scale to meet the needs of training LLMs

segmondy · Answer

Why can't you stuff a GPU or 2 on such a machine?