What's Holding AMD Back?

Question

Nvidia has an effective monopoly on the GPUs used for AI/ML. AMD GPUs only seem to be useful for gaming. The common sentiment is that the difference between the two GPUs is only the software; the hardware is comparable. Specifically, many people say there are issues with the drivers crashing.I'd like to hear from those familiar with writing and using GPU software.1. Is it really only a software issue?2. What are all of the issues? Crashing? Interface compatibility with CUDA?3. What is the scale of work that would need to be done to get AMD GPU software to the level of Nvidia GPU software? Does this require a complete revolution of the culture at AMD, or is this a task that could be accomplished by a highly motivated and competent vanguard?I ask because the common sentiment seems silly to me. How hard is it really to write some good drivers? I'd like to challenge my held beliefs.

wmf · Accepted Answer

Have you read https://semianalysis.com/2024/12/22/mi300x-vs-h100-vs-h200-b... ?

talldayo · Answer

> 1. Is it really only a software issue?
There are elements of hardware issues to it, but they're kinda negligible. Nvidia is really good at hardware design, bought out a huge amount of TSMC space (a-la Apple with N5) and both their consumer and server offerings are well stocked with options.
It's also important to understand what we mean by "software issue" in this case, CUDA is a de-facto leader because no other efforts are important enough to support. OpenCL was promising but got abandoned, Vulkan compute would be nice if people would converge on Vulkan, and DirectML/MLX/ONNX have all fallen to obscurity for a lack of vision. AMD can't do this alone, they need multiple stakeholders to convince themselves that standardizing a CUDA-equivalent featureset is a good idea.
> 2. What are all of the issues? Crashing? Interface compatibility with CUDA?
Here's a good example. All of Nvidia's modern GPUs support some version of CUDA, be it a weak GT 720 or an RTX 4090. AMD's software support is not quite as converged. Some mobile chipsets support ROCm, others don't, some desktop GPUs support it, many others don't, and it makes it hard to convince anyone that support is worthwhile.
> 3. What is the scale of work that would need to be done to get AMD GPU software to the level of Nvidia GPU software?
It might not even be possible or worth doing - simply getting AI to run in 2028 is a pyrrhic victory, and Lisa Su may be chronically aware of that. By then, there might be another edge compute market like cryptocurrency or Folding@Home that CUDA will take by storm.
The half that AMD can control is their GPU architecture. They can promise to commit to a standardized CUDA-like pipeline across their mobile, desktop and server GPUs to start. That would be a huge commitment and reverse a lot of the simplifications AMD is known for, but it's their choice to make.
But there's two other parts I think are worth considering. One point is that AMD might build their mansion for a ghost town, and waste billions serving a market that came and went. If CUDA ends up being a brouhaha and AMD was right to simplify all along, then trying to course correct a decade too late might be a waste of capital. The other problem is that their competitors might try to fight them instead of help them. Apple wants nothing to do with them, and AMD lost their leverage in negotiating when Apple quit needing x86 GPUs. Intel is a wild card right now and Nvidia will support anything as long as it's a selling point.
People like to accuse AMD of non-competition, but the sadder truth is that there's a general apathy in the industry towards what Nvidia did. Apple, AMD and Intel are all betting on weaker dedicated NPU hardware and not really taking Nvidia's design choices to heart. That's lead to a chronic neglect in GPU compute libraries, and the few that exist are usually domain-specific enough to be someone's CUDA pet project.