HACKER Q&A
📣 rramadass

What is an A.I. chip and how does it work?


With all the current news about NVIDIA AI/ML chips;

Can anybody give an overview of AI/ML/NPU/TPU/etc chips and pointers to detailed technical papers/books/videos about them? All i am able to find are marketing/sales/general overviews which really don't explain anything.

Am looking for a technical deep dive.


  👤 nologic01 Accepted Answer ✓
It may help your digging and search if you have in mind what those chips really try to do: Accelerate numerical linear algebra calculations.

If you are familiar with linear algebra these specialized chips literally etch silicon so as to perform vector (and more general multi-array or tensor) computations faster than a general purpose CPU. They do that by loading and operating a whole set of numbers (a chunk of a vector or a matrix) simultaneously (whereas the CPU would operate mostly serially - one at a time).

The advantage is (in a nutshell) that you can get a significant speedup. How much depends on the problem and how big a chunk you can process simultaneously but it can be a significant factor.

There are disadvantages that people ignore in the current AI hype:

* The speedup in a one-off gain, the death of Moore's law is equally dead for "AI chips" and CPU's

* It is extremely specialized and fine-tuned software you need to develop and run and it only applies to the above linear algebra problems.

* In the past such specialized numerical algebra hardware was the domain of HPC (high performance computing). Many a supercomputer vendor went bankrupt in the past because the cost versus market size was not there.


👤 ftxbro
This isn't a technical deep dive, but here's a simplified explanation.

It's a matrix multiplication (https://en.wikipedia.org/wiki/Matrix_multiplication) accelerator chip. Matrix multiplication is important for a few reasons. 1) it's the slow part of a lot of AI algorithms, 2) it's a 'high intensity' algorithm (naively n^3 computation vs. n^2 data), 3) it's easily parallelizable, and 4) it's conceptually and computationally simple.

The situation is kind of analogous to FPUs (floating point math co-processor units) when they were first introduced before they were integrated into computers.

There's more to it than this, but that's the basic idea.


👤 phh
There are very different architectures in the wild. Some are simply standard GPUs (maybe with additional support for bf16/float16) (Rockchip RK1808 has one like that). You give it a list of instructions pretty much like a CPU (except massively parallel), and it'll execute it. BTW when I say standard GPU, I'm not saying "kinda like GPU, but really literally GPU architecture. Linux mainline support for Amlogic A311D2's NPU is 10 lines, and it's declaring a no-output vivante GPU.

Some are just hardware pipeline to compute 2d/3d convolutions + activation function (Rockchip RK3588 has one like that). You give it the memory address + dimensions of the input matrix, the memory address + dimensions of the weights, memory address + dimensions of the output, which activation function you want (there are only like 4 supported), then you tell it RUN, you wait for a bit, and you have the result at the output memory address.

(I took Rockchip example to show that even in one microcosm it can change a lot)

And then you can imagine any architecture in-between.

AFAIK they all work with some RAM as input and some RAM as output, but some may have its own RAM, some may share RAM with the system, some might have mixed usage (RK3588 has some SRAM, and when you ask it to compute the convolution, you can tell it either to write to SRAM or system RAM)

It's possible that there are some components that are border line between ISP (Image Signal Porcessing) and a NPU, where the input is the direct camera stream, but my guess is that they do some very small processing on the camera stream, then dump it to RAM, then do all the heavy work from RAM to RAM. I think that Pixel 4-5 had something like that.


👤 elseless
H.T. Kung’s 1982 paper on systolic arrays is the genesis of what are now called TPUs: http://www.eecs.harvard.edu/~htk/publication/1982-kung-why-s...

👤 neximo64
Short story, CPU's can do calculations, they can do them one at a time. Think of something like 1+1, = 2. If you had 1 million equations like these, CPU's will generally do them one at a go, i.e the first one, then the second, etc.

GPUs were optimised to draw, so were able to do dozens of these at a go. So these can be used for AI/ML in both gradient descent and inference (forward passes). Because you can do many at a go, in parallel, they speed things up dramatically. Geoff Hinton experimented with GPUs exploiting their ability to do this, but they aren't actually optimised to do that. It just turned out that it is the best way available to do it at the time, and still currently.

AI chips, are optimised to do either inference or gradient descent. They are not good at drawing like GPUs are. They are optimised for machine learning and joining other AI chips together so you can have massive networks of chips that can parallel compute.

One other class of chips that has not yet shown up are ASICs that mimic the transformers architectures for even more speed - though it changes too much at the moment for it to be useful.

Also because of the mechanics of scale manufacturing: GPUs are currently cheaper per flop of compute as the aggregate of scale is shared with graphical uses. Though with time if there is enough scale AI chips should end up cheaper


👤 binarymax
I’d start with CUDA, because knowing what a chip does won’t click until you see how it can be programmed to do massive parallel computation and matmul.

I read the first book in this list about 10 years ago, and though it’s pretty old the concepts are solid.

https://developer.nvidia.com/cuda-books-archive


👤 imakwana
Relevant: Stanford Online course - CS217 Hardware acceleration for machine learning

https://online.stanford.edu/courses/cs217-hardware-accelerat...

Course website with lecture notes: https://cs217.stanford.edu/

Reading list: https://cs217.stanford.edu/readings


👤 psychphysic
Google's TPU which they sell via Coral is just a systolic array of multiply-accumulates arranged in a grid.

Here's a decent overview from the horse's mouth. https://cloud.google.com/blog/products/ai-machine-learning/a...

It's called a systolic array because the data moves through it in waves similar to what an engineer imagines the heart looks like :)


👤 visarga
Trying to make a list of AI accelerator chip families, anything missing?

- GPU (Graphics Processing Unit)

- TPU (Tensor Processing Unit): ASIC designed for TensorFlow

- IPU (Intelligence Processing Unit): Graphcore

- HPU (Habana Processing Unit): Intel Habana Labs' Gaudi and Goya AI

- NPU (Neural Processing Unit): Huawei, Samsung, Microsoft Brainwave

- VPU (Vision Processing Unit): Intel Movidius

- DPU (Data Processing Unit): NVIDIA data center infrastructure processing unit

- Amazon's Inferentia: Amazon's accelerator chip focused on low cost

- Cerebras Wafer Scale Engine (WSE)

- SambaNova Systems DataScale

- Groq Tensor Streaming Processor (TSP)


👤 neom
Veritasium did a pretty good video on some of them: https://www.youtube.com/watch?v=GVsUOuSjvcg

👤 fragmede
Starting from https://cloud.google.com/tpu/docs/system-architecture-tpu-vm what are you looking for?

👤 zoogeny
There is a YouTube channel TechTechPotato [1] that has a podcast on AI hardware called "The AI Hardware Show". Pretty small and it gives you a view on how niche this market is - but if you want the 10k foot view from young budding tech journalists then I think this fits the bill.

Some random examples of video titles from the last 6 months of the channel:

* A Deep Dive into IBM's New Machine Learning Chip

* Does my PC actually use Machine Learning?

* Intel's Next-Gen 2023 Max CPU and Max GPU

* A Deep Dive into Avant, the new chip from Lattice Semiconductor (White Paper Video)

* The AI Hardware Show 2023, Episode 1: TPU, A100, AIU, BR100, MI250X

I think the podcasters background is actually in HPC (High Performance Computing), i.e. super computers. But that overlaps just enough with AI hardware that he saw an opportunity to capitalize on the new AI hype.

1. https://www.youtube.com/c/TechTechPotato


👤 nl
There's a lot of information here about chips which are mostly built for training neural networks.

It's worth noting there are very widely deployed chips primarily built for inference (running the network) especially on mobile phones.

Depending on the device and manufacturer sometimes this is implemented as part of the CPU itself, but functionally it's the same idea.

The Apple Neural Engine is a good example of this. This is separate to the GPU which is also on the CPU.

Further information is here: https://machinelearning.apple.com/research/neural-engine-tra...

The Google Tensor CPU used in the pixel has a similar coprocessor called the EdgeTPU.


👤 anon291
I've worked in this space for the past five years. The chips are essentially highly parallel processors. There's no unifying architecture. You have the graph-based / hpc-simulator chips like Cerebras, Graphcore, etc which are basically a stick-as-many-cores-as-possible situation with a high-speed networking fabric. You have the 'tensor' cores like Groq where the chip operates as a whole and is just well suited for tensor processing (parallelizable, high-speed memory, etc).

At the end of the day, it's matrix multiplication acceleration mostly, and then IO optimization. Literally most of the optimization has nothing to do with compute. We can compute faster than we can ingest.


👤 sremani
I want to latch on this question a bit -- which company out there is primed to bring us a CUDA competitor. AMD has failed, so any wise words from the people in the industry?

👤 gojo_dog
There's a five part series outlining AI accelerator chips that came out last year. Starting with Introduction, Motivation, Technical Foundations, and Existing Solutions:

https://medium.com/@adi.fu7/ai-accelerators-part-i-intro-822...


👤 fulafel
Here's one from Google (paper link at the end): https://cloud.google.com/blog/topics/systems/tpu-v4-enables-...

👤 sharph
Great video from asianometry explaining AI chips (GPGPUS, general purpose GPUs) roots in GPUs (graphics processing units) -- how did we get here and what do these chips do?

https://www.youtube.com/watch?v=GuV-HyslPxk


👤 tomek32
You might find this talk interesting, The AI Chip Revolution with Andrew Feldman of Cerebras, https://youtu.be/JjQkNzzPm_Q

It's the founder of a new AI chip company and they talk a bit on the differences


👤 nottorp
An "AI" chip is marketing. But as other posts say, "linear algebra coprocessor" doesn't roll of the tongue as well.

Incidentally there used to be a proper "AI" chip. The original perceptron was intended to be implemented in hardware. But general purpose chips evolved much faster.

https://en.wikipedia.org/wiki/Perceptron


👤 ttul
On top of what others have said here about TPUs and their kin, you can make things really scream by taping out an ASIC for a specific frozen neural network (i.e. including the weights and parameters).

If you never have to change the network - for instance to do image segmentation or object recognition - then you can’t get any more efficient than a custom silicon design that bakes in the weights as transistors.


👤 mongol
What would be an affordable/ cheap way to get hands on with this type of hardware? Right now I have zero knowledge.

👤 turbojerry
The NVIDIA Deep Learning Accelerator (NVDLA) is a free and open architecture that promotes a standard way to design deep learning inference accelerators.

http://nvdla.org/


👤 sethgoodluck
Remember crypto miners (ASICS). Exact same thing but built for the math around AI work instead of Blockchain work.

👤 joyeuse6701
Interestingly, this might be well answered by the LLMs built on the technology you’re interested in.

👤 m3kw9
AI chip is basically a chip that calculates matrix’s better than general purpose CPUs

👤 arroz
AI chips is just regular chips that do AI stuff faster

So dedicated hardware to do math stuff


👤 pulkas
It is what ASIC for bitcoin. A new era for AI models.

👤 HarHarVeryFunny
Modern AI/ML is increasingly about neural nets (deep learning), whose performance is based on floating point math - mostly matrix multiplication and multiply-and-add operations. These neural nets are increasingly massive, e.g. GPT-3 has 175 billion parameters, meaning that each pass thru the net (each word generated) is going to involve in excess of 175B floating point multiplications!

When you're multiplying two large matrices together (or other similar operations) there are thousands of individual multiply operations that need to be performed, and they can be done in parallel since these are all independent (one result doesn't depend on the other).

So, to train/run these ML/AI models as fast as possible requires the ability to perform massive numbers of floating point operations in parallel, but a desktop CPU only has a limited capacity to do that, since they are designed as general purpose devices, not just for math. A modern CPU has multiple "cores" (individual processors than can run in parallel), but only a small number ~10, and not all of these can do floating point since it has specialized FPU units to do that, typically less in number than the number of cores.

This is where GPU/TPU/etc "AI/ML" chips come in, and what makes them special. They are designed specifically for this job - to do massive numbers of floating point multiplications in parallel. A GPU of course can run games too, but it turns out the requirements for real-time graphics are very similar - a massive amount of parallelism. In contrast to the CPUs ~10 cores, GPUs have thousands of cores (e.g. NVIDIA GTX 4070 has 5,888) running in parallel, and these are all floating-point capable. This results in the ability to do huge numbers of floating point operations per second (FLOPS), e.g. the GTX 4070 can do 30 TFLOPS (Tera-FLOPS) - i.e. 30,000,000,000,000 floating point multiplications per second !!

This brings us to the second specialization of these GPU/TPU chips - since they can do these ridiculous number of FLOPS, they need to be fed data at an equally ridiculous rate to keep them busy, so they need massive memory bandwidth - way more than the CPU needs to be kept busy. The normal RAM in a desktop computer is too slow for this, and is in any case in the wrong place - on the motherboard, where it can only be accessed across the PCI bus which is again way too slow to keep up. GPU's solve this memory speed problem by having a specially designed memory architecture and lots of very fast RAM co-located very close to the GPU chip. For example, that GTX 4070 has 12GB of RAM and can move data from it into its processing cores at a speed (memory bandwidth) of 1TB/sec !!

The exact designs of the various chips differ a bit (and a lot is proprietary), but they are all designed to provided these two capabilities - massive floating point parallelism, and massive memory bandwidth to feed it.

If you want to get into this in detail, best place to start would be to look into low level CUDA programming for NVIDIAs cards. CUDA is the lowest level API that NVIDIA provide to program their GPUs.