Can anybody give an overview of AI/ML/NPU/TPU/etc chips and pointers to detailed technical papers/books/videos about them? All i am able to find are marketing/sales/general overviews which really don't explain anything.
Am looking for a technical deep dive.
If you are familiar with linear algebra these specialized chips literally etch silicon so as to perform vector (and more general multi-array or tensor) computations faster than a general purpose CPU. They do that by loading and operating a whole set of numbers (a chunk of a vector or a matrix) simultaneously (whereas the CPU would operate mostly serially - one at a time).
The advantage is (in a nutshell) that you can get a significant speedup. How much depends on the problem and how big a chunk you can process simultaneously but it can be a significant factor.
There are disadvantages that people ignore in the current AI hype:
* The speedup in a one-off gain, the death of Moore's law is equally dead for "AI chips" and CPU's
* It is extremely specialized and fine-tuned software you need to develop and run and it only applies to the above linear algebra problems.
* In the past such specialized numerical algebra hardware was the domain of HPC (high performance computing). Many a supercomputer vendor went bankrupt in the past because the cost versus market size was not there.
It's a matrix multiplication (https://en.wikipedia.org/wiki/Matrix_multiplication) accelerator chip. Matrix multiplication is important for a few reasons. 1) it's the slow part of a lot of AI algorithms, 2) it's a 'high intensity' algorithm (naively n^3 computation vs. n^2 data), 3) it's easily parallelizable, and 4) it's conceptually and computationally simple.
The situation is kind of analogous to FPUs (floating point math co-processor units) when they were first introduced before they were integrated into computers.
There's more to it than this, but that's the basic idea.
Some are just hardware pipeline to compute 2d/3d convolutions + activation function (Rockchip RK3588 has one like that). You give it the memory address + dimensions of the input matrix, the memory address + dimensions of the weights, memory address + dimensions of the output, which activation function you want (there are only like 4 supported), then you tell it RUN, you wait for a bit, and you have the result at the output memory address.
(I took Rockchip example to show that even in one microcosm it can change a lot)
And then you can imagine any architecture in-between.
AFAIK they all work with some RAM as input and some RAM as output, but some may have its own RAM, some may share RAM with the system, some might have mixed usage (RK3588 has some SRAM, and when you ask it to compute the convolution, you can tell it either to write to SRAM or system RAM)
It's possible that there are some components that are border line between ISP (Image Signal Porcessing) and a NPU, where the input is the direct camera stream, but my guess is that they do some very small processing on the camera stream, then dump it to RAM, then do all the heavy work from RAM to RAM. I think that Pixel 4-5 had something like that.
GPUs were optimised to draw, so were able to do dozens of these at a go. So these can be used for AI/ML in both gradient descent and inference (forward passes). Because you can do many at a go, in parallel, they speed things up dramatically. Geoff Hinton experimented with GPUs exploiting their ability to do this, but they aren't actually optimised to do that. It just turned out that it is the best way available to do it at the time, and still currently.
AI chips, are optimised to do either inference or gradient descent. They are not good at drawing like GPUs are. They are optimised for machine learning and joining other AI chips together so you can have massive networks of chips that can parallel compute.
One other class of chips that has not yet shown up are ASICs that mimic the transformers architectures for even more speed - though it changes too much at the moment for it to be useful.
Also because of the mechanics of scale manufacturing: GPUs are currently cheaper per flop of compute as the aggregate of scale is shared with graphical uses. Though with time if there is enough scale AI chips should end up cheaper
I read the first book in this list about 10 years ago, and though it’s pretty old the concepts are solid.
https://online.stanford.edu/courses/cs217-hardware-accelerat...
Course website with lecture notes: https://cs217.stanford.edu/
Reading list: https://cs217.stanford.edu/readings
Here's a decent overview from the horse's mouth. https://cloud.google.com/blog/products/ai-machine-learning/a...
It's called a systolic array because the data moves through it in waves similar to what an engineer imagines the heart looks like :)
- GPU (Graphics Processing Unit)
- TPU (Tensor Processing Unit): ASIC designed for TensorFlow
- IPU (Intelligence Processing Unit): Graphcore
- HPU (Habana Processing Unit): Intel Habana Labs' Gaudi and Goya AI
- NPU (Neural Processing Unit): Huawei, Samsung, Microsoft Brainwave
- VPU (Vision Processing Unit): Intel Movidius
- DPU (Data Processing Unit): NVIDIA data center infrastructure processing unit
- Amazon's Inferentia: Amazon's accelerator chip focused on low cost
- Cerebras Wafer Scale Engine (WSE)
- SambaNova Systems DataScale
- Groq Tensor Streaming Processor (TSP)
Some random examples of video titles from the last 6 months of the channel:
* A Deep Dive into IBM's New Machine Learning Chip
* Does my PC actually use Machine Learning?
* Intel's Next-Gen 2023 Max CPU and Max GPU
* A Deep Dive into Avant, the new chip from Lattice Semiconductor (White Paper Video)
* The AI Hardware Show 2023, Episode 1: TPU, A100, AIU, BR100, MI250X
I think the podcasters background is actually in HPC (High Performance Computing), i.e. super computers. But that overlaps just enough with AI hardware that he saw an opportunity to capitalize on the new AI hype.
It's worth noting there are very widely deployed chips primarily built for inference (running the network) especially on mobile phones.
Depending on the device and manufacturer sometimes this is implemented as part of the CPU itself, but functionally it's the same idea.
The Apple Neural Engine is a good example of this. This is separate to the GPU which is also on the CPU.
Further information is here: https://machinelearning.apple.com/research/neural-engine-tra...
The Google Tensor CPU used in the pixel has a similar coprocessor called the EdgeTPU.
At the end of the day, it's matrix multiplication acceleration mostly, and then IO optimization. Literally most of the optimization has nothing to do with compute. We can compute faster than we can ingest.
https://medium.com/@adi.fu7/ai-accelerators-part-i-intro-822...
It's the founder of a new AI chip company and they talk a bit on the differences
Incidentally there used to be a proper "AI" chip. The original perceptron was intended to be implemented in hardware. But general purpose chips evolved much faster.
If you never have to change the network - for instance to do image segmentation or object recognition - then you can’t get any more efficient than a custom silicon design that bakes in the weights as transistors.
So dedicated hardware to do math stuff
When you're multiplying two large matrices together (or other similar operations) there are thousands of individual multiply operations that need to be performed, and they can be done in parallel since these are all independent (one result doesn't depend on the other).
So, to train/run these ML/AI models as fast as possible requires the ability to perform massive numbers of floating point operations in parallel, but a desktop CPU only has a limited capacity to do that, since they are designed as general purpose devices, not just for math. A modern CPU has multiple "cores" (individual processors than can run in parallel), but only a small number ~10, and not all of these can do floating point since it has specialized FPU units to do that, typically less in number than the number of cores.
This is where GPU/TPU/etc "AI/ML" chips come in, and what makes them special. They are designed specifically for this job - to do massive numbers of floating point multiplications in parallel. A GPU of course can run games too, but it turns out the requirements for real-time graphics are very similar - a massive amount of parallelism. In contrast to the CPUs ~10 cores, GPUs have thousands of cores (e.g. NVIDIA GTX 4070 has 5,888) running in parallel, and these are all floating-point capable. This results in the ability to do huge numbers of floating point operations per second (FLOPS), e.g. the GTX 4070 can do 30 TFLOPS (Tera-FLOPS) - i.e. 30,000,000,000,000 floating point multiplications per second !!
This brings us to the second specialization of these GPU/TPU chips - since they can do these ridiculous number of FLOPS, they need to be fed data at an equally ridiculous rate to keep them busy, so they need massive memory bandwidth - way more than the CPU needs to be kept busy. The normal RAM in a desktop computer is too slow for this, and is in any case in the wrong place - on the motherboard, where it can only be accessed across the PCI bus which is again way too slow to keep up. GPU's solve this memory speed problem by having a specially designed memory architecture and lots of very fast RAM co-located very close to the GPU chip. For example, that GTX 4070 has 12GB of RAM and can move data from it into its processing cores at a speed (memory bandwidth) of 1TB/sec !!
The exact designs of the various chips differ a bit (and a lot is proprietary), but they are all designed to provided these two capabilities - massive floating point parallelism, and massive memory bandwidth to feed it.
If you want to get into this in detail, best place to start would be to look into low level CUDA programming for NVIDIAs cards. CUDA is the lowest level API that NVIDIA provide to program their GPUs.