How Do AI Apps Like ChatGPT Achieve Such High Speeds?

Question

I&rsquo;ve been experimenting with building my own Retrieval-Augmented Generation (RAG) application using an NVIDIA H100 GPU with 80GB memory, leveraging web search APIs like Serper and deploying various GPT models on platforms such as OpenAI, Azure, and quantized models on Ollama. While smaller models perform reasonably fast, anything beyond the 7-billion parameter range tends to be significantly slow.In contrast, AI applications like ChatGPT and Perplexity demonstrate impressive speed in real-time searching, web scraping, content generation, and reasoning. Their ability to deliver results so quickly, even with large-scale models, is quite remarkable.I&rsquo;m curious to understand the engineering strategies and optimizations these companies use to achieve such high performance. Are there any insightful engineering blogs or technical resources that explain how they optimize their infrastructure, parallelize workloads, or manage latency effectively? Any insights into their backend architecture, caching mechanisms, or inference optimization techniques would be greatly appreciated.

verdverm · Accepted Answer

Look up Google's TPU. Generally speaking, the Google engineering blogs are quite good. A lot of effort has gone into making all the other parts around the tensor processing fast as well. There are some AI specifics about batching requests into a single model as well.