How do live video streaming services handle encoding?

Question

I wanted to make an online video streaming service, for fun. Something like dlive or twitch. But I have encountered a big problem with encoding.a) you have to normalize the input streamb) you have to create multiple versions with lower bitrate for slow consumersThe issue I got stuck on was that the transcoding speed of ffmpeg was really slow. No matter how you configure it, it will always be way too slow in order to be usable for LIVE video streaming. Especially if you have to do two runs(as mentioend above).So I am wondering how are these services doing it? Even if I would take into account use of really specced out servers - lets say 64 cores, that is 128 threads, leave 1 thread for application stuff, so that is 127 threads. You have 2 second long segments(GOP) into which you split the stream, that means such machine can transcode only 127 live video streams at any one time. And since the transcoding is slow, the process will be longer than 2 seconds which is the length of the video segments. So there is inherent lag being added into the data flow, plus the cpu/hw limitation. And this lag will keep on accumulating over time. So if you have 10k live video streams, that's over 156 really powerful servers, yet you still end up with lag nevertheless.And that is only ingress. So I wonder, how on earth, are they doing it without introducing massive amounts of lag and spending massive amounts of money on hardware? Or do they do just that, spend a ton of money on hw to handle some realistic amount of ingress and use gpus or specific cpus/hw to make the trnacoding faster and keep the lag low? ie, is it only about money then or is there some tech i am missing?

GeneticGenesis · Accepted Answer

Great question, this is a large, complicated topic, but here's a quick overview. For context I've been building live streaming video infrastructure for a little over 10 years now.
At a fundamental level, yes, the encoding is one of the most expensive components of a live-streaming system (at low scale), and honestly, your guess of 2 bitrates for each video is very much on the low end - generally on average most platforms create about 5 different qualities created for any one stream, ranging from ~500kbps to ~5+mbps.
If you look at the pricing of modern video platforms, you can see the high cost of ingest and transcode captured in their pricing:
AWS IVS - $2.00 per input hour. API.video - $2.40 per input hour. Mux - $4.20 per input hour.
Generally there isn't too much use of hardware acceleration on ASIC or GPU for h.264 processing today. FFmpeg (x264) is plenty "fast enough" when tuned on commodity X86 hardware, when you have things like AVX extensions. Generally transcoding a 4-6 second segment of video should only be taking a couple of hundred milliseconds at a maximum.
As for how the larger platforms deal with keeping live streaming cost competitive, the number of qualities (and often codecs used) varies depending on the number of viewers you have, so that they're not wasting resources encoding for a couple of viewers. Many platforms also implement just-in-time encoding, to limit the amount of content that's transcoded when it isn't being viewed.
Some platforms also drop all the way down to a transmux for streams with very low viewer numbers - transmux in this context just changes the packaging of the inbound stream without changing the actual encoded picture data.
It's also worth considering from a business perspective that many UGC live platforms will also be taking a loss on small streamers with a low number of viewers, and covering that with the revenue from larger streamers with ad revenue / subscribers.
I hope that helps!