HACKER Q&A
📣 someinteresting

How does MS-Teams, Meets and Zoom virtual background works?


All right, so we need a person body segmentation, that's not a huge deal now, but then we need to serve it somewhere and we need to decide if we'll do this on client side, server side or in-between.

All of these have pros and cons, and I am sure big companies already measured and optimized the hell out of the solution and this is why I am interested in the topic.


  👤 rapsey Accepted Answer ✓
I think the most popular solution for this is Google's mediapipe selfie segmentation: https://developers.google.com/mediapipe/solutions/vision/ima...

They definitely do it client side.

I just wish there was something more easily deployable to desktops as a C/Rust solution.


👤 samlinnfer
It is a classic computer vision problem: image segmentation. There are open source tools like rembg (https://github.com/danielgatis/rembg) which call into pre-trained models. It is also easy to train your own model, I've used code from https://github.com/SkyTNT/anime-segmentation with my own data to train an image segmentation model to separate objects with really good results.

👤 pavlov
I'm not aware of any of the big players rendering virtual backgrounds on the server. It's all client-side.

A couple years ago when I worked at Meta, I built the server bits for a demo that was a hybrid solution. The client did the segmentation as usual but instead of sending the final composite, it sent the raw alpha channel (i.e. the transparency mask) in addition to the video stream. The composite was then rendered on a server. This added an interesting level of flexibility to the composition, as you could change the background independently and move/scale the participant's masked video to adapt the layout on the fly. But I don't suppose it ever went into production.


👤 rcarmo
It’s done client side. Even a measly iGPU can do this today, and you don’t even get the option to turn it on if your hardware isn’t able to do it.

👤 theGeatZhopa
Computer vision, otsy algorithm, and some movement comparison like if the recorded pixel is same in value, make it's value smaller and display it. Over the time all stable pixels will loose its information, but the ones, that record movements, will be refreshed each time. With that you can easily segment background from moving bodied which you are when talking.

And then one needs to think of other possibilities:)


👤 doix
It definitely used to all happen client side, if your graphics stack was messed up, then those tools wouldn't apply those effects.

I haven't used those tools in a couple years, but I doubt it's any different now. I know that Google Meet (the tool I use nowadays) still does it client side. Every now and then I break something when updating my Nvidia drivers on Linux and I can no longer blur my background.


👤 jameshush
What's your use case? If it's for live video calls, I'd recommend using a vendor to handle this (and every other WebRTC edge case you can think of) for you. I'm a bit biased because I work for Daily.co, who does this.

Banuba is great if you're not trying to integrate any WebRTC stuff too.


👤 rspoerri
OBS has a plugin that does background removal and incorporates different solutions to do so.

https://github.com/occ-ai/obs-backgroundremoval


👤 scrapheap
This is an assumption, so may not be how they do it - but I'd assume you'd push it client side and the client doesn't have the hardware or power to do an option then you don't give it to them on that device.