All of these have pros and cons, and I am sure big companies already measured and optimized the hell out of the solution and this is why I am interested in the topic.
They definitely do it client side.
I just wish there was something more easily deployable to desktops as a C/Rust solution.
A couple years ago when I worked at Meta, I built the server bits for a demo that was a hybrid solution. The client did the segmentation as usual but instead of sending the final composite, it sent the raw alpha channel (i.e. the transparency mask) in addition to the video stream. The composite was then rendered on a server. This added an interesting level of flexibility to the composition, as you could change the background independently and move/scale the participant's masked video to adapt the layout on the fly. But I don't suppose it ever went into production.
And then one needs to think of other possibilities:)
I haven't used those tools in a couple years, but I doubt it's any different now. I know that Google Meet (the tool I use nowadays) still does it client side. Every now and then I break something when updating my Nvidia drivers on Linux and I can no longer blur my background.
Banuba is great if you're not trying to integrate any WebRTC stuff too.