Why, techically, is video conferencing so bad?

Question

Television interviews nowadays are worse than the radio interviews of my youth, with so much of the sound being garbled. It would seem sensible to switch to audio-only when bandwidth is limited, but this appears not to be the case. Another solution might be to insert a delay for buffering, to cover rough patches. Things do not have to be exactly synchronous -- people can get used to saying "over" at the end of a sentence and waiting a few seconds.The problem seems to occur with multiple software systems. I wonder, can anyone on HN point to a technical treatment of this issue?

hctaw · Accepted Answer

Disclaimer: I've only worked on lower parts of the stack so my experience is biased. It's mostly the concept of "garbage in, garbage out."
There's a simple fix for most audio problems:
- use wired headphones
- use a wired microphone
- disable all DSP in your conferencing platform (echo cancellation is the big one, AGC helps as well)
Almost all the fidelity problems in audio conferencing stacks comes down to crappy microphones, wireless codecs/latency constraints on wireless audio (almost all bluetooth headsets, it doesn't really matter how much you spend, the only model I've heard that doesn't wreck fidelity are Earpods), and DSP that is intended to remedy environmental problems and compensate for the aforementioned crappy microphones and mic conditions.
> It would seem sensible to switch to audio-only when bandwidth is limited
Audio doesn't need that much bandwidth to be perceptually lossless. The bigger problem for conferencing encoders is latency.
I'd also add that the "garbling" sound is definitely different compared to how poor fidelity manifests in analog telecoms/radio. Before you'd get nonlinear distortion (frequencies that don't exist, but are correlated, not a huge deal), low bandwidth (the human brain can interpolate missing frequency information quite well) and wideband noise (human brains are good at dealing with this).
The garbling sound you hear in crappy DSP, including encoders that are doing their best to discard information that doesn't matter and echo cancellation doing realtime adaptive filtering (which is like training a neural net as you evaluate it!) will manifest in really strange ways where small amounts of temporal/phase coherence is lost.
I'm not an expert on perceptual codecs but my guess is that overlap-add/save or other time/frequency based analysis/synthesis algorithms that can lose phase coherence between frames will wind up mutating the edges of some phonemes in human speech in a way that our brains can't interpolate what was said, and its a very alien kind of distortion compared to harmonic distortion, low bandwidth, and noise which all occur in nature.

taurath · Answer

> Things do not have to be exactly synchronous -- people can get used to saying "over" at the end of a sentence and waiting a few seconds.
As someone who has supported users of conference systems, nobody will do this, and on a human connection level it doesn’t make much sense. Ideal is lowest latency possible.
TV interviews which you asked about specifically are often handled cheaply over Skype which has a ton of its own problems. You notice Anderson Cooper on CNN has no problem broadcasting from his home, but they have a $50,000 cart of direct connection + pro AV equipment that goes with it (not to mention lighting etc). Using consumer video conferencing allows them to connect to most people’s home equipment and mux it together at the station.
Perhaps I’ve misunderstood though, do you have an example?