> video in real time
These two are going to be mutually exclusive for a few years, I'm afraid. Sure, the hardware investment will put you back a few thousand dollars, but running the latest temporally-stable video models is not realistic for most experienced devs, much less a beginner. If you wanted to be corny you could train a few-shot voice model to sound like you and plug it into GPT-3, but that would be neither realtime nor particularly gratifying.
It's not exactly a surprise that this is a bad idea. Now I'm going to get nightmares with my AI doppelganger repeating GPT-3 gobbledygook in monotone...