I have had some fun playing with TorToiSe TTS, which is mixed when it comes to being better than ElevenLabs. In small snippets it does sound better, but overall it does not. I mention it because it's openly available and runs locally. I didn't spend more than a weekend on it, and it's popular enough to have a small community collection of voices. You have to search for them, but they're small in size and it's zero shot generation. It's very similar to how stable diffusion felt when it first came out, a lot of trail and error and no consensus of the "right" answers.
The main reason why I liked it, even though the bad generations are really bad, is because you have full control of the training data set. I haven't kept up with it in a few weeks so I am sure there have been advances I'm not aware of.
https://git.ecker.tech/mrq/ai-voice-cloning