Youtube's AI-powered auto gen captions are often inaccurate for uncommon words and proper nouns (names, etc).
Here's an example to a deep link : https://youtu.be/z9IXfYHhKYI?t=20m33s
The instructor actually says "UIs" (user interfaces) but the auto gen transcript is "you eyes".
Therefore, accurate transcripts require humans to manually correct the errors and most content creators of podcasts don't have the time-or-money budget for that.
If your question is actually asking why the iPhone/Android software apps don't dynamically generate and display "close captions" while the podcast/audiobook is playing similar to Youtube or Chrome, it's because the devices don't include an onboard trained neural network that's big & comprehensive enough to translate general audio of uncommon terminology. Today's smartphones only have RAM of ~4GB to 8GB. (I read that iOS API has max RAM allocation of 5GB per app.)
I found a Google paper talking about reducing a 2GB voice recognition model to 450MB with alternative techniques but the neural net is probably not good enough for accurately transcribing uncommon words: https://ai.googleblog.com/2019/03/an-all-neural-on-device-sp...
So current state-of-technology means accurate close captions on smartphones would require a separate "subtitle track" (authored by humans) akin to *.srt files in mkv video containers -- instead of attempting dynamic speech recognition of the audio. Maybe not enough consumer demand to have mp3 or m4b files formats with transcripts embedded.
There is also LRC[3], a synchronized lyrics format stored in an extra file (*.lrc).
[1]: https://id3.org/id3v2.3.0#sec4.10
[2]: https://github.com/Zeugma440/atldotnet/blob/749d9ccb03032667...
If it’s automatic, I don’t see how it would be beneficial to listeners/readers. If it’s manual, it’s not cheap.
But machine transcriptions are IMO useless for the purpose. (They're fine for interviews where they're just source material for something you're writing.) And they're especially useless for audio you're having trouble understanding anyway.
Human transcriptions aren't that expensive, about $1/minute, but they do cost money and you still need to spend some time cleaning them up and formatting them--and you still end up with words here and there you can't make out.
A quick hack for fellow people who are hard of hearing: I often turn on a private "google meets" session, turn video off, and have google do the automatic captions through there if it's a captionless video I really want to see. I don't expect any privacy, and you do have to have a silent home for it, but hey. Works.
And human transcription costs like a lot 45 cents a word, imagine a 1 hour conversation.