HACKER Q&A
📣 amichail

Why don't podcasts/audiobooks have closed captions for misheard words?


This would be like automatically generated captions in YouTube but for audio only content.


  👤 imgabe Accepted Answer ✓
Podcasts don’t do it because good transcription costs money. Audiobooks already do it, it’s called … the book.

👤 jasode
>automatically generated captions in YouTube

Youtube's AI-powered auto gen captions are often inaccurate for uncommon words and proper nouns (names, etc).

Here's an example to a deep link : https://youtu.be/z9IXfYHhKYI?t=20m33s

The instructor actually says "UIs" (user interfaces) but the auto gen transcript is "you eyes".

Therefore, accurate transcripts require humans to manually correct the errors and most content creators of podcasts don't have the time-or-money budget for that.

If your question is actually asking why the iPhone/Android software apps don't dynamically generate and display "close captions" while the podcast/audiobook is playing similar to Youtube or Chrome, it's because the devices don't include an onboard trained neural network that's big & comprehensive enough to translate general audio of uncommon terminology. Today's smartphones only have RAM of ~4GB to 8GB. (I read that iOS API has max RAM allocation of 5GB per app.)

I found a Google paper talking about reducing a 2GB voice recognition model to 450MB with alternative techniques but the neural net is probably not good enough for accurately transcribing uncommon words: https://ai.googleblog.com/2019/03/an-all-neural-on-device-sp...

So current state-of-technology means accurate close captions on smartphones would require a separate "subtitle track" (authored by humans) akin to *.srt files in mkv video containers -- instead of attempting dynamic speech recognition of the audio. Maybe not enough consumer demand to have mp3 or m4b files formats with transcripts embedded.


👤 WesleyLivesay
The simple answer for this is probably just that the vast majority of people are not looking at the screen while listening, which probably makes it hard to make a business case for it to happen.

👤 sandreas
There is a id3v2.3 tag field "synchronized lyrics" (Field: SYLT)[1] where you can store a "phrase" with a "timestamp". One library I know that may support writing this tag is atldotnet[2], while I did not try it out. At least there is a Type TRANSCRIPTION in the linked code segment.

There is also LRC[3], a synchronized lyrics format stored in an extra file (*.lrc).

[1]: https://id3.org/id3v2.3.0#sec4.10

[2]: https://github.com/Zeugma440/atldotnet/blob/749d9ccb03032667...

[3]: https://en.wikipedia.org/wiki/LRC_(file_format)


👤 baxtr
Amazon's WhisperSync is pretty cool. I love listening to audio books but I also like to read and annotate them. WhisperSync solves for that. I don't buy any eBooks without that feature.

👤 sokoloff
My experience with automatic transcription is that it introduces far more errors than it would eliminate.

If it’s automatic, I don’t see how it would be beneficial to listeners/readers. If it’s manual, it’s not cheap.


👤 ghaff
Some sites do for podcasts.

But machine transcriptions are IMO useless for the purpose. (They're fine for interviews where they're just source material for something you're writing.) And they're especially useless for audio you're having trouble understanding anyway.

Human transcriptions aren't that expensive, about $1/minute, but they do cost money and you still need to spend some time cleaning them up and formatting them--and you still end up with words here and there you can't make out.


👤 elil17
For audiobooks, there’s a legal question of whether it causes licensing issues. Audible got in a huge lawsuit over it which they settled out of court: https://amp.theguardian.com/books/2020/jan/15/audible-settle...

👤 donatj
Audiobooks in the kindle app highlight the spoken text as the audio plays.

👤 Overtonwindow
I believe that’s called a book, and you could use it to read along?


👤 kradeelav
My guess would be "it's expensive", same reason why live-action TV shows online didn't have subs until fairly recently.

A quick hack for fellow people who are hard of hearing: I often turn on a private "google meets" session, turn video off, and have google do the automatic captions through there if it's a captionless video I really want to see. I don't expect any privacy, and you do have to have a silent home for it, but hey. Works.


👤 paxys
Search for " transcript". A lot of them do publish the full text of the show on their website.

👤 t-3
Generally, when I'm listening to podcasts or audiobooks, I don't have a screen in front of me to put CC on. Might just be me, but I can only follow one source of information at a time, so I don't like non-music audio while using the computer or reading.

👤 annagrigoryan2
For podcasts only human transcription works good. Tech still is very hard on people with accents, even worse with non-english languages.

And human transcription costs like a lot 45 cents a word, imagine a 1 hour conversation.


👤 ohiovr
Would something like this help? https://www.youtube.com/watch?v=pPBpxoIqDSA

👤 assbuttbuttass
On Android, you can turn on "Live Caption" for autogenerated captions a la YouTube

👤 shreyshnaccount
Nice project idea

👤 seydor
Is there an audio file format with teletext?