Remove the HTML transcript format #591

ryan-lp · 2024-01-21T12:18:05Z

ryan-lp
Jan 21, 2024

very low adoption
very low conformance

The standard dictates the following format:

<cite>Kevin:</cite>
<time>0:00</time>
<p>We have an update planned where we would like to give the ability to upload an artwork file for these videos</p>
<cite>Alban :</cite>
<time>0:09</time>
<p>You're triggering Tom right now with a hey, here's a cool feature.</p>

with <cite> and <time> being optional, and <p> containing transcript segments.

In practice, people are using this format as a lazy option to just provide a link to any randomly formatted web page, whether or not it contains a transcript, and provide no machine readable options. If a parser has any hope of extracting out the speaker tags and time codes, it will soon be met with the disappointing reality that transcripts purporting to be in this format will contain all manner of content including:

podcast summaries (which are not transcripts)
navigation bars (which are not transcripts)
control panel pages hosted on Descript and similar services, some of which are Javascript apps without any parsable HTML
and so on...

Even buzzsprout hosts non-conformant HTML transcripts:

https://feeds.buzzsprout.com/697189/3387430/transcript

<p><!--block--><b>spk_1:</b>&nbsp;&nbsp;&nbsp;0:04<br>
tha system<br>
<br>
<b>spk_0:</b>&nbsp;&nbsp;&nbsp;0:05<br>
green with what really matters interviews. And today I'm really happy to be
interviewing a woman named Carla King. She's an adventure motorcycle writer.
She is a adventure, right tour and author. She's been writing about adventure,
motorcycling and other areas for at least 20 years. I think 25 going back to
1994 maybe even before that. Ah, she speaks to the venture travel industry.
She's an expert on that. There's so many things about her. And so, Carla,
thanks for joining us.<br>
<br>
<b>spk_1:</b>&nbsp;&nbsp;&nbsp;0:39<br>
Thanks for asking the<br>
<br>
<b>spk_0:</b>&nbsp;&nbsp;&nbsp;0:40<br>
So why did you pick a moment something to just kind of illustrate
...

Although in fact Buzzsprout's HTML transcripts follow some sort of pattern that is at least consistent with other Buzzsprout HTML transcripts, they nonetheless fail to conform to the actual standard.

("And what" said Alice, "is the sense of a rule that everyone pays no heed to, or, if they do happen to pay heed, simply toss it aside and go on with their own peculiar doings?")

Here is another HTML transcript which is basically a web page, and does not conform to the standard:

https://www.parolaprogetto.com/podcast-episode/vergo-progettare-la-musica-oltre-i-generi/

Here's one that actually conforms to the standard but it doesn't take advantage of time codes or speaker tags, it consists of one <p>very long paragraph containing the entire episode transcript</p>:

https://transcripts.captivate.fm/transcript/8fe97677-2591-4a98-8cc9-2d4df89c42f0/index.html

As someone building a service that ingests transcripts, this format has caused nothing but hassle (I cannot distinguish between transcript and non-transcript parts, speaker tags and non-speaker tag parts, and so on), and I will be removing support for it, but I would recommend removing it from the spec outright to discourage further people from using it when superior options are available.

ryan-lp · 2024-01-21T13:55:57Z

ryan-lp
Jan 21, 2024
Author

very low adoption

Actually, it seems PodcastGuru and steno.fm do show them. However, I'm not sure about steno.fm. Doing a few tests, it successfully opened some of the non-standard HTML transcripts (and not others), but it didn't successfully open standard HTML transcripts with time codes for some reason - of the ones I tested. This may hint at some of the difficulties in trying to parse this format due to the lack of conformance.

Although in the end, the listener will rarely see these HTML transcripts because almost all transcribed episodes have a SRT/VTT/JSON transcript available, and those will be preferred by the apps.

0 replies

ryan-lp · 2024-01-22T12:26:26Z

ryan-lp
Jan 22, 2024
Author

Along with this, we could also probably remove the rel="captions" attribute on the transcript tag. This attribute was ONLY of any value for the HTML transcript type to specifically indicate that the transcript contained timestamps. Since in practice the apps are preferring SRT/JSON, this attribute is redundant because those formats always have timestamps. And even in cases where an HTML transcript is provided, the rel="captions" attribute can't reliably be used as an indicator of the type of HTML content being offered. For example, Buzzsprout (who contribute the spec) don't use this attribute on any of their HTML transcripts even when it is indicated to do so. If it's not actually being used, we don't need it.

By the way, removing this from the spec will not break apps. Apps already contain logic to deal with the very common situation where this attribute is absent, and probably ignore this attribute anyway given that it is an unreliable indicator of the type of HTML file being offered. Apps will continue working as they've always worked after its removal, hence it is truly redundant and safe to remove.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the HTML transcript format #591

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Remove the HTML transcript format #591

ryan-lp Jan 21, 2024

Replies: 2 comments

ryan-lp Jan 21, 2024 Author

ryan-lp Jan 22, 2024 Author

ryan-lp
Jan 21, 2024

ryan-lp
Jan 21, 2024
Author

ryan-lp
Jan 22, 2024
Author