Remove the HTML transcript format #591
Replies: 2 comments
-
Actually, it seems PodcastGuru and steno.fm do show them. However, I'm not sure about steno.fm. Doing a few tests, it successfully opened some of the non-standard HTML transcripts (and not others), but it didn't successfully open standard HTML transcripts with time codes for some reason - of the ones I tested. This may hint at some of the difficulties in trying to parse this format due to the lack of conformance. Although in the end, the listener will rarely see these HTML transcripts because almost all transcribed episodes have a SRT/VTT/JSON transcript available, and those will be preferred by the apps. |
Beta Was this translation helpful? Give feedback.
-
Along with this, we could also probably remove the By the way, removing this from the spec will not break apps. Apps already contain logic to deal with the very common situation where this attribute is absent, and probably ignore this attribute anyway given that it is an unreliable indicator of the type of HTML file being offered. Apps will continue working as they've always worked after its removal, hence it is truly redundant and safe to remove. |
Beta Was this translation helpful? Give feedback.
-
The HTML transcript format has:
The standard dictates the following format:
with
<cite>
and<time>
being optional, and<p>
containing transcript segments.In practice, people are using this format as a lazy option to just provide a link to any randomly formatted web page, whether or not it contains a transcript, and provide no machine readable options. If a parser has any hope of extracting out the speaker tags and time codes, it will soon be met with the disappointing reality that transcripts purporting to be in this format will contain all manner of content including:
Even buzzsprout hosts non-conformant HTML transcripts:
https://feeds.buzzsprout.com/697189/3387430/transcript
Although in fact Buzzsprout's HTML transcripts follow some sort of pattern that is at least consistent with other Buzzsprout HTML transcripts, they nonetheless fail to conform to the actual standard.
("And what" said Alice, "is the sense of a rule that everyone pays no heed to, or, if they do happen to pay heed, simply toss it aside and go on with their own peculiar doings?")
Here is another HTML transcript which is basically a web page, and does not conform to the standard:
https://www.parolaprogetto.com/podcast-episode/vergo-progettare-la-musica-oltre-i-generi/
Here's one that actually conforms to the standard but it doesn't take advantage of time codes or speaker tags, it consists of one
<p>very long paragraph containing the entire episode transcript</p>
:https://transcripts.captivate.fm/transcript/8fe97677-2591-4a98-8cc9-2d4df89c42f0/index.html
As someone building a service that ingests transcripts, this format has caused nothing but hassle (I cannot distinguish between transcript and non-transcript parts, speaker tags and non-speaker tag parts, and so on), and I will be removing support for it, but I would recommend removing it from the spec outright to discourage further people from using it when superior options are available.
Beta Was this translation helpful? Give feedback.
All reactions