Proposal: Reserve the JSON transcript format for word timestamps #484
Replies: 2 comments 2 replies
-
A minor update to my proposal: In light of the proposal in #452 to have more appropriate filename extensions other than the generic |
Beta Was this translation helpful? Give feedback.
-
I just wanted to note that WebVTT actually supports word timestamps: Docs, Spec
This allows someone to (optionally) mark up when exactly a certain word inside a cue occured, meaning WebVTT can be used for both "medium-fidelity" text sequences to display on the screen together and "high-fidelity" word timestamps (to incrementally show words, or highlight the current word, etc). It also means there is progressive enhancement; clients that use them can use them, while clients that don't have a good way to display them would just ignore them and display the whole cue at once (as opposed to JSON, where the producer decides whether there's going to be one cue per word or per sentence). In the browser, when a WebVTT cue is marked up with timestamps before each word, the CSS selectors This is just an example of how the browser enables this, of course native apps can parse the WebVTT too and do their own handling of styles. As seen in the example above the timestamps are trivial to parse out and the format is much more concise than having a JSON object for each individual word. |
Beta Was this translation helpful? Give feedback.
-
What is the proposal?
Restrict the scope of the JSON transcript format from supporting arbitrary fidelity to only supporting word-level fidelity.
Why would we want this?
It is super helpful when a format specifies the intended level of fidelity. For example, SRT is useful precisely because the apps that consume it know to expect 1-2 lines in each block, and that is what makes the subtitles use case possible. When you display SRT blocks, it is because of this assumption that an app can safely go ahead and display one block at a time, because we know each block will appear on the screen long enough to be read before disappearing and showing the next block.
There is another style of rendering captions based on word timestamps where you incrementally display more words as they're being spoken. Once 2 lines have been filled, you scroll the lines up so that the top line goes out of view and creates space for a new line below where words can continue being added. Once again, it is super helpful if an app can make the assumption that such a format definitely contains word-level fidelity before attempting to render captions this way. If the assumption is wrong, and the publisher suddenly inserts 50 words (rather than 1 word) into the next segment, this method of rendering will completely break down. Indeed, there are some publishers who have put the entire podcast episode in a single JSON segment. Thus is the over-flexibility of this format! (Although this is rare, by far, most people use JSON for word-level fidelity.)
Basically, what is desired is more specificity in each format, rather than a general purpose format that is so broadly defined that an app may not know what to do with it.
So I think we have an opportunity here to have separate formats appropriate for low fidelity, medium fidelity and high fidelity:
In terms of backward compatibility, old apps that accept arbitrary fidelity in JSON will continue to work. New apps that depend on word-level fidelity in JSON should try to do normal error handling and detect invalid transcripts. But the spec change should at least encourage publishers to provide transcripts in formats that can be usefully consumed by the apps.
Beta Was this translation helpful? Give feedback.
All reactions