Proposal: Reserve the JSON transcript format for word timestamps #484

ryan-lp · 2022-04-14T14:04:01Z

ryan-lp
Apr 14, 2022

What is the proposal?

Restrict the scope of the JSON transcript format from supporting arbitrary fidelity to only supporting word-level fidelity.

Why would we want this?

It is super helpful when a format specifies the intended level of fidelity. For example, SRT is useful precisely because the apps that consume it know to expect 1-2 lines in each block, and that is what makes the subtitles use case possible. When you display SRT blocks, it is because of this assumption that an app can safely go ahead and display one block at a time, because we know each block will appear on the screen long enough to be read before disappearing and showing the next block.

There is another style of rendering captions based on word timestamps where you incrementally display more words as they're being spoken. Once 2 lines have been filled, you scroll the lines up so that the top line goes out of view and creates space for a new line below where words can continue being added. Once again, it is super helpful if an app can make the assumption that such a format definitely contains word-level fidelity before attempting to render captions this way. If the assumption is wrong, and the publisher suddenly inserts 50 words (rather than 1 word) into the next segment, this method of rendering will completely break down. Indeed, there are some publishers who have put the entire podcast episode in a single JSON segment. Thus is the over-flexibility of this format! (Although this is rare, by far, most people use JSON for word-level fidelity.)

Basically, what is desired is more specificity in each format, rather than a general purpose format that is so broadly defined that an app may not know what to do with it.

So I think we have an opportunity here to have separate formats appropriate for low fidelity, medium fidelity and high fidelity:

Low fidelity: HTML and TEXT. An app can simply render these as whole documents, perhaps via a WebView.
Medium fidelity: SRT and WebVTT are intended to present small blocks of text at a time.
High fidelity: JSON could be repurposed as the high fidelity machine readable format with timestamps for each individual word.

In terms of backward compatibility, old apps that accept arbitrary fidelity in JSON will continue to work. New apps that depend on word-level fidelity in JSON should try to do normal error handling and detect invalid transcripts. But the spec change should at least encourage publishers to provide transcripts in formats that can be usefully consumed by the apps.

ryan-lp · 2023-03-26T14:16:30Z

ryan-lp
Mar 26, 2023
Author

A minor update to my proposal:

In light of the proposal in #452 to have more appropriate filename extensions other than the generic *.json, and also keeping in mind backwards compatibility, perhaps what we can do is leave the old JSON format alone (to silently die), and reintroduce a brand new version of this JSON format but with a new extension and a new restriction to be only for word-level fidelity. That shouldn't break anything, and it would simultaneously solve the issue we have with generic extensions when trying to map extensions to MIME types on the server.

1 reply

ryan-lp Apr 4, 2023
Author

Oh, I forgot that the JSON format actually includes a version spec for this purpose:

"version": "1.0.0",

so we can bump the version number to address the compatibility issue.

felixfbecker · 2024-01-29T18:50:56Z

felixfbecker
Jan 29, 2024

I just wanted to note that WebVTT actually supports word timestamps: Docs, Spec

00:16.500 --> 00:18.500
When the moon <00:17.500>hits your eye

00:00:18.500 --> 00:00:20.500
Like a <00:19.000>big-a <00:19.500>pizza <00:20.000>pie

This allows someone to (optionally) mark up when exactly a certain word inside a cue occured, meaning WebVTT can be used for both "medium-fidelity" text sequences to display on the screen together and "high-fidelity" word timestamps (to incrementally show words, or highlight the current word, etc). It also means there is progressive enhancement; clients that use them can use them, while clients that don't have a good way to display them would just ignore them and display the whole cue at once (as opposed to JSON, where the producer decides whether there's going to be one cue per word or per sentence).

In the browser, when a WebVTT cue is marked up with timestamps before each word, the CSS selectors :past and :future can be used to match the words/part of the cue that are before or after the current playback position. This can then be used to e.g. make words appear as they are spoken (using visibility: hidden/visible or opacity) or highlight the already-spoken words (with color, text-shadow, text-decoration, etc).

This is just an example of how the browser enables this, of course native apps can parse the WebVTT too and do their own handling of styles. As seen in the example above the timestamps are trivial to parse out and the format is much more concise than having a JSON object for each individual word.

1 reply

ryan-lp Jan 30, 2024
Author

Hi @felixfbecker , that's a good point, and has actually already been proposed here: #599

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Reserve the JSON transcript format for word timestamps #484

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Proposal: Reserve the JSON transcript format for word timestamps #484

ryan-lp Apr 14, 2022

What is the proposal?

Why would we want this?

Replies: 2 comments · 2 replies

ryan-lp Mar 26, 2023 Author

ryan-lp Apr 4, 2023 Author

felixfbecker Jan 29, 2024

ryan-lp Jan 30, 2024 Author

ryan-lp
Apr 14, 2022

Replies: 2 comments 2 replies

ryan-lp
Mar 26, 2023
Author

ryan-lp Apr 4, 2023
Author

felixfbecker
Jan 29, 2024

ryan-lp Jan 30, 2024
Author