-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use referenceScaling instead of scalabilityMode for decodingInfo queries #182
Comments
Ping to @chcunningham, since he's an editor, and a few RTC experts representing different browsers to get their opinion on this proposal, |
My first thought would be to instead add scalabilityMode to decodingInfo. This is hopefully easier to use and more future proof against a future where some new mode is introduced with features that break decoding support. WDYT? |
@chcunningham That's the approach taken in WebRTC-SVC where a decoder that can only decode a subset of The only tricky part is interpreting the meaning when the decoder returns no |
I do not understand what does it mean "decoder supports certain scalability mode". When there is scalability, there is filtering. Decoder sees only part of the the stream. e.g. even if encoder produce L3T3 stream, smart middlebox may detect client doesn't support referenceScaling and forward such client just the lower spatial layer. that substream would be decodable. On the other hand, even without scalability it is possible to encode stream with increasing resolution. such substream wouldn't be decodable (by decoder that can't change resolution on the fly). inability to decode streams encoded with certain scalability mode is a symptom. Root cause is lack of reference scaling support. |
It seems like there are two issues to resolve here:
|
An encoder producing an L3T3 stream might encounter a middle box that reduces this to L3T1 (invented term - "no temporal scaling"), which would work for decoders that support reference scaling - or might choose to reduce it to L1T3 for decoders that don't support reference scaling but do support temporal scalability, or L1T1 for decoders that don't support either. What this shows is that in the case of the smart middle box, it's the middle box that needs to know the receiver's capabilities; the sender doesn't need to know, and probably shouldn't know when the stream has several recipients - it's the job of the middle box to figure out what to tell the sender what makes sense. |
In SFU scenarios, the SFU is indeed a "peer" for the SVC capability exchange, just as it participates in signaling. |
Yes, I agree, but I'm not sure if that changes anything. I'm thinking that the MediaCapabilities API could then be used to get the information that is needed by the middle box to make the decision of what to forward etc. So the question remains if reference scaling should be used or if it should be hidden in various scalability modes that are supported or not? |
I think the consideration of SFU pushes me towards considering reference
scaling to be its own decoder attribute, and not forcing listing of all
supported modes on the decoder.
It's something the SFU has to take into consideration, not something the
video producer has to take into consideration, so having the same ID set on
sender and receiver isn't important.
…On Tue, Aug 24, 2021 at 1:22 AM Johannes Kron ***@***.***> wrote:
Yes, I agree, but I'm not sure if that changes anything. I'm thinking that
the MediaCapabilities API could then be used to get the information that is
needed by the middle box to make the decision of what to forward etc. So
the question remains if reference scaling should be used or if it should be
hidden in various scalability modes that are supported or not?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#182 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADVM7MVYSPD7EZUCYMH76TT6LJ3FANCNFSM5CJYOJIA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
Video producer still might need to take reference scaling into consideration. |
Reference scaling support on the decoder may constrain the encoder's choice of modes, alternatively the SFU's choice of what to strip out. If I look at the scalability mode dependency diagrams in https://www.w3.org/TR/webrtc-svc/#dependencydiagrams*, it seems that L2T1_KEY and L2T3_KEY_SHIFT are special in that they only require reference scaling at keyframes, but basically all Ln with L>1 seem to require reference scaling. |
I think I agree that referenceScaling works to answer all known the questions, but I dislike how it introduces new vocabulary for describing the SVC capabilities. It seems If we instead put scalabilityMode in VideoDecoderConfig, we get to re-use the existing definition, which seems simpler. Using scalabilityMode, callers can still determine what filtering, if any, is needed by initially calling with decoderConfig.scalabilityMode = encoderConfig.scalabilityMode and making a second call with a decoderConfig.scalabilityMode = filteredScalabilityMode if needed. This is no worse than the two calls needed for referenceScalaing = true followed by referenceScaling = false. I'm also worried about whether referenceScaling is sufficient to address all future questions of decode support. Even now, are we certain that any device that supports decode for reference scaling in L2T1 can also support it L2T3_KEY_SHIFT? Even if the answer is yes, I worry for the future where some new mode is introduced such that referenceScaling is no longer adequate to describe what makes/breaks decode support. With scalabilityMode, the decode and encode configs advance in sync. |
We do have the future-looking problem for the decoders returning "" for their reference scaling modes (for codecs where a compliant decoder is currently able to decode any mode). List of modes has a bit of a safety edge over "referenceScaling"; if the drawbacks are known and acceptable, it is probably a viable way forward too. But: If we want to stick with list of modes for decoders, we should just ban "", and insist that people enumerate all the modes they know about when they support "everything". Which means that introducing new modes will not be beneficial until the decoders are upgraded, even if the new mode is decodable by any compliant decoder. Costs on all sides. |
IIUC you're referring to the behavior @aboba mentioned above. I'd like to understand it better. Is this defined in https://w3c.github.io/webrtc-pc ? Does it have any impact on MediaCapabilities, or are we just talking about RTC's getCapabilities()?
Note: for MC, I'm not proposing we list the modes (I agree w/ @drkron above; scalabilityMode should be part of the query). The model with MC is to ask about one thing at a time. Callers can walk the list with multiple calls. |
I still strongly think that question "does decoder supports that scalability mode?" is wrong. Let me try to explain it with a different semi-theoretical example. Suppose we have a decoder that do not support decoding odd resolutions. If in your application you're using resolution 320x180, then you may notice that decoder fails to handle it with any scalability mode with 3 spatial layers, but succeed with other scalability modes. However it is wrong to making conclusion that decoder doesn't support L3T3 mode in this example. This decoder can decode stream encoded in L3T3 mode with resolution 640x180, and would fail to decode stream encoded in L1T3 mode with resolution 80x45. Same goes for 'referenceScaling'. While there is a correlation between this missed feature and supported scalability modes, it is incorrect to say decoder doesn't support certain scalability modes. |
I agree that resolution is not a property of scalabilityMode. This particular example could be handled today by returning support=false whenever config.width or config.height is odd.
My understanding is that reference scaling (unlike resolution) is a property of certain scalability modes. Concretely, if referenceScaling = false => scalabilityMode = LN* for N > 1 is not supported. Can you give an example of where this is wrong? |
I doubt it can be done that easily: requested resolution would be 320x180, which is even. but with 3 spatial layers each reducing resolution by half, the lowest resolution would be 80x45, which is a bit odd.
When there is no scalability, there is no requirement resolution should stay the same, i.e. L3T3_KEY can be decoded with referenceScaling = false if SFM chooses to forward bottom layer only. structures webrtc uses for VP9 screenshare (not described in webrtc-svc spec) use spatial layers as quality layers, i.e. that structure is similar to L3T1, but resolution for all spatial layers is the same; referenceScaling = false Afaik, the decoders that have problem with scalability, do support referenceScaling on key frames, but don't support increasing it on delta frames. That one can support any existent scalability mode, as long as SFM or Encoder would avoid delta frames without temporal dependencies. |
@DanilChapovalov @drkron and I had a call this morning.
Bug details for ^this are here: http://crbug.com/1022359. The issue occurs when switching from an upper to lower spatial layer, with the lower layer frame being a delta frame that references a key frame from the higher layer. I finally grok the point @DanilChapovalov and @alvestrand were making. The SFU may filter down to produce a stream that is technically valid, but no longer matches any named scalabilityMode. My understanding is that codec specs give enough flexibility that we probably don't want to mint new scalabilityModes for every valid stream. I'm now leaning against using scalabilityMode for decoder configs. I don't want to give the impression that SFU's are restricted to just named modes. Having said all that, referenceScaling also doesn't work to describing that particular bug. The decoder does support reference scaling generally, but does not support this particular scenario. Saying referenceScaling = false would harm this receiver, excluding many common forms of SVC that it actually supports. What you really need for that bug is something like As a general guideline, MC should avoid describing bugs. The API will likely outlive most bugs, leaving warts that confuse developers down the road. Meet is already working around the above bug w/ server filtering logic that only switches resolution at key frames, and the longterm work to fix the ChromeOS decoder is still being tracked. Hence, I don't think that particular bug (http://crbug.com/1022359) motivates a change to MC. We may still have a problem to solve, and referenceScaling may be the solution. @drkron noticed in local testing that his mac would fallback to software decoding when SVC was received. The theory is that AVFoundation may not support reference scaling generally (even at key frames), and that this may be a common gap for other platform decoders (MediaFoundation, MediaCodec, ...). But they may still support svc without reference scaling (i.e. just temporal scaling). He's looking into it further. @youennf, @aboba: do you know the details of svc support on mac/windows? |
Thanks for the summary @chcunningham! I've done some tests now on both Windows and MacOS with VP9 hardware decoding. The only type of stream with spatial scaling that I could produce was a k-SVC stream or more specifically scalability mode L2T3_KEY. I was not able to decode this stream on either platform. |
@chcunningham Is this a problem in the Media Foundation VP9 decoder? If so, I would file a Chromium bug, with CC: to [email protected] (Steve Becker). He can pull in the Sigma team. Also, are you seeing a similar problem with AV1 decoders? |
I would think that on MacOS/iOS, referenceScaling=false is probably the most accurate choice as of now. |
I've asked around and the ChromeOS support for spatial scalability seems to be split into the following groups: |
Thanks @youennf. Just to note: offline you mentioned this is just for VP9. Do you know if MacOS/iOS has any plans to add this support? Is it even feasible?
Thanks @drkron. Same question as above: can we fix this in software? I know we're tracking a fix for group 2 (albeit, without much urgency). |
We did a WebRTC specific fix (https://bugs.webkit.org/show_bug.cgi?id=231071). |
To summarize the discussion, there seem to be three different levels of support for scalability mode on the decoder side:
Although a few HW decoders might be moved from No support to k-SVC support, I don't think that we can expect that all HW bugs will be solved or even be classified, it's probably not even feasible. Falling back to SW decoding for k-SVC streams which is what Chrome does and now also Safari does, emphasizes the need of a field in VideoConfiguration to query on scalability mode also for decoders to get the correct predictions. Based on the three levels of support listed above, I propose that this issue is closed and the specification is kept as. That is, with an optional scalabilityMode field that can be used for querying both encoding and decoding info. The motivation to use scalabilityMode is that it’s more suitable to distinguish between k-SVC support and general SVC support and also is an existing concept, whereas referenceScaling would be to introduce a new term. |
I can support that. Saying referenceScaling = false in cases where kSVC is actually supported seems pretty harmful. From earlier discussion, it's regrettable that scalabilityMode only allows for N predefined arrangements of layers, but allowing kSVC where supported is a higher priority. In practice I think we callers can use heuristics like giving the closest matching scalabilityMode or we can mint new scalabilityMode values if needed. |
For folks reading along (@aboba @DanilChapovalov @alvestrand @youennf), please yell if you have any objections / concerns / better ideas concerning my previous comment. I'd like to settle this discussion and unblock @drkron. |
I still think it is bad idea to use scalability mode to describe partial decoding support, but I don't have new arguments. Scalability mode is an existent concept for encoder, however it is not defined for decoder process, so to use it, one still have to define what does it mean "supports decoding certain scalability mode". In particular it would be nice to describe how it can be tested. Even if you would define it, I still do not see how scalability mode can describe all the scenarios: |
Re: definitions, my first thought is along the lines of: the decoder is capable of decoding a sequence of frames described by the given mode, ordered by time and dependencies. WDYT? I think this definition is usable for encoding as well. Just s/decoder/encoder. Re: scenarios, I agree that scalabilityMode does not cover those.
I definitely hear your points; scalabilityMode isn't perfect by any stretch. Would you agree that it's better than referenceScaling for answering the critical questions (e.g. is k-svc supported)? Do you see a better way? |
I do not think encoding and decoding definitions can be symmetric. O <- O V V KF <- O <- O So rather than trying to define what does it mean 'decoder support scalability mode' and then describe support levels with scalability modes, I think it is better to have new words for different features that decoder doesn't support and work directly on their definitions. I do not think there are existent methods for detecting what receivers don't support. Afaik software decoders supports all features, so as soon as hardware decoder fails to decode, software fallback is used. Most applications do not use spatial scalability and do encounter these problems in the first place. |
To clarify features above, consider following examples (q - a frame encoded in qvga resolution, O - a frame encoded in VGA resolution): q <- q <- O <- ... This structure (without scalability) requires prediction from different resolution (2nd feature), but doesn't require several frames per temporal unit (1st feature). O <- O | | O <- O This structure require several frames per temporal unit (1st feature), but doesn't require prediction from a frame with different resolution (2nd feature). Support for the 3rd feature (change of display resolution) likely imply support of prediction from different resolution (2nd feature). k-svc example below demonstrate that they are not the same. O <- O | q <- q This does require several frame per temporal unit (1st feature) and prediction from different resolution (2nd feature), Full svc with delayed spatial upswitch: O <- O | | q <- q <- q Requires all 3 features above. Temporal scalability: q q / / q <- q <- q doesn't require any of the 3 features above. |
@DanilChapovalov @drkron spoke offline again. We concluded that the earlier description of decoders that support just k-SVC is not quite right. While k-SVC is the most tested configuration, it is expected that such decoders actually support all manner of SVC. There are known bugs (ex) in that support, but so far none that deserve permanent documentation via MediaCapabilities. So this leaves just two states for svc support: true and false. That suggests that the decoder signal could be reduced to something like
|
Friendly ping for thoughts on proposals in my final paragraph. |
I think you're right about this. I can prepare a PR unless there are objections from someone else? |
This bit really needs confirmation from @DanilChapovalov. |
I would like to think that every hardware decoder supports quality layers, but I'm worry it might not be true. I'm not aware of such decoders though. |
Thanks. In summary, we now agree drop scalabilityMode from the decoder configuration and instead use referenceScaling. The encoder configuration would still use scalabilityMode. @aboba @alvestrand - are y'all on board? If so, we'll send a PR. (aside: we should probably reorganize the various dictionaries to not inherit each other, as encode/decode divergence makes it clumsy). |
@chcunningham What would we do in the case of H.264/AVC with temporal scalability? This is supported in WebCodecs today (both for encoding and decoding). This isn't H.264/SVC so spatial scalability can't be supported on the decoder side (e.g. only modes would be 'L1T2' and 'L1T3'). So even if reference scaling is supported, that wouldn't imply support for spatial scalability. The same is true of VP8. |
@chcunningham I have submitted a PR to remove In the PR, it is recommended to utilize MC, and to interpret
In the case of H.264/AVC with temporal scalability, we would presumably have Does this make sense? |
Thank you! That makes sense to me. @chcunningham Regarding the need to reorganize the various dictionaries. What do you think of instead keeping it as is but having something like
in the description of each field. This way a lot of duplication is avoided or do you still think it's clumsy? |
Why would scalabiltyMode and referenceScaling only apply to WebRTC? I could see this info being of use with WebCodecs and some transport (e.g. RTCDataChannel or WebTransport). |
This may be a misunderstanding from my side, but I thought that the scalability modes that we use here are the ones defined in https://www.w3.org/TR/webrtc-svc/, which seems to be targeted towards WebRTC? If it makes sense for the types "file", "media-source", and "record", we could of course have something like:
|
under discussion in w3c/media-capabilities#182
@aboba and I met today. Recording our conclusions and some new questions.
The current plan is to assume temporal scalability is always supported, irrespective of reference scaling.
Thanks. Generally looks good, but then your example made me wonder: how does the app learn of SFM recv capabilities now? I hadn't considered this before.
Generally the meanings above sound correct, but we should swap the default value of referenceScaling to false for backward compatibility. Defaulting to true could change resulting MediaCapabilitiesInfo values when compared to today's results for a given combination of inputs. One other thing is that the language above should be modified to make it clear that we're talking about the track (content) rather than the decoder. This improves consistency w/ existing MC semantics and avoids constraining decoder selection for referenceScaling = false (i.e. fine to choose a decoder that does support reference scaling even if it isn't used by the content).
Thinking on it more, keeping the inheritance is ok. I did a quick pass on member:context validity just now (#187) and it's complicated enough that we shouldn't try to further separate the dictionaries. We should have some validity checks though. Please weigh in on that issue. To answer this specific question, I agree scalabilityMode is only desired for webrtc for now. It does make sense for WebCodecs too, but that's not currently part of MC. For referenceScaling, my first thought is to allow this for all of file/media-source/webrtc, as it is technically possible outside of webRTC (even if it's not used much in practice). |
Forgot to note one other thing: here @aboba is using the term spatial scalability to mean scalability modes that change the resolution. For me this is the intuitive meaning, but just highlighting it because earlier in this thread there was some discussion of "spatial" scalability for quality layers where resolution remains fixed. For the purposes of that PR, such scalability is not "spatial". Just FYI for folks following along. |
Submitted PR w3c/webrtc-svc#56 to reflect Chris's guidance on default behavior. |
Thanks, will take a look tomorrow. @aboba did you see this question in my wall of text above?
|
The SFM can send the info on codecs/modes it can receive in the format that would have been used by RTCRtpReceiver.getCapabilities(). The intersection of the browser's RTCRtpSender.capabilities(kind) and the SFM's simulated RTCRtpReceiver.getCapabilties(kind) is what the browser can send to the SFM. Since the SFM typically isn't a browser, removing support for the RTCRtpReceiver.getCapabilities() method on the browser doesn't impact the SFM. |
Thanks @aboba . Re-reading my comments above triggers one last bikeshed: I think this is less ambiguous. "Scaling" in SVC refers generally to any of "spatial", "temporal" and/or "quality" scaling... and we only want to describe spatial. On the other hand, folks may point out that reference scaling can be used outside of SVC. In practice I think that's rare enough that using SVC vernacular to name this member is still a fine call. |
I like 'spatialScalability' better than 'referenceScaling'. |
During the implementation in Chrome I've come across an issue related to scalabilityMode.
Scalability mode specifies how the video stream should be encoded and puts a few requirements on the encoder. In theory all decoders should support decoding of all valid streams regardless of which scalability mode that was used (https://www.w3.org/TR/webrtc-svc/). In reality this is not always the case and it may happen that a decoder is not able to decode a stream due to certain properties of the stream. All cases I know of where this happens are tied to reference scaling, which means that a frame of resolution A is used as a reference when decoding a frame of resolution B != A.
My proposal is therefore to add a boolean called referenceScaling to the dictionary VideoConfiguration.
referenceScaling would be an optional member that could be set when querying decodingInfo(). The member scalabilityMode remains in the dictionary but will only be allowed when querying encodingInfo.
The text was updated successfully, but these errors were encountered: