Improving Videoencoder #395

hermann-noll · 2020-06-18T08:55:41Z

So after a couple of shooting days we had some issues with the resulting footage, namely some stutters by duplicated frames and what seemed like frames starting and ending at wrong times relative to the video (video pausing for up to several minutes at starting and freezing at the end). This is why I tried looking into improving the videoencoder a bit, ideally lowering the required CPU usage especially for split channels as these are very high leaving not much room for the application itself (jumping from ~ca 30% to 70-80% on a modern 6-core CPU).

I found a couple of things but as some of these changes are a bit bigger I would like to check with you whether you are OK with the proposed API changes or if you have further insights which I missed. Also I am sorry in beforehand for the long text, I deemed it necessary to explain the complexity involved.

Current Changes

Unnecessary Memory allocations/copies

With the current state of master a rendered frame takes this way to the IMFSinkWriter:

Fetched from GPU into a (unguarded) ring of 10 buffers owned by UnityCompositorInterface (UnityCompositorInterface.cpp:181)
Copied into a newly allocated temporary buffer owned by the current videoencoder (VideoEncoder.cpp:346)
Copied into another newly allocated IMFMediaBuffer owned by the WriteVideo task (VideoEncoder.cpp:387)
(Presumably) uploaded to GPU for hardware encoding; alternatively processed directly by the CPU

These copies as far as I can see are done for buffering (multiple render events for one UpdateCompositor call), multithreading (so the WriteVideo task does not access the ring pool) and maybe encapsulation (so the UnityCompositorInterface does not have to know about Media Foundation)? I drafted an alternative method to deal with these intentions without copies although with a bit looser encapsulation, based on move semantics. So (for CPU encoding) the new flow would look like this (probably best explored at 9cc0fbb):

at initialization the videoencoder allocates a pool of 10 frame buffers (VideoEncoder.cpp:66)
The UnityCompositorInterface gets a new frame buffer (by std::unique_ptr) from the videoencoder (UnityCompositorInterface.cpp:125)
these frames own an lockable IMFMediaBuffer in which the texture is fetched (VideoEncoder.h:127)
the UnityCompositorInterface moves the frame back to the videoencoder with QueueVideoFrame (UnityCompositorInterface.cpp:128)
the videoencoder calculates the right sample time and moves the frame into the asynchronous task of WriteVideo (VideoEncoder.cpp:366)
when the task finishes it moves the frame back into the pool (VideoEncoder.cpp:411)

This way the fetched frame is never copied (at least not by the SpectatorView) around in memory and no temporary memory is allocated and freed per frame. The biggest caveat for me with this would be the exposing of IMFMediaBuffer although the UnityCompositorInterface does not have to interact with it.

For implementation I had to replace the usage of the concurrency::create_task from the PPL, also I could not use concurrency::concurrent_queue for the frame pool as the PPL does not seem to support move-only datatypes. Luckily there is a standard c++ functionality for the create_task (std::async with the std::launch::async policy set) and a simple pair of std::queue and std::mutex for the pool.

Potentially unordered `WriteSample` calls

As far as I understood multiple concurrency::create_task calls are not guaranteed to be called in order? This would mean that samples could be written out of order and we would rely on the implementation of the MFT to deal with sorting the frames. Also as the completion of the tasks are never waited for there could WriteSample calls still queued while the Finalize call for the sinkWriter is executed.

By using std::async we get a std::future one can wait for to ensure that frames are written in order (VideoEncoder.cpp:370) and that all frames are written before finalizing (VideoEncoder.cpp:449).

(Actual) hardware encoding

While trying to transfer the video texture directly to the video encoder without fetching its contents I stumbled upon this line. It seems to me like by setting the result to an error code the attributes responsible for hardware encoding (and disabling throttling, lines 92-96) are never set? So mediafoundation could have chosen a software encoder all along.

Encoding the video texture directly

As already stated by a TODO comment it would be great to encode the rendered video texture directly without temporarily fetching its content to RAM and back uploading back to GPU. Unfortunately this is a bit easier said than done.

Getting the textures to the video encoder is still easy enough, the texture should be copied into temporary one so the original can be modified while the older ones are encoded. The VideoInput class for hardware encoding needs a reference to the ID3D11Device though to create the texture (to not restrict the texture format of the input) and to trigger the GPU texture copy. The corresponding IMFMediaBuffer is created with MFCreateDXGISurfaceBuffer (VideoEncoder.h:97) which is then passed to the sinkWriter as normal.

In order to prepare the sinkWriter to accept textures it needs synchronized access to the ID3D11Device for which a IMFDXGIDeviceManager is required which needs to be set as attribute (VideoEncoder.cpp:115). This manager was already created in the code but not passed. Also I think it is supposed to be used to synchronize the access to the device for the entire application? Which is not possible as we cannot control Unity's usage of the device. For an easy and working solution I opted to just move the synchronization to DirectX itself by using ID3D10Multithread (VideoEncoder.cpp:61) but I guess a cleaner way would be to create a new headless ID3D11Device and use shared textures?

Also I could not get the encoding to work with the NV12 conversion, as WriteSample would return MF_E_BUFFERTOOSMALL. My guess would be that despite setting every input buffer and stating every buffer size as being stride=4 (as RGBA), the encoder would try to internally copy the entire texture (which still is stride=4 after conversion) into a internally allocated buffer with stride=2 for NV12.

Changes left to do (and not yet discussed)

Wrong sample time calculation?

I did not yet tackle the problem of some videos starting or ending with long freezes, I think the first one can be replicated by recording twice for a bit without leaving the play mode in between.

Color conversions

I would guess that the RGB->NV12 conversion is not needed for hardware encoding anymore but would accelerate the CPU encoding (so turning the current logic around). The current state of this PR is that this conversion is disabled for all (with uncleaned parts still lying around)

Audio side

Similar changes (especially the frame moving) can be applied to the audio side of the video encoder as well, but I wanted to discuss the more complex (and more performance-heavy) video changes with you first.

Runtime-fallback

With the current state of this PR the video encoding probably does not work anymore on machines that do not provide any hardware encoding, maybe it would be better if the video encoder could fallback to software encoding in that case. For implementing that more API changes would be necessary and the removal of the HARDWARE_ENCODE_VIDEO macro.

as to prevent another unnecessary memory allocation

chrisfromwork · 2020-06-18T23:46:29Z

There's a lot to digest in this review. In general, the recording stack could use some love and we are definitely open to these sorts of changes. I will try and provide a formal review/dive deep into these changes the beginning of next week.

src/SpectatorView.Native/SpectatorView.Compositor/Compositor/VideoEncoder.h

chrisfromwork · 2020-06-22T18:03:02Z

src/SpectatorView.Native/SpectatorView.Compositor/Compositor/VideoEncoder.cpp

 #endif
+    inputFormat = MFVideoFormat_RGB32;
 }


Does this change mean that hardware encoding succeeds using RGB32?

For me it only worked with RGB32, I guess this is because the converted texture is still reported to D3D as being RGB. While it gets the information that the input is supposed to be NV12 with a therefore smaller buffer.
MF seems to be confused and writing samples failed with MF_E_BUFFERTOOSMALL.

src/SpectatorView.Native/SpectatorView.Compositor/Compositor/VideoEncoder.cpp

chrisfromwork · 2020-06-22T21:23:15Z

src/SpectatorView.Native/SpectatorView.Compositor/Compositor/VideoEncoder.cpp

+        if (videoWriteFuture.valid())
+        {
+            videoWriteFuture.wait();
+            videoWriteFuture = {};


Its inconsistent, but i have seen an exception throw during this cleanup process for debug flavors of the compositor dll when running locally. I have been using the split channels recording option

specifically what looks like a stack overflow when setting this videoWriteFuture to an empty struct

In reply to: 443834338 [](ancestors = 443834338)

I'll look into reproducing that exception, I haven't experienced any stack overflows before with the current changes.

A very nice little bug, caused by future objects keeping the async lambda alive. By waiting on the previous future we create a chain of future objects where every element keeps the next one alive, growing rapidly over a long recording. When manually destroyed in line 450, this chain is recursively destructed causing the stack overflow.
I guess the really proper solution would be a more classic approach of a single task/thread consuming queued video frames, the chosen solution in cae0f9f just frees the future early, causing only the unprocessed frames to be in the chain.

that does sound quite nice. :)

In general, would you like me to test this out again locally so we can look into getting these changes in? Or would you prefer to continue looking into the TODO items?

I would prefer a bit more work done to it, there are still some unclean parts to it. Nevertheless it would be great to know whether you guys have a dependency on the RGB->NV12 color conversion? (e.g. idk whether it works on Intel/AMD hardware encoder)

EDIT: Testing-wise with the current state I could no longer reproduce any stuttering caused by recording, although definitely old frame insertions can be observed on very high load (e.g. I used artificial load on the CPU/GPU for repro) but this is caused by the unguarded ring buffers in both the k4a and Decklink frame providers and should be dealt with in separate issues (workaround is setting the ring size to 100).

chrisfromwork · 2020-06-22T22:10:25Z

I took a look at your review. In general, things seem to look good. In regards to your other follow up items, I have the following comments:

I also believe these incorrect sample time calculations may still exist. In testing your changes today I have not been able to repro the error. However, I am seeing a crash when recording for an extended period of time when stopping the compositor (inconsistent crash, but can somewhat consistently be reproed after multiple recording attempts). After figuring out this std::future cleanup, we may still see the same frame time errors. Do you know if this is something you have seen with hardware encoding, or is this something that may only exist for software encoding?
The folks that know the most about the origins of the nv12 codec conversions are out of office this week. I will follow up with them next week to understand whether the nv12 conversions are still needed.
Sounds good to look into the audio changes.
The HARDWARE_ENCODE_VIDEO macro may have not been properly used in the current master branch based on the bugs you found with these changes, but we have never suggested to external developers through documentation that the HARDWARE_ENCODE_VIDEO macro can be changed. I think we may be able to do away with software encoding to reduce our testing surface. I will reach out to some other folks internally that may have perspectives on whether they need the software encoding functionality. If you have preferences on keeping/removing the non-harware encoder video logic, let me know. :)

hermann-noll · 2020-07-20T14:34:20Z

So I adapted the audio side to be more symmetric to the video side, nothing really new there, just two cycles less of allocation-copy-free for the audio frames.

During testing these changes by playing some music whilst recording I observed that in the recording the music was sped and chopped up. As this also occurred without my audio changes, this bug might be unrelated to this PR? I tracked it down to the sample time of the audio frames.
When the capture frame index is ignored and instead the sample time is calculated by adding up the frame durations the bug disappeared, but this method is of course not very resilient to stutters in both audio and video.
Instead I drafted an estimation method to align DSP time and capture frame time which works alright, but I want to stress-test this one a bit more. Do you guys have any feedback regarding this method?

chrisfromwork · 2020-07-27T23:46:01Z

I don't think anyone has a strong preference or suggestion for this time interpolation.

hermann-noll added 6 commits June 12, 2020 09:54

Prevent unnecessary copying of video frames

c970780

Ensure ordered frame writing

4432731

Remove unnecessary frame allocations

b54c641

Move IMFMediaBuffer to frame class

9cc0fbb

as to prevent another unnecessary memory allocation

Inital working hardware encoder

9104cf4

Revert debugging tests

09f62de

chrisfromwork requested review from chrisfromwork and matthejo and removed request for chrisfromwork June 18, 2020 23:38

chrisfromwork requested a review from wiwei June 18, 2020 23:55

chrisfromwork reviewed Jun 22, 2020

View reviewed changes

src/SpectatorView.Native/SpectatorView.Compositor/Compositor/VideoEncoder.h Show resolved Hide resolved

chrisfromwork reviewed Jun 22, 2020

View reviewed changes

src/SpectatorView.Native/SpectatorView.Compositor/Compositor/VideoEncoder.cpp Show resolved Hide resolved

chrisfromwork reviewed Jun 22, 2020

View reviewed changes

hermann-noll added 4 commits July 16, 2020 09:05

Add comment

c76e00f

Fix stack overflow after long recording

cae0f9f

Remove temporary buffers for audio frames

83afc70

Add start-time estimation for audio frames

fc8a8ea

hermann-noll marked this pull request as ready for review July 28, 2020 05:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving Videoencoder #395

Improving Videoencoder #395

hermann-noll commented Jun 18, 2020 •

edited

Loading

chrisfromwork commented Jun 18, 2020

chrisfromwork Jun 22, 2020

hermann-noll Jun 25, 2020

chrisfromwork Jun 22, 2020

chrisfromwork Jun 22, 2020

hermann-noll Jun 25, 2020

hermann-noll Jul 16, 2020

chrisfromwork Jul 16, 2020

hermann-noll Jul 17, 2020 •

edited

Loading

chrisfromwork commented Jun 22, 2020

hermann-noll commented Jul 20, 2020 •

edited

Loading

chrisfromwork commented Jul 27, 2020

Improving Videoencoder #395

Are you sure you want to change the base?

Improving Videoencoder #395

Conversation

hermann-noll commented Jun 18, 2020 • edited Loading

Current Changes

Unnecessary Memory allocations/copies

Potentially unordered WriteSample calls

(Actual) hardware encoding

Encoding the video texture directly

Changes left to do (and not yet discussed)

Wrong sample time calculation?

Color conversions

Audio side

Runtime-fallback

chrisfromwork commented Jun 18, 2020

chrisfromwork Jun 22, 2020

Choose a reason for hiding this comment

hermann-noll Jun 25, 2020

Choose a reason for hiding this comment

chrisfromwork Jun 22, 2020

Choose a reason for hiding this comment

chrisfromwork Jun 22, 2020

Choose a reason for hiding this comment

hermann-noll Jun 25, 2020

Choose a reason for hiding this comment

hermann-noll Jul 16, 2020

Choose a reason for hiding this comment

chrisfromwork Jul 16, 2020

Choose a reason for hiding this comment

hermann-noll Jul 17, 2020 • edited Loading

Choose a reason for hiding this comment

chrisfromwork commented Jun 22, 2020

hermann-noll commented Jul 20, 2020 • edited Loading

chrisfromwork commented Jul 27, 2020

hermann-noll commented Jun 18, 2020 •

edited

Loading

Potentially unordered `WriteSample` calls

hermann-noll Jul 17, 2020 •

edited

Loading

hermann-noll commented Jul 20, 2020 •

edited

Loading