Skip to content
This repository has been archived by the owner on Jun 14, 2024. It is now read-only.

Improving Videoencoder #395

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

hermann-noll
Copy link
Contributor

@hermann-noll hermann-noll commented Jun 18, 2020

So after a couple of shooting days we had some issues with the resulting footage, namely some stutters by duplicated frames and what seemed like frames starting and ending at wrong times relative to the video (video pausing for up to several minutes at starting and freezing at the end). This is why I tried looking into improving the videoencoder a bit, ideally lowering the required CPU usage especially for split channels as these are very high leaving not much room for the application itself (jumping from ~ca 30% to 70-80% on a modern 6-core CPU).

I found a couple of things but as some of these changes are a bit bigger I would like to check with you whether you are OK with the proposed API changes or if you have further insights which I missed. Also I am sorry in beforehand for the long text, I deemed it necessary to explain the complexity involved.

Current Changes

Unnecessary Memory allocations/copies

With the current state of master a rendered frame takes this way to the IMFSinkWriter:

  • Fetched from GPU into a (unguarded) ring of 10 buffers owned by UnityCompositorInterface (UnityCompositorInterface.cpp:181)
  • Copied into a newly allocated temporary buffer owned by the current videoencoder (VideoEncoder.cpp:346)
  • Copied into another newly allocated IMFMediaBuffer owned by the WriteVideo task (VideoEncoder.cpp:387)
  • (Presumably) uploaded to GPU for hardware encoding; alternatively processed directly by the CPU

These copies as far as I can see are done for buffering (multiple render events for one UpdateCompositor call), multithreading (so the WriteVideo task does not access the ring pool) and maybe encapsulation (so the UnityCompositorInterface does not have to know about Media Foundation)? I drafted an alternative method to deal with these intentions without copies although with a bit looser encapsulation, based on move semantics. So (for CPU encoding) the new flow would look like this (probably best explored at 9cc0fbb):

This way the fetched frame is never copied (at least not by the SpectatorView) around in memory and no temporary memory is allocated and freed per frame. The biggest caveat for me with this would be the exposing of IMFMediaBuffer although the UnityCompositorInterface does not have to interact with it.

For implementation I had to replace the usage of the concurrency::create_task from the PPL, also I could not use concurrency::concurrent_queue for the frame pool as the PPL does not seem to support move-only datatypes. Luckily there is a standard c++ functionality for the create_task (std::async with the std::launch::async policy set) and a simple pair of std::queue and std::mutex for the pool.

Potentially unordered WriteSample calls

As far as I understood multiple concurrency::create_task calls are not guaranteed to be called in order? This would mean that samples could be written out of order and we would rely on the implementation of the MFT to deal with sorting the frames. Also as the completion of the tasks are never waited for there could WriteSample calls still queued while the Finalize call for the sinkWriter is executed.

By using std::async we get a std::future one can wait for to ensure that frames are written in order (VideoEncoder.cpp:370) and that all frames are written before finalizing (VideoEncoder.cpp:449).

(Actual) hardware encoding

While trying to transfer the video texture directly to the video encoder without fetching its contents I stumbled upon this line. It seems to me like by setting the result to an error code the attributes responsible for hardware encoding (and disabling throttling, lines 92-96) are never set? So mediafoundation could have chosen a software encoder all along.

Encoding the video texture directly

As already stated by a TODO comment it would be great to encode the rendered video texture directly without temporarily fetching its content to RAM and back uploading back to GPU. Unfortunately this is a bit easier said than done.

Getting the textures to the video encoder is still easy enough, the texture should be copied into temporary one so the original can be modified while the older ones are encoded. The VideoInput class for hardware encoding needs a reference to the ID3D11Device though to create the texture (to not restrict the texture format of the input) and to trigger the GPU texture copy. The corresponding IMFMediaBuffer is created with MFCreateDXGISurfaceBuffer (VideoEncoder.h:97) which is then passed to the sinkWriter as normal.

In order to prepare the sinkWriter to accept textures it needs synchronized access to the ID3D11Device for which a IMFDXGIDeviceManager is required which needs to be set as attribute (VideoEncoder.cpp:115). This manager was already created in the code but not passed. Also I think it is supposed to be used to synchronize the access to the device for the entire application? Which is not possible as we cannot control Unity's usage of the device. For an easy and working solution I opted to just move the synchronization to DirectX itself by using ID3D10Multithread (VideoEncoder.cpp:61) but I guess a cleaner way would be to create a new headless ID3D11Device and use shared textures?

Also I could not get the encoding to work with the NV12 conversion, as WriteSample would return MF_E_BUFFERTOOSMALL. My guess would be that despite setting every input buffer and stating every buffer size as being stride=4 (as RGBA), the encoder would try to internally copy the entire texture (which still is stride=4 after conversion) into a internally allocated buffer with stride=2 for NV12.

Changes left to do (and not yet discussed)

Wrong sample time calculation?

I did not yet tackle the problem of some videos starting or ending with long freezes, I think the first one can be replicated by recording twice for a bit without leaving the play mode in between.

Color conversions

I would guess that the RGB->NV12 conversion is not needed for hardware encoding anymore but would accelerate the CPU encoding (so turning the current logic around). The current state of this PR is that this conversion is disabled for all (with uncleaned parts still lying around)

Audio side

Similar changes (especially the frame moving) can be applied to the audio side of the video encoder as well, but I wanted to discuss the more complex (and more performance-heavy) video changes with you first.

Runtime-fallback

With the current state of this PR the video encoding probably does not work anymore on machines that do not provide any hardware encoding, maybe it would be better if the video encoder could fallback to software encoding in that case. For implementing that more API changes would be necessary and the removal of the HARDWARE_ENCODE_VIDEO macro.

@chrisfromwork chrisfromwork requested review from chrisfromwork and matthejo and removed request for chrisfromwork June 18, 2020 23:38
@chrisfromwork
Copy link
Contributor

There's a lot to digest in this review. In general, the recording stack could use some love and we are definitely open to these sorts of changes. I will try and provide a formal review/dive deep into these changes the beginning of next week.

@chrisfromwork chrisfromwork requested a review from wiwei June 18, 2020 23:55
#endif
inputFormat = MFVideoFormat_RGB32;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change mean that hardware encoding succeeds using RGB32?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me it only worked with RGB32, I guess this is because the converted texture is still reported to D3D as being RGB. While it gets the information that the input is supposed to be NV12 with a therefore smaller buffer.
MF seems to be confused and writing samples failed with MF_E_BUFFERTOOSMALL.

if (videoWriteFuture.valid())
{
videoWriteFuture.wait();
videoWriteFuture = {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its inconsistent, but i have seen an exception throw during this cleanup process for debug flavors of the compositor dll when running locally. I have been using the split channels recording option

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specifically what looks like a stack overflow when setting this videoWriteFuture to an empty struct


In reply to: 443834338 [](ancestors = 443834338)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into reproducing that exception, I haven't experienced any stack overflows before with the current changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very nice little bug, caused by future objects keeping the async lambda alive. By waiting on the previous future we create a chain of future objects where every element keeps the next one alive, growing rapidly over a long recording. When manually destroyed in line 450, this chain is recursively destructed causing the stack overflow.
I guess the really proper solution would be a more classic approach of a single task/thread consuming queued video frames, the chosen solution in cae0f9f just frees the future early, causing only the unprocessed frames to be in the chain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that does sound quite nice. :)

In general, would you like me to test this out again locally so we can look into getting these changes in? Or would you prefer to continue looking into the TODO items?

Copy link
Contributor Author

@hermann-noll hermann-noll Jul 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer a bit more work done to it, there are still some unclean parts to it. Nevertheless it would be great to know whether you guys have a dependency on the RGB->NV12 color conversion? (e.g. idk whether it works on Intel/AMD hardware encoder)

EDIT: Testing-wise with the current state I could no longer reproduce any stuttering caused by recording, although definitely old frame insertions can be observed on very high load (e.g. I used artificial load on the CPU/GPU for repro) but this is caused by the unguarded ring buffers in both the k4a and Decklink frame providers and should be dealt with in separate issues (workaround is setting the ring size to 100).

@chrisfromwork
Copy link
Contributor

I took a look at your review. In general, things seem to look good. In regards to your other follow up items, I have the following comments:

  1. I also believe these incorrect sample time calculations may still exist. In testing your changes today I have not been able to repro the error. However, I am seeing a crash when recording for an extended period of time when stopping the compositor (inconsistent crash, but can somewhat consistently be reproed after multiple recording attempts). After figuring out this std::future cleanup, we may still see the same frame time errors. Do you know if this is something you have seen with hardware encoding, or is this something that may only exist for software encoding?

  2. The folks that know the most about the origins of the nv12 codec conversions are out of office this week. I will follow up with them next week to understand whether the nv12 conversions are still needed.

  3. Sounds good to look into the audio changes.

  4. The HARDWARE_ENCODE_VIDEO macro may have not been properly used in the current master branch based on the bugs you found with these changes, but we have never suggested to external developers through documentation that the HARDWARE_ENCODE_VIDEO macro can be changed. I think we may be able to do away with software encoding to reduce our testing surface. I will reach out to some other folks internally that may have perspectives on whether they need the software encoding functionality. If you have preferences on keeping/removing the non-harware encoder video logic, let me know. :)

@hermann-noll
Copy link
Contributor Author

hermann-noll commented Jul 20, 2020

So I adapted the audio side to be more symmetric to the video side, nothing really new there, just two cycles less of allocation-copy-free for the audio frames.

During testing these changes by playing some music whilst recording I observed that in the recording the music was sped and chopped up. As this also occurred without my audio changes, this bug might be unrelated to this PR? I tracked it down to the sample time of the audio frames.
When the capture frame index is ignored and instead the sample time is calculated by adding up the frame durations the bug disappeared, but this method is of course not very resilient to stutters in both audio and video.
Instead I drafted an estimation method to align DSP time and capture frame time which works alright, but I want to stress-test this one a bit more. Do you guys have any feedback regarding this method?

@chrisfromwork
Copy link
Contributor

I don't think anyone has a strong preference or suggestion for this time interpolation.

@hermann-noll hermann-noll marked this pull request as ready for review July 28, 2020 05:05
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants