-
Notifications
You must be signed in to change notification settings - Fork 114
Improving Videoencoder #395
base: master
Are you sure you want to change the base?
Improving Videoencoder #395
Conversation
as to prevent another unnecessary memory allocation
There's a lot to digest in this review. In general, the recording stack could use some love and we are definitely open to these sorts of changes. I will try and provide a formal review/dive deep into these changes the beginning of next week. |
src/SpectatorView.Native/SpectatorView.Compositor/Compositor/VideoEncoder.h
Show resolved
Hide resolved
#endif | ||
inputFormat = MFVideoFormat_RGB32; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this change mean that hardware encoding succeeds using RGB32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me it only worked with RGB32, I guess this is because the converted texture is still reported to D3D as being RGB. While it gets the information that the input is supposed to be NV12 with a therefore smaller buffer.
MF seems to be confused and writing samples failed with MF_E_BUFFERTOOSMALL.
src/SpectatorView.Native/SpectatorView.Compositor/Compositor/VideoEncoder.cpp
Show resolved
Hide resolved
if (videoWriteFuture.valid()) | ||
{ | ||
videoWriteFuture.wait(); | ||
videoWriteFuture = {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its inconsistent, but i have seen an exception throw during this cleanup process for debug flavors of the compositor dll when running locally. I have been using the split channels recording option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specifically what looks like a stack overflow when setting this videoWriteFuture to an empty struct
In reply to: 443834338 [](ancestors = 443834338)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into reproducing that exception, I haven't experienced any stack overflows before with the current changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A very nice little bug, caused by future
objects keeping the async lambda alive. By waiting on the previous future we create a chain of future objects where every element keeps the next one alive, growing rapidly over a long recording. When manually destroyed in line 450, this chain is recursively destructed causing the stack overflow.
I guess the really proper solution would be a more classic approach of a single task/thread consuming queued video frames, the chosen solution in cae0f9f just frees the future early, causing only the unprocessed frames to be in the chain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that does sound quite nice. :)
In general, would you like me to test this out again locally so we can look into getting these changes in? Or would you prefer to continue looking into the TODO items?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer a bit more work done to it, there are still some unclean parts to it. Nevertheless it would be great to know whether you guys have a dependency on the RGB->NV12 color conversion? (e.g. idk whether it works on Intel/AMD hardware encoder)
EDIT: Testing-wise with the current state I could no longer reproduce any stuttering caused by recording, although definitely old frame insertions can be observed on very high load (e.g. I used artificial load on the CPU/GPU for repro) but this is caused by the unguarded ring buffers in both the k4a and Decklink frame providers and should be dealt with in separate issues (workaround is setting the ring size to 100).
I took a look at your review. In general, things seem to look good. In regards to your other follow up items, I have the following comments:
|
So I adapted the audio side to be more symmetric to the video side, nothing really new there, just two cycles less of allocation-copy-free for the audio frames. During testing these changes by playing some music whilst recording I observed that in the recording the music was sped and chopped up. As this also occurred without my audio changes, this bug might be unrelated to this PR? I tracked it down to the sample time of the audio frames. |
I don't think anyone has a strong preference or suggestion for this time interpolation. |
So after a couple of shooting days we had some issues with the resulting footage, namely some stutters by duplicated frames and what seemed like frames starting and ending at wrong times relative to the video (video pausing for up to several minutes at starting and freezing at the end). This is why I tried looking into improving the videoencoder a bit, ideally lowering the required CPU usage especially for split channels as these are very high leaving not much room for the application itself (jumping from ~ca 30% to 70-80% on a modern 6-core CPU).
I found a couple of things but as some of these changes are a bit bigger I would like to check with you whether you are OK with the proposed API changes or if you have further insights which I missed. Also I am sorry in beforehand for the long text, I deemed it necessary to explain the complexity involved.
Current Changes
Unnecessary Memory allocations/copies
With the current state of master a rendered frame takes this way to the
IMFSinkWriter
:IMFMediaBuffer
owned by theWriteVideo
task (VideoEncoder.cpp:387)These copies as far as I can see are done for buffering (multiple render events for one
UpdateCompositor
call), multithreading (so theWriteVideo
task does not access the ring pool) and maybe encapsulation (so the UnityCompositorInterface does not have to know about Media Foundation)? I drafted an alternative method to deal with these intentions without copies although with a bit looser encapsulation, based on move semantics. So (for CPU encoding) the new flow would look like this (probably best explored at 9cc0fbb):std::unique_ptr
) from the videoencoder (UnityCompositorInterface.cpp:125)IMFMediaBuffer
in which the texture is fetched (VideoEncoder.h:127)QueueVideoFrame
(UnityCompositorInterface.cpp:128)WriteVideo
(VideoEncoder.cpp:366)This way the fetched frame is never copied (at least not by the SpectatorView) around in memory and no temporary memory is allocated and freed per frame. The biggest caveat for me with this would be the exposing of
IMFMediaBuffer
although the UnityCompositorInterface does not have to interact with it.For implementation I had to replace the usage of the
concurrency::create_task
from the PPL, also I could not useconcurrency::concurrent_queue
for the frame pool as the PPL does not seem to support move-only datatypes. Luckily there is a standard c++ functionality for thecreate_task
(std::async
with thestd::launch::async
policy set) and a simple pair ofstd::queue
andstd::mutex
for the pool.Potentially unordered
WriteSample
callsAs far as I understood multiple
concurrency::create_task
calls are not guaranteed to be called in order? This would mean that samples could be written out of order and we would rely on the implementation of the MFT to deal with sorting the frames. Also as the completion of the tasks are never waited for there couldWriteSample
calls still queued while theFinalize
call for thesinkWriter
is executed.By using
std::async
we get astd::future
one can wait for to ensure that frames are written in order (VideoEncoder.cpp:370) and that all frames are written before finalizing (VideoEncoder.cpp:449).(Actual) hardware encoding
While trying to transfer the video texture directly to the video encoder without fetching its contents I stumbled upon this line. It seems to me like by setting the result to an error code the attributes responsible for hardware encoding (and disabling throttling, lines 92-96) are never set? So mediafoundation could have chosen a software encoder all along.
Encoding the video texture directly
As already stated by a TODO comment it would be great to encode the rendered video texture directly without temporarily fetching its content to RAM and back uploading back to GPU. Unfortunately this is a bit easier said than done.
Getting the textures to the video encoder is still easy enough, the texture should be copied into temporary one so the original can be modified while the older ones are encoded. The
VideoInput
class for hardware encoding needs a reference to theID3D11Device
though to create the texture (to not restrict the texture format of the input) and to trigger the GPU texture copy. The correspondingIMFMediaBuffer
is created withMFCreateDXGISurfaceBuffer
(VideoEncoder.h:97) which is then passed to thesinkWriter
as normal.In order to prepare the
sinkWriter
to accept textures it needs synchronized access to theID3D11Device
for which aIMFDXGIDeviceManager
is required which needs to be set as attribute (VideoEncoder.cpp:115). This manager was already created in the code but not passed. Also I think it is supposed to be used to synchronize the access to the device for the entire application? Which is not possible as we cannot control Unity's usage of the device. For an easy and working solution I opted to just move the synchronization to DirectX itself by usingID3D10Multithread
(VideoEncoder.cpp:61) but I guess a cleaner way would be to create a new headlessID3D11Device
and use shared textures?Also I could not get the encoding to work with the NV12 conversion, as
WriteSample
would returnMF_E_BUFFERTOOSMALL
. My guess would be that despite setting every input buffer and stating every buffer size as being stride=4 (as RGBA), the encoder would try to internally copy the entire texture (which still is stride=4 after conversion) into a internally allocated buffer with stride=2 for NV12.Changes left to do (and not yet discussed)
Wrong sample time calculation?
I did not yet tackle the problem of some videos starting or ending with long freezes, I think the first one can be replicated by recording twice for a bit without leaving the play mode in between.
Color conversions
I would guess that the RGB->NV12 conversion is not needed for hardware encoding anymore but would accelerate the CPU encoding (so turning the current logic around). The current state of this PR is that this conversion is disabled for all (with uncleaned parts still lying around)
Audio side
Similar changes (especially the frame moving) can be applied to the audio side of the video encoder as well, but I wanted to discuss the more complex (and more performance-heavy) video changes with you first.
Runtime-fallback
With the current state of this PR the video encoding probably does not work anymore on machines that do not provide any hardware encoding, maybe it would be better if the video encoder could fallback to software encoding in that case. For implementing that more API changes would be necessary and the removal of the
HARDWARE_ENCODE_VIDEO
macro.