-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use xav
instead of ffmpeg
#403
base: main
Are you sure you want to change the base?
Conversation
|> Stream.chunk_every(1000) | ||
|> Stream.map(&Nx.Batch.concatenate/1) | ||
|> Stream.map(fn batch -> Nx.Defn.jit_apply(&Function.identity/1, [batch]) end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function just needs to return a stream of chunks, so we don't need to do this concatenation.
|> Stream.chunk_every(1000) | |
|> Stream.map(&Nx.Batch.concatenate/1) | |
|> Stream.map(fn batch -> Nx.Defn.jit_apply(&Function.identity/1, [batch]) end) |
Do you know what determines the length of each chunk, could that be configurable perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not completely sure but i'm guessing that a chunk/frame in a video context is the audio duration of one frame. When i remove the Stream.chunk_every/2
pipeline i get a massive performance degradation and the processing does not finish in a reasonable time. I imagine that it's far more efficient to read a chunk of frames first, before sending it to the Whisper serving.
I just reused the implementation of the Elixir WebRTC team from https://github.com/elixir-webrtc/xav/blob/master/README.md
{:ok, whisper} = Bumblebee.load_model({:hf, "openai/whisper-tiny"})
{:ok, featurizer} = Bumblebee.load_featurizer({:hf, "openai/whisper-tiny"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/whisper-tiny"})
{:ok, generation_config} = Bumblebee.load_generation_config({:hf, "openai/whisper-tiny"})
serving =
Bumblebee.Audio.speech_to_text_whisper(whisper, featurizer, tokenizer, generation_config,
defn_options: [compiler: EXLA]
)
# Read a couple of frames.
# See https://hexdocs.pm/bumblebee/Bumblebee.Audio.WhisperFeaturizer.html for default sampling rate.
frames =
Xav.Reader.stream!("sample.mp3", read: :audio, out_format: :f32, out_channels: 1, out_sample_rate: 16_000)
|> Stream.take(200)
|> Enum.map(fn frame -> Xav.Reader.to_nx(frame) end)
batch = Nx.Batch.concatenate(frames)
batch = Nx.Defn.jit_apply(&Function.identity/1, [batch])
Nx.Serving.run(serving, batch)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. The serving transforms the stream to accumulate smaller chunks, but there is a place where we need to append to a list and that may be the reason why it's inefficient with tiny chunks.
However, either way, I think it's wasteful to convert every frame to a tensor just to concatenate later. With the current ffmpeg code we get a single binary for the whole 30s and create a tensor from that. So ideally we want to replicate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this suffice?
path
|> Xav.Reader.stream!(
read: :audio,
out_format: :f32,
out_channels: 1,
out_sample_rate: sampling_rate
)
|> Stream.chunk_every(1000)
|> Stream.map(fn frames ->
[frame | _] = frames
binary = Enum.reduce(frames, <<>>, fn frame, acc -> acc <> frame.data end)
Nx.with_default_backend(Nx.BinaryBackend, fn -> Nx.from_binary(binary, frame.format) end)
end)
The 1000 chunks is currently arbitrary because we don't know the frame_size of the used codec. This could be added to xav
i think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction: The information is already there. Xav.Frame
has a samples
field from which we could calculate the 30s chunks which would be calculated by
round(sampling_rate / frame.samples * 30)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handling the frame binaries directly is a good call. I think we can transform the stream, so that we can accumulate the binary, instead of waiting for 1000 binaries and joining then. I was thinking we can determine the number of samples from binary size, but frame.samples
is even more convenient.
So it would be something like this:
chunk_samples = sampling_rate * 30
path
|> Xav.Reader.stream!(
read: :audio,
out_format: :f32,
out_channels: 1,
out_sample_rate: sampling_rate
)
|> Stream.transform(
fn -> {<<>>, 0} end,
fn frame, {buffer, samples} ->
buffer = buffer <> frame.data
samples = samples + frame.samples
if samples >= chunk_samples do
chunk = Nx.from_binary(buffer, :f32, backend: Nx.BinaryBackend)
{[chunk], {<<>>, 0}}
else
{[], {buffer, samples}}
end
end,
fn {buffer, _samples} ->
chunk = Nx.from_binary(buffer, :f32, backend: Nx.BinaryBackend)
{[chunk], {<<>>, 0}}
end,
fn _ -> :ok end
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow! That is a great solution! Thanks!
@kevinschweikert thanks for the PR! Dropping the ffmpeg dependency would be great, but yeah, I agree that we need xav to be precompiled for this to be beneficial. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep ffmpeg
but also have xav
as an optional dependency.
Sounds good to me! |
Co-authored-by: Jonatan Kłosko <[email protected]>
Awesome idea! I will bring back the |
I've just realised that it's not just about precompilation, the main blocker is that |
Precompiling FFmpeg is not an easy task as it has so many other dependencies. I agree that until Xav solves this problem, it's better for Bumblebee to stay with the current solution |
The
xav
library is a wrapper around the libav libs, from which FFmpeg is built. The nice thing is, that this compiles these libav functions to a NIF, so that you don't need FFmpeg installed on your computer.There are two prerequisites until this could be merged:
xav
library could precompile the NIF so we get a real benefit from this change. Otherwise, we just make it harder for the user to install the development packages (Install instructions from xav) instead of the normal FFmpeg binaryxav
needs to updatenx
so we can merge. I've already opened a PR version bump nx, exla and bumblebee elixir-webrtc/xav#19. To test this branch, i temporarily setoverride: true
on thenx
dependency