use `xav` instead of `ffmpeg` #403

kevinschweikert · 2024-10-22T22:07:38Z

The xav library is a wrapper around the libav libs, from which FFmpeg is built. The nice thing is, that this compiles these libav functions to a NIF, so that you don't need FFmpeg installed on your computer.

There are two prerequisites until this could be merged:

the xav library could precompile the NIF so we get a real benefit from this change. Otherwise, we just make it harder for the user to install the development packages (Install instructions from xav) instead of the normal FFmpeg binary
xav needs to update nx so we can merge. I've already opened a PR version bump nx, exla and bumblebee elixir-webrtc/xav#19. To test this branch, i temporarily set override: true on the nx dependency

jonatanklosko · 2024-10-23T04:00:29Z

lib/bumblebee/audio/speech_to_text_whisper.ex

+        |> Stream.chunk_every(1000)
+        |> Stream.map(&Nx.Batch.concatenate/1)
+        |> Stream.map(fn batch -> Nx.Defn.jit_apply(&Function.identity/1, [batch]) end)


This function just needs to return a stream of chunks, so we don't need to do this concatenation.

Suggested change

|> Stream.chunk_every(1000)

|> Stream.map(&Nx.Batch.concatenate/1)

|> Stream.map(fn batch -> Nx.Defn.jit_apply(&Function.identity/1, [batch]) end)

Do you know what determines the length of each chunk, could that be configurable perhaps?

I am not completely sure but i'm guessing that a chunk/frame in a video context is the audio duration of one frame. When i remove the Stream.chunk_every/2 pipeline i get a massive performance degradation and the processing does not finish in a reasonable time. I imagine that it's far more efficient to read a chunk of frames first, before sending it to the Whisper serving.
I just reused the implementation of the Elixir WebRTC team from https://github.com/elixir-webrtc/xav/blob/master/README.md

{:ok, whisper} = Bumblebee.load_model({:hf, "openai/whisper-tiny"}) {:ok, featurizer} = Bumblebee.load_featurizer({:hf, "openai/whisper-tiny"}) {:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/whisper-tiny"}) {:ok, generation_config} = Bumblebee.load_generation_config({:hf, "openai/whisper-tiny"}) serving = Bumblebee.Audio.speech_to_text_whisper(whisper, featurizer, tokenizer, generation_config, defn_options: [compiler: EXLA] ) # Read a couple of frames. # See https://hexdocs.pm/bumblebee/Bumblebee.Audio.WhisperFeaturizer.html for default sampling rate. frames = Xav.Reader.stream!("sample.mp3", read: :audio, out_format: :f32, out_channels: 1, out_sample_rate: 16_000) |> Stream.take(200) |> Enum.map(fn frame -> Xav.Reader.to_nx(frame) end) batch = Nx.Batch.concatenate(frames) batch = Nx.Defn.jit_apply(&Function.identity/1, [batch]) Nx.Serving.run(serving, batch)

I see. The serving transforms the stream to accumulate smaller chunks, but there is a place where we need to append to a list and that may be the reason why it's inefficient with tiny chunks.

However, either way, I think it's wasteful to convert every frame to a tensor just to concatenate later. With the current ffmpeg code we get a single binary for the whole 30s and create a tensor from that. So ideally we want to replicate this.

Would this suffice?

path |> Xav.Reader.stream!( read: :audio, out_format: :f32, out_channels: 1, out_sample_rate: sampling_rate ) |> Stream.chunk_every(1000) |> Stream.map(fn frames -> [frame | _] = frames binary = Enum.reduce(frames, <<>>, fn frame, acc -> acc <> frame.data end) Nx.with_default_backend(Nx.BinaryBackend, fn -> Nx.from_binary(binary, frame.format) end) end)

The 1000 chunks is currently arbitrary because we don't know the frame_size of the used codec. This could be added to xav i think

Correction: The information is already there. Xav.Frame has a samples field from which we could calculate the 30s chunks which would be calculated by

round(sampling_rate / frame.samples * 30)

Handling the frame binaries directly is a good call. I think we can transform the stream, so that we can accumulate the binary, instead of waiting for 1000 binaries and joining then. I was thinking we can determine the number of samples from binary size, but frame.samples is even more convenient.

So it would be something like this:

chunk_samples = sampling_rate * 30 path |> Xav.Reader.stream!( read: :audio, out_format: :f32, out_channels: 1, out_sample_rate: sampling_rate ) |> Stream.transform( fn -> {<<>>, 0} end, fn frame, {buffer, samples} -> buffer = buffer <> frame.data samples = samples + frame.samples if samples >= chunk_samples do chunk = Nx.from_binary(buffer, :f32, backend: Nx.BinaryBackend) {[chunk], {<<>>, 0}} else {[], {buffer, samples}} end end, fn {buffer, _samples} -> chunk = Nx.from_binary(buffer, :f32, backend: Nx.BinaryBackend) {[chunk], {<<>>, 0}} end, fn _ -> :ok end )

Wow! That is a great solution! Thanks!

lib/bumblebee/audio/speech_to_text_whisper.ex

jonatanklosko · 2024-10-23T04:06:39Z

@kevinschweikert thanks for the PR! Dropping the ffmpeg dependency would be great, but yeah, I agree that we need xav to be precompiled for this to be beneficial.

josevalim

I would keep ffmpeg but also have xav as an optional dependency.

jonatanklosko · 2024-10-23T05:24:11Z

Sounds good to me!

Co-authored-by: Jonatan Kłosko <[email protected]>

kevinschweikert · 2024-10-23T11:02:15Z

Awesome idea! I will bring back the ffmpeg implementation when xav is not available. Thanks for the feedback and suggestions!

jonatanklosko · 2024-10-23T11:05:51Z

I've just realised that it's not just about precompilation, the main blocker is that xav still requires ffmpeg to be installed, so at the moment there is no benefit really (perhaps a tiny bit less overhead, because we don't need System.cmd, but I think that's fine). So I think what we need is xav having the necessary ffmpeg C code vendored, and then ideally precompilation.

mickel8 · 2024-10-23T14:31:45Z

Precompiling FFmpeg is not an easy task as it has so many other dependencies. I agree that until Xav solves this problem, it's better for Bumblebee to stay with the current solution

use xav instead of ffmpeg

1b6e128

jonatanklosko reviewed Oct 23, 2024

View reviewed changes

lib/bumblebee/audio/speech_to_text_whisper.ex Outdated Show resolved Hide resolved

josevalim reviewed Oct 23, 2024

View reviewed changes

use default backend

7b1c02f

Co-authored-by: Jonatan Kłosko <[email protected]>

kevinschweikert mentioned this pull request Oct 23, 2024

vendoring and precompilation elixir-webrtc/xav#20

Open

kevinschweikert added 3 commits October 24, 2024 11:54

transform stream to buffer frame binaries

fd81786

bring back ffmpeg

fc45ff5

make xav optional

d882343

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use `xav` instead of `ffmpeg` #403

use `xav` instead of `ffmpeg` #403

kevinschweikert commented Oct 22, 2024

jonatanklosko Oct 23, 2024

kevinschweikert Oct 23, 2024 •

edited

Loading

jonatanklosko Oct 23, 2024

kevinschweikert Oct 23, 2024

kevinschweikert Oct 24, 2024

jonatanklosko Oct 24, 2024

kevinschweikert Oct 24, 2024

jonatanklosko commented Oct 23, 2024

josevalim left a comment

jonatanklosko commented Oct 23, 2024

kevinschweikert commented Oct 23, 2024

jonatanklosko commented Oct 23, 2024

mickel8 commented Oct 23, 2024

	\|> Stream.chunk_every(1000)
	\|> Stream.map(&Nx.Batch.concatenate/1)
	\|> Stream.map(fn batch -> Nx.Defn.jit_apply(&Function.identity/1, [batch]) end)

use xav instead of ffmpeg #403

Are you sure you want to change the base?

use xav instead of ffmpeg #403

Conversation

kevinschweikert commented Oct 22, 2024

jonatanklosko Oct 23, 2024

Choose a reason for hiding this comment

kevinschweikert Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

jonatanklosko Oct 23, 2024

Choose a reason for hiding this comment

kevinschweikert Oct 23, 2024

Choose a reason for hiding this comment

kevinschweikert Oct 24, 2024

Choose a reason for hiding this comment

jonatanklosko Oct 24, 2024

Choose a reason for hiding this comment

kevinschweikert Oct 24, 2024

Choose a reason for hiding this comment

jonatanklosko commented Oct 23, 2024

josevalim left a comment

Choose a reason for hiding this comment

jonatanklosko commented Oct 23, 2024

kevinschweikert commented Oct 23, 2024

jonatanklosko commented Oct 23, 2024

mickel8 commented Oct 23, 2024

use `xav` instead of `ffmpeg` #403

use `xav` instead of `ffmpeg` #403

kevinschweikert Oct 23, 2024 •

edited

Loading