Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved detect silence #745

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

lumip
Copy link

@lumip lumip commented Jul 23, 2023

Overview

Reimplementation of detect_silence: Previously this function would invoke RMS computations independently for each slice of min_silence_len in the given audio segment, which leads to a lot of recomputing of similar values of the seek_step is small. The new implementation avoids this, resulting in much smaller detection time.

Caveats

This introduces numpy as a new dependency. This is for two reasons:

  1. it makes the computation easy to express
  2. it is very performant due to numpy being highly optimized for computations on large numeric arrays

While implementing this without numpy would be possible, it would likely not see the same amount of performance increase and easy of implementation.

detect_silence previously used audioop to compute RMS values of slices, which rounds the computed value down to the nearest integers - the silence threshold is not rounded. This is no longer the case in the new implementation, resulting in some slices that were previously detected as silent to not be so anymore. In practice this means that detected silent regions might be slightly shorter than before (by usually one or two seek_steps).

Performance results

%timeit results on audio segments consisting mostly of silence

20 minute segment

# old
> %timeit detect_silence(aus_short, silence_thresh=-50, seek_step=1)
1min 36s ± 914 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# new
> %timeit detect_silence(aus_short, silence_thresh=-50, seek_step=1)
2.66 s ± 20.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

~114 minute segment

# old
> %timeit detect_silence(aus, silence_thresh=-50, seek_step=1)
8min 37s ± 10.5 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

# new
> %timeit detect_silence(aus, silence_thresh=-50, seek_step=1)
15 s ± 392 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

lumip added 3 commits July 23, 2023 17:51
detect_silence finds separate slices of silence and in a last step
combines subsequent silent slices into ranges of continuous silence.
The added tests specifically ensure the correct function of this
combination step.
Previously, detect_silence would collect all slices of min_silence_len
in a list, then processed that list to merge subsequent slices
into continuous silent ranges.
This change performs the merging immediately when silence is detected
for a slice, eliminating the need for a second pass over and the memory
overhead associated with the internal list of silent slices.
Using numpy to compute RMS for silence detection to reduce
redundant computation (and benefit from numpys highly optimized
implementation) compared to previous implementation of
detect_silence.

Some caveats:
- adds numpy as new dependency
- previously RMS values where rounded down to the next integer;
  this is now not the case anymore, resulting in borders of silence
  ranges to possibly vary slightly compared to previous implementation
@lumip lumip force-pushed the improved_detect_silence_rebased branch from 29837f7 to 61a9459 Compare July 25, 2023 07:54

from .utils import db_to_float


def detect_silence(audio_segment, min_silence_len=1000, silence_thresh=-16, seek_step=1):
def _convert_to_numpy(audio_segment):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a property to AudioSegment?
Something like:

@property
def as_numpy(self):

@emsi
Copy link

emsi commented Feb 5, 2024

It does not seem to work properly.
When ran on YouTube video (~4h length) with:
split_on_silence(audio_segment, min_silence_len=800, keep_silence=True))

It returns the following ranges (just 4 segments):
(0, 6812736, 6812736, 13615464, 13615464, 13635080, 13635080, 13677621)

When running the same file with the same arguments (min_silence_len=800, silence_thresh=-16) in Audacity it finds lots and lots of silence (and I can confirm at glance that those findings are correct):
image

@lumip
Copy link
Author

lumip commented Feb 24, 2024

Hey, sorry I saw your responses a bit late just now. Could you perhaps provide a link to the video in question so that I can have a look?

@emsi
Copy link

emsi commented Feb 24, 2024

I believe I was processing the audio from this video:

https://youtu.be/AY9MnQ4x3zk

BTW: I've used ffmpeg eventually. Super fast and accurate.

@lumip
Copy link
Author

lumip commented Feb 29, 2024

It does not seem to work properly. When ran on YouTube video (~4h length) with: split_on_silence(audio_segment, min_silence_len=800, keep_silence=True))

It returns the following ranges (just 4 segments): (0, 6812736, 6812736, 13615464, 13615464, 13635080, 13635080, 13677621)

When running the same file with the same arguments (min_silence_len=800, silence_thresh=-16) in Audacity it finds lots and lots of silence (and I can confirm at glance that those findings are correct): image

To come back to this, I first want to point out that the changes made in this PR match the regions of silence found by the current implementation in pydub overall fairly well, although there were some larger deviations that I might look into a bit more, but I think these are all explained by the caveats I already pointed out.

With regards to the discrepancy with audacity and ffmpeg: If I run detect_silence with silence_thresh=-32 I obtain results that also reasonably match those produced by ffmpeg with threshold -16. pydub's db_to_float conversion applies different conversion based on whether the using_amplitude keyword argument is True or not - in one case an initial division of the passed in decibel value is a factor of 2 larger than in the other, so I believe that there is a difference in the interpretation of the dB value between pydub's silence detection and that of audacity and ffmpeg. I tried to figure out which one would be more canocical, but I couldn't find reliable definitions for dBFS that do not contradict each other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants