Skip to content

💡 Technical Details

Captain FLAM edited this page Sep 29, 2023 · 7 revisions

Volume Compensations

These are very important values that needs to be fine-tuned for each model, to obtain the best results.

Volume compensation is a process that adjusts the volume of the audio to compensate for the volume changes that occur during the separation process.
This is necessary because the volume of the audio is reduced during the separation process.
The volume compensation process is performed internally and does not generate a separate output file.

How we calculate the Cut-Off ?

Jarredou have found a way to calculate the cutoff frequency of a model !
It is calculated by the following formula :

cutoff = samplerate / N_FFT_scale * dim_F_set

You can find all the values of the parameters in the Models_DATA.csv in the App folder.
(that you can open with Excel or LibreOffice Calc)

About « Silent » filter

Make silent the parts of audio where dynamic range (RMS) goes below threshold.
Don't misundertand : this function is NOT a noise reduction !
Its behavior is to clean the audio from "silent parts" (below -50 dB) to :

  • avoid the MLM model to work on "silent parts", and save GPU time
  • avoid the MLM model to produce artifacts on "silent parts"
  • clean the final Vocals audio files from residues of "silent parts" (and get back them in "Music")

GOD MODE ??

Give you the GOD's POWER : each audio file is reloaded IF it was created before,
NO NEED to process it again and again !!

You'll be warned : You have to delete MANUALLY each file that you want to re-process !

by e.g :

  • you process the song for the first time
  • then decide that Vocals are not good :
    • Keep the "1 - Music_extract" & "2 - Audio_sub_Music" files
    • Delete the "3 - Vocals" & "4 - Music" files
    • Modify parameters as you want
    • Click on « Start » button again

It will re-process only 3 & 4 and load 1 & 2 instead of re-processing it ... got it ?

« SRS » - Soprano mode

Option to use the soprano mode as a model bandwidth extender to make narrowband models fullband. (At least those with a cutoff between 14 Khz - 17.5 Khz).

Description of the trick :

  • process the input audio at original sample rate
  • process the input audio with shifted sample rate by a ratio that make the original audio spectrum fit in the model bandwidth, then restore the original samplerate
  • use lowpass & highpass filters to create the multiband ensemble of the 2 separated audio, using the shifted sample rate results as the high band to fill what's above the cutoff of the model.
  • with scipy.signal.resample_poly, a ratio of 5/4 for up/down before processing does the trick for models with cutoff at 17.5khz

User Stories/Use Cases/Benefits :

Fullband results with "old" narrowband models

Potential Challenges/Considerations :

A smooth transition with zerophase soft filtering between the 2 bands works better than brickwall filters, around 14000 hertz was a good value in my few tests.
Make sure to not have volume changes in the crossover region (I've used Linkwitz-Riley filters).

Downside is first, the doubled separation time because of the 2 passes, and that the separation quality of the shifted sample rate audio is often lower than the normal processed one, but in most of the cases, as it's using only its high freq, it's enough the make that "fullband trick" works very well !

« Big Shifts » Trick

Feature Description :

Improved version of Demucs' "shift trick", that is originaly shifting the full audio by inserting a random lengthed silence between 0 and 0.5 seconds.

The "big shift trick" uses instead a fixed shifting value (+1 second) for each shift.
This larger shifting give a bit better results than with the original one and it removes the randomness.

This is working great with Demucs AND MDX ! (Not tried with VR arch models)

User Stories/Use Cases/Benefits :

Give slightly better quality separations.

Potential Challenges/Considerations :

My first implementation was padding the full mixture with zeros, like it's done in most of the cases. But each "BigShift" being 1 second length, with BigShifts=20 means adding 20seconds of silence. So kinda wasting ressources.
In current implementation, it takes the "shifted" part from begining of the mixture and move it to the end, process the file and then move back the "shifted"' part to its original place. Surprisingly, that doesn't create artifacts (that would be easily identified, every seconds).
This need to be confirmed with Demucs processing as my implementation is only for MDX models.

Downside of this implementation, it must has a limit depending of the length of the processed song (because f.e. it can't do a shift of 40sec with a 30sec long mixture).

~