💡 Technical Details

« My Magic Recipe »

For now, I use only these 2 MDX models : « Int HQ 3 » & « Kim Vocal 2 »
(I know some people prefer the « Voc_FT » model, but personally I found it muddier than Kim Vocal 2 in my tests)
--

Step	Filename
1 - Normalization of Original audio	1 - NORMALIZED.flac
2 - Instrumental Extraction (with A.I.) from Normalized	2 - Music_extract.flac
3 - Volume Compensation for Instrumental	(internal)
4 - Subtraction of Instrumental from Normalized (remove Music)	3 - Audio_sub_Music.flac
5 -FINAL Vocal Extraction (with A.I.) from "Audio_sub_Music"	4_F - Vocals.flac
6 - Volume Compensation for Vocals	(internal)
7 -FINAL Subtraction of Vocals from Normalized	5_F - Music.flac
8 - Bleeding Vocals/Other in final "Music"	6 - Bleeding_in_Music.flac

Details of each step :

1️⃣ Normalization of Original audio

Normalize audio to -1.0 dB peak amplitude

This is mandatory because every process is based on RMS dB levels.
(Volumes Compensations & audio Substractions)

2️⃣ Instrumental Extraction from Normalized
You will understand that I only use this model to extract the instrumental part to have at most a clean vocals, but it is not used in the final result.
In all my tests, I saw (in Audacity) & heard that this helps to reduce the artifacts in the final Vocals result.

Use the MDX model to isolate the instrumental parts of the audio track.

3️⃣ Volume Compensation for Instrumental

Internal step involving volume compensation for the extracted instrumental.

4️⃣ Subtraction of Instrumental from Normalized
The instrumental part is then subtracted from the previously normalized to obtain an audio track with only vocals.

Isolate the vocal parts.

5️⃣ Vocal Extraction from cleaned "Audio_sub_Music"

Use the MDX model to isolate the vocal component of the music track, removing any remaining instrumental or background noise.

6️⃣ Volume Compensation for Vocals + « Silent »

Internal step involving volume compensation for the extracted vocal audio.
Pass the result trough the « Silent » filter (read below ...)

7️⃣ Subtraction of Vocals from Normalized
The vocal parts are subtracted from the previously normalized to obtain an audio track with only instrumental music.

Isolate the instrumental component from the original audio normalized.

8️⃣ Bleeding Vocals/Other in "Music"
The bleeding vocals or other elements are obtained by subtracting 1st "Music_extract" track from the final "Music" track.

Obtain an audio track that contains any residual vocal or other elements present in the final instrumental music.

These steps collectively represent the audio processing workflow, which separates vocals and instruments from a music track and handles various audio adjustments and filtering.
Some steps involve internal operations without generating separate output files.

Volume Compensations

These are very important values that needs to be fine-tuned for each model, to obtain the best results.

Volume compensation is a process that adjusts the volume of the audio to compensate for the volume changes that occur during the separation process.
This is necessary because the volume of the audio is reduced during the separation process.
The volume compensation process is performed internally and does not generate a separate output file.

About « Silent » filter

Make silent the parts of audio where dynamic range (RMS) goes below threshold.
Don't misundertand : this function is NOT a noise reduction !
Its behavior is to clean the audio from "silent parts" (below -50 dB) to :

avoid the MLM model to work on "silent parts", and save GPU time
avoid the MLM model to produce artifacts on "silent parts"
clean the final Vocals audio files from residues of "silent parts" (and get back them in "Music")

GOD MODE ??

Give you the GOD's POWER : each audio file is reloaded IF it was created before,
NO NEED to process it again and again !!

You'll be warned : You have to delete MANUALLY each file that you want to re-process !

by e.g :

you process the song for the first time
then decide that Vocals are not good :
- Keep the "1 - Music_extract" & "2 - Audio_sub_Music" files
- Delete the "3 - Vocals" & "4 - Music" files
- Modify parameters as you want
- Click on « Start » button again

It will re-process only 3 & 4 and load 1 & 2 instead of re-processing it ... got it ?

« SRS » - Soprano mode by Jarredou

Option to use the soprano mode as a model bandwidth extender to make narrowband models fullband. (At least those with a cutoff between 14 Khz - 17.5 Khz).

Description of the trick :

process the input audio at original sample rate
process the input audio with shifted sample rate by a ratio that make the original audio spectrum fit in the model bandwidth, then restore the original samplerate
use lowpass & highpass filters to create the multiband ensemble of the 2 separated audio, using the shifted sample rate results as the high band to fill what's above the cutoff of the model.
with scipy.signal.resample_poly, a ratio of 5/4 for up/down before processing does the trick for models with cutoff at 17.5khz

User Stories/Use Cases/Benefits:

Fullband results with "old" narrowband models

Potential Challenges/Considerations:

A smooth transition with zerophase soft filtering between the 2 bands works better than brickwall filters, around 14000 hertz was a good value in my few tests.
Make sure to not have volume changes in the crossover region (I've used Linkwitz-Riley filters).

Downside is first, the doubled separation time because of the 2 passes, and that the separation quality of the shifted sample rate audio is often lower than the normal processed one, but in most of the cases, as it's using only its high freq, it's enough the make that "fullband trick" works very well !

.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly