MIDI SVS mode may produce uneven rhythms #60

yqzhishen · 2022-07-28T14:37:31Z

Hello and thank you for your great work. However, I tried MIDI SVS of DiffSinger and found that there might be a conceptual mistake in the phoneme duration inference logic, which may lead to uneven rhythms of the output voice.
This possible mistake relates to the definitions of "note duration". Here I would like to show several examples.

Explanations of the duration of notes

As shown in the picture below, the duration of a note (containing one single syllable) is normally defined as the duration between the beginning of its vowel part and the beginning of the vowel part of the next note.

That is to say, notes begin at the beginning of their VOWEL parts, not their CONSONANT parts (as notes in MIDI SVS of DiffSinger currently do). When we sing, the rhythm sounds correct because every vowel starts on its right place, but not because consonants do; in fact, the length of consonants may affect the strength we feel, but theoretically not the rhythm.

Consequences of this kind of inconsistency

This kind of inconsistency can lead to chaotic rhythms. Take the demo lyric "小酒窝长睫毛是你最美的记号" for an example, and here is its music score:

Thus, we input:

input text
小 酒 窝 长 睫 毛 SP 是 你 最 美 的 记 号

input note
C#4 | F#4 | G#4 | A#4 F#4 | F#4 C#4 | C#4 | rest | C#4 | A#4 | G#4 | A#4 | G#4 | F#4 | C#4 

input duration
0.315789 | 0.315789 | 0.315789 | 0.315789 0.315789 | 0.315789 0.315789 | 0.315789 | 0.315789 | 0.315789 | 0.315789 | 0.315789 | 0.315789 | 0.315789 | 0.315789 | 0.315789

The output audio sounds wired and is probably not sung in rhythm ("小酒窝_diffsinger_raw.wav" in the attachment).

I then used other algorithm to predict the duration of each phone, and tried to fix this incorrect rhythm:

input text
小 酒 窝 长 睫 毛 SP 是 你 最 美 的 记 号

input note
C#4 | F#4 | G#4 | A#4 F#4 | F#4 C#4 | C#4 | rest | C#4 | A#4 | G#4 | A#4 | G#4 | F#4 | C#4 

input duration
0.390789 | 0.375789 | 0.25579 | 0.420789 0.210789 | 0.420789 0.21079 | 0.420789 | 0.13579 | 0.405789 | 0.30079 | 0.330789 | 0.36079 | 0.25579 | 0.315789 | 0.42079

The output audio sounds much better ("小酒窝_diffsinger_ fixed_phone_durations.wav" in the attachment).
However, as only the beginning of consonant parts, but not the vowel parts, can be specified in MIDI SVS mode of DiffSinger, we may never get correct rhythms (in theory).

As a comparison, I produced a piece of audio with X Studio (Xiaoice Sing) that has the correct rhythm ("小酒窝_xiaoicesing_correct_rhythm.wav" in the attachment).

Here are the audios: audios.zip

My expectations

My teammates and I are trying to bring DiffSinger to more ordinary fans and users of SVS technology and products. These people (or you can say, most people) are more familiar with the interaction mode that takes notes or music scores as input. Therefore, correct rhythms are important and can help a lot.
It helps a lot if you fix the issue in rhythms (i. e. specify the beginning of vowels and predict the duration of the consonants).
I'm looking forward to your improvements.

The text was updated successfully, but these errors were encountered:

yqzhishen mentioned this issue Oct 1, 2022

Fix rhythm (CV to VC) openvpi/DiffSinger#16

Closed

yqzhishen mentioned this issue Dec 27, 2022

Question about training in other languages like English openvpi/DiffSinger#29

Closed

yqzhishen mentioned this issue Mar 3, 2023

Rhythmizers for other languages openvpi/DiffSinger#62

Closed

MoonInTheRiver pinned this issue Mar 26, 2023

MoonInTheRiver added enhancement New feature or request documentation Improvements or additions to documentation must-read and removed documentation Improvements or additions to documentation labels Apr 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MIDI SVS mode may produce uneven rhythms #60

MIDI SVS mode may produce uneven rhythms #60

yqzhishen commented Jul 28, 2022

MIDI SVS mode may produce uneven rhythms #60

MIDI SVS mode may produce uneven rhythms #60

Comments

yqzhishen commented Jul 28, 2022

Explanations of the duration of notes

Consequences of this kind of inconsistency

My expectations