You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello and thank you for your great work. However, I tried MIDI SVS of DiffSinger and found that there might be a conceptual mistake in the phoneme duration inference logic, which may lead to uneven rhythms of the output voice.
This possible mistake relates to the definitions of "note duration". Here I would like to show several examples.
Explanations of the duration of notes
As shown in the picture below, the duration of a note (containing one single syllable) is normally defined as the duration between the beginning of its vowel part and the beginning of the vowel part of the next note.
That is to say, notes begin at the beginning of their VOWEL parts, not their CONSONANT parts (as notes in MIDI SVS of DiffSinger currently do). When we sing, the rhythm sounds correct because every vowel starts on its right place, but not because consonants do; in fact, the length of consonants may affect the strength we feel, but theoretically not the rhythm.
Consequences of this kind of inconsistency
This kind of inconsistency can lead to chaotic rhythms. Take the demo lyric "小酒窝长睫毛是你最美的记号" for an example, and here is its music score:
The output audio sounds much better ("小酒窝_diffsinger_ fixed_phone_durations.wav" in the attachment).
However, as only the beginning of consonant parts, but not the vowel parts, can be specified in MIDI SVS mode of DiffSinger, we may never get correct rhythms (in theory).
As a comparison, I produced a piece of audio with X Studio (Xiaoice Sing) that has the correct rhythm ("小酒窝_xiaoicesing_correct_rhythm.wav" in the attachment).
My teammates and I are trying to bring DiffSinger to more ordinary fans and users of SVS technology and products. These people (or you can say, most people) are more familiar with the interaction mode that takes notes or music scores as input. Therefore, correct rhythms are important and can help a lot.
It helps a lot if you fix the issue in rhythms (i. e. specify the beginning of vowels and predict the duration of the consonants).
I'm looking forward to your improvements.
The text was updated successfully, but these errors were encountered:
Hello and thank you for your great work. However, I tried MIDI SVS of DiffSinger and found that there might be a conceptual mistake in the phoneme duration inference logic, which may lead to uneven rhythms of the output voice.
This possible mistake relates to the definitions of "note duration". Here I would like to show several examples.
Explanations of the duration of notes
As shown in the picture below, the duration of a note (containing one single syllable) is normally defined as the duration between the beginning of its vowel part and the beginning of the vowel part of the next note.
That is to say, notes begin at the beginning of their VOWEL parts, not their CONSONANT parts (as notes in MIDI SVS of DiffSinger currently do). When we sing, the rhythm sounds correct because every vowel starts on its right place, but not because consonants do; in fact, the length of consonants may affect the strength we feel, but theoretically not the rhythm.
Consequences of this kind of inconsistency
This kind of inconsistency can lead to chaotic rhythms. Take the demo lyric "小酒窝长睫毛是你最美的记号" for an example, and here is its music score:
Thus, we input:
The output audio sounds wired and is probably not sung in rhythm ("小酒窝_diffsinger_raw.wav" in the attachment).
I then used other algorithm to predict the duration of each phone, and tried to fix this incorrect rhythm:
The output audio sounds much better ("小酒窝_diffsinger_ fixed_phone_durations.wav" in the attachment).
However, as only the beginning of consonant parts, but not the vowel parts, can be specified in MIDI SVS mode of DiffSinger, we may never get correct rhythms (in theory).
As a comparison, I produced a piece of audio with X Studio (Xiaoice Sing) that has the correct rhythm ("小酒窝_xiaoicesing_correct_rhythm.wav" in the attachment).
Here are the audios: audios.zip
My expectations
My teammates and I are trying to bring DiffSinger to more ordinary fans and users of SVS technology and products. These people (or you can say, most people) are more familiar with the interaction mode that takes notes or music scores as input. Therefore, correct rhythms are important and can help a lot.
It helps a lot if you fix the issue in rhythms (i. e. specify the beginning of vowels and predict the duration of the consonants).
I'm looking forward to your improvements.
The text was updated successfully, but these errors were encountered: