Replies: 2 comments
-
@lucidrains just creating this discussion to jot ideas as we have them |
Beta Was this translation helpful? Give feedback.
0 replies
-
InstructTTS used variance adaptor from fastspeech2 as pitch/duration predictor with their own style encoder. this model is also nar model with neural audio codec. it generates discrete tokens with discrete diffusion. will just using standard Text Encoder + pitch/duration predictor with prompt could work nicely? it seems much doable than training good text-to-semantic transformer. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Reference: PR Comment
Ideas / Approach to incorporating duration / speech priors
--> e.g NaturalSpeech2 Impl
Beta Was this translation helpful? Give feedback.
All reactions