layout | title | subtitle |
---|---|---|
home |
FluentTTS |
Text-dependent Fine-grained Style Control for Multi-style TTS |
Submitted to INTERSPEECH 2022 (Paper ID: 988)
Changhwan Kim, Seyun Um, Hyungchan Yoon, Hong-goo Kang
In this paper, we propose a method to flexibly control the local prosodic variation of a neural text-to-speech (TTS) model. To provide expressiveness for synthesized speech, conventional TTS models utilize utterance-wise global style embeddings that are obtained by compressing frame-level embeddings along the time axis. However, since utterance-wise global features do not contain sufficient information to represent the characteristics of word-level local features, they are not appropriate for direct use on controlling prosody at a fine scale.
In multi-style TTS models, it is very important to have the capability to control local prosody because it plays a key role in finding the most appropriate text-to-speech pair among many one-to-many mapping candidates.
To explicitly present local prosodic characteristics to the contextual information of the corresponding input text, we propose a module to predict the fundamental frequency (
Text | 두 사람 눈치 보기 싫어서 한 발이라도 먼저 나간다. (I don't want to look at them, so I'll take a step forward.) | 아침 하기 싫어서 나오는 거 내 모를 줄 알아? (Don't you think I don't know you're coming out because you don't want to make breakfast?) | 내 앞에 앉기 싫은 모양인데, 그럼 앉지마. (You don't want to sit in front of me, then don't sit down.) | 들어오기 싫으면, 이참에 끝장을 내라고 그러세요. (If he don't want to come in, tell him to finish it this time.) |
Ground Truth | ||||
Baseline | ||||
Proposed |
Text | 난 아저씨가 빨리 우리 아빠가 됐으면 좋겠어. (I hope he will be my father soon.) | 엿장수 맘대로 아니고, 지혜 맘대로. (It's not up to the candy seller, it's up to Jihye.) | 그런 맘 먹기 힘들었을텐데, 고맙다 인경아. (It must have been hard for you to make up your mind, thank you In-kyung.) | 아뇨, 전 호텔에서의 만찬보다는 이런 자리가 훨씬 편하고 좋은데요. (No, I like this kind of place much more comfortable than a hotel feast.) |
Ground Truth | ||||
Baseline | ||||
Proposed |
Text | 엄마가 날 안낳았다니까 더 우울하고 더 슬퍼. (It's even more depressing and sadder that my mom didn't give birth to me.) | 나 아까워서 우리 혜인이 시집 못보내겠어. (I can't let my Hye-in get married because it's such a waste.) | 오늘 만나면 어제 했던 말 취소한다고 할까봐 밤새 걱정했어. (When we met today, I was afraid you'd take back what you said yesterday.) | 공휴일이라 쉬실텐데, 전화드려서 죄송합니다. (I'm sorry to call you because it's a public holiday.) |
Ground Truth | ||||
Baseline | ||||
Proposed |
Text | 세상에 둘도 없는 범생이 차림으로 갈 거다. (I'm going to dress up as the best student in the world.) | 일본 작가가 쓴 소설을 출판하고 싶은가봐. (He seems to want to publish a novel written by a Japanese writer) | 애한테 이런 불량식품을 사먹이면 어떡해요. (You shouldn't buy such junk food for that kid.) | 아빠, 우리 유치원 얼마나 좋은데요. (Dad, my kindergarten is so nice.) |
Ground Truth | ||||
Baseline | ||||
Proposed |
Text 1. 이번 여름엔 다같이 놀러가면 좋겠다. (I hope we can all play together in this summer.)
F0 Shift | Original | +50Hz | -50Hz |
Happiness | |||
Sadness |
Text 2. 아니야, 정말로 엄마랑 살고 싶어. (No, I really want to live with my mom.)
F0 Shift | Original | +50Hz | -50Hz |
Anger |
(The part where F0 has changed is marked in bold.)
Text 1. 이번 여름엔(yeo reum en) 다같이 놀러가면 좋겠다. (I hope we can all play together in this summer.)
F0 Shift | Original | +50Hz | -50Hz |
Anger | |||
Sadness | |||
Neutral |
Text 2. 데려다주기로 한 거니까, 어디든(eo di deun) 데려다주마. (I'm going to take you anywhere, because I'm going to take you there.)
F0 Shift | Original | +50Hz | -50Hz |
Happiness |
(The part where F0 has changed is marked in bold.)
Text 1. 이번 여름엔(reum en) 다같이 놀러가면 좋겠다. (I hope we can all play together in this summer.)
F0 Shift | Original | 름(reum) | 엔(en) |
Happiness | |||
Neutral |
Text 2. 아빠, 우리 유(yu)치원(won) 얼마나 좋은데요. (Dad, my kindergarten is so nice.)
F0 Shift | Original | 유(yu) | 원(won) |
Anger |
Since our proposed model provides local information from F0 values, the local prosodic variations of synthesized output are various. On the other hand, the baseline only generates averaged prosodic variation.
Please focus on the prosodic variation of the following samples.
Baseline | |||
Proposed |
Baseline | ||||
Proposed |