Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multilingual support #72

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
286a93c
[WIP] Add bark and parler multi support
Kostis-S-Z Dec 17, 2024
14b69bf
Add config files for other models to easily test across models
Kostis-S-Z Dec 17, 2024
20ab8e9
Use model loading wrapper function for download_models.py
Kostis-S-Z Dec 17, 2024
ee38e10
Make sure transformers>4.31.0 (required for bark model)
Kostis-S-Z Dec 17, 2024
890c684
Add parler dependency
Kostis-S-Z Dec 17, 2024
8cc7b0d
Use TTSModelWrapper for demo code
Kostis-S-Z Dec 17, 2024
dcbb254
Use TTSModelWrapper for cli
Kostis-S-Z Dec 17, 2024
b0d40bc
Add outetts_language attribute
Kostis-S-Z Dec 17, 2024
5e47b1e
Add TTSModelWrapper
Kostis-S-Z Dec 17, 2024
945c44f
Update text_to_speech.py
Kostis-S-Z Dec 17, 2024
4565fb8
Pass model-specific variables as **kwargs
Kostis-S-Z Dec 18, 2024
01d0e7a
Rename TTSModelWrapper to TTSInterface
Kostis-S-Z Dec 18, 2024
5af3e72
Update language argument to kwargs
Kostis-S-Z Dec 18, 2024
e3a3f17
Remove parler from dependencies
Kostis-S-Z Dec 18, 2024
a918574
Merge branch 'mozilla-ai:main' into multilingual-support
Kostis-S-Z Dec 18, 2024
fb814fa
Separate inference from TTSModel
Kostis-S-Z Dec 19, 2024
672c0e0
Make sure config model is properly registered
Kostis-S-Z Dec 19, 2024
28b02b8
Decouple loading & inference of TTS model
Kostis-S-Z Dec 19, 2024
b489e0d
Decouple loading & inference of TTS model
Kostis-S-Z Dec 19, 2024
dc89668
Enable user to exit podcast generation gracefully
Kostis-S-Z Dec 19, 2024
0d143eb
Add Q2 Oute version to TTS_LOADERS
Kostis-S-Z Dec 19, 2024
e9ca498
Add comment for support in TTS_INFERENCE
Kostis-S-Z Dec 19, 2024
47112a0
Update test_model_loaders.py
Kostis-S-Z Dec 19, 2024
ec0fe5a
Update test_text_to_speech.py
Kostis-S-Z Dec 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions demo/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,13 @@
import soundfile as sf
import streamlit as st

from document_to_podcast.inference.text_to_speech import text_to_speech
from document_to_podcast.preprocessing import DATA_LOADERS, DATA_CLEANERS
from document_to_podcast.inference.model_loaders import (
load_llama_cpp_model,
load_outetts_model,
load_tts_model,
)
from document_to_podcast.config import DEFAULT_PROMPT, DEFAULT_SPEAKERS, Speaker
from document_to_podcast.inference.text_to_speech import text_to_speech
from document_to_podcast.inference.text_to_text import text_to_text_stream


Expand All @@ -24,7 +24,7 @@ def load_text_to_text_model():

@st.cache_resource
def load_text_to_speech_model():
return load_outetts_model("OuteAI/OuteTTS-0.2-500M-GGUF/OuteTTS-0.2-500M-FP16.gguf")
return load_tts_model("OuteAI/OuteTTS-0.2-500M-GGUF/OuteTTS-0.2-500M-FP16.gguf")


script = "script"
Expand Down Expand Up @@ -153,7 +153,7 @@ def gen_button_clicked():
speech_model,
voice_profile,
)
st.audio(speech, sample_rate=speech_model.audio_codec.sr)
st.audio(speech, sample_rate=speech_model.sample_rate)

st.session_state.audio.append(speech)
text = ""
Expand All @@ -164,7 +164,7 @@ def gen_button_clicked():
sf.write(
"podcast.wav",
st.session_state.audio,
samplerate=speech_model.audio_codec.sr,
samplerate=speech_model.sample_rate,
)
st.markdown("Podcast saved to disk!")

Expand Down
4 changes: 2 additions & 2 deletions demo/download_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,10 @@

from document_to_podcast.inference.model_loaders import (
load_llama_cpp_model,
load_outetts_model,
load_tts_model,
)

load_llama_cpp_model(
"allenai/OLMoE-1B-7B-0924-Instruct-GGUF/olmoe-1b-7b-0924-instruct-q8_0.gguf"
)
load_outetts_model("OuteAI/OuteTTS-0.2-500M-GGUF/OuteTTS-0.2-500M-FP16.gguf")
load_tts_model("OuteAI/OuteTTS-0.2-500M-GGUF/OuteTTS-0.2-500M-FP16.gguf")
31 changes: 31 additions & 0 deletions example_data/config_bark.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we want to add multiple configs here, I guess is a discussion to have with the developer hub. All these seem like potential "Use Case" / "Customization" examples

Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
input_file: "example_data/a.md"
output_folder: "example_data/bark/"
text_to_text_model: "allenai/OLMoE-1B-7B-0924-Instruct-GGUF/olmoe-1b-7b-0924-instruct-q8_0.gguf"
text_to_speech_model: "suno/bark"
text_to_text_prompt: |
You are a podcast scriptwriter generating engaging and natural-sounding conversations in JSON format.
The script features the following speakers:
{SPEAKERS}
Instructions:
- Write dynamic, easy-to-follow dialogue.
- Include natural interruptions and interjections.
- Avoid repetitive phrasing between speakers.
- Format output as a JSON conversation.
Example:
{
"Speaker 1": "Welcome to our podcast! Today, we're exploring...",
"Speaker 2": "Hi! I'm excited to hear about this. Can you explain...",
"Speaker 1": "Sure! Imagine it like this...",
"Speaker 2": "Oh, that's cool! But how does..."
}
speakers:
- id: 1
name: Laura
description: The main host. She explains topics clearly using anecdotes and analogies, teaching in an engaging and captivating way.
voice_profile: "v2/en_speaker_0"

- id: 2
name: Daniel
description: The co-host. He keeps the conversation on track, asks curious follow-up questions, and reacts with excitement or confusion, often using interjections like hmm or umm.
voice_profile: "v2/en_speaker_1"
outetts_language: "en" # Supported languages in version 0.2-500M: en, zh, ja, ko.
Kostis-S-Z marked this conversation as resolved.
Show resolved Hide resolved
31 changes: 31 additions & 0 deletions example_data/config_parler.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
input_file: "example_data/a.md"
output_folder: "example_data/parler/"
text_to_text_model: "allenai/OLMoE-1B-7B-0924-Instruct-GGUF/olmoe-1b-7b-0924-instruct-q8_0.gguf"
text_to_speech_model: "parler-tts/parler-tts-mini-v1.1"
text_to_text_prompt: |
You are a podcast scriptwriter generating engaging and natural-sounding conversations in JSON format.
The script features the following speakers:
{SPEAKERS}
Instructions:
- Write dynamic, easy-to-follow dialogue.
- Include natural interruptions and interjections.
- Avoid repetitive phrasing between speakers.
- Format output as a JSON conversation.
Example:
{
"Speaker 1": "Welcome to our podcast! Today, we're exploring...",
"Speaker 2": "Hi! I'm excited to hear about this. Can you explain...",
"Speaker 1": "Sure! Imagine it like this...",
"Speaker 2": "Oh, that's cool! But how does..."
}
speakers:
- id: 1
name: Laura
description: The main host. She explains topics clearly using anecdotes and analogies, teaching in an engaging and captivating way.
voice_profile: Laura's voice is calm and slow in delivery, with no background noise.

- id: 2
name: Daniel
description: The co-host. He keeps the conversation on track, asks curious follow-up questions, and reacts with excitement or confusion, often using interjections like hmm or umm.
voice_profile: Daniel's voice is monotone yet slightly fast in delivery, with a very close recording that almost has no background noise.
outetts_language: "en" # Supported languages in version 0.2-500M: en, zh, ja, ko.
31 changes: 31 additions & 0 deletions example_data/config_parler_multi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
input_file: "example_data/a.md"
output_folder: "example_data/parler_multi/"
text_to_text_model: "allenai/OLMoE-1B-7B-0924-Instruct-GGUF/olmoe-1b-7b-0924-instruct-q8_0.gguf"
text_to_speech_model: "parler-tts/parler-tts-mini-multilingual-v1.1"
text_to_text_prompt: |
You are a podcast scriptwriter generating engaging and natural-sounding conversations in JSON format.
The script features the following speakers:
{SPEAKERS}
Instructions:
- Write dynamic, easy-to-follow dialogue.
- Include natural interruptions and interjections.
- Avoid repetitive phrasing between speakers.
- Format output as a JSON conversation.
Example:
{
"Speaker 1": "Welcome to our podcast! Today, we're exploring...",
"Speaker 2": "Hi! I'm excited to hear about this. Can you explain...",
"Speaker 1": "Sure! Imagine it like this...",
"Speaker 2": "Oh, that's cool! But how does..."
}
speakers:
- id: 1
name: Laura
description: The main host. She explains topics clearly using anecdotes and analogies, teaching in an engaging and captivating way.
voice_profile: Laura's voice is calm and slow in delivery, with no background noise.

- id: 2
name: Daniel
description: The co-host. He keeps the conversation on track, asks curious follow-up questions, and reacts with excitement or confusion, often using interjections like hmm or umm.
voice_profile: Daniel's voice is monotone yet slightly fast in delivery, with a very close recording that almost has no background noise.
outetts_language: "en" # Supported languages in version 0.2-500M: en, zh, ja, ko.
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ dependencies = [
"pydantic",
"PyPDF2[crypto]",
"python-docx",
"transformers>4.31.0",
"streamlit",
]

Expand Down
74 changes: 36 additions & 38 deletions src/document_to_podcast/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,14 @@
Speaker,
DEFAULT_PROMPT,
DEFAULT_SPEAKERS,
SUPPORTED_TTS_MODELS,
TTS_LOADERS,
)
from document_to_podcast.inference.model_loaders import (
load_llama_cpp_model,
load_outetts_model,
load_parler_tts_model_and_tokenizer,
load_tts_model,
)
from document_to_podcast.inference.text_to_text import text_to_text_stream
from document_to_podcast.inference.text_to_speech import text_to_speech
from document_to_podcast.inference.text_to_text import text_to_text_stream
from document_to_podcast.preprocessing import DATA_CLEANERS, DATA_LOADERS


Expand All @@ -30,8 +29,9 @@ def document_to_podcast(
output_folder: str | None = None,
text_to_text_model: str = "allenai/OLMoE-1B-7B-0924-Instruct-GGUF/olmoe-1b-7b-0924-instruct-q8_0.gguf",
text_to_text_prompt: str = DEFAULT_PROMPT,
text_to_speech_model: SUPPORTED_TTS_MODELS = "OuteAI/OuteTTS-0.1-350M-GGUF/OuteTTS-0.1-350M-FP16.gguf",
text_to_speech_model: TTS_LOADERS = "OuteAI/OuteTTS-0.1-350M-GGUF/OuteTTS-0.1-350M-FP16.gguf",
speakers: list[Speaker] | None = None,
outetts_language: str = "en", # Only applicable to OuteTTS models
from_config: str | None = None,
):
"""
Expand Down Expand Up @@ -70,8 +70,10 @@ def document_to_podcast(
speakers (list[Speaker] | None, optional): The speakers for the podcast.
Defaults to DEFAULT_SPEAKERS.

from_config (str, optional): The path to the config file. Defaults to None.
outetts_language (str): For OuteTTS models we need to specify which language to use.
Supported languages in 0.2-500M: en, zh, ja, ko. More info: https://github.com/edwko/OuteTTS

from_config (str, optional): The path to the config file. Defaults to None.

If provided, all other arguments will be ignored.
"""
Expand All @@ -86,6 +88,7 @@ def document_to_podcast(
text_to_text_prompt=text_to_text_prompt,
text_to_speech_model=text_to_speech_model,
speakers=[Speaker.model_validate(speaker) for speaker in speakers],
outetts_language=outetts_language,
)

output_folder = Path(config.output_folder)
Expand All @@ -106,15 +109,9 @@ def document_to_podcast(
text_model = load_llama_cpp_model(model_id=config.text_to_text_model)

logger.info(f"Loading {config.text_to_speech_model}")
if "oute" in config.text_to_speech_model.lower():
speech_model = load_outetts_model(model_id=config.text_to_speech_model)
speech_tokenizer = None
sample_rate = speech_model.audio_codec.sr
else:
speech_model, speech_tokenizer = load_parler_tts_model_and_tokenizer(
model_id=config.text_to_speech_model
)
sample_rate = speech_model.config.sampling_rate
speech_model = load_tts_model(
model_id=config.text_to_speech_model, outetts_language=outetts_language
)

# ~4 characters per token is considered a reasonable default.
max_characters = text_model.n_ctx() * 4
Expand All @@ -133,33 +130,34 @@ def document_to_podcast(
system_prompt = system_prompt.replace(
"{SPEAKERS}", "\n".join(str(speaker) for speaker in config.speakers)
)
for chunk in text_to_text_stream(
clean_text, text_model, system_prompt=system_prompt
):
text += chunk
podcast_script += chunk
if text.endswith("\n") and "Speaker" in text:
logger.debug(text)
speaker_id = re.search(r"Speaker (\d+)", text).group(1)
voice_profile = next(
speaker.voice_profile
for speaker in config.speakers
if speaker.id == int(speaker_id)
)
speech = text_to_speech(
text.split(f'"Speaker {speaker_id}":')[-1],
speech_model,
voice_profile,
tokenizer=speech_tokenizer, # Applicable only for parler models
)
podcast_audio.append(speech)
text = ""

try:
for chunk in text_to_text_stream(
clean_text, text_model, system_prompt=system_prompt
):
text += chunk
podcast_script += chunk
if text.endswith("\n") and "Speaker" in text:
logger.debug(text)
speaker_id = re.search(r"Speaker (\d+)", text).group(1)
voice_profile = next(
speaker.voice_profile
for speaker in config.speakers
if speaker.id == int(speaker_id)
)
speech = text_to_speech(
text.split(f'"Speaker {speaker_id}":')[-1],
speech_model,
voice_profile,
)
podcast_audio.append(speech)
text = ""
except KeyboardInterrupt:
logger.warning("Podcast generation stopped by user.")
logger.info("Saving Podcast...")
sf.write(
str(output_folder / "podcast.wav"),
np.concatenate(podcast_audio),
samplerate=sample_rate,
samplerate=speech_model.sample_rate,
)
(output_folder / "podcast.txt").write_text(podcast_script)
logger.success("Done!")
Expand Down
26 changes: 16 additions & 10 deletions src/document_to_podcast/config.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
from pathlib import Path
from typing import Literal
from typing_extensions import Annotated

from pydantic import BaseModel, FilePath
from pydantic.functional_validators import AfterValidator

from document_to_podcast.inference.model_loaders import TTS_LOADERS
from document_to_podcast.inference.text_to_speech import TTS_INFERENCE
from document_to_podcast.preprocessing import DATA_LOADERS


Expand Down Expand Up @@ -41,14 +42,6 @@
},
]

SUPPORTED_TTS_MODELS = Literal[
"OuteAI/OuteTTS-0.1-350M-GGUF/OuteTTS-0.1-350M-FP16.gguf",
"OuteAI/OuteTTS-0.2-500M-GGUF/OuteTTS-0.2-500M-FP16.gguf",
"parler-tts/parler-tts-large-v1",
"parler-tts/parler-tts-mini-v1",
"parler-tts/parler-tts-mini-v1.1",
]


def validate_input_file(value):
if Path(value).suffix not in DATA_LOADERS:
Expand All @@ -73,6 +66,18 @@ def validate_text_to_text_prompt(value):
return value


def validate_text_to_speech_model(value):
if value not in TTS_LOADERS:
raise ValueError(
f"Model {value} is missing a loading function. Please define it under model_loaders.py"
)
if value not in TTS_INFERENCE:
raise ValueError(
f"Model {value} is missing an inference function. Please define it under text_to_speech.py"
)
return value


class Speaker(BaseModel):
id: int
name: str
Expand All @@ -88,5 +93,6 @@ class Config(BaseModel):
output_folder: str
text_to_text_model: Annotated[str, AfterValidator(validate_text_to_text_model)]
text_to_text_prompt: Annotated[str, AfterValidator(validate_text_to_text_prompt)]
text_to_speech_model: SUPPORTED_TTS_MODELS
text_to_speech_model: Annotated[str, AfterValidator(validate_text_to_speech_model)]
speakers: list[Speaker]
outetts_language: str = "en"
Loading
Loading