Skip to content

Commit d76ddbc

Browse files
authored
Merge pull request #384 from idiap/dev
v0.26.1
2 parents 746c377 + 4446003 commit d76ddbc

File tree

15 files changed

+275
-39
lines changed

15 files changed

+275
-39
lines changed

Dockerfile

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,6 @@ RUN curl --proto '=https' --tlsv1.2 -sSf "https://sh.rustup.rs" | sh -s -- -y
1414
ENV PATH="/root/.cargo/bin:${PATH}"
1515

1616
RUN pip3 install -U pip setuptools wheel
17-
RUN pip3 install -U "spacy[ja]<3.8"
1817
RUN pip3 install llvmlite --ignore-installed
1918

2019
# Install Dependencies:

TTS/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@
3131
# Bark
3232
torch.serialization.add_safe_globals(
3333
[
34-
np.core.multiarray.scalar,
34+
np._core.multiarray.scalar,
3535
np.dtype,
3636
np.dtypes.Float64DType,
3737
_codecs.encode, # TODO: safe by default from Pytorch 2.5

TTS/demos/xtts_ft_demo/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
You can open the notebook in Google Colab: https://colab.research.google.com/github/idiap/coqui-ai-TTS/blob/dev/TTS/demos/xtts_ft_demo/XTTS_finetune_colab.ipynb
Lines changed: 174 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {
6+
"id": "Th91ofnQWr8Y"
7+
},
8+
"source": [
9+
"## Dataset building + XTTS finetuning and inference\n",
10+
"\n",
11+
"#### Running the demo\n",
12+
"To start the demo run the first two cells (ignore pip install errors in the first one)\n",
13+
"\n",
14+
"Then click on the link `Running on public URL: ` when the demo is ready.\n",
15+
"\n",
16+
"#### Downloading the results\n",
17+
"\n",
18+
"You can run cell [3] to zip and download default dataset path\n",
19+
"\n",
20+
"You can run cell [4] to zip and download the latest model you trained"
21+
]
22+
},
23+
{
24+
"cell_type": "markdown",
25+
"metadata": {
26+
"id": "cdWKA_xFqkKq"
27+
},
28+
"source": [
29+
"### Installing the requirements"
30+
]
31+
},
32+
{
33+
"cell_type": "code",
34+
"execution_count": null,
35+
"metadata": {
36+
"id": "lmUUQqdN6BXk"
37+
},
38+
"outputs": [],
39+
"source": [
40+
"!pip install coqui-tts\n",
41+
"!pip install gradio==4.7.1 faster_whisper"
42+
]
43+
},
44+
{
45+
"cell_type": "markdown",
46+
"metadata": {
47+
"id": "g7rNt1e2qtDP"
48+
},
49+
"source": [
50+
"### Running the gradio UI"
51+
]
52+
},
53+
{
54+
"cell_type": "code",
55+
"execution_count": null,
56+
"metadata": {
57+
"id": "zd2xo_7a8wyj"
58+
},
59+
"outputs": [],
60+
"source": [
61+
"!python -m TTS.demos.xtts_ft_demo.xtts_demo --batch_size 2 --num_epochs 6"
62+
]
63+
},
64+
{
65+
"cell_type": "markdown",
66+
"metadata": {
67+
"id": "oXEBRA_kq23i"
68+
},
69+
"source": [
70+
"### Downloading the dataset"
71+
]
72+
},
73+
{
74+
"cell_type": "code",
75+
"execution_count": null,
76+
"metadata": {
77+
"id": "dBxgdKcvi4kO"
78+
},
79+
"outputs": [],
80+
"source": [
81+
"from google.colab import files\n",
82+
"\n",
83+
"!zip -q -r dataset.zip /tmp/xtts_ft/dataset\n",
84+
"files.download('dataset.zip')"
85+
]
86+
},
87+
{
88+
"cell_type": "markdown",
89+
"metadata": {
90+
"id": "ZKzoP53Nq_rJ"
91+
},
92+
"source": [
93+
"### Downloading the model"
94+
]
95+
},
96+
{
97+
"cell_type": "code",
98+
"execution_count": null,
99+
"metadata": {
100+
"id": "NpfdzHvKaX8M"
101+
},
102+
"outputs": [],
103+
"source": [
104+
"from google.colab import files\n",
105+
"import os\n",
106+
"import glob\n",
107+
"import torch\n",
108+
"\n",
109+
"def find_latest_best_model(folder_path):\n",
110+
" search_path = os.path.join(folder_path, '**', 'best_model.pth')\n",
111+
" files = glob.glob(search_path, recursive=True)\n",
112+
" latest_file = max(files, key=os.path.getctime, default=None)\n",
113+
" return latest_file\n",
114+
"\n",
115+
"model_path = find_latest_best_model(\"/tmp/xtts_ft/run/training/\")\n",
116+
"checkpoint = torch.load(model_path, map_location=torch.device(\"cpu\"))\n",
117+
"del checkpoint[\"optimizer\"]\n",
118+
"for key in list(checkpoint[\"model\"].keys()):\n",
119+
" if \"dvae\" in key:\n",
120+
" del checkpoint[\"model\"][key]\n",
121+
"torch.save(checkpoint, \"model.pth\")\n",
122+
"model_dir = os.path.dirname(model_path)\n",
123+
"files.download(os.path.join(model_dir, 'config.json'))\n",
124+
"files.download(os.path.join(model_dir, 'vocab.json'))\n",
125+
"files.download('model.pth')"
126+
]
127+
},
128+
{
129+
"cell_type": "markdown",
130+
"metadata": {
131+
"id": "Eh9_SusYdRE4"
132+
},
133+
"source": [
134+
"### Copy files to your google drive\n",
135+
"\n",
136+
"The two previous cells are a requirement for this step but it can be much faster"
137+
]
138+
},
139+
{
140+
"cell_type": "code",
141+
"execution_count": null,
142+
"metadata": {
143+
"id": "piLAaVHSdQs5"
144+
},
145+
"outputs": [],
146+
"source": [
147+
"from google.colab import drive\n",
148+
"import shutil\n",
149+
"drive.mount('/content/drive')\n",
150+
"!mkdir /content/drive/MyDrive/XTTS_ft_colab\n",
151+
"shutil.copy(os.path.join(model_dir, 'config.json'), \"/content/drive/MyDrive/XTTS_ft_colab/config.json\")\n",
152+
"shutil.copy(os.path.join(model_dir, 'vocab.json'), \"/content/drive/MyDrive/XTTS_ft_colab/vocab.json'\")\n",
153+
"shutil.copy('model.pth', \"/content/drive/MyDrive/XTTS_ft_colab/model.pth\")\n",
154+
"shutil.copy('dataset.zip', \"/content/drive/MyDrive/XTTS_ft_colab/dataset.zip\")"
155+
]
156+
}
157+
],
158+
"metadata": {
159+
"accelerator": "GPU",
160+
"colab": {
161+
"gpuType": "T4",
162+
"provenance": []
163+
},
164+
"kernelspec": {
165+
"display_name": "Python 3",
166+
"name": "python3"
167+
},
168+
"language_info": {
169+
"name": "python"
170+
}
171+
},
172+
"nbformat": 4,
173+
"nbformat_minor": 0
174+
}

TTS/server/server.py

Lines changed: 20 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,10 @@ def create_argparser() -> argparse.ArgumentParser:
3838
default="tts_models/en/ljspeech/tacotron2-DDC",
3939
help="Name of one of the pre-trained tts models in format <language>/<dataset>/<model_name>",
4040
)
41-
parser.add_argument("--vocoder_name", type=str, default=None, help="name of one of the released vocoder models.")
41+
parser.add_argument("--vocoder_name", type=str, default=None, help="Name of one of the released vocoder models.")
42+
parser.add_argument(
43+
"--speaker_idx", type=str, default=None, help="Target speaker ID for a multi-speaker TTS model."
44+
)
4245

4346
# Args for running custom models
4447
parser.add_argument("--config_path", default=None, type=str, help="Path to model config file.")
@@ -163,10 +166,10 @@ def tts():
163166
with lock:
164167
text = request.headers.get("text") or request.values.get("text", "")
165168
speaker_idx = (
166-
request.headers.get("speaker-id") or request.values.get("speaker_id", "") if api.is_multi_speaker else None
169+
request.headers.get("speaker-id") or request.values.get("speaker_id", args.speaker_idx)
170+
if api.is_multi_speaker
171+
else None
167172
)
168-
if speaker_idx == "":
169-
speaker_idx = None
170173
language_idx = (
171174
request.headers.get("language-id") or request.values.get("language_id", "")
172175
if api.is_multi_lingual
@@ -207,6 +210,13 @@ def mary_tts_api_voices():
207210
model_details = args.model_name.split("/")
208211
else:
209212
model_details = ["", "en", "", "default"]
213+
if api.is_multi_speaker:
214+
return render_template_string(
215+
"{% for speaker in speakers %}{{ speaker }} {{ locale }} {{ gender }}\n{% endfor %}",
216+
speakers=api.speakers,
217+
locale=model_details[1],
218+
gender="u",
219+
)
210220
return render_template_string(
211221
"{{ name }} {{ locale }} {{ gender }}\n", name=model_details[3], locale=model_details[1], gender="u"
212222
)
@@ -218,12 +228,16 @@ def mary_tts_api_process():
218228
with lock:
219229
if request.method == "POST":
220230
data = parse_qs(request.get_data(as_text=True))
221-
# NOTE: we ignore param. LOCALE and VOICE for now since we have only one active model
231+
speaker_idx = data.get("VOICE", [args.speaker_idx])[0]
232+
# NOTE: we ignore parameter LOCALE for now since we have only one active model
222233
text = data.get("INPUT_TEXT", [""])[0]
223234
else:
224235
text = request.args.get("INPUT_TEXT", "")
236+
speaker_idx = request.args.get("VOICE", args.speaker_idx)
237+
225238
logger.info("Model input: %s", text)
226-
wavs = api.tts(text)
239+
logger.info("Speaker idx: %s", speaker_idx)
240+
wavs = api.tts(text, speaker=speaker_idx)
227241
out = io.BytesIO()
228242
api.synthesizer.save_wav(wavs, out)
229243
return send_file(out, mimetype="audio/wav")

TTS/tts/layers/xtts/tokenizer.py

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,6 @@
66

77
import torch
88
from num2words import num2words
9-
from spacy.lang.ar import Arabic
10-
from spacy.lang.en import English
11-
from spacy.lang.es import Spanish
12-
from spacy.lang.hi import Hindi
13-
from spacy.lang.ja import Japanese
14-
from spacy.lang.zh import Chinese
159
from tokenizers import Tokenizer
1610

1711
from TTS.tts.layers.xtts.zh_num2words import TextNorm as zh_num2words
@@ -21,6 +15,15 @@
2115

2216

2317
def get_spacy_lang(lang):
18+
try:
19+
from spacy.lang.ar import Arabic
20+
from spacy.lang.en import English
21+
from spacy.lang.es import Spanish
22+
from spacy.lang.hi import Hindi
23+
from spacy.lang.ja import Japanese
24+
from spacy.lang.zh import Chinese
25+
except ImportError as e:
26+
raise ImportError("enable_text_splitting=True requires Spacy: pip install spacy[ja]") from e
2427
"""Return Spacy language used for sentence splitting."""
2528
if lang == "zh":
2629
return Chinese()

TTS/tts/models/forward_tts.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -331,7 +331,7 @@ def format_durations(self, o_dr_log, x_mask):
331331
return o_dr
332332

333333
def _forward_encoder(
334-
self, x: torch.LongTensor, x_mask: torch.FloatTensor, g: torch.FloatTensor = None
334+
self, x: torch.LongTensor, x_mask: torch.FloatTensor, g: torch.FloatTensor | None = None
335335
) -> tuple[torch.FloatTensor, torch.FloatTensor, torch.FloatTensor, torch.FloatTensor, torch.FloatTensor]:
336336
"""Encoding forward pass.
337337
@@ -356,7 +356,7 @@ def _forward_encoder(
356356
- g: :math:`(B, C)`
357357
"""
358358
if hasattr(self, "emb_g"):
359-
g = g.type(torch.LongTensor)
359+
g = g.type(torch.LongTensor).to(x.device)
360360
g = self.emb_g(g) # [B, C, 1]
361361
if g is not None:
362362
g = g.unsqueeze(-1)

TTS/utils/audio/numpy_transforms.py

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -281,16 +281,20 @@ def compute_f0(
281281
>>> wav = ap.load_wav(WAV_FILE, sr=ap.sample_rate)[:5 * ap.sample_rate]
282282
>>> pitch = ap.compute_f0(wav)
283283
"""
284-
assert pitch_fmax is not None, " [!] Set `pitch_fmax` before caling `compute_f0`."
285-
assert pitch_fmin is not None, " [!] Set `pitch_fmin` before caling `compute_f0`."
284+
assert pitch_fmax is not None, " [!] Set `pitch_fmax` before calling `compute_f0`."
285+
assert pitch_fmin is not None, " [!] Set `pitch_fmin` before calling `compute_f0`."
286+
287+
if sample_rate / pitch_fmin >= win_length - 1:
288+
logger.warning("pitch_fmin=%.2f is too small for win_length=%d", pitch_fmin, win_length)
289+
pitch_fmin = sample_rate / (win_length - 1) + 0.1
290+
logger.warning("pitch_fmin increased to %f", pitch_fmin)
286291

287292
f0, voiced_mask, _ = pyin(
288293
y=x.astype(np.double),
289294
fmin=pitch_fmin,
290295
fmax=pitch_fmax,
291296
sr=sample_rate,
292297
frame_length=win_length,
293-
win_length=win_length // 2,
294298
hop_length=hop_length,
295299
pad_mode=stft_pad_mode,
296300
center=center,

docs/source/marytts.md

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,5 +39,25 @@ You can enter the same URLs in your browser and check-out the results there as w
3939

4040
### How it works and limitations
4141

42-
A classic Mary-TTS server would usually show all installed locales and voices via the corresponding endpoints and accept the parameters `LOCALE` and `VOICE` for processing. For Coqui-TTS we usually start the server with one specific locale and model and thus cannot return all available options. Instead we return the active locale and use the model name as "voice". Since we only have one active model and always want to return a WAV-file, we currently ignore all other processing parameters except `INPUT_TEXT`. Since the gender is not defined for models in Coqui-TTS we always return `u` (undefined).
43-
We think that this is an acceptable compromise, since users are often only interested in one specific voice anyways, but the API might get extended in the future to support multiple languages and voices at the same time.
42+
#### Single-speaker models
43+
44+
A classic Mary-TTS server would usually show all installed locales and voices
45+
via the corresponding endpoints and accept the parameters `LOCALE` and `VOICE`
46+
for processing. For Coqui-TTS we usually start the server with one specific
47+
locale and model and thus cannot return all available options. Instead, for
48+
single-speaker models, we return the active locale and use the model name as
49+
"voice". Since we only have one active model and always want to return a
50+
WAV-file, we currently ignore all other processing parameters except
51+
`INPUT_TEXT`. Since the gender is not defined for models in Coqui-TTS we always
52+
return `u` (undefined). We think that this is an acceptable compromise, since
53+
users are often only interested in one specific voice anyways, but the API might
54+
get extended in the future to support multiple languages and voices at the same
55+
time.
56+
57+
#### Multi-speaker models
58+
59+
For multi-speaker models, a specific speaker ID can be passed with the `VOICE`
60+
parameter. The `/voices` endpoint will return all available speaker IDs.
61+
Alternatively, the server can be started with e.g. `tts-server --model_name
62+
tts_models/en/vctk/vits --speaker_idx p376` to set a default speaker that will
63+
be used if the `VOICE` parameter is left out.

docs/source/models/xtts.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -197,7 +197,7 @@ pip install deepspeed
197197
- `top_k`: Lower values mean the decoder produces more "likely" (aka boring) outputs. Defaults to 50.
198198
- `top_p`: Lower values mean the decoder produces more "likely" (aka boring) outputs. Defaults to 0.8.
199199
- `speed`: The speed rate of the generated audio. Defaults to 1.0. (can produce artifacts if far from 1.0)
200-
- `enable_text_splitting`: Whether to split the text into sentences and generate audio for each sentence. It allows you to have infinite input length but might loose important context between sentences. Defaults to True.
200+
- `enable_text_splitting`: Whether to split the text into sentences and generate audio for each sentence. It allows you to have infinite input length but might loose important context between sentences. Defaults to False.
201201

202202

203203
#### Inference
@@ -295,7 +295,7 @@ The user can run this gradio demo locally or remotely using a Colab Notebook.
295295
#### Run demo on Colab
296296
To make the `XTTS_v2` fine-tuning more accessible for users that do not have good GPUs available we did a Google Colab Notebook.
297297

298-
The Colab Notebook is available [here](https://colab.research.google.com/drive/1GiI4_X724M8q2W-zZ-jXo7cWTV7RfaH-?usp=sharing).
298+
The Colab Notebook is available [here](https://colab.research.google.com/github/idiap/coqui-ai-TTS/blob/dev/TTS/demos/xtts_ft_demo/XTTS_finetune_colab.ipynb).
299299

300300
To learn how to use this Colab Notebook please check the [XTTS fine-tuning video](https://www.youtube.com/watch?v=8tpDiiouGxc).
301301

0 commit comments

Comments
 (0)