Merge pull request #384 from idiap/dev

eginhard · web-flow · commit d76ddbce0907 · 2025-05-16T15:43:04.000+02:00
v0.26.1
diff --git a/Dockerfile b/Dockerfile
@@ -14,7 +14,6 @@ RUN curl --proto '=https' --tlsv1.2 -sSf "https://sh.rustup.rs" | sh -s -- -y
 ENV PATH="/root/.cargo/bin:${PATH}"
 
 RUN pip3 install -U pip setuptools wheel
-RUN pip3 install -U "spacy[ja]<3.8"
 RUN pip3 install llvmlite --ignore-installed
 
 # Install Dependencies:
diff --git a/TTS/__init__.py b/TTS/__init__.py
@@ -31,7 +31,7 @@
     # Bark
     torch.serialization.add_safe_globals(
         [
-            np.core.multiarray.scalar,
+            np._core.multiarray.scalar,
             np.dtype,
             np.dtypes.Float64DType,
             _codecs.encode,  # TODO: safe by default from Pytorch 2.5
diff --git a/TTS/demos/xtts_ft_demo/README.md b/TTS/demos/xtts_ft_demo/README.md
@@ -0,0 +1 @@
+You can open the notebook in Google Colab: https://colab.research.google.com/github/idiap/coqui-ai-TTS/blob/dev/TTS/demos/xtts_ft_demo/XTTS_finetune_colab.ipynb
diff --git a/TTS/demos/xtts_ft_demo/XTTS_finetune_colab.ipynb b/TTS/demos/xtts_ft_demo/XTTS_finetune_colab.ipynb
@@ -0,0 +1,174 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Th91ofnQWr8Y"
+      },
+      "source": [
+        "## Dataset building + XTTS finetuning and inference\n",
+        "\n",
+        "#### Running the demo\n",
+        "To start the demo run the first two cells (ignore pip install errors in the first one)\n",
+        "\n",
+        "Then click on the link `Running on public URL: ` when the demo is ready.\n",
+        "\n",
+        "#### Downloading the results\n",
+        "\n",
+        "You can run cell [3] to zip and download default dataset path\n",
+        "\n",
+        "You can run cell [4] to zip and download the latest model you trained"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "cdWKA_xFqkKq"
+      },
+      "source": [
+        "### Installing the requirements"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "lmUUQqdN6BXk"
+      },
+      "outputs": [],
+      "source": [
+        "!pip install coqui-tts\n",
+        "!pip install gradio==4.7.1 faster_whisper"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "g7rNt1e2qtDP"
+      },
+      "source": [
+        "### Running the gradio UI"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "zd2xo_7a8wyj"
+      },
+      "outputs": [],
+      "source": [
+        "!python -m TTS.demos.xtts_ft_demo.xtts_demo --batch_size 2 --num_epochs 6"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "oXEBRA_kq23i"
+      },
+      "source": [
+        "### Downloading the dataset"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "dBxgdKcvi4kO"
+      },
+      "outputs": [],
+      "source": [
+        "from google.colab import files\n",
+        "\n",
+        "!zip -q -r dataset.zip /tmp/xtts_ft/dataset\n",
+        "files.download('dataset.zip')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "ZKzoP53Nq_rJ"
+      },
+      "source": [
+        "### Downloading the model"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "NpfdzHvKaX8M"
+      },
+      "outputs": [],
+      "source": [
+        "from google.colab import files\n",
+        "import os\n",
+        "import glob\n",
+        "import torch\n",
+        "\n",
+        "def find_latest_best_model(folder_path):\n",
+        "    search_path = os.path.join(folder_path, '**', 'best_model.pth')\n",
+        "    files = glob.glob(search_path, recursive=True)\n",
+        "    latest_file = max(files, key=os.path.getctime, default=None)\n",
+        "    return latest_file\n",
+        "\n",
+        "model_path = find_latest_best_model(\"/tmp/xtts_ft/run/training/\")\n",
+        "checkpoint = torch.load(model_path, map_location=torch.device(\"cpu\"))\n",
+        "del checkpoint[\"optimizer\"]\n",
+        "for key in list(checkpoint[\"model\"].keys()):\n",
+        "    if \"dvae\" in key:\n",
+        "        del checkpoint[\"model\"][key]\n",
+        "torch.save(checkpoint, \"model.pth\")\n",
+        "model_dir = os.path.dirname(model_path)\n",
+        "files.download(os.path.join(model_dir, 'config.json'))\n",
+        "files.download(os.path.join(model_dir, 'vocab.json'))\n",
+        "files.download('model.pth')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Eh9_SusYdRE4"
+      },
+      "source": [
+        "### Copy files to your google drive\n",
+        "\n",
+        "The two previous cells are a requirement for this step but it can be much faster"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "piLAaVHSdQs5"
+      },
+      "outputs": [],
+      "source": [
+        "from google.colab import drive\n",
+        "import shutil\n",
+        "drive.mount('/content/drive')\n",
+        "!mkdir /content/drive/MyDrive/XTTS_ft_colab\n",
+        "shutil.copy(os.path.join(model_dir, 'config.json'), \"/content/drive/MyDrive/XTTS_ft_colab/config.json\")\n",
+        "shutil.copy(os.path.join(model_dir, 'vocab.json'), \"/content/drive/MyDrive/XTTS_ft_colab/vocab.json'\")\n",
+        "shutil.copy('model.pth', \"/content/drive/MyDrive/XTTS_ft_colab/model.pth\")\n",
+        "shutil.copy('dataset.zip', \"/content/drive/MyDrive/XTTS_ft_colab/dataset.zip\")"
+      ]
+    }
+  ],
+  "metadata": {
+    "accelerator": "GPU",
+    "colab": {
+      "gpuType": "T4",
+      "provenance": []
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "name": "python3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
diff --git a/TTS/server/server.py b/TTS/server/server.py
@@ -38,7 +38,10 @@ def create_argparser() -> argparse.ArgumentParser:
         default="tts_models/en/ljspeech/tacotron2-DDC",
         help="Name of one of the pre-trained tts models in format <language>/<dataset>/<model_name>",
     )
-    parser.add_argument("--vocoder_name", type=str, default=None, help="name of one of the released vocoder models.")
+    parser.add_argument("--vocoder_name", type=str, default=None, help="Name of one of the released vocoder models.")
+    parser.add_argument(
+        "--speaker_idx", type=str, default=None, help="Target speaker ID for a multi-speaker TTS model."
+    )
 
     # Args for running custom models
     parser.add_argument("--config_path", default=None, type=str, help="Path to model config file.")
@@ -163,10 +166,10 @@ def tts():
     with lock:
         text = request.headers.get("text") or request.values.get("text", "")
         speaker_idx = (
-            request.headers.get("speaker-id") or request.values.get("speaker_id", "") if api.is_multi_speaker else None
+            request.headers.get("speaker-id") or request.values.get("speaker_id", args.speaker_idx)
+            if api.is_multi_speaker
+            else None
         )
-        if speaker_idx == "":
-            speaker_idx = None
         language_idx = (
             request.headers.get("language-id") or request.values.get("language_id", "")
             if api.is_multi_lingual
@@ -207,6 +210,13 @@ def mary_tts_api_voices():
         model_details = args.model_name.split("/")
     else:
         model_details = ["", "en", "", "default"]
+    if api.is_multi_speaker:
+        return render_template_string(
+            "{% for speaker in speakers %}{{ speaker }} {{ locale }} {{ gender }}\n{% endfor %}",
+            speakers=api.speakers,
+            locale=model_details[1],
+            gender="u",
+        )
     return render_template_string(
         "{{ name }} {{ locale }} {{ gender }}\n", name=model_details[3], locale=model_details[1], gender="u"
     )
@@ -218,12 +228,16 @@ def mary_tts_api_process():
     with lock:
         if request.method == "POST":
             data = parse_qs(request.get_data(as_text=True))
-            # NOTE: we ignore param. LOCALE and VOICE for now since we have only one active model
+            speaker_idx = data.get("VOICE", [args.speaker_idx])[0]
+            # NOTE: we ignore parameter LOCALE for now since we have only one active model
             text = data.get("INPUT_TEXT", [""])[0]
         else:
             text = request.args.get("INPUT_TEXT", "")
+            speaker_idx = request.args.get("VOICE", args.speaker_idx)
+
         logger.info("Model input: %s", text)
-        wavs = api.tts(text)
+        logger.info("Speaker idx: %s", speaker_idx)
+        wavs = api.tts(text, speaker=speaker_idx)
         out = io.BytesIO()
         api.synthesizer.save_wav(wavs, out)
     return send_file(out, mimetype="audio/wav")
diff --git a/TTS/tts/layers/xtts/tokenizer.py b/TTS/tts/layers/xtts/tokenizer.py
@@ -6,12 +6,6 @@
 
 import torch
 from num2words import num2words
-from spacy.lang.ar import Arabic
-from spacy.lang.en import English
-from spacy.lang.es import Spanish
-from spacy.lang.hi import Hindi
-from spacy.lang.ja import Japanese
-from spacy.lang.zh import Chinese
 from tokenizers import Tokenizer
 
 from TTS.tts.layers.xtts.zh_num2words import TextNorm as zh_num2words
@@ -21,6 +15,15 @@
 
 
 def get_spacy_lang(lang):
+    try:
+        from spacy.lang.ar import Arabic
+        from spacy.lang.en import English
+        from spacy.lang.es import Spanish
+        from spacy.lang.hi import Hindi
+        from spacy.lang.ja import Japanese
+        from spacy.lang.zh import Chinese
+    except ImportError as e:
+        raise ImportError("enable_text_splitting=True requires Spacy: pip install spacy[ja]") from e
     """Return Spacy language used for sentence splitting."""
     if lang == "zh":
         return Chinese()
diff --git a/TTS/tts/models/forward_tts.py b/TTS/tts/models/forward_tts.py
@@ -331,7 +331,7 @@ def format_durations(self, o_dr_log, x_mask):
         return o_dr
 
     def _forward_encoder(
-        self, x: torch.LongTensor, x_mask: torch.FloatTensor, g: torch.FloatTensor = None
+        self, x: torch.LongTensor, x_mask: torch.FloatTensor, g: torch.FloatTensor | None = None
     ) -> tuple[torch.FloatTensor, torch.FloatTensor, torch.FloatTensor, torch.FloatTensor, torch.FloatTensor]:
         """Encoding forward pass.
 
@@ -356,7 +356,7 @@ def _forward_encoder(
             - g: :math:`(B, C)`
         """
         if hasattr(self, "emb_g"):
-            g = g.type(torch.LongTensor)
+            g = g.type(torch.LongTensor).to(x.device)
             g = self.emb_g(g)  # [B, C, 1]
         if g is not None:
             g = g.unsqueeze(-1)
diff --git a/TTS/utils/audio/numpy_transforms.py b/TTS/utils/audio/numpy_transforms.py
@@ -281,16 +281,20 @@ def compute_f0(
         >>> wav = ap.load_wav(WAV_FILE, sr=ap.sample_rate)[:5 * ap.sample_rate]
         >>> pitch = ap.compute_f0(wav)
     """
-    assert pitch_fmax is not None, " [!] Set `pitch_fmax` before caling `compute_f0`."
-    assert pitch_fmin is not None, " [!] Set `pitch_fmin` before caling `compute_f0`."
+    assert pitch_fmax is not None, " [!] Set `pitch_fmax` before calling `compute_f0`."
+    assert pitch_fmin is not None, " [!] Set `pitch_fmin` before calling `compute_f0`."
+
+    if sample_rate / pitch_fmin >= win_length - 1:
+        logger.warning("pitch_fmin=%.2f is too small for win_length=%d", pitch_fmin, win_length)
+        pitch_fmin = sample_rate / (win_length - 1) + 0.1
+        logger.warning("pitch_fmin increased to %f", pitch_fmin)
 
     f0, voiced_mask, _ = pyin(
         y=x.astype(np.double),
         fmin=pitch_fmin,
         fmax=pitch_fmax,
         sr=sample_rate,
         frame_length=win_length,
-        win_length=win_length // 2,
         hop_length=hop_length,
         pad_mode=stft_pad_mode,
         center=center,
diff --git a/docs/source/marytts.md b/docs/source/marytts.md
@@ -39,5 +39,25 @@ You can enter the same URLs in your browser and check-out the results there as w
 
 ### How it works and limitations
 
-A classic Mary-TTS server would usually show all installed locales and voices via the corresponding endpoints and accept the parameters `LOCALE` and `VOICE` for processing. For Coqui-TTS we usually start the server with one specific locale and model and thus cannot return all available options. Instead we return the active locale and use the model name as "voice". Since we only have one active model and always want to return a WAV-file, we currently ignore all other processing parameters except `INPUT_TEXT`. Since the gender is not defined for models in Coqui-TTS we always return `u` (undefined).
-We think that this is an acceptable compromise, since users are often only interested in one specific voice anyways, but the API might get extended in the future to support multiple languages and voices at the same time.
+#### Single-speaker models
+
+A classic Mary-TTS server would usually show all installed locales and voices
+via the corresponding endpoints and accept the parameters `LOCALE` and `VOICE`
+for processing. For Coqui-TTS we usually start the server with one specific
+locale and model and thus cannot return all available options. Instead, for
+single-speaker models, we return the active locale and use the model name as
+"voice". Since we only have one active model and always want to return a
+WAV-file, we currently ignore all other processing parameters except
+`INPUT_TEXT`. Since the gender is not defined for models in Coqui-TTS we always
+return `u` (undefined). We think that this is an acceptable compromise, since
+users are often only interested in one specific voice anyways, but the API might
+get extended in the future to support multiple languages and voices at the same
+time.
+
+#### Multi-speaker models
+
+For multi-speaker models, a specific speaker ID can be passed with the `VOICE`
+parameter. The `/voices` endpoint will return all available speaker IDs.
+Alternatively, the server can be started with e.g. `tts-server --model_name
+tts_models/en/vctk/vits --speaker_idx p376` to set a default speaker that will
+be used if the `VOICE` parameter is left out.
diff --git a/docs/source/models/xtts.md b/docs/source/models/xtts.md
@@ -197,7 +197,7 @@ pip install deepspeed
 - `top_k`: Lower values mean the decoder produces more "likely" (aka boring) outputs. Defaults to 50.
 - `top_p`: Lower values mean the decoder produces more "likely" (aka boring) outputs. Defaults to 0.8.
 - `speed`: The speed rate of the generated audio. Defaults to 1.0. (can produce artifacts if far from 1.0)
-- `enable_text_splitting`: Whether to split the text into sentences and generate audio for each sentence. It allows you to have infinite input length but might loose important context between sentences. Defaults to True.
+- `enable_text_splitting`: Whether to split the text into sentences and generate audio for each sentence. It allows you to have infinite input length but might loose important context between sentences. Defaults to False.
 
 
 #### Inference
@@ -295,7 +295,7 @@ The user can run this gradio demo locally or remotely using a Colab Notebook.
 #### Run demo on Colab
 To make the `XTTS_v2` fine-tuning more accessible for users that do not have good GPUs available we did a Google Colab Notebook.
 
-The Colab Notebook is available [here](https://colab.research.google.com/drive/1GiI4_X724M8q2W-zZ-jXo7cWTV7RfaH-?usp=sharing).
+The Colab Notebook is available [here](https://colab.research.google.com/github/idiap/coqui-ai-TTS/blob/dev/TTS/demos/xtts_ft_demo/XTTS_finetune_colab.ipynb).
 
 To learn how to use this Colab Notebook please check the [XTTS fine-tuning video](https://www.youtube.com/watch?v=8tpDiiouGxc).
 
diff --git a/docs/source/server.md b/docs/source/server.md
diff --git a/pyproject.toml b/pyproject.toml
diff --git a/tests/aux_tests/test_torch_transforms.py b/tests/aux_tests/test_torch_transforms.py
diff --git a/tests/data_tests/test_loader.py b/tests/data_tests/test_loader.py
diff --git a/tests/zoo_tests/test_models.py b/tests/zoo_tests/test_models.py

Original file line number	Diff line number	Diff line change
`@@ -31,7 +31,7 @@`
`31`	`31`	`# Bark`
`32`	`32`	`torch.serialization.add_safe_globals(`
`33`	`33`	`[`
`34`		`- np.core.multiarray.scalar,`
	`34`	`+ np._core.multiarray.scalar,`
`35`	`35`	`np.dtype,`
`36`	`36`	`np.dtypes.Float64DType,`
`37`	`37`	`_codecs.encode, # TODO: safe by default from Pytorch 2.5`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+You can open the notebook in Google Colab: https://colab.research.google.com/github/idiap/coqui-ai-TTS/blob/dev/TTS/demos/xtts_ft_demo/XTTS_finetune_colab.ipynb`