Skip to main content
You can convert text to speech (TTS) by using Serverless AI. To do so:
  1. Create a Docker image powered by the Piper engine for TTS.
  2. Run a fine-tuning job based on this image. This job produces an ONNX model for TTS.
  3. Deploy the model as a Serverless AI endpoint.
  4. Synthesize speech from text by using the deployed model.

Costs

Nebius AI Cloud charges you for the following billing items:

Steps

Prepare infrastructure

  1. Create a CPU-only VM. It’s required to build the Docker image because the VM’s operating system (Linux) is suitable for the image architecture. Configure SSH access to the VM so that you can connect to it later.
    1. In the web console, go to https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/sidebar/compute.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=b91340217b08a1456d88ae0347f281d1 Compute → Virtual machines.
    2. Click https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/plus.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=7c9efc69d65fc58db0eb73702fd81aa1 Create virtual machine.
    3. On the page that opens, set the following VM configuration:
      • Computing resources: Without GPU.
      • Platform: Non-GPU AMD EPYC Genoa.
      • Preset: 16 CPUs — 64 GiB RAM.
      • Boot disk size: at least 100 GiB.
      • Public IP address: Auto assign dynamic IP.
      • Username and SSH key: Configure access credentials.
    4. Click Create VM.
  2. Create a bucket to store fine-tuning artifacts.
    1. In the web console, go to https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/sidebar/storage.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=0a2dad6b48aea10e85f6f3e2343aee26 Storage → Object Storage.
    2. Click https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/plus.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=7c9efc69d65fc58db0eb73702fd81aa1 Create bucket.
    3. In the Maximum size field, select Unlimited. Leave the other settings at their default values.
    4. Click Create bucket.

Prepare a dataset

On a local machine, prepare a dataset for training the ONNX model. After that, upload the dataset to the bucket.
  1. Create a working directory:
    mkdir -p ~/voice-demo-upload/input/raw
    cd ~/voice-demo-upload
    
  2. Create and activate a virtual Python environment:
    python3 -m venv .venv
    source .venv/bin/activate
    
  3. In this environment, install the required Python dependencies for the dataset preparation:
    pip3 install --upgrade pip
    pip3 install datasets soundfile torchcodec torch
    
  4. Install FFmpeg. This is a tool that allows you to record and convert audio and that is required for TorchCodec. You can install FFmpeg by running conda install "ffmpeg" or brew install "ffmpeg" (macOS only).
  5. Download five training samples from Hugging Face:
    python3 - <<'PY'
    from pathlib import Path
    from datasets import load_dataset, Audio
    
    out = Path("input/raw")
    out.mkdir(parents=True, exist_ok=True)
    
    ds = load_dataset(
        "openslr/librispeech_asr",
        "clean",
        split="train.100",
        streaming=True,
    )
    
    ds = ds.cast_column("audio", Audio(decode=False))
    
    for i, row in enumerate(ds.take(5)):
        audio = row["audio"]
        audio_bytes = audio.get("bytes")
    
        if not audio_bytes:
            raise RuntimeError(f"No audio bytes found for sample {i}")
    
        with open(out / f"sample_{i:04d}.flac", "wb") as f:
            f.write(audio_bytes)
    
    print("Done")
    PY
    
  6. After the script prints Done, check that the samples are downloaded:
    find ~/voice-demo-upload/input -maxdepth 3 -type f | sort
    
    The output should be the following:
    ~/voice-demo-upload/input/raw/sample_0000.flac
    ~/voice-demo-upload/input/raw/sample_0001.flac
    ~/voice-demo-upload/input/raw/sample_0002.flac
    ~/voice-demo-upload/input/raw/sample_0003.flac
    ~/voice-demo-upload/input/raw/sample_0004.flac
    
  7. Upload the input folder to the bucket created earlier:
    1. In the web console, go to https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/sidebar/storage.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=0a2dad6b48aea10e85f6f3e2343aee26 Storage → Object Storage.
    2. Open the bucket page.
    3. Create the /mnt/data/input/raw directory. To do so, click Add → Folder for every directory in this path.
    4. Go to /mnt/data/input/raw and then click Add → Object.
    5. Upload the samples.

Prepare files for the Docker image

  1. To connect to the VM, get its public IP address:
    1. In the web console, go to https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/sidebar/compute.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=b91340217b08a1456d88ae0347f281d1 Compute → Virtual machines.
    2. Open the VM page.
    3. In Network → Public IPv4, copy the address.
  2. Connect to the VM by using SSH:
    ssh <username>@<IP_address>
    
    Specify the username that you set when creating the VM.
  3. On the VM, create a working directory:
    mkdir ~/piper-nebius
    cd ~/piper-nebius
    
  4. In this directory, create the following files for building the Docker image:
    #!/usr/bin/env python3
    import argparse
    import csv
    import shutil
    import subprocess
    import sys
    import urllib.request
    from pathlib import Path
    
    import torch
    import whisper
    
    
    BASE_CKPT_URL = (
        "https://huggingface.co/datasets/rhasspy/piper-checkpoints/resolve/main/"
        "en/en_US/lessac/medium/epoch%3D2164-step%3D1355540.ckpt"
    )
    
    
    def run(cmd, cwd=None):
        print("+", " ".join(cmd), flush=True)
        subprocess.run(cmd, cwd=cwd, check=True)
    
    
    def parse_args():
        parser = argparse.ArgumentParser(description="Train a Piper voice model non-interactively.")
        parser.add_argument("--raw-dir", default="/mnt/data/input/raw")
        parser.add_argument("--work-dir", default="/mnt/data/work")
        parser.add_argument("--output-dir", default="/mnt/data/output")
        parser.add_argument("--voice-name", default="custom_voice")
        parser.add_argument("--espeak-voice", default="en-us")
        parser.add_argument("--sample-rate", type=int, default=22050)
        parser.add_argument("--segment-seconds", type=int, default=10)
        parser.add_argument("--whisper-model", default="turbo")
        parser.add_argument("--max-epochs", type=int, default=4000)
        parser.add_argument("--batch-size", type=int, default=32)
        parser.add_argument("--num-workers", type=int, default=8)
        parser.add_argument("--base-ckpt-url", default=BASE_CKPT_URL)
        parser.add_argument("--no-base-ckpt", action="store_true")
        parser.add_argument("--device", default="cuda")
        parser.add_argument("--piper-repo", default="/opt/piper1-gpl")
        return parser.parse_args()
    
    
    def collect_audio_files(raw_dir: Path):
        exts = {".wav", ".mp3", ".m4a", ".flac", ".ogg"}
        files = sorted(p for p in raw_dir.rglob("*") if p.suffix.lower() in exts)
        if not files:
            raise FileNotFoundError(f"No audio files found under {raw_dir}")
        return files
    
    
    def segment_audio(files, wav_dir: Path, segment_seconds: int, sample_rate: int):
        wav_dir.mkdir(parents=True, exist_ok=True)
        for src in files:
            stem = src.stem.replace(" ", "_")
            out_pattern = wav_dir / f"{stem}_%04d.wav"
            run(
                [
                    "ffmpeg",
                    "-y",
                    "-i",
                    str(src),
                    "-vn",
                    "-ac",
                    "1",
                    "-ar",
                    str(sample_rate),
                    "-c:a",
                    "pcm_s16le",
                    "-f",
                    "segment",
                    "-segment_time",
                    str(segment_seconds),
                    str(out_pattern),
                ]
            )
    
    
    def transcribe_segments(wav_dir: Path, metadata_path: Path, whisper_model: str, device: str):
        model = whisper.load_model(whisper_model).to(device)
        wav_files = sorted(wav_dir.glob("*.wav"))
        if not wav_files:
            raise FileNotFoundError(f"No segmented wav files found under {wav_dir}")
    
        with metadata_path.open("w", encoding="utf-8", newline="") as f:
            writer = csv.writer(f, delimiter="|", lineterminator="\n")
            for wav_path in wav_files:
                result = model.transcribe(str(wav_path))
                transcript = " ".join(result["text"].strip().split())
                if transcript:
                    writer.writerow([wav_path.stem, transcript])
    
    
    def download_base_checkpoint(url: str, checkpoint_path: Path):
        checkpoint_path.parent.mkdir(parents=True, exist_ok=True)
        print(f"+ download {url} -> {checkpoint_path}", flush=True)
        urllib.request.urlretrieve(url, checkpoint_path)
    
    
    def sanitize_checkpoint(checkpoint_path: Path):
        checkpoint = torch.load(checkpoint_path, map_location="cpu", weights_only=False)
        checkpoint["hyper_parameters"] = {}
        torch.save(checkpoint, checkpoint_path)
    
    
    def latest_checkpoint(logs_dir: Path):
        checkpoints = sorted(logs_dir.glob("**/checkpoints/*.ckpt"), key=lambda p: p.stat().st_mtime)
        if not checkpoints:
            raise FileNotFoundError(f"No checkpoints found under {logs_dir}")
        return checkpoints[-1]
    
    
    def main():
        args = parse_args()
    
        raw_dir = Path(args.raw_dir)
        work_dir = Path(args.work_dir)
        output_dir = Path(args.output_dir)
        wav_dir = work_dir / "wav"
        metadata_path = work_dir / "metadata.csv"
        cache_dir = work_dir / "cache"
        config_path = work_dir / "config.json"
        base_ckpt_path = work_dir / "base.ckpt"
        logs_dir = Path(args.piper_repo) / "lightning_logs"
    
        work_dir.mkdir(parents=True, exist_ok=True)
        output_dir.mkdir(parents=True, exist_ok=True)
    
        audio_files = collect_audio_files(raw_dir)
        segment_audio(audio_files, wav_dir, args.segment_seconds, args.sample_rate)
        transcribe_segments(wav_dir, metadata_path, args.whisper_model, args.device)
    
        if not args.no_base_ckpt:
            download_base_checkpoint(args.base_ckpt_url, base_ckpt_path)
            sanitize_checkpoint(base_ckpt_path)
    
        fit_cmd = [
            sys.executable,
            "-m",
            "piper.train",
            "fit",
            "--data.voice_name",
            args.voice_name,
            "--data.csv_path",
            str(metadata_path),
            "--data.audio_dir",
            str(wav_dir),
            "--model.sample_rate",
            str(args.sample_rate),
            "--data.espeak_voice",
            args.espeak_voice,
            "--data.cache_dir",
            str(cache_dir),
            "--data.config_path",
            str(config_path),
            "--data.batch_size",
            str(args.batch_size),
            "--data.num_workers",
            str(args.num_workers),
            "--trainer.log_every_n_steps",
            "1",
            "--trainer.max_epochs",
            str(args.max_epochs),
            "--trainer.accelerator",
            "gpu",
            "--trainer.devices",
            "1",
        ]
    
        if not args.no_base_ckpt:
            fit_cmd.extend(
                [
                    "--ckpt_path",
                    str(base_ckpt_path),
                    "--weights_only",
                    "true",
                ]
            )
    
        run(fit_cmd, cwd=args.piper_repo)
    
        checkpoint_path = latest_checkpoint(logs_dir)
        output_model = output_dir / "model.onnx"
        output_config = output_dir / "model.onnx.json"
    
        run(
            [
                sys.executable,
                "-m",
                "piper.train.export_onnx",
                "--checkpoint",
                str(checkpoint_path),
                "--output-file",
                str(output_model),
            ],
            cwd=args.piper_repo,
        )
    
        if config_path.exists():
            if output_config.exists():
                print(f"Config already exists at {output_config}, leaving as-is", flush=True)
            else:
                try:
                    shutil.copy2(config_path, output_config)
                except PermissionError:
                    try:
                        shutil.copyfile(config_path, output_config)
                    except PermissionError as exc:
                        print(
                            f"Warning: could not write config to {output_config}: {exc}. "
                            "Model export succeeded; continuing without output JSON copy.",
                            flush=True,
                        )
    
        print(f"Training complete. ONNX model: {output_model}", flush=True)
    
    
    if __name__ == "__main__":
        main()
    
    #!/usr/bin/env python3
    import subprocess
    import tempfile
    from pathlib import Path
    
    from fastapi import FastAPI, HTTPException
    from fastapi.responses import FileResponse
    from pydantic import BaseModel
    
    
    MODEL_PATH = Path("/mnt/data/output/model.onnx")
    CONFIG_PATH = Path("/mnt/data/output/model.onnx.json")
    
    app = FastAPI(title="Piper Voice Endpoint")
    
    
    class SynthesizeRequest(BaseModel):
        text: str
    
    
    @app.get("/health")
    def health():
        return {
            "ok": MODEL_PATH.exists(),
            "model": str(MODEL_PATH),
            "config": str(CONFIG_PATH),
        }
    
    
    @app.post("/synthesize")
    def synthesize(request: SynthesizeRequest):
        if not MODEL_PATH.exists():
            raise HTTPException(status_code=503, detail="model.onnx not found at /mnt/data/output")
    
        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
            output_path = Path(tmp.name)
    
        try:
            subprocess.run(
                ["piper", "-m", str(MODEL_PATH), "--output_file", str(output_path)],
                input=request.text,
                text=True,
                check=True,
            )
            return FileResponse(output_path, media_type="audio/wav", filename="speech.wav")
        except subprocess.CalledProcessError as exc:
            raise HTTPException(status_code=500, detail=f"inference failed: {exc}") from exc
    
    datasets
    soundfile
    torch<2.6
    openai-whisper
    fastapi
    uvicorn[standard]
    piper-phonemize
    
    import pathlib
    import torch.serialization
    
    torch.serialization.add_safe_globals([pathlib.PosixPath])
    
    FROM nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04
    
    ENV DEBIAN_FRONTEND=noninteractive
    ENV PIPER_REPO=/opt/piper1-gpl
    ENV PYTHONUNBUFFERED=1
    ENV PYTHONPATH=/app
    
    RUN apt-get update && apt-get install -y \
        python3 \
        python3-pip \
        python3-venv \
        python3-dev \
        ffmpeg \
        git \
        wget \
        curl \
        cmake \
        build-essential \
        ninja-build \
        espeak-ng \
        && rm -rf /var/lib/apt/lists/*
    
    WORKDIR /app
    
    COPY requirements.txt /app/requirements.txt
    
    RUN python3 -m pip install --upgrade pip setuptools wheel scikit-build && \
        python3 -m pip install -r /app/requirements.txt
    
    RUN git clone https://github.com/OHF-voice/piper1-gpl.git ${PIPER_REPO} && \
        python3 -m pip install ${PIPER_REPO} && \
        cp /usr/local/lib/python3.10/dist-packages/piper/espeakbridge.so /tmp/espeakbridge.so && \
        cp /usr/local/lib/python3.10/dist-packages/piper/espeakbridge.pyi /tmp/espeakbridge.pyi && \
        python3 -m pip install -e ${PIPER_REPO}[train] && \
        cp /tmp/espeakbridge.so ${PIPER_REPO}/src/piper/espeakbridge.so && \
        cp /tmp/espeakbridge.pyi ${PIPER_REPO}/src/piper/espeakbridge.pyi && \
        ${PIPER_REPO}/build_monotonic_align.sh
    
    COPY train.py /app/train.py
    COPY app.py /app/app.py
    COPY sitecustomize.py /app/sitecustomize.py
    
    EXPOSE 8000
    
    CMD ["python3", "/app/train.py"]
    
    You can check that you stored all the files and didn’t miss any by using ls or tree.

Build and push the Docker image

On the VM:
  1. Install Docker.
  2. Install additional packages and prepare Docker for building the image:
    sudo apt-get update
    sudo apt-get install -y git curl wget unzip python3 python3-venv python3-pip ca-certificates
    sudo usermod -aG docker "$USER"
    newgrp docker
    
  3. Check that the Docker daemon is running:
    docker ps
    
    If Docker is running, this command returns a table of containers (can be empty). If you don’t see the table and the daemon isn’t running, launch it.
  4. Create an account in Docker Hub. Use it for authentication when you push your image to a repository.
  5. Create a public repository in Docker Hub. You will push your Docker image there.
  6. In the ~/piper-nebius directory, build the image:
    docker build -t <repository>/<image>:piper-nebius-ui-tutorial .
    
    In the command, specify your public repository. For example, myrepository/tts:piper-nebius-ui-tutorial.
    If you build the image on a non-Linux system, the image architecture will be insufficient, and a fine-tuning job will fail.
  7. Authenticate in Docker Hub:
    docker login -u <username>
    
    Specify your username at Docker Hub and enter your password when prompted.
  8. Push the image to the repository:
    docker push <repository>/<image>:piper-nebius-ui-tutorial
    
    This operation can take several minutes to complete.

Create and deploy the ONNX model by using a Serverless AI job and endpoint

  1. Create a fine-tuning job that generates the ONNX model:
    1. In the web console, go to https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/sidebar/ai-services.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=ab4ff229f7690c99deb1dc52d3daf987 AI Services → Jobs.
    2. Click https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/plus.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=7c9efc69d65fc58db0eb73702fd81aa1 Create job.
    3. On the page that opens, specify the following job parameters:
      • Image path: <repository>/<image>:piper-nebius-ui-tutorial. Set the image that you’ve pushed to the Docker repository.
      • Advanced settings → Entrypoint command: python3.
      • Advanced settings → Arguments: /app/train.py --raw-dir /mnt/data/input/raw --work-dir /tmp/work --output-dir /mnt/data/output --voice-name demo_voice --no-base-ckpt --max-epochs 50 --batch-size 4 --num-workers 0.
        • --raw-dir /mnt/data/input/raw: Matches the uploaded files.
        • --work-dir /tmp/work: Properly saves files to Object Storage.
        • --output-dir /mnt/data/output: Saves the exported ONNX model to the mounted volume.
        • --no-base-ckpt: Helps avoid checkpoint compatibility problems in the dataset path.
        • --batch-size 4 --num-workers 0: Make standard settings for a small dataset.
      • Computing resources: Keep the predefined settings.
      • Mount volumes: Bucket.
      • Mount path: /mnt/data. After that, click https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/plus.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=7c9efc69d65fc58db0eb73702fd81aa1 Attach bucket and then select the bucket created earlier.
    4. Click Create.
    After the job reaches the Complete status, the files output/model.onnx and output/model.onnx.json are created in the bucket. These files contain the produced model.
  2. Deploy the model on a Serverless AI endpoint:
    1. In the web console, go to https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/sidebar/ai-services.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=ab4ff229f7690c99deb1dc52d3daf987 AI Services → Endpoints.
    2. Click https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/plus.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=7c9efc69d65fc58db0eb73702fd81aa1 Create endpoint.
    3. On the page that opens, specify the following endpoint parameters:
      • Image path: <repository>/<image>:piper-nebius-ui-tutorial. Set the image that you’ve pushed to the Docker repository.
      • Ports: 8000.
      • Advanced settings → Entrypoint command: uvicorn.
      • Advanced settings → Arguments: app:app --host 0.0.0.0 --port 8000.
      • Computing resources: Keep the predefined settings.
      • Mount volumes: Bucket.
      • Mount path: /mnt/data. After that, click https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/plus.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=7c9efc69d65fc58db0eb73702fd81aa1 Attach bucket and then select the bucket created earlier.
      • IP address: Public static IP.
    4. Click Create.
    Wait until the endpoint reaches the Running status.

Synthesize speech

  1. Get the endpoint IP address:
    1. In the web console, go to https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/sidebar/ai-services.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=ab4ff229f7690c99deb1dc52d3daf987 AI Services → Endpoints.
    2. Open the page of the deployed endpoint.
    3. Copy the IP address from the Network → Public endpoints field.
  2. To verify the endpoint health, run a health check:
    curl http://<IP_address>:8000/health
    
    Expected output:
    {"ok":true,"model":"/mnt/data/output/model.onnx","config":"/mnt/data/output/model.onnx.json"}
    
    The "ok":true message shows that the endpoint is healthy.
  3. To synthesize speech, call the endpoint:
    curl -X POST "http://<IP_address>:8000/synthesize" \
       -H "Content-Type: application/json" \
       -d '{"text":"Hello world"}' \
       --output speech.wav
    
    The method generates the speech.wav file with the recorded Hello world phrase. The audio quality can be low because only five samples from a dataset were used to train the model. That is expected because the tutorial’s purpose is only to showcase the process of the speech synthesis. To improve the audio quality, use a bigger dataset and more samples for the model training.

How to delete the created resources

Some of the created resources are chargeable. If you don’t need them, delete these resources, so Nebius AI Cloud doesn’t charge for them:
  • CPU-only VM.
  • Boot disk attached to the VM.
  • Bucket.
  • Endpoint. When you delete an endpoint, Serverless AI automatically deletes the endpoint VM and container (boot) disk.