Speech synthesis in Serverless AI

You can convert text to speech (TTS) by using Serverless AI. To do so:

Create a Docker image powered by the Piper engine for TTS.
Run a fine-tuning job based on this image. This job produces an Open Neural Network Exchange (ONNX) model for TTS.
Deploy the model as a Serverless AI endpoint.
Synthesize speech from text by using the deployed model.

Costs

Nebius AI Cloud charges you for the following billing items:

Compute virtual machines (VMs)
Boot disks attached to the VMs
Used space in Standard storage in an Object Storage bucket

Prerequisites

Make sure you are in a group that has at least the editor role within your tenant or project; for example, the default editors group. You can check this in the Administration → IAM section of the web console.

Steps

Prepare infrastructure

Locate all resources in the same project.

Create a CPU-only VM. The VM is required to build the Docker image based on the VM’s Linux operating system (OS). If you build the image on a non-Linux OS, the image architecture will be incompatible with Serverless AI, and the fine-tuning job will fail. Configure SSH access to the VM so that you can connect to it later.

Web console
CLI
Terraform

In the web console, go to Compute → Virtual machines.
Click Create virtual machine.
On the page that opens, set the following VM configuration:
- Computing resources: Without GPU.
- Platform: Non-GPU AMD EPYC Genoa.
- Preset: 16 CPUs — 64 GiB RAM.
- Boot disk size: At least 100 GiB.
- Public IP address: Auto assign dynamic IP.
- Username and SSH key: Configure access credentials.
Click Create VM.

Create a boot disk:

nebius compute disk create \
  --name my-boot-disk \
  --size-gibibytes 100 \
  --type network_ssd \
  --source-image-family-image-family ubuntu24.04-driverless \
  --block-size-bytes 4096

To add a user for connections to the VM, create a configuration by using the cloud-init format:

export USER_DATA=$(jq -Rrs '.' <<EOF
#cloud-config
users:
  - name: $USER
    sudo: ALL=(ALL) NOPASSWD:ALL
    shell: /bin/bash
    ssh_authorized_keys:
      - $(cat ~/.ssh/id_ed25519.pub)
EOF
)

Create the VM:

nebius compute instance create \
  --name my-vm \
  --boot-disk-existing-disk-id <boot_disk_ID> \
  --boot-disk-attach-mode READ_WRITE \
  --resources-platform cpu-d3 \
  --resources-preset 16vcpu-64gb \
  --network-interfaces "[{\"name\": \"eth0\", \"subnet_id\": \"<subnet_ID>\", \"ip_address\": {}, \"public_ip_address\": {}}]" \
  --cloud-init-user-data "$USER_DATA"

This command creates a VM without GPUs, assigns a dynamic public IP address and configures SSH access. For details about the subnet ID, see How to get a subnet ID.

Install and configure the Nebius AI Cloud provider for Terraform.

Create a boot disk by using the following configuration:

resource "nebius_compute_v1_disk" "my_boot_disk" {
  name           = "my-boot-disk"
  parent_id      = "<project_ID>"
  size_gibibytes = 100
  type           = "NETWORK_SSD"
  source_image_family = {
    image_family = "ubuntu24.04-driverless"
  }
  block_size_bytes = 4096
}

To get the project ID, go to the web console and expand the top-left list of projects. Next to the project’s name, click

→ Copy project ID.

To add a user for connections to the VM, create a configuration by using the cloud-init format:

export USER_DATA=$(jq -Rrs '.' <<EOF
#cloud-config
users:
  - name: $USER
    sudo: ALL=(ALL) NOPASSWD:ALL
    shell: /bin/bash
    ssh_authorized_keys:
      - $(cat ~/.ssh/id_ed25519.pub)
EOF
)

Create the VM:

resource "nebius_compute_v1_instance" "my_vm" {
  name      = "my-vm"
  parent_id = "<project_ID>"
  resources = {
    platform = "cpu-d3"
    preset   = "16vcpu-64gb"
  }
  boot_disk = {
    existing_disk = {
      id = nebius_compute_v1_disk.my_boot_disk.id
    }
    attach_mode = "READ_WRITE"
  }
  cloud_init_user_data = var.user_data
  network_interfaces = [
    {
      name       = "eth0"
      ip_address = {}
      public_ip_address = {}
      subnet_id = "<subnet_ID>"
    }
  ]
}

This manifest creates a VM without GPUs, assigns a dynamic public IP address and configures SSH access. For details about the subnet ID, see How to get a subnet ID.

Check that the configuration is correct:
```
terraform validate
```
Apply the changes:
```
terraform apply
```

Create a bucket to store fine-tuning artifacts.
- Web console
- CLI
- Terraform
1. In the web console, go to Storage → Object Storage.
2. Click Create bucket.
3. In the Maximum size field, select Unlimited. Leave the other settings at their default values.
4. Click Create bucket.
Run the following command:
nebius storage bucket create --name my-tts-bucket
1. Use the following configuration file:
  resource "nebius_storage_v1_bucket" "my_bucket" { name = "my-tts-bucket" parent_id = "<project_ID>" }
2. Check that the configuration is correct:
  terraform validate
3. Apply the changes:
  terraform apply

Prepare a dataset

On a local machine, prepare a dataset for training the ONNX model. After that, upload the dataset to the bucket.

Create a working directory:

mkdir -p ~/voice-demo-upload/input/raw
cd ~/voice-demo-upload

Create and activate a virtual Python environment:

python3 -m venv .venv
source .venv/bin/activate

In this environment, install the required Python dependencies for the dataset preparation:
```
pip3 install --upgrade pip
pip3 install datasets soundfile torchcodec torch
```
Install FFmpeg. This is a tool that allows you to record and convert audio, and that is required for TorchCodec. You can install FFmpeg by running conda install "ffmpeg" or brew install "ffmpeg" (macOS only).

Download five training samples from Hugging Face:

python3 - <<'PY'
from pathlib import Path
from datasets import load_dataset, Audio

out = Path("input/raw")
out.mkdir(parents=True, exist_ok=True)

ds = load_dataset(
    "openslr/librispeech_asr",
    "clean",
    split="train.100",
    streaming=True,
)

ds = ds.cast_column("audio", Audio(decode=False))

for i, row in enumerate(ds.take(5)):
    audio = row["audio"]
    audio_bytes = audio.get("bytes")

    if not audio_bytes:
        raise RuntimeError(f"No audio bytes found for sample {i}")

    with open(out / f"sample_{i:04d}.flac", "wb") as f:
        f.write(audio_bytes)

print("Done")
PY

After the script prints Done, check that the samples are downloaded:

find ~/voice-demo-upload/input -maxdepth 3 -type f | sort

The output should be the following:

~/voice-demo-upload/input/raw/sample_0000.flac
~/voice-demo-upload/input/raw/sample_0001.flac
~/voice-demo-upload/input/raw/sample_0002.flac
~/voice-demo-upload/input/raw/sample_0003.flac
~/voice-demo-upload/input/raw/sample_0004.flac

Upload the input folder to the bucket created earlier:
- Web console
1. In the web console, go to Storage → Object Storage.
2. Open the bucket page.
3. Create the /mnt/data/input/raw directory. To do so, click Add → Folder for every directory in this path.
4. Go to /mnt/data/input/raw and then click Add → Object.
5. Upload the samples.

Prepare files for the Docker image

To connect to the VM, get its public IP address:
- Web console
- CLI
1. In the web console, go to Compute → Virtual machines.
2. Open the VM page.
3. In Network → Public IPv4, copy the address.
Run the following command:
nebius compute instance get-by-name --name my-vm \ --format jsonpath='{.status.network_interfaces[0].public_ip_address.address}'
Connect to the VM by using SSH:
```
ssh <username>@<IP_address>
```
Specify the username that you set when creating the VM.
On the VM, create a working directory:
```
mkdir ~/piper-nebius
cd ~/piper-nebius
```

In this directory, create the following files for building the Docker image:

train.py

#!/usr/bin/env python3
import argparse
import csv
import shutil
import subprocess
import sys
import urllib.request
from pathlib import Path

import torch
import whisper


BASE_CKPT_URL = (
    "https://huggingface.co/datasets/rhasspy/piper-checkpoints/resolve/main/"
    "en/en_US/lessac/medium/epoch%3D2164-step%3D1355540.ckpt"
)


def run(cmd, cwd=None):
    print("+", " ".join(cmd), flush=True)
    subprocess.run(cmd, cwd=cwd, check=True)


def parse_args():
    parser = argparse.ArgumentParser(description="Train a Piper voice model non-interactively.")
    parser.add_argument("--raw-dir", default="/mnt/data/input/raw")
    parser.add_argument("--work-dir", default="/mnt/data/work")
    parser.add_argument("--output-dir", default="/mnt/data/output")
    parser.add_argument("--voice-name", default="custom_voice")
    parser.add_argument("--espeak-voice", default="en-us")
    parser.add_argument("--sample-rate", type=int, default=22050)
    parser.add_argument("--segment-seconds", type=int, default=10)
    parser.add_argument("--whisper-model", default="turbo")
    parser.add_argument("--max-epochs", type=int, default=4000)
    parser.add_argument("--batch-size", type=int, default=32)
    parser.add_argument("--num-workers", type=int, default=8)
    parser.add_argument("--base-ckpt-url", default=BASE_CKPT_URL)
    parser.add_argument("--no-base-ckpt", action="store_true")
    parser.add_argument("--device", default="cuda")
    parser.add_argument("--piper-repo", default="/opt/piper1-gpl")
    return parser.parse_args()


def collect_audio_files(raw_dir: Path):
    exts = {".wav", ".mp3", ".m4a", ".flac", ".ogg"}
    files = sorted(p for p in raw_dir.rglob("*") if p.suffix.lower() in exts)
    if not files:
        raise FileNotFoundError(f"No audio files found under {raw_dir}")
    return files


def segment_audio(files, wav_dir: Path, segment_seconds: int, sample_rate: int):
    wav_dir.mkdir(parents=True, exist_ok=True)
    for src in files:
        stem = src.stem.replace(" ", "_")
        out_pattern = wav_dir / f"{stem}_%04d.wav"
        run(
            [
                "ffmpeg",
                "-y",
                "-i",
                str(src),
                "-vn",
                "-ac",
                "1",
                "-ar",
                str(sample_rate),
                "-c:a",
                "pcm_s16le",
                "-f",
                "segment",
                "-segment_time",
                str(segment_seconds),
                str(out_pattern),
            ]
        )


def transcribe_segments(wav_dir: Path, metadata_path: Path, whisper_model: str, device: str):
    model = whisper.load_model(whisper_model).to(device)
    wav_files = sorted(wav_dir.glob("*.wav"))
    if not wav_files:
        raise FileNotFoundError(f"No segmented wav files found under {wav_dir}")

    with metadata_path.open("w", encoding="utf-8", newline="") as f:
        writer = csv.writer(f, delimiter="|", lineterminator="\n")
        for wav_path in wav_files:
            result = model.transcribe(str(wav_path))
            transcript = " ".join(result["text"].strip().split())
            if transcript:
                writer.writerow([wav_path.stem, transcript])


def download_base_checkpoint(url: str, checkpoint_path: Path):
    checkpoint_path.parent.mkdir(parents=True, exist_ok=True)
    print(f"+ download {url} -> {checkpoint_path}", flush=True)
    urllib.request.urlretrieve(url, checkpoint_path)


def sanitize_checkpoint(checkpoint_path: Path):
    checkpoint = torch.load(checkpoint_path, map_location="cpu", weights_only=False)
    checkpoint["hyper_parameters"] = {}
    torch.save(checkpoint, checkpoint_path)


def latest_checkpoint(logs_dir: Path):
    checkpoints = sorted(logs_dir.glob("**/checkpoints/*.ckpt"), key=lambda p: p.stat().st_mtime)
    if not checkpoints:
        raise FileNotFoundError(f"No checkpoints found under {logs_dir}")
    return checkpoints[-1]


def main():
    args = parse_args()

    raw_dir = Path(args.raw_dir)
    work_dir = Path(args.work_dir)
    output_dir = Path(args.output_dir)
    wav_dir = work_dir / "wav"
    metadata_path = work_dir / "metadata.csv"
    cache_dir = work_dir / "cache"
    config_path = work_dir / "config.json"
    base_ckpt_path = work_dir / "base.ckpt"
    logs_dir = Path(args.piper_repo) / "lightning_logs"

    work_dir.mkdir(parents=True, exist_ok=True)
    output_dir.mkdir(parents=True, exist_ok=True)

    audio_files = collect_audio_files(raw_dir)
    segment_audio(audio_files, wav_dir, args.segment_seconds, args.sample_rate)
    transcribe_segments(wav_dir, metadata_path, args.whisper_model, args.device)

    if not args.no_base_ckpt:
        download_base_checkpoint(args.base_ckpt_url, base_ckpt_path)
        sanitize_checkpoint(base_ckpt_path)

    fit_cmd = [
        sys.executable,
        "-m",
        "piper.train",
        "fit",
        "--data.voice_name",
        args.voice_name,
        "--data.csv_path",
        str(metadata_path),
        "--data.audio_dir",
        str(wav_dir),
        "--model.sample_rate",
        str(args.sample_rate),
        "--data.espeak_voice",
        args.espeak_voice,
        "--data.cache_dir",
        str(cache_dir),
        "--data.config_path",
        str(config_path),
        "--data.batch_size",
        str(args.batch_size),
        "--data.num_workers",
        str(args.num_workers),
        "--trainer.log_every_n_steps",
        "1",
        "--trainer.max_epochs",
        str(args.max_epochs),
        "--trainer.accelerator",
        "gpu",
        "--trainer.devices",
        "1",
    ]

    if not args.no_base_ckpt:
        fit_cmd.extend(
            [
                "--ckpt_path",
                str(base_ckpt_path),
                "--weights_only",
                "true",
            ]
        )

    run(fit_cmd, cwd=args.piper_repo)

    checkpoint_path = latest_checkpoint(logs_dir)
    output_model = output_dir / "model.onnx"
    output_config = output_dir / "model.onnx.json"

    run(
        [
            sys.executable,
            "-m",
            "piper.train.export_onnx",
            "--checkpoint",
            str(checkpoint_path),
            "--output-file",
            str(output_model),
        ],
        cwd=args.piper_repo,
    )

    if config_path.exists():
        if output_config.exists():
            print(f"Config already exists at {output_config}, leaving as-is", flush=True)
        else:
            try:
                shutil.copy2(config_path, output_config)
            except PermissionError:
                try:
                    shutil.copyfile(config_path, output_config)
                except PermissionError as exc:
                    print(
                        f"Warning: could not write config to {output_config}: {exc}. "
                        "Model export succeeded; continuing without output JSON copy.",
                        flush=True,
                    )

    print(f"Training complete. ONNX model: {output_model}", flush=True)


if __name__ == "__main__":
    main()

app.py

#!/usr/bin/env python3
import subprocess
import tempfile
from pathlib import Path

from fastapi import FastAPI, HTTPException
from fastapi.responses import FileResponse
from pydantic import BaseModel


MODEL_PATH = Path("/mnt/data/output/model.onnx")
CONFIG_PATH = Path("/mnt/data/output/model.onnx.json")

app = FastAPI(title="Piper Voice Endpoint")


class SynthesizeRequest(BaseModel):
    text: str


@app.get("/health")
def health():
    return {
        "ok": MODEL_PATH.exists(),
        "model": str(MODEL_PATH),
        "config": str(CONFIG_PATH),
    }


@app.post("/synthesize")
def synthesize(request: SynthesizeRequest):
    if not MODEL_PATH.exists():
        raise HTTPException(status_code=503, detail="model.onnx not found at /mnt/data/output")

    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
        output_path = Path(tmp.name)

    try:
        subprocess.run(
            ["piper", "-m", str(MODEL_PATH), "--output_file", str(output_path)],
            input=request.text,
            text=True,
            check=True,
        )
        return FileResponse(output_path, media_type="audio/wav", filename="speech.wav")
    except subprocess.CalledProcessError as exc:
        raise HTTPException(status_code=500, detail=f"inference failed: {exc}") from exc

requirements.txt

datasets
soundfile
torch<2.6
openai-whisper
fastapi
uvicorn[standard]
piper-phonemize

sitecustomize.py

import pathlib
import torch.serialization

torch.serialization.add_safe_globals([pathlib.PosixPath])

Dockerfile

FROM nvidia/cuda:12.8.0-cudnn-runtime-ubuntu24.04

ENV DEBIAN_FRONTEND=noninteractive
ENV PIPER_REPO=/opt/piper1-gpl
ENV PYTHONUNBUFFERED=1
ENV PYTHONPATH=/app

RUN apt-get update && apt-get install -y \
    python3 \
    python3-pip \
    python3-venv \
    python3-dev \
    ffmpeg \
    git \
    wget \
    curl \
    cmake \
    build-essential \
    ninja-build \
    espeak-ng \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY requirements.txt /app/requirements.txt

RUN python3 -m pip install --upgrade pip setuptools wheel scikit-build && \
    python3 -m pip install -r /app/requirements.txt

RUN git clone https://github.com/OHF-voice/piper1-gpl.git ${PIPER_REPO} && \
    python3 -m pip install ${PIPER_REPO} && \
    cp /usr/local/lib/python3.10/dist-packages/piper/espeakbridge.so /tmp/espeakbridge.so && \
    cp /usr/local/lib/python3.10/dist-packages/piper/espeakbridge.pyi /tmp/espeakbridge.pyi && \
    python3 -m pip install -e ${PIPER_REPO}[train] && \
    cp /tmp/espeakbridge.so ${PIPER_REPO}/src/piper/espeakbridge.so && \
    cp /tmp/espeakbridge.pyi ${PIPER_REPO}/src/piper/espeakbridge.pyi && \
    ${PIPER_REPO}/build_monotonic_align.sh

COPY train.py /app/train.py
COPY app.py /app/app.py
COPY sitecustomize.py /app/sitecustomize.py

EXPOSE 8000

CMD ["python3", "/app/train.py"]

To verify that all files are present, run ls or tree.

Build and push the Docker image

On the VM:

Install Docker.

Install additional packages and prepare Docker for building the image:

sudo apt-get update
sudo apt-get install -y git curl wget unzip python3 python3-venv python3-pip ca-certificates
sudo usermod -aG docker "$USER"
newgrp docker

Check that the Docker daemon is running:
```
docker ps
```
If Docker is running, this command returns a table of containers (can be empty). If you don’t see the table and the daemon isn’t running, launch it.
Create an account in Docker Hub. Use it for authentication when you push your image to a repository.
Create a public repository in Docker Hub. You will push your Docker image there.
In the ~/piper-nebius directory, build the image:
```
docker build -t <repository>/<image>:piper-nebius-ui-tutorial .
```
In the command, specify your public repository. For example, myrepository/tts:piper-nebius-ui-tutorial. This operation can take several minutes to complete.
Authenticate in Docker Hub:
```
docker login -u <username>
```
Specify your username at Docker Hub and enter your password when prompted.
Push the image to the repository:
```
docker push <repository>/<image>:piper-nebius-ui-tutorial
```
This operation can take several minutes to complete.

Create and deploy the ONNX model by using a Serverless AI job and endpoint

Create a fine-tuning job that generates the ONNX model:
- Web console
- CLI
1. In the web console, go to AI Services → Jobs.
2. Click Create job.
3. On the page that opens, specify the following job parameters:
  
  Image path: <repository>/<image>:piper-nebius-ui-tutorial. Set the image that you’ve pushed to the Docker repository.
  
  Entrypoint command:
  python3 /app/train.py --raw-dir /mnt/data/input/raw --work-dir /tmp/work --output-dir /mnt/data/output --voice-name demo_voice --no-base-ckpt --max-epochs 50 --batch-size 4 --num-workers 0
  
  Why the command contains exactly these arguments
  
  --raw-dir /mnt/data/input/raw: Matches the uploaded files.
  
  --work-dir /tmp/work: Properly saves files to Object Storage.
  
  --output-dir /mnt/data/output: Saves the exported ONNX model to the mounted volume.
  
  --no-base-ckpt: Helps avoid checkpoint compatibility problems in the dataset path.
  
  --batch-size 4 --num-workers 0: Make standard settings for a small dataset.
  
  Computing resources: Keep the predefined settings.
  
  Mount volumes: Bucket.
  
  Mount path: /mnt/data. After that, click Attach bucket and then select the bucket created earlier.
4. Click Create.
Run the following command:
nebius ai job create \ --name my-job \ --image <repository>/<image>:piper-nebius-ui-tutorial \ --container-command python3 \ --args "/app/train.py --raw-dir /mnt/data/input/raw --work-dir /tmp/work --output-dir /mnt/data/output --voice-name demo_voice --no-base-ckpt --max-epochs 50 --batch-size 4 --num-workers 0" \ --volume "<bucket_ID>:/mnt/data" \ --platform gpu-l40s-a \ --preset 1gpu-8vcpu-32gb \ --disk-size 250Gi \ --subnet-id <subnet_ID>
To get the bucket ID, run nebius storage bucket list. For details about the subnet ID, see How to get a subnet ID.
After the job reaches the Complete status, the files output/model.onnx and output/model.onnx.json are created in the bucket. These files contain the produced model.
Deploy the model on a Serverless AI endpoint:
- Web console
- CLI
1. In the web console, go to AI Services → Endpoints.
2. Click Create endpoint.
3. On the page that opens, specify the following endpoint parameters:
  
  Image path: <repository>/<image>:piper-nebius-ui-tutorial. Set the image that you’ve pushed to the Docker repository.
  
  Ports: 8000.
  
  Entrypoint command:
  uvicorn app:app --host 0.0.0.0 --port 8000
  
  Computing resources: Keep the predefined settings.
  
  Mount volumes: Bucket.
  
  Mount path: /mnt/data. After that, click Attach bucket and then select the bucket created earlier.
  
  IP address: Public static IP.
4. Click Create.
Run the following command:
nebius ai endpoint create \ --name my-endpoint \ --image <repository>/<image>:piper-nebius-ui-tutorial \ --container-port 8000 \ --container-command uvicorn \ --args "app:app --host 0.0.0.0 --port 8000" \ --volume "<bucket_ID>:/mnt/data" \ --subnet-id <subnet_ID> \ --public
Wait until the endpoint reaches the Running status.

Synthesize speech

Get the endpoint IP address:
- Web console
- CLI
1. In the web console, go to AI Services → Endpoints.
2. Open the page of the deployed endpoint.
3. Copy the IP address from the Network → Public endpoints field.
nebius ai endpoint get <endpoint_ID> \ --format json | jq -r '.status.instances[0].public_ip'
To get the endpoint ID, run nebius ai endpoint list.

To verify the endpoint health, run a health check:

curl http://<IP_address>:8000/health

Expected output:

{"ok":true,"model":"/mnt/data/output/model.onnx","config":"/mnt/data/output/model.onnx.json"}

The "ok":true message shows that the endpoint is healthy.

To synthesize speech, call the endpoint:
```
curl -X POST "http://<IP_address>:8000/synthesize" \
   -H "Content-Type: application/json" \
   -d '{"text":"Hello world"}' \
   --output speech.wav
```
The method generates the speech.wav file with the recorded Hello world phrase. The audio quality can be low because only five samples from a dataset were used to train the model. That is expected because the tutorial’s purpose is only to showcase the process of the speech synthesis. To improve the audio quality, use a bigger dataset and more samples for the model training.

How to delete the created resources

Some of the created resources are chargeable. If you don’t need them, delete these resources, so Nebius AI Cloud doesn’t charge for them:

CPU-only VM.
Boot disk attached to the VM.
Bucket.
Endpoint. When you delete an endpoint, Serverless AI automatically deletes the endpoint VM and container (boot) disk.

​Costs

​Prerequisites

​Steps

​Prepare infrastructure

​Prepare a dataset

​Prepare files for the Docker image

​Build and push the Docker image

​Create and deploy the ONNX model by using a Serverless AI job and endpoint

​Synthesize speech

​How to delete the created resources

Costs

Prerequisites

Steps

Prepare infrastructure

Prepare a dataset

Prepare files for the Docker image

Build and push the Docker image

Create and deploy the ONNX model by using a Serverless AI job and endpoint

Synthesize speech

How to delete the created resources