> ## Documentation Index > Fetch the complete documentation index at: https://docs.nebius.com/llms.txt > Use this file to discover all available pages before exploring further. # Speech synthesis in Serverless AI You can convert text to speech (TTS) by using Serverless AI. To do so: 1. Create a Docker image powered by the [Piper](https://github.com/OHF-Voice/piper1-gpl) engine for TTS. 2. Run a fine-tuning job based on this image. This job produces an [Open Neural Network Exchange](https://onnx.ai) (ONNX) model for TTS. 3. Deploy the model as a Serverless AI endpoint. 4. Synthesize speech from text by using the deployed model. ## Costs Nebius AI Cloud charges you for the following billing items: * [Compute virtual machines](/compute/resources/pricing#virtual-machines-gpus-vcpus-ram) (VMs) * [Boot disks](/compute/resources/pricing#disks) attached to the VMs * Used space in Standard storage in an [Object Storage bucket](/object-storage/resources/pricing#storing-data) ## Prerequisites Make sure you are in a [group](/iam/authorization/groups/index) that has at least the `editor` role within your tenant or project; for example, the default `editors` group. You can check this in the [Administration → IAM](https://console.nebius.com/iam) section of the web console. ## Steps ### Prepare infrastructure Locate all resources in the same project. 1. Create a CPU-only VM. The VM is required to build the Docker image based on the VM's Linux operating system (OS). If you build the image on a non-Linux OS, the image architecture will be incompatible with Serverless AI, and the fine-tuning job will fail. Configure SSH access to the VM so that you can connect to it later. 1. In the [web console](https://console.nebius.com), go to **Compute** → **Virtual machines**. 2. Click **Create virtual machine**. 3. On the page that opens, set the following VM configuration: * **Computing resources**: Without GPU. * **Platform**: Non-GPU AMD EPYC Genoa. * **Preset**: 16 CPUs — 64 GiB RAM. * **Boot disk size**: At least 100 GiB. * **Public IP address**: `Auto assign dynamic IP`. * **Username and SSH key**: Configure access credentials. 4. Click **Create VM**. 1. Create a boot disk: ```bash theme={null} nebius compute disk create \ --name my-boot-disk \ --size-gibibytes 100 \ --type network_ssd \ --source-image-family-image-family ubuntu24.04-driverless \ --block-size-bytes 4096 ``` 2. To add a user for connections to the VM, create a configuration by using the [cloud-init](https://cloudinit.readthedocs.io/en/latest/reference/modules.html#users-and-groups) format: ```bash theme={null} export USER_DATA=$(jq -Rrs '.' < \ --boot-disk-attach-mode READ_WRITE \ --resources-platform cpu-d3 \ --resources-preset 16vcpu-64gb \ --network-interfaces "[{\"name\": \"eth0\", \"subnet_id\": \"\", \"ip_address\": {}, \"public_ip_address\": {}}]" \ --cloud-init-user-data "$USER_DATA" ``` This command creates a VM without GPUs, assigns a dynamic public IP address and configures SSH access. For details about the subnet ID, see [How to get a subnet ID](/vpc/networking/resources#how-to-get-a-subnet-id). 1. [Install and configure](/terraform-provider/quickstart) the Nebius AI Cloud provider for Terraform. 2. Create a boot disk by using the following configuration: ```hcl theme={null} resource "nebius_compute_v1_disk" "my_boot_disk" { name = "my-boot-disk" parent_id = "" size_gibibytes = 100 type = "NETWORK_SSD" source_image_family = { image_family = "ubuntu24.04-driverless" } block_size_bytes = 4096 } ``` To get the project ID, go to the [web console](https://console.nebius.com) and expand the top-left list of projects. Next to the project's name, click → **Copy project ID**. 3. To add a user for connections to the VM, create a configuration by using the [cloud-init](https://cloudinit.readthedocs.io/en/latest/reference/modules.html#users-and-groups) format: ```bash theme={null} export USER_DATA=$(jq -Rrs '.' < 2. Create a bucket to store fine-tuning artifacts. 1. In the web console, go to **Storage** → **Object Storage**. 2. Click **Create bucket**. 3. In the **Maximum size** field, select **Unlimited**. Leave the other settings at their default values. 4. Click **Create bucket**. Run the following command: ```bash theme={null} nebius storage bucket create --name my-tts-bucket ``` 1. Use the following configuration file: ```hcl theme={null} resource "nebius_storage_v1_bucket" "my_bucket" { name = "my-tts-bucket" parent_id = "" } ``` 2. Check that the configuration is correct: ```bash theme={null} terraform validate ``` 3. Apply the changes: ```bash theme={null} terraform apply ``` ### Prepare a dataset On a local machine, prepare a dataset for training the ONNX model. After that, upload the dataset to the bucket. 1. Create a working directory: ```bash theme={null} mkdir -p ~/voice-demo-upload/input/raw cd ~/voice-demo-upload ``` 2. Create and activate a [virtual Python environment](https://docs.python.org/3/library/venv.html): ```bash theme={null} python3 -m venv .venv source .venv/bin/activate ``` 3. In this environment, install the required Python dependencies for the dataset preparation: ```bash theme={null} pip3 install --upgrade pip pip3 install datasets soundfile torchcodec torch ``` 4. Install [FFmpeg](https://www.ffmpeg.org). This is a tool that allows you to record and convert audio, and that is [required for TorchCodec](https://github.com/meta-pytorch/torchcodec?tab=readme-ov-file#installing-cpu-only-torchcodec). You can install FFmpeg by running `conda install "ffmpeg"` or `brew install "ffmpeg"` (macOS only). 5. Download five training samples from Hugging Face: ```python theme={null} python3 - <<'PY' from pathlib import Path from datasets import load_dataset, Audio out = Path("input/raw") out.mkdir(parents=True, exist_ok=True) ds = load_dataset( "openslr/librispeech_asr", "clean", split="train.100", streaming=True, ) ds = ds.cast_column("audio", Audio(decode=False)) for i, row in enumerate(ds.take(5)): audio = row["audio"] audio_bytes = audio.get("bytes") if not audio_bytes: raise RuntimeError(f"No audio bytes found for sample {i}") with open(out / f"sample_{i:04d}.flac", "wb") as f: f.write(audio_bytes) print("Done") PY ``` 6. After the script prints `Done`, check that the samples are downloaded: ```bash theme={null} find ~/voice-demo-upload/input -maxdepth 3 -type f | sort ``` The output should be the following: ```text theme={null} ~/voice-demo-upload/input/raw/sample_0000.flac ~/voice-demo-upload/input/raw/sample_0001.flac ~/voice-demo-upload/input/raw/sample_0002.flac ~/voice-demo-upload/input/raw/sample_0003.flac ~/voice-demo-upload/input/raw/sample_0004.flac ``` 7. Upload the `input` folder to the bucket created earlier: 1. In the web console, go to **Storage** → **Object Storage**. 2. Open the bucket page. 3. Create the `/mnt/data/input/raw` directory. To do so, click **Add** → **Folder** for every directory in this path. 4. Go to `/mnt/data/input/raw` and then click **Add** → **Object**. 5. Upload the samples. ### Prepare files for the Docker image 1. To connect to the VM, get its public IP address: 1. In the web console, go to **Compute** → **Virtual machines**. 2. Open the VM page. 3. In **Network** → **Public IPv4**, copy the address. Run the following command: ```bash theme={null} nebius compute instance get-by-name --name my-vm \ --format jsonpath='{.status.network_interfaces[0].public_ip_address.address}' ``` 2. [Connect to the VM](/compute/virtual-machines/connect#connect-to-the-vm-by-using-ssh) by using SSH: ```bash theme={null} ssh @ ``` Specify the username that you set when creating the VM. 3. On the VM, create a working directory: ```bash theme={null} mkdir ~/piper-nebius cd ~/piper-nebius ``` 4. In this directory, create the following files for building the Docker image: ```python theme={null} #!/usr/bin/env python3 import argparse import csv import shutil import subprocess import sys import urllib.request from pathlib import Path import torch import whisper BASE_CKPT_URL = ( "https://huggingface.co/datasets/rhasspy/piper-checkpoints/resolve/main/" "en/en_US/lessac/medium/epoch%3D2164-step%3D1355540.ckpt" ) def run(cmd, cwd=None): print("+", " ".join(cmd), flush=True) subprocess.run(cmd, cwd=cwd, check=True) def parse_args(): parser = argparse.ArgumentParser(description="Train a Piper voice model non-interactively.") parser.add_argument("--raw-dir", default="/mnt/data/input/raw") parser.add_argument("--work-dir", default="/mnt/data/work") parser.add_argument("--output-dir", default="/mnt/data/output") parser.add_argument("--voice-name", default="custom_voice") parser.add_argument("--espeak-voice", default="en-us") parser.add_argument("--sample-rate", type=int, default=22050) parser.add_argument("--segment-seconds", type=int, default=10) parser.add_argument("--whisper-model", default="turbo") parser.add_argument("--max-epochs", type=int, default=4000) parser.add_argument("--batch-size", type=int, default=32) parser.add_argument("--num-workers", type=int, default=8) parser.add_argument("--base-ckpt-url", default=BASE_CKPT_URL) parser.add_argument("--no-base-ckpt", action="store_true") parser.add_argument("--device", default="cuda") parser.add_argument("--piper-repo", default="/opt/piper1-gpl") return parser.parse_args() def collect_audio_files(raw_dir: Path): exts = {".wav", ".mp3", ".m4a", ".flac", ".ogg"} files = sorted(p for p in raw_dir.rglob("*") if p.suffix.lower() in exts) if not files: raise FileNotFoundError(f"No audio files found under {raw_dir}") return files def segment_audio(files, wav_dir: Path, segment_seconds: int, sample_rate: int): wav_dir.mkdir(parents=True, exist_ok=True) for src in files: stem = src.stem.replace(" ", "_") out_pattern = wav_dir / f"{stem}_%04d.wav" run( [ "ffmpeg", "-y", "-i", str(src), "-vn", "-ac", "1", "-ar", str(sample_rate), "-c:a", "pcm_s16le", "-f", "segment", "-segment_time", str(segment_seconds), str(out_pattern), ] ) def transcribe_segments(wav_dir: Path, metadata_path: Path, whisper_model: str, device: str): model = whisper.load_model(whisper_model).to(device) wav_files = sorted(wav_dir.glob("*.wav")) if not wav_files: raise FileNotFoundError(f"No segmented wav files found under {wav_dir}") with metadata_path.open("w", encoding="utf-8", newline="") as f: writer = csv.writer(f, delimiter="|", lineterminator="\n") for wav_path in wav_files: result = model.transcribe(str(wav_path)) transcript = " ".join(result["text"].strip().split()) if transcript: writer.writerow([wav_path.stem, transcript]) def download_base_checkpoint(url: str, checkpoint_path: Path): checkpoint_path.parent.mkdir(parents=True, exist_ok=True) print(f"+ download {url} -> {checkpoint_path}", flush=True) urllib.request.urlretrieve(url, checkpoint_path) def sanitize_checkpoint(checkpoint_path: Path): checkpoint = torch.load(checkpoint_path, map_location="cpu", weights_only=False) checkpoint["hyper_parameters"] = {} torch.save(checkpoint, checkpoint_path) def latest_checkpoint(logs_dir: Path): checkpoints = sorted(logs_dir.glob("**/checkpoints/*.ckpt"), key=lambda p: p.stat().st_mtime) if not checkpoints: raise FileNotFoundError(f"No checkpoints found under {logs_dir}") return checkpoints[-1] def main(): args = parse_args() raw_dir = Path(args.raw_dir) work_dir = Path(args.work_dir) output_dir = Path(args.output_dir) wav_dir = work_dir / "wav" metadata_path = work_dir / "metadata.csv" cache_dir = work_dir / "cache" config_path = work_dir / "config.json" base_ckpt_path = work_dir / "base.ckpt" logs_dir = Path(args.piper_repo) / "lightning_logs" work_dir.mkdir(parents=True, exist_ok=True) output_dir.mkdir(parents=True, exist_ok=True) audio_files = collect_audio_files(raw_dir) segment_audio(audio_files, wav_dir, args.segment_seconds, args.sample_rate) transcribe_segments(wav_dir, metadata_path, args.whisper_model, args.device) if not args.no_base_ckpt: download_base_checkpoint(args.base_ckpt_url, base_ckpt_path) sanitize_checkpoint(base_ckpt_path) fit_cmd = [ sys.executable, "-m", "piper.train", "fit", "--data.voice_name", args.voice_name, "--data.csv_path", str(metadata_path), "--data.audio_dir", str(wav_dir), "--model.sample_rate", str(args.sample_rate), "--data.espeak_voice", args.espeak_voice, "--data.cache_dir", str(cache_dir), "--data.config_path", str(config_path), "--data.batch_size", str(args.batch_size), "--data.num_workers", str(args.num_workers), "--trainer.log_every_n_steps", "1", "--trainer.max_epochs", str(args.max_epochs), "--trainer.accelerator", "gpu", "--trainer.devices", "1", ] if not args.no_base_ckpt: fit_cmd.extend( [ "--ckpt_path", str(base_ckpt_path), "--weights_only", "true", ] ) run(fit_cmd, cwd=args.piper_repo) checkpoint_path = latest_checkpoint(logs_dir) output_model = output_dir / "model.onnx" output_config = output_dir / "model.onnx.json" run( [ sys.executable, "-m", "piper.train.export_onnx", "--checkpoint", str(checkpoint_path), "--output-file", str(output_model), ], cwd=args.piper_repo, ) if config_path.exists(): if output_config.exists(): print(f"Config already exists at {output_config}, leaving as-is", flush=True) else: try: shutil.copy2(config_path, output_config) except PermissionError: try: shutil.copyfile(config_path, output_config) except PermissionError as exc: print( f"Warning: could not write config to {output_config}: {exc}. " "Model export succeeded; continuing without output JSON copy.", flush=True, ) print(f"Training complete. ONNX model: {output_model}", flush=True) if __name__ == "__main__": main() ``` ```python theme={null} #!/usr/bin/env python3 import subprocess import tempfile from pathlib import Path from fastapi import FastAPI, HTTPException from fastapi.responses import FileResponse from pydantic import BaseModel MODEL_PATH = Path("/mnt/data/output/model.onnx") CONFIG_PATH = Path("/mnt/data/output/model.onnx.json") app = FastAPI(title="Piper Voice Endpoint") class SynthesizeRequest(BaseModel): text: str @app.get("/health") def health(): return { "ok": MODEL_PATH.exists(), "model": str(MODEL_PATH), "config": str(CONFIG_PATH), } @app.post("/synthesize") def synthesize(request: SynthesizeRequest): if not MODEL_PATH.exists(): raise HTTPException(status_code=503, detail="model.onnx not found at /mnt/data/output") with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp: output_path = Path(tmp.name) try: subprocess.run( ["piper", "-m", str(MODEL_PATH), "--output_file", str(output_path)], input=request.text, text=True, check=True, ) return FileResponse(output_path, media_type="audio/wav", filename="speech.wav") except subprocess.CalledProcessError as exc: raise HTTPException(status_code=500, detail=f"inference failed: {exc}") from exc ``` ```txt theme={null} datasets soundfile torch<2.6 openai-whisper fastapi uvicorn[standard] piper-phonemize ``` ```python theme={null} import pathlib import torch.serialization torch.serialization.add_safe_globals([pathlib.PosixPath]) ``` ```dockerfile theme={null} FROM nvidia/cuda:12.8.0-cudnn-runtime-ubuntu24.04 ENV DEBIAN_FRONTEND=noninteractive ENV PIPER_REPO=/opt/piper1-gpl ENV PYTHONUNBUFFERED=1 ENV PYTHONPATH=/app RUN apt-get update && apt-get install -y \ python3 \ python3-pip \ python3-venv \ python3-dev \ ffmpeg \ git \ wget \ curl \ cmake \ build-essential \ ninja-build \ espeak-ng \ && rm -rf /var/lib/apt/lists/* WORKDIR /app COPY requirements.txt /app/requirements.txt RUN python3 -m pip install --upgrade pip setuptools wheel scikit-build && \ python3 -m pip install -r /app/requirements.txt RUN git clone https://github.com/OHF-voice/piper1-gpl.git ${PIPER_REPO} && \ python3 -m pip install ${PIPER_REPO} && \ cp /usr/local/lib/python3.10/dist-packages/piper/espeakbridge.so /tmp/espeakbridge.so && \ cp /usr/local/lib/python3.10/dist-packages/piper/espeakbridge.pyi /tmp/espeakbridge.pyi && \ python3 -m pip install -e ${PIPER_REPO}[train] && \ cp /tmp/espeakbridge.so ${PIPER_REPO}/src/piper/espeakbridge.so && \ cp /tmp/espeakbridge.pyi ${PIPER_REPO}/src/piper/espeakbridge.pyi && \ ${PIPER_REPO}/build_monotonic_align.sh COPY train.py /app/train.py COPY app.py /app/app.py COPY sitecustomize.py /app/sitecustomize.py EXPOSE 8000 CMD ["python3", "/app/train.py"] ``` To verify that all files are present, run `ls` or `tree`. ### Build and push the Docker image On the VM: 1. [Install Docker](https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository). 2. Install additional packages and prepare Docker for building the image: ```bash theme={null} sudo apt-get update sudo apt-get install -y git curl wget unzip python3 python3-venv python3-pip ca-certificates sudo usermod -aG docker "$USER" newgrp docker ``` 3. Check that the Docker daemon is running: ```bash theme={null} docker ps ``` If Docker is running, this command returns a table of containers (can be empty). If you don't see the table and the daemon isn't running, [launch it](https://docs.docker.com/engine/daemon/start/). 4. [Create an account](https://docs.docker.com/accounts/create-account/) in Docker Hub. Use it for authentication when you push your image to a repository. 5. [Create a public repository](https://docs.docker.com/get-started/docker-concepts/the-basics/what-is-a-registry/) in Docker Hub. You will push your Docker image there. 6. In the `~/piper-nebius` directory, build the image: ```bash theme={null} docker build -t /:piper-nebius-ui-tutorial . ``` In the command, specify your public repository. For example, `myrepository/tts:piper-nebius-ui-tutorial`. This operation can take several minutes to complete. 7. Authenticate in Docker Hub: ```bash theme={null} docker login -u ``` Specify your username at Docker Hub and enter your password when prompted. 8. Push the image to the repository: ```bash theme={null} docker push /:piper-nebius-ui-tutorial ``` This operation can take several minutes to complete. ### Create and deploy the ONNX model by using a Serverless AI job and endpoint 1. Create a fine-tuning job that generates the ONNX model: 1. In the web console, go to **AI Services** → **Jobs**. 2. Click **Create job**. 3. On the page that opens, specify the following job parameters: * **Image path**: `/:piper-nebius-ui-tutorial`. Set the image that you've pushed to the Docker repository. * **Entrypoint command**: ```bash theme={null} python3 /app/train.py --raw-dir /mnt/data/input/raw --work-dir /tmp/work --output-dir /mnt/data/output --voice-name demo_voice --no-base-ckpt --max-epochs 50 --batch-size 4 --num-workers 0 ``` * `--raw-dir /mnt/data/input/raw`: Matches the uploaded files. * `--work-dir /tmp/work`: Properly saves files to Object Storage. * `--output-dir /mnt/data/output`: Saves the exported ONNX model to the mounted volume. * `--no-base-ckpt`: Helps avoid checkpoint compatibility problems in the dataset path. * `--batch-size 4 --num-workers 0`: Make standard settings for a small dataset. * **Computing resources**: Keep the predefined settings. * **Mount volumes**: Bucket. * **Mount path**: `/mnt/data`. After that, click **Attach bucket** and then select the bucket created earlier. 4. Click **Create**. Run the following command: ```bash theme={null} nebius ai job create \ --name my-job \ --image /:piper-nebius-ui-tutorial \ --container-command python3 \ --args "/app/train.py --raw-dir /mnt/data/input/raw --work-dir /tmp/work --output-dir /mnt/data/output --voice-name demo_voice --no-base-ckpt --max-epochs 50 --batch-size 4 --num-workers 0" \ --volume ":/mnt/data" \ --platform gpu-l40s-a \ --preset 1gpu-8vcpu-32gb \ --disk-size 250Gi \ --subnet-id ``` To get the bucket ID, run `nebius storage bucket list`. For details about the subnet ID, see [How to get a subnet ID](/vpc/networking/resources#how-to-get-a-subnet-id). After the job reaches the `Complete` status, the files `output/model.onnx` and `output/model.onnx.json` are created in the bucket. These files contain the produced model. 2. Deploy the model on a Serverless AI endpoint: 1. In the web console, go to **AI Services** → **Endpoints**. 2. Click **Create endpoint**. 3. On the page that opens, specify the following endpoint parameters: * **Image path**: `/:piper-nebius-ui-tutorial`. Set the image that you've pushed to the Docker repository. * **Ports**: `8000`. * **Entrypoint command**: ```bash theme={null} uvicorn app:app --host 0.0.0.0 --port 8000 ``` * **Computing resources**: Keep the predefined settings. * **Mount volumes**: Bucket. * **Mount path**: `/mnt/data`. After that, click **Attach bucket** and then select the bucket created earlier. * **IP address**: Public static IP. 4. Click **Create**. Run the following command: ```bash theme={null} nebius ai endpoint create \ --name my-endpoint \ --image /:piper-nebius-ui-tutorial \ --container-port 8000 \ --container-command uvicorn \ --args "app:app --host 0.0.0.0 --port 8000" \ --volume ":/mnt/data" \ --subnet-id \ --public ``` Wait until the endpoint reaches the `Running` status. ### Synthesize speech 1. Get the endpoint IP address: 1. In the web console, go to **AI Services** → **Endpoints**. 2. Open the page of the deployed endpoint. 3. Copy the IP address from the **Network** → **Public endpoints** field. ```bash theme={null} nebius ai endpoint get \ --format json | jq -r '.status.instances[0].public_ip' ``` To get the endpoint ID, run `nebius ai endpoint list`. 2. To verify the endpoint health, run a health check: ```bash theme={null} curl http://:8000/health ``` Expected output: ```text theme={null} {"ok":true,"model":"/mnt/data/output/model.onnx","config":"/mnt/data/output/model.onnx.json"} ``` The `"ok":true` message shows that the endpoint is healthy. 3. To synthesize speech, call the endpoint: ```bash theme={null} curl -X POST "http://:8000/synthesize" \ -H "Content-Type: application/json" \ -d '{"text":"Hello world"}' \ --output speech.wav ``` The method generates the `speech.wav` file with the recorded `Hello world` phrase. The audio quality can be low because only five samples from a dataset were used to train the model. That is expected because the tutorial's purpose is only to showcase the process of the speech synthesis. To improve the audio quality, use a bigger dataset and more samples for the model training. ## How to delete the created resources Some of the created resources are chargeable. If you don't need them, delete these resources, so Nebius AI Cloud doesn't charge for them: * [CPU-only VM](/compute/virtual-machines/delete). * [Boot disk](/compute/storage/manage#how-to-delete-a-volume) attached to the VM. * [Bucket](/object-storage/buckets/manage#how-to-delete-buckets). * [Endpoint](/serverless/endpoints/manage#how-to-delete-an-endpoint). When you delete an endpoint, Serverless AI automatically deletes the endpoint VM and container (boot) disk.