Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.nebius.com/llms.txt

Use this file to discover all available pages before exploring further.

With Serverless AI, you can translate and dub a video into another language. To do so, create a Docker image and run a fine-tuning job based on it. The job converts audio to text, translates the text and creates a dubbed video.

Costs

Nebius AI Cloud charges you for the following billing items:

Steps

Prepare infrastructure

Create resources in the eu-north1 region. The most suitable platform for Serverless AI jobs and endpoints, NVIDIA® L40S PCIe with Intel Ice Lake, is only available in eu-north1. All the resources must be located in the same project.
  1. Create a CPU-only VM. The VM is required to build the Docker image based on the VM’s Linux operating system (OS). If you build the image on a non-Linux OS, the image architecture will be incompatible with Serverless AI, and the fine-tuning job will fail. Configure SSH access to the VM so that you can connect to it later.
    1. In the web console, go to https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/sidebar/compute.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=b91340217b08a1456d88ae0347f281d1 Compute → Virtual machines.
    2. Click https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/plus.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=7c9efc69d65fc58db0eb73702fd81aa1 Create virtual machine.
    3. On the page that opens, set the following VM configuration:
      • Computing resources: Without GPU.
      • Platform: Non-GPU AMD EPYC Genoa.
      • Preset: 4 CPUs — 16 GiB RAM.
      • Boot disk operating system: Ubuntu 24.04 LTS.
      • Boot disk size: At least 100 GiB.
      • Public IP address: Auto assign dynamic IP.
      • Username and SSH key: Configure access credentials.
    4. Click Create VM.
  2. Create a bucket to store fine-tuning artifacts.
    1. In the web console, go to https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/sidebar/storage.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=0a2dad6b48aea10e85f6f3e2343aee26 Storage → Object Storage.
    2. Click https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/plus.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=7c9efc69d65fc58db0eb73702fd81aa1 Create bucket.
    3. In the Maximum size field, select Unlimited. Leave the other settings at their default values.
    4. Click Create bucket.

Prepare files for the Docker image

  1. To connect to the VM, get its public IP address:
    1. In the web console, go to https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/sidebar/compute.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=b91340217b08a1456d88ae0347f281d1 Compute → Virtual machines.
    2. Open the VM page.
    3. In Network → Public IPv4, copy the address.
  2. Connect to the VM by using SSH:
    ssh <username>@<IP_address>
    
    Specify the username that you set when creating the VM.
  3. On the VM, create a working directory:
    mkdir ~/video-translation-nebius
    cd ~/video-translation-nebius
    
  4. In this directory, create the following files for building the Docker image:
    # Core API
    fastapi==0.115.12
    uvicorn[standard]==0.30.6
    requests==2.32.3
    
    # ASR
    openai-whisper==20250625
    
    # Translation
    transformers==4.48.3
    accelerate==1.6.0
    sentencepiece==0.2.0
    
    # Text to speech
    TTS==0.22.0
    
    # Media pipeline
    moviepy==1.0.3
    
    # Torch stack
    torch==2.5.1
    torchaudio==2.5.1
    
    #!/usr/bin/env python3
    import argparse
    import subprocess
    from pathlib import Path
    import requests
    import torch
    import whisper
    from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
    from TTS.api import TTS
    
    def run(cmd):
        print("+", " ".join(cmd), flush=True)
        subprocess.run(cmd, check=True)
    
    def download(url: str, dst: Path):
        with requests.get(url, stream=True, timeout=60) as r:
            r.raise_for_status()
            with dst.open("wb") as f:
                for chunk in r.iter_content(chunk_size=1024 * 1024):
                    if chunk:
                        f.write(chunk)
    
    def split_text(text: str, max_chars: int = 700):
        text = " ".join(text.split())
        chunks, cur = [], ""
        for sent in text.split(". "):
            s = sent.strip()
            if not s:
                continue
            s = s + ("" if s.endswith(".") else ".")
            if len(cur) + len(s) + 1 > max_chars:
                if cur:
                    chunks.append(cur.strip())
                cur = s
            else:
                cur = (cur + " " + s).strip()
        if cur:
            chunks.append(cur)
        return chunks
    
    def translate_text(text: str, target_lang: str):
        model_name = "jbochi/madlad400-3b-mt"
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForSeq2SeqLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
            device_map="auto",
        )
        out = []
        for chunk in split_text(text):
            prompt = f"<2{target_lang}> {chunk}"
            ids = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024).to(model.device)
            gen = model.generate(**ids, max_new_tokens=512)
            out.append(tokenizer.decode(gen[0], skip_special_tokens=True))
        return " ".join(out).strip()
    
    def main():
        p = argparse.ArgumentParser()
        p.add_argument("--url", required=True)
        p.add_argument("--target-lang", default="de")
        p.add_argument("--work-dir", default="/tmp/work")
        p.add_argument("--output-dir", default="/mnt/data/output")
        p.add_argument("--tts-model", default="tts_models/de/thorsten/vits")
        args = p.parse_args()
    
        work = Path(args.work_dir)
        out = Path(args.output_dir)
        work.mkdir(parents=True, exist_ok=True)
        out.mkdir(parents=True, exist_ok=True)
    
        in_mp4 = work / "input.mp4"
        asr_wav = work / "asr.wav"
        dub_wav = work / "dub.wav"
        tmp_out_mp4 = work / "output_video_with_audio.mp4"
        out_mp4 = out / "output_video_with_audio.mp4"
        transcript_txt = out / "transcript.txt"
        translated_txt = out / "translated.txt"
    
        download(args.url, in_mp4)
    
        run(["ffmpeg", "-y", "-i", str(in_mp4), "-vn", "-ac", "1", "-ar", "16000", str(asr_wav)])
    
        device = "cuda" if torch.cuda.is_available() else "cpu"
        asr = whisper.load_model("turbo", device=device)
        r = asr.transcribe(str(asr_wav))
        source_text = " ".join(r["text"].split())
        transcript_txt.write_text(source_text, encoding="utf-8")
    
        translated = translate_text(source_text, args.target_lang)
        translated_txt.write_text(translated, encoding="utf-8")
    
        tts = TTS(model_name=args.tts_model, gpu=torch.cuda.is_available())
        tts.tts_to_file(text=translated, file_path=str(dub_wav))
    
        run([
            "ffmpeg", "-y",
            "-i", str(in_mp4),
            "-i", str(dub_wav),
            "-map", "0:v:0",
            "-map", "1:a:0",
            "-c:v", "copy",
            "-c:a", "aac",
            "-shortest",
            str(tmp_out_mp4),
        ])
    
        # IMPORTANT: write locally first, then copy to mounted Object Storage
        with tmp_out_mp4.open("rb") as src, out_mp4.open("wb") as dst:
            dst.write(src.read())
    
        print(f"Done: {out_mp4}", flush=True)
    
    if __name__ == "__main__":
        main()
    
    from fastapi import FastAPI, HTTPException
    from fastapi.responses import FileResponse
    from pathlib import Path
    
    app = FastAPI()
    OUT = Path("/mnt/data/output")
    
    @app.get("/health")
    def health():
        return {"ok": True, "output_dir": str(OUT), "exists": OUT.exists()}
    
    @app.get("/outputs")
    def outputs():
        if not OUT.exists():
            return {"files": []}
        return {"files": sorted([p.name for p in OUT.iterdir() if p.is_file()])}
    
    @app.get("/download")
    def download():
        f = OUT / "output_video_with_audio.mp4"
        if not f.exists():
            raise HTTPException(status_code=404, detail="output_video_with_audio.mp4 not found")
        return FileResponse(str(f), media_type="video/mp4", filename=f.name)
    
    FROM nvidia/cuda:13.1.2-cudnn-runtime-ubuntu24.04
    
    ENV DEBIAN_FRONTEND=noninteractive
    WORKDIR /app
    
    RUN apt-get update && apt-get install -y --no-install-recommends \
        software-properties-common curl git ffmpeg libgl1 libglib2.0-0 espeak-ng \
        && add-apt-repository ppa:deadsnakes/ppa -y \
        && apt-get update && apt-get install -y --no-install-recommends \
        python3.11 python3.11-venv python3.11-dev \
        && rm -rf /var/lib/apt/lists/*
    
    ENV VENV_PATH=/opt/venv
    RUN python3.11 -m venv $VENV_PATH
    ENV PATH="$VENV_PATH/bin:$PATH"
    
    COPY requirements.txt /app/requirements.txt
    RUN pip install --upgrade pip setuptools wheel && \
        pip install --no-cache-dir -r /app/requirements.txt
    
    COPY process_video.py /app/process_video.py
    COPY app.py /app/app.py
    
    EXPOSE 8000
    CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
    
    To verify that all files are present, run ls or tree.
  5. Make the process_video.py file executable:
    chmod +x process_video.py
    

Build and push the Docker image

On the VM:
  1. Install Docker.
  2. Install additional packages and prepare Docker for building the image:
    sudo apt-get update
    sudo apt-get install -y docker.io git curl wget unzip python3 python3-pip ca-certificates
    sudo usermod -aG docker "$USER"
    newgrp docker
    
  3. Check that the Docker daemon is running:
    docker ps
    
    If Docker is running, this command returns a table of containers (can be empty). If you don’t see the table and the daemon isn’t running, launch it.
  4. Create an account in Docker Hub. Use it for authentication when you push your image to a repository.
  5. Create a public repository in Docker Hub. You will push your Docker image there.
  6. In the ~/video-translation-nebius directory, build the image:
    docker build -t <repository>/<image>:video-translation-nebius .
    
    In the command, specify your public repository. For example, myrepository/dubbing:video-translation-nebius. This operation can take several minutes to complete.
  7. Authenticate in Docker Hub:
    docker login -u <username>
    
    Specify your username at Docker Hub and enter your password when prompted.
  8. Push the image to the repository:
    docker push <repository>/<image>:video-translation-nebius
    
    This operation can take several minutes to complete.

Create a dubbed video

  1. Create a fine-tuning job that generates a model for translation and that dubs the video:
    1. In the web console, go to https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/sidebar/ai-services.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=ab4ff229f7690c99deb1dc52d3daf987 AI Services → Jobs.
    2. Click https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/plus.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=7c9efc69d65fc58db0eb73702fd81aa1 Create job.
    3. On the page that opens, specify the following job parameters:
      • Image path: <repository>/<image>:video-translation-nebius. Set the image that you’ve pushed to the Docker repository.
      • Entrypoint command:
        python3 /app/process_video.py --url https://archive.org/download/BigBuckBunny_328/BigBuckBunny_512kb.mp4 --target-lang de --work-dir /tmp/work --output-dir /mnt/data/output
        
        The --url parameter contains a link to the video being processed. The --target-lang parameter specifies what language the audio track is translated into.
      • Computing resources and Container disk: Keep the predefined settings.
      • Mount volumes: Bucket.
      • Mount path: /mnt/data. After that, click https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/plus.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=7c9efc69d65fc58db0eb73702fd81aa1 Attach bucket and then select the bucket created earlier.
    4. Click Create.
    While the job is running, you can check its logs on the job’s page, on the Logs tab. The logs show how the model is processing the audio, transcribing and translating the text.
    After the job reaches the Complete status, the following files are created in the bucket:
    • output/transcript.txt: Speech that the model recognized in the video.
    • output/translated.txt: Translation of this speech.
    • output/output_video_with_audio.mp4: Dubbed video.
    The speech-to-text (STT) quality in this tutorial is not production-level. Accuracy may be low with short sample videos and default model settings. That is expected because the tutorial’s purpose is only to showcase the process of STT, video translation and dubbing. To improve the quality, use stronger STT or translation models, split audio into smaller segments and add audio post-processing.
  2. Download the dubbed video:
    1. Open the bucket’s page and go to the output directory.
    2. In the line of the output/output_video_with_audio.mp4 object, click https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/button-vellipsis.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=e80b8e57c43bfd117679262e6a1334ad → Download.

How to delete the created resources

Some of the created resources are chargeable. If you don’t need them, delete these resources, so Nebius AI Cloud doesn’t charge for them: