About Serverless AI - Nebius AI Cloud

Serverless AI is a Nebius AI Cloud service for running containerized AI workloads without creating or operating virtual machines or clusters. To run your workload in Serverless AI, you just need to choose how to deploy it (as an interactive endpoint or as a non-interactive job), specify the path to your container, and select the computing and storage resources that the workload requires. Serverless AI handles resource provisioning and lifecycle (endpoints and jobs run on Compute containers over VMs), and usage-based, per-second billing, allowing you to focus on interacting with the workload and getting results from it. To catch and handle errors or unexpected outcomes, you can use the observability and debugging tools that Serverless AI provides.

Endpoints and jobs

You can deploy your workload as an endpoint that listens for requests and returns results immediately, or as a job that runs in the background and quits after completing its task. Here is the comparison between endpoints and jobs at a glance:

	Endpoint	Job
Workflow	Interactive, listens for requests until you terminate it	Non-interactive, terminates upon task completion or timeout
Stop/start	Yes	No
Public URL for requests	Yes	No
Typical lifetime	Hours to days	Minutes to days
Use cases	Persistent workloads: serving and A/B-testing models, real-time inference	Batch workloads: preprocessing data, training and fine-tuning models, batch inference and model evaluation, scientific simulations
Guides	Getting started with endpoints	Getting started with jobs

Observability and debugging

Each Serverless AI endpoint and job has a status that indicates the current stage in the lifecycle. If your endpoint or job fails, you can view its logs. All endpoints and jobs also provide a wide range of GPU and vCPU utilization metrics, sourced from the Compute service and visualized in the web console. For more details, see Monitoring endpoints and jobs.

Pricing and quotas

Serverless AI follows Compute billing and quota rules. Billing is usage-based: the service charges you per-second for the computing and storage resources that you allocate to endpoints and jobs. Only active endpoints and jobs are billed and count towards quotas. This can help you avoid unnecessary costs compared to always-on infrastructure. For more details, see Pricing and quotas.

​Endpoints and jobs

​Observability and debugging

​Pricing and quotas

Endpoints and jobs

Observability and debugging

Pricing and quotas