Skip to main content
Serverless AI is a Nebius AI Cloud service for running containerized AI workloads without creating or operating virtual machines or clusters. To run your workload in Serverless AI, you just need to choose how to deploy it (as an interactive endpoint or as a non-interactive job), specify the path to your container, and select the computing and storage resources that the workload requires. Serverless AI handles resource provisioning and lifecycle (endpoints and jobs run on Compute containers over VMs), and usage-based, per-second billing, allowing you to focus on interacting with the workload and getting results from it. To catch and handle errors or unexpected outcomes, you can use the observability and debugging tools that Serverless AI provides.

Endpoints and jobs

You can deploy your workload as an endpoint that listens for requests and returns results immediately, or as a job that runs in the background and quits after completing its task. Here is the comparison between endpoints and jobs at a glance:
EndpointJob
WorkflowInteractive, listens for requests until you terminate itNon-interactive, terminates upon task completion or timeout
Stop/startYesNo
Public URL for requestsYesNo
Typical lifetimeHours to daysMinutes to days
Use casesPersistent workloads: serving and A/B-testing models, real-time inferenceBatch workloads: pre-processing data, training and fine-tuning models, batch inference and model evaluation, scientific simulations
GuidesGetting started with endpointsGetting started with jobs

Observability and debugging

Each Serverless AI endpoint and job has a status that indicates the current stage in the lifecycle. If your endpoint or job fails, you can view its logs. All endpoints and jobs also provide a wide range of GPU and vCPU utilization metrics, sourced from the Compute service and visualized in the web console. For more details, see Monitoring endpoints and jobs.

Pricing and quotas

Serverless AI follows Compute billing and quota rules. Billing is usage-based: the service charges you per-second for the computing and storage resources that you allocate to endpoints and jobs. Only active endpoints and jobs are billed and count towards quotas. This can help you avoid unnecessary costs compared to always-on infrastructure. For more details, see Pricing and quotas.