Deploying a large language model and chatting with it by using Serverless AI endpoints

Serverless AI lets you deploy and manage endpoints without handling infrastructure yourself. With endpoints, you can create an OpenAI-compatible model backend in a few minutes. This tutorial shows how to prepare your environment, create your first endpoint with an open-source large language model (LLM), and send a chat request. The endpoint is based on the vllm/vllm-openai:latest image. vLLM automatically downloads the model from Hugging Face when the endpoint starts. The container exposes an OpenAI-compatible /v1/chat/completions API.

Costs

Nebius AI Cloud charges you for Compute virtual machines.

Prerequisites

Web console
CLI

Make sure that you are in a group that has the admin role within your tenant; for example, the default admins group.
On the Administration → Limits → Quotas page of the web console, check that you have quotas on the following resources in the region you use:
- NVIDIA® L40S for regular VMs without reservations, under Compute, there should be at least one GPU available.
- Number of virtual machines, under Compute, there should be at least one VM available.
- Total number of allocations, under Virtual Private Cloud, there should be at least one allocation available.
Increase quotas if needed.

Install and configure the Nebius AI Cloud CLI to work in the project in the eu-north1 region.
Install jq to parse JSON outputs in this tutorial:
sudo apt-get install jq
Make sure you are in a group that has the admin role within your tenant; for example, the default admins group. You can check this in the Administration → IAM section of the web console.
In the Quota section of the web console, check that you have quotas on the following resources in the region you use:
- NVIDIA® L40S for regular VMs without reservations, under Compute, there should be at least one GPU available.
- Number of virtual machines, under Compute, there should be at least one VM available.
- Total number of allocations, under Virtual Private Cloud, there should be at least one allocation available.
Increase quotas if needed.

Steps

Create an endpoint

Web console
CLI

In the sidebar, go to AI Services → Endpoints.
Click Create endpoint.
On the page that opens, specify the following endpoint settings:
- Image path: vllm/vllm-openai:v0.18.0-cu130.
- Ports: 8000.
- Advanced settings → Entrypoint command: python3 -m vllm.entrypoints.openai.api_server.
- Advanced settings → Arguments: --model Qwen/Qwen3-0.6B --host 0.0.0.0 --port 8000.
- Advanced settings → Authentication: Token authentication. Copy and save the generated token.
- Computing resources: With GPU.
- Available platform: NVIDIA® L40S PCIe with Intel Ice Lake.
- Preset: 1GPU — 8 CPUs — 32 GiB RAM.
- Network: Public static IP.
Click Create.

Create a token for authorization and save it to an environment variable:
```
export AUTH_TOKEN=$(openssl rand -hex 32)
```
Save the model ID. The model ID is a Hugging Face model identifier. Use a small model if you want a faster startup:
```
export MODEL_ID="Qwen/Qwen3-0.6B"
```
You can use any compatible model from Hugging Face. Replace Qwen/Qwen3-0.6B with the model ID of your choice.

Get a subnet ID (for example, the first subnet in the project) and save it to an environment variable:

export SUBNET_ID=$(nebius vpc subnet list --format jsonpath='{.items[0].metadata.id}')
echo "SUBNET_ID=$SUBNET_ID"

Create an endpoint:

nebius ai endpoint create \
  --name qs-vllm-chat \
  --image vllm/vllm-openai:v0.18.0-cu130 \
  --container-command "python3 -m vllm.entrypoints.openai.api_server" \
  --args "--model $MODEL_ID --host 0.0.0.0 --port 8000" \
  --platform gpu-l40s-a \
  --preset 1gpu-8vcpu-32gb \
  --public \
  --container-port 8000 \
  --auth token \
  --token "$AUTH_TOKEN" \
  --shm-size 16Gi \
  --subnet-id "$SUBNET_ID"

The endpoint creation takes approximately five minutes.

Check the endpoint status

Web console
CLI

Wait until the endpoint status is Running. You can check the status on the endpoint page.

Save the endpoint ID to an environment variable:

export ENDPOINT_ID=$(nebius ai endpoint list --format json | jq -r '.items[0].metadata.id')
echo "ENDPOINT_ID=$ENDPOINT_ID"

Check the status of the endpoint:
```
nebius ai endpoint get $ENDPOINT_ID
```

Wait until the endpoint status is Running.

Test the endpoint

Web console
CLI

In the sidebar, go to AI Services → Endpoints.
Open the page of the required endpoint.
In the Network section, copy the IP address from the Public endpoints or Private endpoints field.

Test the endpoint by listing available models:

curl "http://<endpoint_IP_address>/v1/models" \
  -H "Authorization: Bearer <token>" | jq

Send a chat request to the model:

curl "http://<endpoint_IP_address>/v1/chat/completions" \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [
      {"role": "user", "content": "Say '\''Hello Nebius AI'\'' and nothing else"}
    ]
  }' | jq -r '.choices[0].message.content'

Get the endpoint IP address and save it to an environment variable:

export ENDPOINT_IP=$(nebius ai endpoint get-by-name --name qs-vllm-chat \
  --format json | jq -r '.status.public_endpoints[0]')
echo "ENDPOINT_IP=$ENDPOINT_IP"

Test the endpoint by listing available models:

curl "http://$ENDPOINT_IP/v1/models" -H "Authorization: Bearer $AUTH_TOKEN" | jq

Send a chat request to the model:

curl "http://$ENDPOINT_IP/v1/chat/completions" \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"$MODEL_ID\",
    \"messages\": [{\"role\":\"user\",\"content\":\"Say 'Hello Nebius AI' and nothing else\"}]
  }" | jq -r '.choices[0].message.content'

How to delete the created resources

The endpoint and its computing resources are chargeable. If you don’t need the endpoint, delete it, so Nebius AI Cloud doesn’t charge for it:

Web console
CLI

In the sidebar, go to AI Services → Endpoints.
Locate the endpoint and then click → Delete.
In the window that opens, confirm the deletion.

nebius ai endpoint delete $ENDPOINT_ID

Serverless AI

Managed Service for MLflow

Applications in Nebius AI Cloud

Tutorials

Third-party integrations

Deploying a large language model and chatting with it by using Serverless AI endpoints

Costs

Prerequisites

Steps

Create an endpoint

Check the endpoint status

Test the endpoint

How to delete the created resources

Serverless AI

Managed Service for MLflow

Applications in Nebius AI Cloud

Tutorials

Third-party integrations

Documentation Index

​Costs

​Prerequisites

​Steps

​Create an endpoint

​Check the endpoint status

​Test the endpoint

​How to delete the created resources

Costs

Prerequisites

Steps

Create an endpoint

Check the endpoint status

Test the endpoint

How to delete the created resources