Documentation Index
Fetch the complete documentation index at: https://docs.nebius.com/llms.txt
Use this file to discover all available pages before exploring further.
Serverless AI lets you deploy and manage endpoints without handling infrastructure yourself. With endpoints, you can create an OpenAI-compatible model backend in a few minutes.
This tutorial shows how to prepare your environment, create your first endpoint with an open-source large language model (LLM), and send a chat request.
The endpoint is based on the vllm/vllm-openai:latest image. vLLM automatically downloads the model from Hugging Face when the endpoint starts. The container exposes an OpenAI-compatible /v1/chat/completions API.
Costs
Nebius AI Cloud charges you for Compute virtual machines.
Prerequisites
-
Make sure that you are in a group that has the
admin role within your tenant; for example, the default admins group.
-
On the Administration → Limits → Quotas page of the web console, check that you have quotas on the following resources in the region you use:
- NVIDIA® L40S for regular VMs without reservations, under Compute, there should be at least one GPU available.
- Number of virtual machines, under Compute, there should be at least one VM available.
- Total number of allocations, under Virtual Private Cloud, there should be at least one allocation available.
Increase quotas if needed.
-
Install and configure the Nebius AI Cloud CLI to work in the project in the
eu-north1 region.
-
Install jq to parse JSON outputs in this tutorial:
-
Make sure you are in a group that has the
admin role within your tenant; for example, the default admins group. You can check this in the Administration → IAM section of the web console.
-
In the Quota section of the web console, check that you have quotas on the following resources in the region you use:
- NVIDIA® L40S for regular VMs without reservations, under Compute, there should be at least one GPU available.
- Number of virtual machines, under Compute, there should be at least one VM available.
- Total number of allocations, under Virtual Private Cloud, there should be at least one allocation available.
Increase quotas if needed.
Steps
Create an endpoint
-
In the sidebar, go to
AI Services → Endpoints.
-
Click
Create endpoint.
-
On the page that opens, specify the following endpoint settings:
- Image path:
vllm/vllm-openai:v0.18.0-cu130.
- Ports:
8000.
- Advanced settings → Entrypoint command:
python3 -m vllm.entrypoints.openai.api_server.
- Advanced settings → Arguments:
--model Qwen/Qwen3-0.6B --host 0.0.0.0 --port 8000.
- Advanced settings → Authentication: Token authentication. Copy and save the generated token.
- Computing resources: With GPU.
- Available platform: NVIDIA® L40S PCIe with Intel Ice Lake.
- Preset: 1GPU — 8 CPUs — 32 GiB RAM.
- Network: Public static IP.
-
Click Create.
-
Create a token for authorization and save it to an environment variable:
export AUTH_TOKEN=$(openssl rand -hex 32)
-
Save the model ID. The model ID is a Hugging Face model identifier. Use a small model if you want a faster startup:
export MODEL_ID="Qwen/Qwen3-0.6B"
You can use any compatible model from Hugging Face. Replace Qwen/Qwen3-0.6B with the model ID of your choice.
-
Get a subnet ID (for example, the first subnet in the project) and save it to an environment variable:
export SUBNET_ID=$(nebius vpc subnet list --format jsonpath='{.items[0].metadata.id}')
echo "SUBNET_ID=$SUBNET_ID"
-
Create an endpoint:
nebius ai endpoint create \
--name qs-vllm-chat \
--image vllm/vllm-openai:v0.18.0-cu130 \
--container-command "python3 -m vllm.entrypoints.openai.api_server" \
--args "--model $MODEL_ID --host 0.0.0.0 --port 8000" \
--platform gpu-l40s-a \
--preset 1gpu-8vcpu-32gb \
--public \
--container-port 8000 \
--auth token \
--token "$AUTH_TOKEN" \
--shm-size 16Gi \
--subnet-id "$SUBNET_ID"
The endpoint creation takes approximately five minutes.
Check the endpoint status
Wait until the endpoint status is Running. You can check the status on the endpoint page.
-
Save the endpoint ID to an environment variable:
export ENDPOINT_ID=$(nebius ai endpoint list --format json | jq -r '.items[0].metadata.id')
echo "ENDPOINT_ID=$ENDPOINT_ID"
-
Check the status of the endpoint:
nebius ai endpoint get $ENDPOINT_ID
Wait until the endpoint status is Running.
Test the endpoint
- In the sidebar, go to
AI Services → Endpoints.
- Open the page of the required endpoint.
- In the Network section, copy the IP address from the Public endpoints or Private endpoints field.
Test the endpoint by listing available models:curl "http://<endpoint_IP_address>/v1/models" \
-H "Authorization: Bearer <token>" | jq
Send a chat request to the model:curl "http://<endpoint_IP_address>/v1/chat/completions" \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{"role": "user", "content": "Say '\''Hello Nebius AI'\'' and nothing else"}
]
}' | jq -r '.choices[0].message.content'
-
Get the endpoint IP address and save it to an environment variable:
export ENDPOINT_IP=$(nebius ai endpoint get-by-name --name qs-vllm-chat \
--format json | jq -r '.status.public_endpoints[0]')
echo "ENDPOINT_IP=$ENDPOINT_IP"
-
Test the endpoint by listing available models:
curl "http://$ENDPOINT_IP/v1/models" -H "Authorization: Bearer $AUTH_TOKEN" | jq
-
Send a chat request to the model:
curl "http://$ENDPOINT_IP/v1/chat/completions" \
-H "Authorization: Bearer $AUTH_TOKEN" \
-H "Content-Type: application/json" \
-d "{
\"model\": \"$MODEL_ID\",
\"messages\": [{\"role\":\"user\",\"content\":\"Say 'Hello Nebius AI' and nothing else\"}]
}" | jq -r '.choices[0].message.content'
How to delete the created resources
The endpoint and its computing resources are chargeable. If you don’t need the endpoint, delete it, so Nebius AI Cloud doesn’t charge for it:
- In the sidebar, go to
AI Services → Endpoints.
- Locate the endpoint and then click
→ Delete.
- In the window that opens, confirm the deletion.
nebius ai endpoint delete $ENDPOINT_ID