Skip to main content
Serverless AI lets you deploy and manage endpoints without handling infrastructure yourself. With endpoints, you can create an OpenAI-compatible model backend in a few minutes. This tutorial shows how to prepare your environment, create your first endpoint with an open-source large language model (LLM), and send a chat request. The endpoint is based on the vllm/vllm-openai:latest image. vLLM automatically downloads the model from Hugging Face when the endpoint starts. The container exposes an OpenAI-compatible /v1/chat/completions API.

Costs

Nebius AI Cloud charges you for Compute virtual machines.

Prerequisites

  1. Install and configure the Nebius AI Cloud CLI to work in the project in the eu-north1 region.
  2. Install jq to parse JSON outputs in this tutorial:
    sudo apt-get install jq
    
  3. Make sure that you are in a group that has the admin role within your tenant; for example, the default admins group.
  4. In the Quota section of the web console, check that you have quotas on the following resources in the region you use:
    • NVIDIA® L40S for regular VMs without reservations, under Compute, there should be at least one GPU available.
    • Number of virtual machines, under Compute, there should be at least one VM available.
    • Total number of allocations, under Virtual Private Cloud, there should be at least one allocation available.
    Increase quotas if needed.

Steps

Create an endpoint

  1. Create a token for authorization and save it to an environment variable:
    export AUTH_TOKEN=$(openssl rand -hex 32)
    
  2. Save the model ID. The model ID is a Hugging Face model identifier. Use a small model if you want a faster startup:
    export MODEL_ID="Qwen/Qwen3-0.6B"
    
    You can use any compatible model from Hugging Face. Replace Qwen/Qwen3-0.6B with the model ID of your choice.
  3. Get a subnet ID (for example, the first subnet in the project) and save it to an environment variable:
    export SUBNET_ID=$(nebius vpc subnet list --format jsonpath='{.items[0].metadata.id}')
    echo "SUBNET_ID=$SUBNET_ID"
    
  4. Create an endpoint:
    nebius ai endpoint create \
      --name qs-vllm-chat \
      --image vllm/vllm-openai:cu130-nightly-e68de8adc0301babb3bb3fcd2ddccaf98e7695c8 \
      --container-command "python3 -m vllm.entrypoints.openai.api_server" \
      --args "--model $MODEL_ID --host 0.0.0.0 --port 8000" \
      --platform gpu-l40s-a \
      --preset 1gpu-8vcpu-32gb \
      --public \
      --container-port 8000 \
      --auth token \
      --token "$AUTH_TOKEN" \
      --shm-size 16Gi \
      --subnet-id "$SUBNET_ID"
    
The endpoint creation takes approximately five minutes.

Check the endpoint status

  1. Save the endpoint ID to an environment variable:
    export ENDPOINT_ID=$(nebius ai endpoint list --format json | jq -r '.items[0].metadata.id')
    echo "ENDPOINT_ID=$ENDPOINT_ID"
    
  2. Check the status of the endpoint:
    nebius ai endpoint get $ENDPOINT_ID
    
Wait until the endpoint status is Running.

Test the endpoint

  1. Get the endpoint IP address and save it to an environment variable:
    export ENDPOINT_IP=$(nebius ai endpoint get $ENDPOINT_ID --format json | jq -r '.status.public_endpoints[0]')
    echo "ENDPOINT_IP=$ENDPOINT_IP"
    
  2. Test the endpoint by listing available models:
    curl "http://$ENDPOINT_IP/v1/models" -H "Authorization: Bearer $AUTH_TOKEN" | jq
    
  3. Send a simple chat request to the model:
    curl "http://$ENDPOINT_IP/v1/chat/completions" \
    -H "Authorization: Bearer $AUTH_TOKEN" \
    -H "Content-Type: application/json" \
    -d "{
       \"model\": \"$MODEL_ID\",
       \"messages\": [
          {\"role\": \"user\", \"content\": \"Hello world :)\"}
       ]
    }"
    
  4. Send a more stable request and extract the response:
    curl -sS "http://$ENDPOINT_IP/v1/chat/completions" \
      -H "Authorization: Bearer $AUTH_TOKEN" \
      -H "Content-Type: application/json" \
      -d "{
        \"model\": \"$MODEL_ID\",
        \"messages\": [{\"role\":\"user\",\"content\":\"Say 'Hello Nebius AI' and nothing else\"}]
      }" \
    | jq -r '.choices[0].message.content'
    

How to delete the created resources

The endpoint and its computing resources are chargeable. If you don’t need the endpoint, delete it, so Nebius AI Cloud doesn’t charge for it:
nebius ai endpoint delete $ENDPOINT_ID