> ## Documentation Index
> Fetch the complete documentation index at: https://docs.nebius.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Deploying a large language model and chatting with it by using Serverless AI endpoints

Serverless AI lets you deploy and manage endpoints without handling infrastructure yourself. With endpoints, you can create an OpenAI-compatible model backend in a few minutes.

This tutorial shows how to prepare your environment, create your first endpoint with an open-source large language model (LLM), and send a chat request.

The endpoint is based on the `vllm/vllm-openai:latest` image. [vLLM](https://github.com/vllm-project/vllm) automatically downloads the model from [Hugging Face](https://huggingface.co) when the endpoint starts. The container exposes an OpenAI-compatible `/v1/chat/completions` API.

## Costs

Nebius AI Cloud charges you for [Compute virtual machines](/compute/resources/pricing).

## Prerequisites

<Tabs group="interfaces">
  <Tab title="Web console">
    * Make sure that you are in a [group](/iam/authorization/groups/index) that has the `admin` role within your tenant; for example, the default `admins` group.

    * On the [Administration → Limits → Quotas](https://console.nebius.com/quota) page of the web console, check that you have quotas on the following resources in the region you use:

      * **NVIDIA® L40S for regular VMs without reservations**, under **Compute**, there should be at least one GPU available.
      * **Number of virtual machines**, under **Compute**, there should be at least one VM available.
      * **Total number of allocations**, under **Virtual Private Cloud**, there should be at least one allocation available.

      [Increase quotas](/overview/quotas#change-quotas) if needed.
  </Tab>

  <Tab title="CLI">
    1. [Install](/cli/install) and [configure](/cli/configure) the Nebius AI Cloud CLI to work in the project in the `eu-north1` region.

    2. Install [jq](https://jqlang.org/) to parse JSON outputs in this tutorial:

           <CodeGroup>
             ```bash Ubuntu theme={null}
             sudo apt-get install jq
             ```

             ```bash macOS theme={null}
             brew install jq
             ```
           </CodeGroup>

    3. Make sure you are in a [group](/iam/authorization/groups/index) that has the `admin` role within your tenant; for example, the default `admins` group. You can check this in the [Administration → IAM](https://console.nebius.com/iam) section of the web console.

    4. In the [Quota](https://console.nebius.com/quota) section of the web console, check that you have quotas on the following resources in the region you use:

       * **NVIDIA® L40S for regular VMs without reservations**, under **Compute**, there should be at least one GPU available.
       * **Number of virtual machines**, under **Compute**, there should be at least one VM available.
       * **Total number of allocations**, under **Virtual Private Cloud**, there should be at least one allocation available.

       [Increase quotas](/overview/quotas#change-quotas) if needed.
  </Tab>
</Tabs>

## Steps

### Create an endpoint

<Tabs group="interfaces">
  <Tab title="Web console">
    1. In the sidebar, go to <Icon icon="https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/sidebar/ai-services.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=ab4ff229f7690c99deb1dc52d3daf987" width="16" height="16" data-path="_assets/sidebar/ai-services.svg" /> **AI Services** → **Endpoints**.

    2. Click <Icon icon="https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/plus.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=7c9efc69d65fc58db0eb73702fd81aa1" width="16" height="16" data-path="_assets/plus.svg" /> **Create endpoint**.

    3. On the page that opens, specify the following endpoint settings:

       * **Image path**: `vllm/vllm-openai:v0.18.0-cu130`.

       * **Ports**: `8000`.

       * **Entrypoint command**:

         ```bash theme={null}
         python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen3-0.6B --host 0.0.0.0 --port 8000
         ```

       * **Authentication**: Token authentication. Copy and save the generated token.

       * **Computing resources**: With GPU.

       * **Available platform**: NVIDIA® L40S PCIe with Intel Ice Lake.

       * **Preset**: 1GPU — 8 CPUs — 32 GiB RAM.

       * **Network**: Public static IP.

    4. Click **Create**.
  </Tab>

  <Tab title="CLI">
    1. Create a token for authorization and save it to an environment variable:

       ```bash theme={null}
       export AUTH_TOKEN=$(openssl rand -hex 32)
       ```

    2. Save the model ID. The model ID is a Hugging Face model identifier. Use a small model if you want a faster startup:

       ```bash theme={null}
       export MODEL_ID="Qwen/Qwen3-0.6B"
       ```

       You can use any compatible model from Hugging Face. Replace `Qwen/Qwen3-0.6B` with the model ID of your choice.

    3. Get a subnet ID (for example, the first subnet in the project) and save it to an environment variable:

       ```bash theme={null}
       export SUBNET_ID=$(nebius vpc subnet list --format jsonpath='{.items[0].metadata.id}')
       echo "SUBNET_ID=$SUBNET_ID"
       ```

    4. Create an endpoint:

       ```bash theme={null}
       nebius ai endpoint create \
         --name qs-vllm-chat \
         --image vllm/vllm-openai:v0.18.0-cu130 \
         --container-command "python3 -m vllm.entrypoints.openai.api_server" \
         --args "--model $MODEL_ID --host 0.0.0.0 --port 8000" \
         --platform gpu-l40s-a \
         --preset 1gpu-8vcpu-32gb \
         --public \
         --container-port 8000 \
         --auth token \
         --token "$AUTH_TOKEN" \
         --shm-size 16Gi \
         --subnet-id "$SUBNET_ID"
       ```
  </Tab>
</Tabs>

The endpoint creation takes approximately five minutes.

### Check the endpoint status

<Tabs group="interfaces">
  <Tab title="Web console">
    Wait until the endpoint status is `Running`. You can check the status on the endpoint page.
  </Tab>

  <Tab title="CLI">
    1. Save the endpoint ID to an environment variable:

       ```bash theme={null}
       export ENDPOINT_ID=$(nebius ai endpoint list --format json | jq -r '.items[0].metadata.id')
       echo "ENDPOINT_ID=$ENDPOINT_ID"
       ```

    2. Check the status of the endpoint:

       ```bash theme={null}
       nebius ai endpoint get $ENDPOINT_ID
       ```

    Wait until the endpoint status is `Running`.
  </Tab>
</Tabs>

### Test the endpoint

<Tabs group="interfaces">
  <Tab title="Web console">
    1. In the sidebar, go to <Icon icon="https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/sidebar/ai-services.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=ab4ff229f7690c99deb1dc52d3daf987" width="16" height="16" data-path="_assets/sidebar/ai-services.svg" /> **AI Services** → **Endpoints**.
    2. Open the page of the required endpoint.
    3. In the **Network** section, copy the IP address from the **Public endpoints** or **Private endpoints** field.

    Test the endpoint by listing available models:

    ```bash theme={null}
    curl "http://<endpoint_IP_address>/v1/models" \
      -H "Authorization: Bearer <token>" | jq
    ```

    Send a chat request to the model:

    ```bash theme={null}
    curl "http://<endpoint_IP_address>/v1/chat/completions" \
      -H "Authorization: Bearer <token>" \
      -H "Content-Type: application/json" \
      -d '{
        "model": "Qwen/Qwen3-0.6B",
        "messages": [
          {"role": "user", "content": "Say '\''Hello Nebius AI'\'' and nothing else"}
        ]
      }' | jq -r '.choices[0].message.content'
    ```
  </Tab>

  <Tab title="CLI">
    1. Get the endpoint IP address and save it to an environment variable:

       ```bash theme={null}
       export ENDPOINT_IP=$(nebius ai endpoint get-by-name --name qs-vllm-chat \
         --format json | jq -r '.status.public_endpoints[0]')
       echo "ENDPOINT_IP=$ENDPOINT_IP"
       ```

    2. Test the endpoint by listing available models:

       ```bash theme={null}
       curl "http://$ENDPOINT_IP/v1/models" -H "Authorization: Bearer $AUTH_TOKEN" | jq
       ```

    3. Send a chat request to the model:

       ```bash theme={null}
       curl "http://$ENDPOINT_IP/v1/chat/completions" \
         -H "Authorization: Bearer $AUTH_TOKEN" \
         -H "Content-Type: application/json" \
         -d "{
           \"model\": \"$MODEL_ID\",
           \"messages\": [{\"role\":\"user\",\"content\":\"Say 'Hello Nebius AI' and nothing else\"}]
         }" | jq -r '.choices[0].message.content'
       ```
  </Tab>
</Tabs>

## How to delete the created resources

The endpoint and its computing resources are chargeable. If you don't need the endpoint, delete it, so Nebius AI Cloud doesn't charge for it:

<Tabs group="interfaces">
  <Tab title="Web console">
    1. In the sidebar, go to <Icon icon="https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/sidebar/ai-services.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=ab4ff229f7690c99deb1dc52d3daf987" width="16" height="16" data-path="_assets/sidebar/ai-services.svg" /> **AI Services** → **Endpoints**.
    2. Locate the endpoint and then click <Icon icon="https://mintcdn.com/nebius-ai-cloud/1Ha0sWR6e1mnIaHS/_assets/button-vellipsis.svg?fit=max&auto=format&n=1Ha0sWR6e1mnIaHS&q=85&s=e80b8e57c43bfd117679262e6a1334ad" width="12" height="24" data-path="_assets/button-vellipsis.svg" /> → **Delete**.
    3. In the window that opens, confirm the deletion.
  </Tab>

  <Tab title="CLI">
    ```bash theme={null}
    nebius ai endpoint delete $ENDPOINT_ID
    ```
  </Tab>
</Tabs>
