vllm/vllm-openai:latest image. vLLM automatically downloads the model from Hugging Face when the endpoint starts. The container exposes an OpenAI-compatible /v1/chat/completions API.
Costs
Nebius AI Cloud charges you for Compute virtual machines.Prerequisites
- CLI
-
Install and configure the Nebius AI Cloud CLI to work in the project in the
eu-north1region. -
Install jq to parse JSON outputs in this tutorial:
-
Make sure that you are in a group that has the
adminrole within your tenant; for example, the defaultadminsgroup. -
In the Quota section of the web console, check that you have quotas on the following resources in the region you use:
- NVIDIA® L40S for regular VMs without reservations, under Compute, there should be at least one GPU available.
- Number of virtual machines, under Compute, there should be at least one VM available.
- Total number of allocations, under Virtual Private Cloud, there should be at least one allocation available.
Steps
Create an endpoint
- CLI
-
Create a token for authorization and save it to an environment variable:
-
Save the model ID. The model ID is a Hugging Face model identifier. Use a small model if you want a faster startup:
You can use any compatible model from Hugging Face. Replace
Qwen/Qwen3-0.6Bwith the model ID of your choice. -
Get a subnet ID (for example, the first subnet in the project) and save it to an environment variable:
-
Create an endpoint:
Check the endpoint status
- CLI
-
Save the endpoint ID to an environment variable:
-
Check the status of the endpoint:
Running.
Test the endpoint
- CLI
-
Get the endpoint IP address and save it to an environment variable:
-
Test the endpoint by listing available models:
-
Send a simple chat request to the model:
-
Send a more stable request and extract the response:
How to delete the created resources
The endpoint and its computing resources are chargeable. If you don’t need the endpoint, delete it, so Nebius AI Cloud doesn’t charge for it:- CLI