Documentation Index Fetch the complete documentation index at: https://docs.nebius.com/llms.txt
Use this file to discover all available pages before exploring further.
To optimize your ML/AI workloads, you can use the Run:ai management platform. It dynamically allocates GPU resources, prevents idle GPUs and enables GPU sharing across multiple workloads and users, so that all resources are utilized. This guide explains how to configure your Nebius AI Cloud resources for use in Run:ai.
Costs
This tutorial includes the following chargeable resources:
Prerequisites
Get a Run:ai account token .
Prepare the environment:
Install and configure the Nebius AI Cloud CLI .
Install Terraform .
Install kubectl and Helm .
Install jq , to extract IDs and tokens from the JSON data returned by the Nebius AI Cloud CLI. For more details, see the jq documentation .
Save the domain name you control to an environment variable:
export DOMAIN_NAME =< example . com >
Steps
Set up a Managed Service for Kubernetes cluster
For this tutorial, a Managed Service for Kubernetes cluster must have:
A node group with at least three nodes. Each of these nodes must have a public IP address allocated.
A mounted filesystem.
NVIDIA® GPU Operator.
To create the necessary resources quickly, use the k8s-training solution for Terraform:
Clone the nebius-solution-library repository from GitHub and go to the k8s-training directory:
git clone https://github.com/nebius/nebius-solution-library.git
cd nebius-solution-library/k8s-training
Generate an SSH key pair .
If you use a custom file name to save the key pair, specify your public key path in terraform.tfvars .
Load the environment variables:
Initialize Terraform to download providers and modules:
Set enable_grafana and enable_prometheus to false, gpu_nodes_assign_public_ip to true and enter your project settings in the k8s-training/terraform.tfvars file, or overwrite the values while applying the configuration:
terraform apply -var enable_grafana= false -var enable_prometheus= false \
-var parent_id= < project_I D > -var subnet_id= < subnet_I D > \
-var region= < your_project_regio n > -var gpu_nodes_assign_public_ip= true
The command contains the following parameters:
parent_id: Project ID .
subnet_id: Subnet ID .
region: The project region is displayed in the upper-left corner of the web console, next to your project name.
When the cluster and the nodes are ready, connect to the cluster:
export NEBIUS_CLUSTER_ID = $( terraform output -json kube_cluster | jq -r '.id' )
nebius mk8s cluster get-credentials --id $NEBIUS_CLUSTER_ID --external
kubectl cluster-info
KServe is an open-source framework for serving ML models on Kubernetes. KServe uses KNative for serverless deployment and the auto-scaling of ML models.
KNative can only run on nodes with public IP addresses. To ensure this, identify nodes without public IPs and cordon them.
Identify nodes:
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{" - InternalIP: "}{.status.addresses[?(@.type=="InternalIP")].address}{" - ExternalIP: "}{.status.addresses[?(@.type=="ExternalIP")].address}{"\n"}{end}'
If you earlier created nodes without public IP addresses, they will have only InternalIP in the output.
If the resulting list contains nodes with InternalIP only, cordon these nodes:
kubectl cordon < node_without_public_i p >
Install KNative:
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.14.1/serving-crds.yaml
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.14.1/serving-core.yaml
kubectl apply -f https://github.com/knative/net-kourier/releases/download/knative-v1.14.0/kourier.yaml
kubectl patch configmap/config-network \
--namespace knative-serving \
--type merge \
--patch '{"data":{"ingress-class":"kourier.ingress.networking.knative.dev"}}'
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.14.1/serving-default-domain.yaml
kubectl patch configmap/config-features \
--namespace knative-serving \
--type merge \
--patch '{"data":{"kubernetes.podspec-schedulername":"enabled","kubernetes.podspec-affinity":"enabled","kubernetes.podspec-tolerations":"enabled","kubernetes.podspec-volumes-emptydir":"enabled","kubernetes.podspec-securitycontext":"enabled","kubernetes.podspec-persistent-volume-claim":"enabled","kubernetes.podspec-persistent-volume-write":"enabled","multi-container":"enabled","kubernetes.podspec-init-containers":"enabled"}}'
Check the result:
kubectl get jobs -n knative-serving
kubectl get pods -n kourier-system
Make sure that the default-domain job reaches the Complete status and the 3scale-kourier-gateway pod is running.
Install operators
Install the Kubeflow Training Operator (also known as Kubeflow Trainer):
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"
kubectl delete customresourcedefinition mpijobs.kubeflow.org
kubectl patch deployment training-operator -n kubeflow \
--type= 'json' -p= '[{"op": "add", "path": "/spec/template/spec/containers/0/args", "value": ["--enable-scheme=tfjob", "--enable-scheme=pytorchjob", "--enable-scheme=xgboostjob", "--enable-scheme=paddlejob"]}]'
Install the MPI Operator :
kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.4.0/deploy/v2beta1/mpi-operator.yaml
(Optional) Prepare nodes for installing other applications
If you cordoned nodes without public IP addresses during the KServe installation , uncordon them:
kubectl uncordon < node_without_public_i p >
Set up nginx
Get the IP addresses of the nodes:
export INTERNAL_IPS = $( kubectl get nodes -o jsonpath='{range .items[*]}{.status.addresses[?(@.type=="InternalIP")].address}{","}{end}' | sed 's/,$//' )
export EXTERNAL_IPS = $( kubectl get nodes -o jsonpath='{range .items[*]}{.status.addresses[?(@.type=="ExternalIP")].address}{","}{end}' | sed 's/,$//' )
export IPS = $( echo " $INTERNAL_IPS , $EXTERNAL_IPS " | sed 's/,,*/,/g' | sed 's/^,//' | sed 's/,$//' )
Install nginx:
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm upgrade -i nginx-ingress ingress-nginx/ingress-nginx \
--namespace nginx-ingress --create-namespace \
--set controller.kind=DaemonSet \
--set controller.service.externalIPs="{ $IPS }"
In your DNS provider, link your domain name (DOMAIN_NAME) to the public IP address of one of the nodes.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
-n monitoring --create-namespace --set grafana.enabled= false
(Optional) Install cert-manager
If you do not have public TLS certificates, install cert-manager:
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager --namespace cert-manager --create-namespace \
--version v1.10.1 --set installCRDs= true \
--set podDnsConfig.nameservers={"1.1.1.1"}
Create certs.yaml with the certificate resources. You will use it later when you create a TLS secret :
apiVersion : cert-manager.io/v1
kind : ClusterIssuer
metadata :
name : letsencrypt-runai
spec :
acme :
server : https://acme-v02.api.letsencrypt.org/directory
email : $EMAIL
privateKeySecretRef :
name : letsencrypt-runai
solvers :
- http01 :
ingress :
class : nginx
---
apiVersion : cert-manager.io/v1
kind : Certificate
metadata :
name : runai-tls
namespace : runai-backend
spec :
secretName : runai-backend-tls
issuerRef :
name : letsencrypt-runai
kind : ClusterIssuer
commonName : $DOMAIN_NAME
dnsNames :
- $DOMAIN_NAME
usages :
- digital signature
- key encipherment
- server auth
duration : 2160h # 90 days (default for Let's Encrypt)
renewBefore : 360h # Renew 15 days before expiration
---
apiVersion : cert-manager.io/v1
kind : Certificate
metadata :
name : runai-cluster-domain-tls
namespace : runai
spec :
secretName : runai-cluster-domain-tls-secret
issuerRef :
name : letsencrypt-runai
kind : ClusterIssuer
commonName : $DOMAIN_NAME
dnsNames :
- $DOMAIN_NAME
- digital signature
- key encipherment
- server auth
duration : 2160h # 90 days (Let’s Encrypt’s default)
renewBefore : 360h # Renew 15 days before expiration
(Optional) Configure a Managed Service for PostgreSQL cluster
Setting up a Managed Service for PostgreSQL cluster is not strictly neccessary for using Run:ai, but it’s strongly recommended for production environments.
Install the postgresql package .
Create a Managed PostgreSQL cluster:
export PG_PASSWORD = "<PostgreSQL_password>"
nebius msp postgresql v1alpha1 cluster create \
--name runai-backend-postgres \
--network-id < network_I D > \
--backup-backup-window-start "01:00:00" \
--backup-retention-policy "7d" \
--bootstrap-user-name "runai" \
--bootstrap-user-password " $PG_PASSWORD " \
--bootstrap-db-name "runai" \
--config-version 16 \
--config-pooler-config-pooling-mode "session" \
--config-template-disk-type "network-ssd" \
--config-template-disk-size-gibibytes 20 \
--config-template-hosts-count 2 \
--config-template-resources-platform "cpu-e2" \
--config-template-resources-preset "4vcpu-16gb" \
--config-postgresql-config-16-max-connections 320 \
--config-public-access
In this command, specify the network ID .
Configure the Managed Service for PostgreSQL cluster to work with Run:ai:
export PG_ENDPOINT = $( nebius msp postgresql v1alpha1 cluster get-by-name --name runai-backend-postgres --format json | jq -r ".status.connection_endpoints.private_read_write" )
export SQL = $( cat << EOF
CREATE ROLE grafana WITH LOGIN PASSWORD ' $PG_PASSWORD ';
ALTER USER grafana set search_path='grafana';
GRANT grafana to runai;
CREATE SCHEMA grafana authorization grafana;
EOF
)
export PG_CONNECTION_STRING = "postgresql://runai: $PG_PASSWORD @ $PG_ENDPOINT /runai"
psql $PG_CONNECTION_STRING -c " $SQL "
Create resources
Create namespaces:
kubectl create ns runai-backend
kubectl create ns runai
Use your email and the token received from Run:ai to create a Kubernetes secret with Run:ai credentials:
export TOKEN = "<your_run_ai_token>"
export EMAIL = "<your_email>"
kubectl create secret docker-registry runai-reg-creds \
--docker-server=https://runai.jfrog.io \
--docker-username=self-hosted-image-puller-prod \
--docker-password= $TOKEN \
--docker-email= $EMAIL \
--namespace=runai-backend
If you have not created a Managed Service for PostgreSQL cluster in (Optional) Configure a Managed Service for PostgreSQL cluster , create a password and save it to the PG_PASSWORD environment variable:
export PG_PASSWORD = "<PostgreSQL_password>"
Create a Kubernetes secret with Managed Service for PostgreSQL credentials:
kubectl create secret generic postgresql-credentials -n runai-backend \
--from-literal=postgres-password= $PG_PASSWORD \
--from-literal=password= $PG_PASSWORD
kubectl create secret generic grafana-postgresql-credentials -n runai-backend \
--from-literal=user=grafana \
--from-literal=password= $PG_PASSWORD
Create a TLS secret:
With cert-manager
Without cert-manager
Make sure that you set up DOMAIN_NAME and EMAIL variables, and run the following set of commands: envsubst < certs.yaml | kubectl apply -f -
kubectl wait --for=condition=Ready certificate/runai-tls -n runai-backend --timeout=300s
To be able to log in, upload your existing certificate: kubectl create secret tls runai-backend-tls -n runai-backend \
--cert /path/to/tls.crt --key /path/to/tls.key
kubectl create secret tls runai-cluster-domain-tls-secret -n runai \
--cert /path/to/tls.crt --key /path/to/tls.key
Install Run:ai
Update CoreDNS configuration
Managed Kubernetes uses Cilium as it provides eBPF-based networking, load balancing and security policies, and improves overall observability.
However, eBPF-based networking can conflict with CoreDNS and cause DNS resolution failures. To ensure proper DNS resolution for Kubernetes services, create and apply a custom ConfigMap:
Create a custom ConfigMap coredns-custom.yaml that rewrites DNS queries to match your $DOMAIN_NAME:
apiVersion : v1
kind : ConfigMap
metadata :
name : coredns-custom
namespace : kube-system
data :
custom.override : |
rewrite name exact '$DOMAIN_NAME' nginx-ingress-ingress-nginx-controller.nginx-ingress.svc.cluster.local
Make sure that you set up the DOMAIN_NAME variable and run the following command:
envsubst < coredns-custom.yaml | kubectl apply -f -
Install control plane
Install control plane:
helm repo add runai-backend https://runai.jfrog.io/artifactory/cp-charts-prod
helm repo update
helm upgrade -i runai-backend -n runai-backend runai-backend/control-plane \
--version "~2.18.0" --set global.domain= $DOMAIN_NAME
Wait until all pods are ready. You can check their status by running the following command:
Install control plane:
helm repo add runai-backend https://runai.jfrog.io/artifactory/cp-charts-prod
helm repo update
helm upgrade -i runai-backend -n runai-backend runai-backend/control-plane --version "~2.18.0" \
--set global.domain= $DOMAIN_NAME \
--set postgresql.enabled= false \
--set global.postgresql.auth.host=" $PG_ENDPOINT " \
--set global.postgresql.auth.database=runai \
--set global.postgresql.auth.username=runai \
--set global.postgresql.auth.existingSecret="postgresql-credentials" \
--set grafana.dbScheme=runai \
--set grafana.db.existingSecret="grafana-postgresql-credentials"
Wait until all pods are ready. You can check their status by running the following command:
Create a Run:ai cluster
Go to DOMAIN_NAME in your browser.
Log in with the default credentials:
Username: test@run.ai
Password: Abcd!234
Immediately change the password in the interface.
Create a new cluster:
In the interface, select the Run:ai version 2.18 and the Same as the control plane cluster location.
Copy the provided command to install the cluster Helm chart.
In your terminal, run the provided command.
You have now deployed Run:ai in the Managed Service for Kubernetes cluster and you can work with it in the Run:ai web interface.
How to delete the created resources
Some of the created resources are chargeable. If you do not need them, delete these resources so Nebius AI Cloud does not charge for them:
Delete installed operators:
kubectl delete -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"
kubectl delete -f "githubusercontent.com/kubeflow/mpi-operator/v0.4.0/deploy/v2beta1/mpi-operator.yaml"
Delete the Managed Kubernetes cluster:
terraform destroy -target=nebius_mk8s_v1_cluster.k8s-cluster
Delete the Managed PostgreSQL cluster .
Postgres, PostgreSQL and the Slonik Logo are trademarks or registered trademarks of the PostgreSQL Community Association of Canada, and used with their permission.