Costs
This tutorial includes the following chargeable resources:- Managed Kubernetes cluster.
- Run:ai infrastructure.
- (Optional) Managed Service for PostgreSQL® cluster.
Prerequisites
- Get a Run:ai account token.
-
Prepare the environment:
- Install and configure the Nebius AI Cloud CLI.
- Install Terraform.
- Install kubectl and Helm.
-
Install jq, to extract IDs and tokens from the JSON data returned by the Nebius AI Cloud CLI. For more details, see the jq documentation.
-
Save the domain name you control to an environment variable:
Steps
Set up a Managed Service for Kubernetes cluster
For this tutorial, a Managed Service for Kubernetes cluster must have:- A node group with at least three nodes. Each of these nodes must have a public IP address allocated.
- A mounted filesystem.
- NVIDIA® GPU Operator.
-
Clone the nebius-solution-library repository from GitHub and go to the
k8s-trainingdirectory: -
Generate an SSH key pair:
If you want to use a different name for the SSH key pair, specify your public key path in terraform.tfvars.
-
Load the environment variables:
-
Initialize Terraform to download providers and modules:
-
Set
enable_grafanaandenable_prometheustofalse,gpu_nodes_assign_public_iptotrueand enter your project settings in thek8s-training/terraform.tfvarsfile, or overwrite the values while applying the configuration:The command contains the following parameters:parent_id: Project ID.subnet_id: Subnet ID.region: The project region is displayed in the upper-left corner of the web console, next to your project name.
-
When the cluster and the nodes are ready, connect to the cluster:
Configure KServe
KServe is an open-source framework for serving ML models on Kubernetes. KServe uses KNative for serverless deployment and the auto-scaling of ML models.-
KNative can only run on nodes with public IP addresses. To ensure this, identify nodes without public IPs and cordon them.
-
Identify nodes:
If you earlier created nodes without public IP addresses, they will have only
InternalIPin the output. -
If the resulting list contains nodes with
InternalIPonly, cordon these nodes:
-
Identify nodes:
-
Install KNative:
-
Check the result:
Make sure that the
default-domainjob reaches theCompletestatus and the3scale-kourier-gatewaypod is running.
Install operators
-
Install the Kubeflow Training Operator:
-
Install the MPI Operator:
(Optional) Prepare nodes for installing other applications
If you cordoned nodes without public IP addresses during the KServe installation, uncordon them:Set up nginx
-
Get the IP addresses of the nodes:
-
Install nginx:
-
In your DNS provider, link your domain name (
DOMAIN_NAME) to the public IP address of one of the nodes.
Install and configure Prometheus
(Optional) Install cert-manager
-
If you do not have public TLS certificates, install cert-manager:
-
Create
certs.yamlwith the certificate resources. You will use it later when you create a TLS secret:certs.yaml
(Optional) Configure a Managed Service for PostgreSQL cluster
Setting up a Managed Service for PostgreSQL cluster is not strictly neccessary for using Run:ai, but it’s strongly recommended for production environments.- Install the postgresql package.
-
Create a Managed PostgreSQL cluster:
In this command, specify the network ID.
-
Configure the Managed Service for PostgreSQL cluster to work with Run:ai:
Create resources
-
Create namespaces:
-
Use your email and the token received from Run:ai to create a Kubernetes secret with Run:ai credentials:
-
If you have not created a Managed Service for PostgreSQL cluster in (Optional) Configure a Managed Service for PostgreSQL cluster, create a password and save it to the
PG_PASSWORDenvironment variable: -
Create a Kubernetes secret with Managed Service for PostgreSQL credentials:
-
Create a TLS secret:
- With cert-manager
- Without cert-manager
Make sure that you set upDOMAIN_NAMEandEMAILvariables, and run the following set of commands:
Install Run:ai
Update CoreDNS configuration
Managed Kubernetes uses Cilium as it provides eBPF-based networking, load balancing and security policies, and improves overall observability. However, eBPF-based networking can conflict with CoreDNS and cause DNS resolution failures. To ensure proper DNS resolution for Kubernetes services, create and apply a custom ConfigMap:-
Create a custom ConfigMap
coredns-custom.yamlthat rewrites DNS queries to match your$DOMAIN_NAME: -
Make sure that you set up the
DOMAIN_NAMEvariable and run the following command:
Install control plane
- Without Managed PostgreSQL
- With Managed PostgreSQL
-
Install control plane:
-
Wait until all pods are ready. You can check their status by running the following command:
Create a Run:ai cluster
-
Go to
DOMAIN_NAMEin your browser. -
Log in with the default credentials:
- Username:
test@run.ai - Password:
Abcd!234
- Username:
- Immediately change the password in the interface.
-
Create a new cluster:
- In the interface, select the Run:ai version
2.18and the Same as the control plane cluster location. - Copy the provided command to install the cluster Helm chart.
- In your terminal, run the provided command.
- In the interface, select the Run:ai version
How to delete the created resources
Some of the created resources are chargeable. If you do not need them, delete these resources so Nebius AI Cloud does not charge for them:-
Delete installed operators:
-
Delete the Managed Kubernetes cluster:
- Delete the Managed PostgreSQL cluster.
Postgres, PostgreSQL and the Slonik Logo are trademarks or registered trademarks of the PostgreSQL Community Association of Canada, and used with their permission.