- Go to
Observability → Metrics and select the resources you would like to review.
- Go to the page of the cluster you would like to review and switch to the Metrics tab.
Dashboard filters
Keepmlflow selected in the Group list. The other groups in the list are used for internal purposes.
To check Managed MLflow containers health separately, select them from the Pod list.
Use time filters to view a specific period of usage.
By default, the data is refreshed every 15 seconds. You can configure this interval to the right of the time filters.
Cluster monitoring metrics
Resource usage
- Service containers Number of containers.
- Total service CPU usage Amount of vCPU that a cluster consumed.
- Total service memory usage Amount of consumed RAM in MiB, with work memory and cache.
- Total disk IO usage Disk read/write rate, in bytes per second.
- Total service network usage Network receive/transmit rate, in bytes per second.
- Total service disk usage Amount of consumed storage in KiB.
- Throttling containers in % Percentage of CPU periods when containers went throttling.
- Throttling containers in seconds Percentage of CPU seconds when containers went throttling.
- CPU usage Amount of vCPU consumed per container, including the limit.
- Memory usage Amount of memory that containers used in a dedicated period, including work memory and cache. Measured in bytes.
- Network RX/TX Average speed of data received/sent over the network per container, in bytes per second.
- Disk IO Average IOPS of the cluster’s disks per container, in operations per second.
- Disk used space Amount of disk space consumed per container, including the limit.
- Disk inodes used Number of used inodes with files metadata per container, including the limit. A lack of inodes leads to an overall lack of disk space, especially if you generate many small files in an experiment.
MLflow
- Experiments count Number of experiments created in Managed MLflow.
- Runs count Number of training runs.
- Models count Number of registered models, including versioned models.
- Users count Number of unique users who are running experiments.