Issue
On virtual machines with GPUs, CUDA may fail to initialize in rare cases, leading to errors when you run your workloads and tests. This can occur on any VM with GPUs, regardless of the specific platform, number of GPUs or boot disk image.For example, executing PyTorch code may result in the following error:
Possible cause
One possible cause of CUDA initialization failures is an issue with the NVIDIA Fabric Manager, a component that provides NVLink and NVSwitch support for multi-GPU VMs. In rare cases, its service,nvidia-fabricmanager, does not initialize on VM startup because of a race condition or other timing issues. This can cause initialization issues for CUDA and GPUs on the VM. However, CUDA initialization failures can also occur due to other reasons.
How to check nvidia-fabricmanager
How to check nvidia-fabricmanager
- Connect to the VM.
-
Run the following command:
If the service is not running, the
Activeline in the output indicates that:
Solutions
You can solve the issue on your existing VM without having to create a new VM. Try the following steps in order:-
Start the NVIDIA Fabric Manager service:
- Connect to the VM.
-
Run the following command:
-
Check whether starting the service worked. To do this, you can try running your workload or get diagnostic information about the GPUs on the VM. For example, use the NVIDIA System Management Interface (
nvidia-smi) to get the fabric states and statuses of the GPUs:In the output, all the GPUs should be in theCompletedfabric state and theSuccessstatus:
-
Restart the GPUs on the VM:
-
Stop the services and workloads that use the GPUs. For example, if you are not running any workloads on the VM, you only need to stop the monitoring services:
-
Run the
nvidia-smicommand that restarts GPUs:This may take several minutes. You should see the following output: -
Start the stopped workloads and services again. For example, to start the monitoring services, run the following command:
-
Check whether restarting the GPUs worked. For example, you can run your workload or use
nvidia-smi:If the step has not worked, e.g. there are GPUs that are not in theCompletedfabric state and theSuccessstatus, proceed to the next step.
-
Stop the services and workloads that use the GPUs. For example, if you are not running any workloads on the VM, you only need to stop the monitoring services:
-
Restart the VM.
After that, connect to it and then check whether restarting the VM worked. For example, you can run your workload or use
nvidia-smi:If the step has not worked, e.g. there are GPUs that are not in theCompletedfabric state and theSuccessstatus, contact support.