重装了nvidai cuda ,启动了nvidia-fabricmanager
CUDA initialization: Unexpected error from cudaGetDeviceCount()解决方法
$ python mcw.py /home/mcw/mambaforge/envs/ailme/lib/python3.11/site-packages/torch/cuda/__init__.py:118: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0 torch.cuda.is_available(): False
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb dpkg -i cuda-keyring_1.1-1_all.deb
apt-get install libnvidia-common-525=525.125.06-0ubuntu1
apt-get install nvidia-kernel-common-525=525.125.06-0ubuntu1
apt-get install --no-install-recommends cuda-drivers-525=525.125.06-1 nvidia-driver-525=525.125.06-0ubuntu1 nvidia-dkms-525=525.125.06-0ubuntu1 nvidia-kernel-source-525=525.125.06-0ubuntu1 libnvidia-gl-525=525.125.06-0ubuntu1 libnvidia-compute-525=525.125.06-0ubuntu1 libnvidia-decode-525=525.125.06-0ubuntu1 libnvidia-extra-525=525.125.06-0ubuntu1 nvidia-compute-utils-525=525.125.06-0ubuntu1 libnvidia-encode-525=525.125.06-0ubuntu1 nvidia-utils-525=525.125.06-0ubuntu1 xserver-xorg-video-nvidia-525=525.125.06-0ubuntu1 libnvidia-cfg1-525=525.125.06-0ubuntu1 libnvidia-fbc1-525=525.125.06-0ubuntu1 nvidia-smi wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run sh cuda_11.8.0_520.61.05_linux.run ls /usr/local/cuda apt-get install nvidia-fabricmanager-525=525.125.06-1 systemctl enable nvidia-fabricmanager systemctl start nvidia-fabricmanager
$ python Python 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 10:40:35) [GCC 12.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> torch.cuda.is_available() True >>>
注意:在使用A100-80G服务器时,不用轻易使用apt-get更新或者开启Ubuntu系统更新。
nvidia-fabricmanager 这个包某些原因更新了,如在系统自动更新或者apt-get update、apt-get upgrade等过程中被更新了。而这个包必须和驱动版本一致才能正常使用
参考链接:
https://blog.csdn.net/k_wenry/article/details/138350564
https://bbs.huaweicloud.com/blogs/401682
卸载已经有的包:https://blog.csdn.net/qq_41076797/article/details/124909408
CUDA initialization: Unexpected error from cudaGetDeviceCount()解决方法 https://www.cnblogs.com/huadongw/p/16504137.html
https://developer.nvidia.com/cuda-11-8-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=runfile_local
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/
https://mirrors.cloud.tencent.com/nvidia-cuda/ubuntu2204/x86_64/
降低系统内核:https://blog.csdn.net/qq_62368277/article/details/134273919