nvidia-smi版本驱动不匹配,以及 cuda不可用两个问题处理

随笔2个月前发布 澳洲单创
2 0 0

重装了nvidai cuda ,启动了nvidia-fabricmanager

CUDA initialization: Unexpected error from cudaGetDeviceCount()解决方法

 

$ python mcw.py
/home/mcw/mambaforge/envs/ailme/lib/python3.11/site-packages/torch/cuda/__init__.py:118: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
torch.cuda.is_available(): False

 

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb

apt-get install libnvidia-common-525=525.125.06-0ubuntu1
apt-get install nvidia-kernel-common-525=525.125.06-0ubuntu1

apt-get install --no-install-recommends cuda-drivers-525=525.125.06-1 nvidia-driver-525=525.125.06-0ubuntu1 nvidia-dkms-525=525.125.06-0ubuntu1 nvidia-kernel-source-525=525.125.06-0ubuntu1 libnvidia-gl-525=525.125.06-0ubuntu1 libnvidia-compute-525=525.125.06-0ubuntu1 libnvidia-decode-525=525.125.06-0ubuntu1 libnvidia-extra-525=525.125.06-0ubuntu1 nvidia-compute-utils-525=525.125.06-0ubuntu1 libnvidia-encode-525=525.125.06-0ubuntu1 nvidia-utils-525=525.125.06-0ubuntu1 xserver-xorg-video-nvidia-525=525.125.06-0ubuntu1 libnvidia-cfg1-525=525.125.06-0ubuntu1 libnvidia-fbc1-525=525.125.06-0ubuntu1
nvidia-smi
wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
sh cuda_11.8.0_520.61.05_linux.run
ls /usr/local/cuda
apt-get install nvidia-fabricmanager-525=525.125.06-1
systemctl enable nvidia-fabricmanager
systemctl start nvidia-fabricmanager

 

$ python
Python 3.11.6 | packaged by conda-forge | (main, Oct  3 2023, 10:40:35) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>> 

 

注意:在使用A100-80G服务器时,不用轻易使用apt-get更新或者开启Ubuntu系统更新。

 nvidia-fabricmanager 这个包某些原因更新了,如在系统自动更新或者apt-get update、apt-get upgrade等过程中被更新了。而这个包必须和驱动版本一致才能正常使用

 

 

参考链接:

https://blog.csdn.net/k_wenry/article/details/138350564
https://bbs.huaweicloud.com/blogs/401682
卸载已经有的包:https://blog.csdn.net/qq_41076797/article/details/124909408

CUDA initialization: Unexpected error from cudaGetDeviceCount()解决方法 https://www.cnblogs.com/huadongw/p/16504137.html
https://developer.nvidia.com/cuda-11-8-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=runfile_local
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/
https://mirrors.cloud.tencent.com/nvidia-cuda/ubuntu2204/x86_64/

 

降低系统内核:https://blog.csdn.net/qq_62368277/article/details/134273919

 

© 版权声明

相关文章

暂无评论

您必须登录才能参与评论!
立即登录
暂无评论...