『智谱清言』深度学习环境常见问题

持续更新

问题一

环境：Ubuntu 22.0，8*A100，Nvidia-smi CUDA:12.2，安装的CUDA版本为10.1

之前一直能正常使用CUDA，今天突然无法正常使用GPU，报错如下：

import torch
'''
/home/yangbowen/anaconda3/envs/yangbowen/lib/python3.10/site-packages/torch/cuda/__init__.py:128: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
'''

一开始以为是CUDA版本不对应，但是这台机器的CUDA是管理员配置好的，且有进程在运行，无法卸载重装。

经大量查阅资料，发现是不小心使用了 apt update 更新了 nvidia-fabricmanager 的版本，而 nvidia-fabricmanager 必须要和CUDA版本保持一致，解决方案如下：

systemctl status nvidia-fabricamanager 查看当前状态为 failed，确定此处出现问题

查看版本更新改动，可以发现版本从 535.129.03-1 更新为 535.183.06-0

cat /var/log/dpkg.log | grep nvidia

'''
2024-10-08 15:03:54 status installed nvidia-fabricmanager-535:amd64 535.129.03-1
2024-10-14 12:10:19 upgrade nvidia-fabricmanager-535:amd64 535.129.03-1 535.183.06-0ubuntu0.20.04.1
'''

官网安装对应版本，并重新启动

# 下载，选择自己的版本
wget https://developer.download.nvidia.cn/compute/cuda/repos/{ubuntu2204}/x86_64/nvidia-fabricmanager-{535_535.129.03-1}_amd64.deb

# 安装
sudo apt-get install ./nvidia-fabricmanager-535_535.129.0
3-1_amd64.deb

# 重新启动
systemctl enable nvidia-fabricmanager
systemctl start nvidia-fabricmanager

再次查看状态，并重新import torch，发现已经解决！

问题二：ImportError: {libGL.so.1}: cannot open shared object file: No such file or directory

在Docker内部部署Yolo系列时经常遇见此种问题，这是Docker环境内缺少某些共享库导致的，可以使用如下命令解决。

apt update

# 报错 ImportError: libGL.so.1: cannot open shared object file: No such file or directory 
apt-get install libgl1-mesa-glx

# 报错 ImportError: libgthread-2.0.so.0: cannot open shared object file: No such file or directory
apt-get install libglib2.0-0