harry's blog

其实在之前制作完 Docker 镜像之后，我就用 Docker 运行过一次程序，但可能最近比较忙，就没有用他来调试 PyTorch，下面我记录了一次成功运行 PyTorch 的经历

1. cv2 错误

首先我们进入 Docker

docker run -it --gpus all -v /data4/wangyh:/res nvidia/cuda:v5 /bin/bash`

进入 Docker 之后进入到 res 文件夹找到 docker 中挂载的程序，然后用 torch.distributed.launch 来运行程序，如下

python -m torch.distributed.launch --nproc_per_node 8 train.py

此时服务器报了第一个错误，错误的意思是说没有安装 cv2 库，这里我们使用 pip install opencv-python，再次运行程序，依然报错。

Traceback (most recent call last):
  File "train.py", line 2, in <module>
    from utils.config import Config
  File "/res/restoration/MPRNet/Deraining/utils/__init__.py", line 1, in <module>
    from .image_utils import *
  File "/res/restoration/MPRNet/Deraining/utils/image_utils.py", line 3, in <module>
    import cv2
  File "/root/miniconda3/envs/torch/lib/python3.8/site-packages/cv2/__init__.py", line 8, in <module>
    from .cv2 import *
ImportError: libGL.so.1: cannot open shared object file: No such file or directory
Killing subprocess 477

然后我去 Google 了一些答案，其中有这几个连接：链接 1 链接 2

总结一下就是说有下面几种解决办法

# solution 1
RUN apt-get update
RUN apt-get install ffmpeg libsm6 libxext6  -y

# solution 2
RUN apt-get update && apt-get install -y python3-opencv
RUN pip install opencv-python

# solution 3
apt-get update && apt-get install libgl1

# solution 4
RUN apt-get update
RUN apt install -y libgl1-mesa-glx

但是当我运行这些命令的时候，要么安装完以后依然报错（之前的错误），要么就是无法安装，安装到最后的时候出现下面的错误，按照命令 --fix-missing 之后依然报错

E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/universe/f/fyba/libfyba0_4.1.1-3_amd64.deb  Connection failed [IP: 91.189.88.152 80]
E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/universe/f/freexl/libfreexl1_1.0.5-1_amd64.deb  Connection failed [IP: 91.189.88.152 80]
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?

最终解决方案

最终我在 CSDN 上找到了解决办法：链接
只需要下面两行命令即可

pip uninstall opencv-python
pip install opencv-python-headless

然后运行 python，import cv2 就可以成功导入啦

2. distributed 报错

解决 cv2 的问题之后，我使用 python -m torch.distributed.launch --nproc_per_node 8 train.py 来训练 model，然而报了以下错误

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8

解决办法：链接

即在启动容器的时候加入 --ipc=host 命令

3. 启动/关闭/后台运行 Docker

3.1 启动

1. 正常启动

docker run -it <image:tag> /bin/bash

2. 启动并命名容器

docker run --name <container_name> -it <image:tag> /bin/bash

3. 挂载 GPU

docker run --name <container_name> -it --gpus all <image:tag> /bin/bash

4. 挂载文件夹

docker run -it -v /data4/wangyh:/res <image:tag> /bin/bash`

5. 挂载多个文件夹

docker run -it -v /data4/wangyh/code:/code -v /data4/wangyh/data:/dataset <image:tag> /bin/bash`

最终版本

docker run --name wyh -p 5678:22 --ipc=host --gpus all -it -v /data4/wangyh/FSDNet:/code wangyh/cuda:10.1 /bin/bash

3.2 后台运行 Docker

当我们的程序正常运行时，我们需要让 docker 在后台运行，使用 ctrl+q+p 即可退出 docker，此时 gpu 依然被占用，说明 docker 在后台运行程序

查看容器: docker ps -a
删除容器: docker rm -f <id>

如果需要再次进入 docker，执行下面命令即可

docker attach <name> 或者 docker attach <container id>

本文由 Yonghui Wang 创作，采用知识共享署名4.0 国际许可协议进行许可
本站文章除注明转载/出处外，均为本站原创或翻译，转载前请务必署名
最后编辑时间为: Dec 19, 2024 12:13 pm

docker 运行 pytorch 的一次记录

1. cv2 错误

2. distributed 报错

3. 启动/关闭/后台运行 Docker

3.1 启动

3.2 后台运行 Docker