GPU速度太慢问题排查
# 多卡 GPU 速度太慢问题排查
# 1、根据参考资料 [1] 查看 GPU 之间的通信方式
nvidia-smi topo --matrix
GPU0 GPU1 CPU Affinity NUMA Affinity
GPU0 X NODE 0-15,32-47 0
GPU1 NODE X 0-15,32-47 0
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
2
3
4
5
6
7
8
9
10
11
12
13
14
在双卡 RTX TITAN 的工作站上看到 GPU0 和 GPU1 是通过 NODE 方式通信的
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity
GPU0 X PIX SYS SYS 0-19 N/A
GPU1 PIX X SYS SYS 0-19 N/A
GPU2 SYS SYS X PIX 0-19 N/A
GPU3 SYS SYS PIX X 0-19 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
在四卡 RTX 2080Ti 的服务器上是上述输出,
- GPU0 与 GPU1 是 PIX 通信方式,
- GPU2 与 GPU3 是 PIX 通信方式
- GPU0 与 GPU2、GPU3 是 SYS 通信方式
- GPU1 与 GPU2、GPU3 是 SYS 通信方式
# 2、测试通信带宽
使用 CUDA Samples 提供的 P2P 带宽测试脚本
- https://github.com/NVIDIA/cuda-samples/tree/master/Samples/p2pBandwidthLatencyTest
测试方法:
- 在
https://github.com/NVIDIA/cuda-samples/releases
中找到对应 CUDA 版本的 CUDA Samples nvcc -V
查看 CUDA 版本- 例如我本地是CUDA-10.2 ,然后 Download 以及 unzip,随后按照下面的命令编译并运行
> cd cuda-samples-10.2/Samples/p2pBandwidthLatencyTest
> make
> ./p2pBandwidthLatencyTest
2
3
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 556.92 8.65
1 8.58 457.95
2
3
4
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
DD 0 1 2 3
0 526.19 8.27 10.05 16.41
1 8.37 527.06 10.12 10.53
2 10.47 9.67 526.39 5.80
3 12.30 8.93 8.20 436.79
2
3
4
5
6
GPU 之间的通信带宽属实有点低,尝试通过安装 nvlink 来解决
------------2021/09/19更新------------
更换显卡插槽后再度测试一下
> nvidia-smi topo --matrix
GPU0 GPU1 CPU Affinity NUMA Affinity
GPU0 X SYS 0-15,32-47 0
GPU1 SYS X 16-31,48-63 1
2
3
4
> ./p2pBandwidthLatencyTest
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 554.55 11.42
1 11.30 558.10
2
3
4
5
发现更换插槽后使用了不同组的 cpu
,通信速度有所加快
# 3、 查看 nvlink 的状态
nvidia-smi nvlink --status
GPU 0: TITAN RTX (UUID: GPU-486a9b0f-d80e-668e-fa93-cf5988109248)
Link 0: <inactive>
Link 1: <inactive>
GPU 1: TITAN RTX (UUID: GPU-726723cc-04b1-d7a2-7638-0b73154449bc)
Link 0: <inactive>
Link 1: <inactive>
2
3
4
5
6
处于 inactive 状态
# 4、安装 NCCL
kvstore
4.1 下载 NCCL
- https://developer.nvidia.com/nccl
4.2 安装 NCCL
sudo dpkg -i nccl-local-repo-ubuntu1804-2.10.3-cuda10.2_1.0-1_amd64.deb
sudo apt update
sudo apt install libnccl2 libnccl-dev
2
3
# 4.3 查看 NCCL 安装情况
> dpkg -l|grep nccl
ii libnccl-dev 2.10.3-1+cuda10.2 amd64 NVIDIA Collective Communication Library (NCCL) Development Files
ii libnccl2 2.10.3-1+cuda10.2 amd64 NVIDIA Collective Communication Library (NCCL) Runtime
ii libvncclient1:amd64 0.9.11+dfsg-1ubuntu1.4 amd64 API to write one's own VNC server - client library
ii nccl-local-repo-ubuntu1804-2.10.3-cuda10.2 1.0-1 amd64 nccl-local repository configuration files
ii nccl-repo-ubuntu1804-2.7.8-ga-cuda10.2 1-1 amd64 nccl repository configuration files
2
3
4
5
6
7
接着测试通信带宽并无变化,看来不是 NCCL 的原因
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 555.65 8.89
1 8.74 453.78
2
3
4
# 5、可能的问题
由于客观原因未排查出具体原因,列出可能的两点原因,希望对你有所帮助
- 温度太高:据参考资料[2],温度在 85-90 度会到显卡的温度墙,会降频运行,解决方案可能是改装成水冷或者刀片式服务器,控制显卡温度
- 缺少 nvlink:根据
nvidia-smi nvlink --status
得知,解决方案是购买显卡桥接器看是否能加大 GPU 间通信带宽
# 6、根据参考资料[3] 进行排查
------------2021/09/19更新------------
6.1 查看 nvlink 安装情况
> nvidia-smi topo -p2p n
GPU0 GPU1
GPU0 X NS
GPU1 NS X
Legend:
X = Self
OK = Status Ok
CNS = Chipset not supported
GNS = GPU not supported
TNS = Topology not supported
NS = Not supported
U = Unknown
2
3
4
5
6
7
8
9
10
11
12
13
14
15
显示结果为 Not supported
6.2 查看 P2P 通信情况
> cd ~/NVIDIA_CUDA-10.2_Samples/0_Simple/simpleP2P
> make
> ./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
Checking GPU(s) for support of peer to peer memory access...
> Peer access from TITAN RTX (GPU0) -> TITAN RTX (GPU1) : No
> Peer access from TITAN RTX (GPU1) -> TITAN RTX (GPU0) : No
Two or more GPUs with Peer-to-Peer access capability are required for ./simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.
2
3
4
5
6
7
8
9
10
11
12
13
结果显示在不使用桥接器的情况下,双卡无法实现 P2P 通信
6.3 开启 TCC 计算模式
nvidia-smi -i 0 -dm TCC
# 6、参考资料
[1] https://www.microway.com/hpc-tech-tips/nvidia-smi_control-your-gpus/
[2] https://bizon-tech.com/blog/bizon-z9000-8gpu-deeplearning-server-rtx2080ti-titan-benchamarks-review#:~:text=Idle%20temperature%3A%2040C%20Max%20load%20noise%20level%20%28deep,%28deep%20learning%29%3A%206X%20times%20lower%20%28150%20vs.%20900%29
[3] https://www.cnblogs.com/devilmaycry812839668/p/13264080.html
- 02
- README 美化05-20
- 03
- 常见 Tricks 代码片段05-12