 GPU速度太慢问题排查
          GPU速度太慢问题排查
        
 # 多卡 GPU 速度太慢问题排查
# 1、根据参考资料 [1] 查看 GPU 之间的通信方式
nvidia-smi topo --matrix
	GPU0	GPU1	CPU Affinity	NUMA Affinity
GPU0	 X 	NODE	0-15,32-47	0
GPU1	NODE	 X 	0-15,32-47	0
Legend:
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
2
3
4
5
6
7
8
9
10
11
12
13
14
在双卡 RTX TITAN 的工作站上看到 GPU0 和 GPU1 是通过 NODE 方式通信的
GPU0	GPU1	GPU2	GPU3	CPU Affinity	NUMA Affinity
GPU0	 X 	PIX	SYS	SYS	0-19		N/A
GPU1	PIX	 X 	SYS	SYS	0-19		N/A
GPU2	SYS	SYS	 X 	PIX	0-19		N/A
GPU3	SYS	SYS	PIX	 X 	0-19		N/A
 
Legend:
 
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks 
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
在四卡 RTX 2080Ti 的服务器上是上述输出,
- GPU0 与 GPU1 是 PIX 通信方式,
- GPU2 与 GPU3 是 PIX 通信方式
- GPU0 与 GPU2、GPU3 是 SYS 通信方式
- GPU1 与 GPU2、GPU3 是 SYS 通信方式
# 2、测试通信带宽
使用 CUDA Samples 提供的 P2P 带宽测试脚本
- https://github.com/NVIDIA/cuda-samples/tree/master/Samples/p2pBandwidthLatencyTest
测试方法:
- 在 https://github.com/NVIDIA/cuda-samples/releases中找到对应 CUDA 版本的 CUDA Samples
- nvcc -V查看 CUDA 版本
- 例如我本地是CUDA-10.2 ,然后 Download 以及 unzip,随后按照下面的命令编译并运行
> cd cuda-samples-10.2/Samples/p2pBandwidthLatencyTest
> make
> ./p2pBandwidthLatencyTest
2
3
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 556.92   8.65 
     1   8.58 457.95
2
3
4
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   DD     0      1      2      3 
     0 526.19   8.27  10.05  16.41 
     1   8.37 527.06  10.12  10.53 
     2  10.47   9.67 526.39   5.80 
     3  12.30   8.93   8.20 436.79
2
3
4
5
6
GPU 之间的通信带宽属实有点低,尝试通过安装 nvlink 来解决
------------2021/09/19更新------------
更换显卡插槽后再度测试一下
> nvidia-smi topo --matrix
	GPU0	GPU1	CPU Affinity	NUMA Affinity
GPU0	 X 	SYS	0-15,32-47	0
GPU1	SYS	 X 	16-31,48-63	1
2
3
4
> ./p2pBandwidthLatencyTest
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 554.55  11.42 
     1  11.30 558.10
2
3
4
5
发现更换插槽后使用了不同组的 cpu ,通信速度有所加快
# 3、 查看 nvlink 的状态
nvidia-smi nvlink --status
GPU 0: TITAN RTX (UUID: GPU-486a9b0f-d80e-668e-fa93-cf5988109248)
	 Link 0: <inactive>
	 Link 1: <inactive>
GPU 1: TITAN RTX (UUID: GPU-726723cc-04b1-d7a2-7638-0b73154449bc)
	 Link 0: <inactive>
	 Link 1: <inactive>
2
3
4
5
6
处于 inactive 状态
# 4、安装 NCCL
kvstore
4.1 下载 NCCL
- https://developer.nvidia.com/nccl
4.2 安装 NCCL
sudo dpkg -i nccl-local-repo-ubuntu1804-2.10.3-cuda10.2_1.0-1_amd64.deb
sudo apt update
sudo apt install libnccl2 libnccl-dev
2
3
# 4.3 查看 NCCL 安装情况
> dpkg -l|grep nccl
ii  libnccl-dev                                2.10.3-1+cuda10.2                                   amd64        NVIDIA Collective Communication Library (NCCL) Development Files
ii  libnccl2                                   2.10.3-1+cuda10.2                                   amd64        NVIDIA Collective Communication Library (NCCL) Runtime
ii  libvncclient1:amd64                        0.9.11+dfsg-1ubuntu1.4                              amd64        API to write one's own VNC server - client library
ii  nccl-local-repo-ubuntu1804-2.10.3-cuda10.2 1.0-1                                               amd64        nccl-local repository configuration files
ii  nccl-repo-ubuntu1804-2.7.8-ga-cuda10.2     1-1                                                 amd64        nccl repository configuration files
2
3
4
5
6
7
接着测试通信带宽并无变化,看来不是 NCCL 的原因
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 555.65   8.89 
     1   8.74 453.78
2
3
4
# 5、可能的问题
由于客观原因未排查出具体原因,列出可能的两点原因,希望对你有所帮助
- 温度太高:据参考资料[2],温度在 85-90 度会到显卡的温度墙,会降频运行,解决方案可能是改装成水冷或者刀片式服务器,控制显卡温度
- 缺少 nvlink:根据 nvidia-smi nvlink --status得知,解决方案是购买显卡桥接器看是否能加大 GPU 间通信带宽
# 6、根据参考资料[3] 进行排查
------------2021/09/19更新------------
6.1 查看 nvlink 安装情况
> nvidia-smi topo -p2p n
 	GPU0	GPU1	
 GPU0	X	NS	
 GPU1	NS	X	
Legend:
  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  U    = Unknown
2
3
4
5
6
7
8
9
10
11
12
13
14
15
显示结果为 Not supported
6.2 查看 P2P 通信情况
> cd ~/NVIDIA_CUDA-10.2_Samples/0_Simple/simpleP2P
> make
> ./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
Checking GPU(s) for support of peer to peer memory access...
> Peer access from TITAN RTX (GPU0) -> TITAN RTX (GPU1) : No
> Peer access from TITAN RTX (GPU1) -> TITAN RTX (GPU0) : No
Two or more GPUs with Peer-to-Peer access capability are required for ./simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.
2
3
4
5
6
7
8
9
10
11
12
13
结果显示在不使用桥接器的情况下,双卡无法实现 P2P 通信
6.3 开启 TCC 计算模式
nvidia-smi -i 0 -dm TCC
# 6、参考资料
- [1] https://www.microway.com/hpc-tech-tips/nvidia-smi_control-your-gpus/ 
- [2] https://bizon-tech.com/blog/bizon-z9000-8gpu-deeplearning-server-rtx2080ti-titan-benchamarks-review#:~:text=Idle%20temperature%3A%2040C%20Max%20load%20noise%20level%20%28deep,%28deep%20learning%29%3A%206X%20times%20lower%20%28150%20vs.%20900%29 
- [3] https://www.cnblogs.com/devilmaycry812839668/p/13264080.html 
- 02
- README 美化05-20
- 03
- 常见 Tricks 代码片段05-12
