GPU速度太慢问题排查

# 多卡 GPU 速度太慢问题排查

# 1、根据参考资料 [1] 查看 GPU 之间的通信方式

nvidia-smi topo --matrix

	GPU0	GPU1	CPU Affinity	NUMA Affinity
GPU0	 X 	NODE	0-15,32-47	0
GPU1	NODE	 X 	0-15,32-47	0

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

1
2
3
4
5
6
7
8
9
10
11
12
13
14

在双卡 RTX TITAN 的工作站上看到 GPU0 和 GPU1 是通过 NODE 方式通信的

GPU0	GPU1	GPU2	GPU3	CPU Affinity	NUMA Affinity
GPU0	 X 	PIX	SYS	SYS	0-19		N/A
GPU1	PIX	 X 	SYS	SYS	0-19		N/A
GPU2	SYS	SYS	 X 	PIX	0-19		N/A
GPU3	SYS	SYS	PIX	 X 	0-19		N/A
 
Legend:
 
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

在四卡 RTX 2080Ti 的服务器上是上述输出，

GPU0 与 GPU1 是 PIX 通信方式，
GPU2 与 GPU3 是 PIX 通信方式
GPU0 与 GPU2、GPU3 是 SYS 通信方式
GPU1 与 GPU2、GPU3 是 SYS 通信方式

# 2、测试通信带宽

使用 CUDA Samples 提供的 P2P 带宽测试脚本

https://github.com/NVIDIA/cuda-samples/tree/master/Samples/p2pBandwidthLatencyTest

测试方法：

在 https://github.com/NVIDIA/cuda-samples/releases 中找到对应 CUDA 版本的 CUDA Samples
nvcc -V 查看 CUDA 版本
例如我本地是CUDA-10.2 ，然后 Download 以及 unzip，随后按照下面的命令编译并运行

> cd cuda-samples-10.2/Samples/p2pBandwidthLatencyTest
> make
> ./p2pBandwidthLatencyTest

1
2
3

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 556.92   8.65 
     1   8.58 457.95

1
2
3
4

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   DD     0      1      2      3 
     0 526.19   8.27  10.05  16.41 
     1   8.37 527.06  10.12  10.53 
     2  10.47   9.67 526.39   5.80 
     3  12.30   8.93   8.20 436.79

1
2
3
4
5
6

GPU 之间的通信带宽属实有点低，尝试通过安装 nvlink 来解决

------------2021/09/19更新------------

更换显卡插槽后再度测试一下

> nvidia-smi topo --matrix
	GPU0	GPU1	CPU Affinity	NUMA Affinity
GPU0	 X 	SYS	0-15,32-47	0
GPU1	SYS	 X 	16-31,48-63	1

1
2
3
4

> ./p2pBandwidthLatencyTest
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 554.55  11.42 
     1  11.30 558.10

1
2
3
4
5

发现更换插槽后使用了不同组的 cpu ，通信速度有所加快

# 3、查看 nvlink 的状态

nvidia-smi nvlink --status

GPU 0: TITAN RTX (UUID: GPU-486a9b0f-d80e-668e-fa93-cf5988109248)
	 Link 0: <inactive>
	 Link 1: <inactive>
GPU 1: TITAN RTX (UUID: GPU-726723cc-04b1-d7a2-7638-0b73154449bc)
	 Link 0: <inactive>
	 Link 1: <inactive>

1
2
3
4
5
6

处于 inactive 状态

# 4、安装 NCCL

kvstore

4.1 下载 NCCL

https://developer.nvidia.com/nccl

4.2 安装 NCCL

sudo dpkg -i nccl-local-repo-ubuntu1804-2.10.3-cuda10.2_1.0-1_amd64.deb
sudo apt update
sudo apt install libnccl2 libnccl-dev

1
2
3

# 4.3 查看 NCCL 安装情况

> dpkg -l|grep nccl

ii  libnccl-dev                                2.10.3-1+cuda10.2                                   amd64        NVIDIA Collective Communication Library (NCCL) Development Files
ii  libnccl2                                   2.10.3-1+cuda10.2                                   amd64        NVIDIA Collective Communication Library (NCCL) Runtime
ii  libvncclient1:amd64                        0.9.11+dfsg-1ubuntu1.4                              amd64        API to write one's own VNC server - client library
ii  nccl-local-repo-ubuntu1804-2.10.3-cuda10.2 1.0-1                                               amd64        nccl-local repository configuration files
ii  nccl-repo-ubuntu1804-2.7.8-ga-cuda10.2     1-1                                                 amd64        nccl repository configuration files

1
2
3
4
5
6
7

接着测试通信带宽并无变化，看来不是 NCCL 的原因

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 555.65   8.89 
     1   8.74 453.78

1
2
3
4

# 5、可能的问题

由于客观原因未排查出具体原因，列出可能的两点原因，希望对你有所帮助

温度太高：据参考资料[2]，温度在 85-90 度会到显卡的温度墙，会降频运行，解决方案可能是改装成水冷或者刀片式服务器，控制显卡温度
缺少 nvlink：根据 nvidia-smi nvlink --status 得知，解决方案是购买显卡桥接器看是否能加大 GPU 间通信带宽

# 6、根据参考资料[3] 进行排查

------------2021/09/19更新------------

6.1 查看 nvlink 安装情况

> nvidia-smi topo -p2p n

 	GPU0	GPU1	
 GPU0	X	NS	
 GPU1	NS	X	

Legend:

  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  U    = Unknown

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

显示结果为 Not supported

6.2 查看 P2P 通信情况

> cd ~/NVIDIA_CUDA-10.2_Samples/0_Simple/simpleP2P
> make
> ./simpleP2P

[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2

Checking GPU(s) for support of peer to peer memory access...
> Peer access from TITAN RTX (GPU0) -> TITAN RTX (GPU1) : No
> Peer access from TITAN RTX (GPU1) -> TITAN RTX (GPU0) : No
Two or more GPUs with Peer-to-Peer access capability are required for ./simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.

1
2
3
4
5
6
7
8
9
10
11
12
13

结果显示在不使用桥接器的情况下，双卡无法实现 P2P 通信

6.3 开启 TCC 计算模式

nvidia-smi -i 0 -dm TCC

# 6、参考资料

[1] https://www.microway.com/hpc-tech-tips/nvidia-smi_control-your-gpus/
[2] https://bizon-tech.com/blog/bizon-z9000-8gpu-deeplearning-server-rtx2080ti-titan-benchamarks-review#:~:text=Idle%20temperature%3A%2040C%20Max%20load%20noise%20level%20%28deep,%28deep%20learning%29%3A%206X%20times%20lower%20%28150%20vs.%20900%29
[3] https://www.cnblogs.com/devilmaycry812839668/p/13264080.html

上次更新: 2021/09/26, 00:09:41

← ubuntu 18-04 搭建 go 语言开发环境 Ubuntu系统安装→

GPU速度太慢问题排查

# 多卡 GPU 速度太慢问题排查

# 1、根据参考资料 [1] 查看 GPU 之间的通信方式

# 2、测试通信带宽

# 3、 查看 nvlink 的状态

# 4、安装 NCCL

# 4.3 查看 NCCL 安装情况

# 5、可能的问题

# 6、根据参考资料[3] 进行排查

# 6、参考资料

# 3、查看 nvlink 的状态