Muyun99's wiki Muyun99's wiki
首页
学术搬砖
学习笔记
生活杂谈
wiki搬运
资源收藏
关于
  • 分类
  • 标签
  • 归档
GitHub (opens new window)

Muyun99

努力成为一个善良的人
首页
学术搬砖
学习笔记
生活杂谈
wiki搬运
资源收藏
关于
  • 分类
  • 标签
  • 归档
GitHub (opens new window)
  • 常见 bug 修复

  • 环境配置

    • GPU功率不一致
    • Anaconda下载及配置
    • 从零开始配PyTorch GPU环境
    • ubuntu 18-04 搭建 go 语言开发环境
    • GPU速度太慢问题排查
      • 多卡 GPU 速度太慢问题排查
    • Ubuntu系统安装
    • 服务器重装系统
    • Clash 配置
  • 常用库的常见用法

  • wiki搬运
  • 环境配置
Muyun99
2021-09-07

GPU速度太慢问题排查

# 多卡 GPU 速度太慢问题排查

# 1、根据参考资料 [1] 查看 GPU 之间的通信方式

nvidia-smi topo --matrix
1
	GPU0	GPU1	CPU Affinity	NUMA Affinity
GPU0	 X 	NODE	0-15,32-47	0
GPU1	NODE	 X 	0-15,32-47	0

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

1
2
3
4
5
6
7
8
9
10
11
12
13
14

在双卡 RTX TITAN 的工作站上看到 GPU0 和 GPU1 是通过 NODE 方式通信的

GPU0	GPU1	GPU2	GPU3	CPU Affinity	NUMA Affinity
GPU0	 X 	PIX	SYS	SYS	0-19		N/A
GPU1	PIX	 X 	SYS	SYS	0-19		N/A
GPU2	SYS	SYS	 X 	PIX	0-19		N/A
GPU3	SYS	SYS	PIX	 X 	0-19		N/A
 
Legend:
 
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

在四卡 RTX 2080Ti 的服务器上是上述输出,

  • GPU0 与 GPU1 是 PIX 通信方式,
  • GPU2 与 GPU3 是 PIX 通信方式
  • GPU0 与 GPU2、GPU3 是 SYS 通信方式
  • GPU1 与 GPU2、GPU3 是 SYS 通信方式

# 2、测试通信带宽

使用 CUDA Samples 提供的 P2P 带宽测试脚本

  • https://github.com/NVIDIA/cuda-samples/tree/master/Samples/p2pBandwidthLatencyTest

测试方法:

  • 在 https://github.com/NVIDIA/cuda-samples/releases 中找到对应 CUDA 版本的 CUDA Samples
  • nvcc -V 查看 CUDA 版本
  • 例如我本地是CUDA-10.2 ,然后 Download 以及 unzip,随后按照下面的命令编译并运行
> cd cuda-samples-10.2/Samples/p2pBandwidthLatencyTest
> make
> ./p2pBandwidthLatencyTest
1
2
3
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 556.92   8.65 
     1   8.58 457.95
1
2
3
4
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   DD     0      1      2      3 
     0 526.19   8.27  10.05  16.41 
     1   8.37 527.06  10.12  10.53 
     2  10.47   9.67 526.39   5.80 
     3  12.30   8.93   8.20 436.79
1
2
3
4
5
6

GPU 之间的通信带宽属实有点低,尝试通过安装 nvlink 来解决

------------2021/09/19更新------------

更换显卡插槽后再度测试一下

> nvidia-smi topo --matrix
	GPU0	GPU1	CPU Affinity	NUMA Affinity
GPU0	 X 	SYS	0-15,32-47	0
GPU1	SYS	 X 	16-31,48-63	1
1
2
3
4
> ./p2pBandwidthLatencyTest
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 554.55  11.42 
     1  11.30 558.10
1
2
3
4
5

发现更换插槽后使用了不同组的 cpu ,通信速度有所加快

# 3、 查看 nvlink 的状态

nvidia-smi nvlink --status
1
GPU 0: TITAN RTX (UUID: GPU-486a9b0f-d80e-668e-fa93-cf5988109248)
	 Link 0: <inactive>
	 Link 1: <inactive>
GPU 1: TITAN RTX (UUID: GPU-726723cc-04b1-d7a2-7638-0b73154449bc)
	 Link 0: <inactive>
	 Link 1: <inactive>
1
2
3
4
5
6

处于 inactive 状态

# 4、安装 NCCL

kvstore

4.1 下载 NCCL

  • https://developer.nvidia.com/nccl

4.2 安装 NCCL

sudo dpkg -i nccl-local-repo-ubuntu1804-2.10.3-cuda10.2_1.0-1_amd64.deb
sudo apt update
sudo apt install libnccl2 libnccl-dev
1
2
3
# 4.3 查看 NCCL 安装情况
> dpkg -l|grep nccl

ii  libnccl-dev                                2.10.3-1+cuda10.2                                   amd64        NVIDIA Collective Communication Library (NCCL) Development Files
ii  libnccl2                                   2.10.3-1+cuda10.2                                   amd64        NVIDIA Collective Communication Library (NCCL) Runtime
ii  libvncclient1:amd64                        0.9.11+dfsg-1ubuntu1.4                              amd64        API to write one's own VNC server - client library
ii  nccl-local-repo-ubuntu1804-2.10.3-cuda10.2 1.0-1                                               amd64        nccl-local repository configuration files
ii  nccl-repo-ubuntu1804-2.7.8-ga-cuda10.2     1-1                                                 amd64        nccl repository configuration files
1
2
3
4
5
6
7

接着测试通信带宽并无变化,看来不是 NCCL 的原因

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 555.65   8.89 
     1   8.74 453.78
1
2
3
4

# 5、可能的问题

由于客观原因未排查出具体原因,列出可能的两点原因,希望对你有所帮助

  • 温度太高:据参考资料[2],温度在 85-90 度会到显卡的温度墙,会降频运行,解决方案可能是改装成水冷或者刀片式服务器,控制显卡温度
  • 缺少 nvlink:根据 nvidia-smi nvlink --status 得知,解决方案是购买显卡桥接器看是否能加大 GPU 间通信带宽

# 6、根据参考资料[3] 进行排查

------------2021/09/19更新------------

6.1 查看 nvlink 安装情况

> nvidia-smi topo -p2p n

 	GPU0	GPU1	
 GPU0	X	NS	
 GPU1	NS	X	

Legend:

  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  U    = Unknown
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

显示结果为 Not supported

6.2 查看 P2P 通信情况

> cd ~/NVIDIA_CUDA-10.2_Samples/0_Simple/simpleP2P
> make
> ./simpleP2P

[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2

Checking GPU(s) for support of peer to peer memory access...
> Peer access from TITAN RTX (GPU0) -> TITAN RTX (GPU1) : No
> Peer access from TITAN RTX (GPU1) -> TITAN RTX (GPU0) : No
Two or more GPUs with Peer-to-Peer access capability are required for ./simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.
1
2
3
4
5
6
7
8
9
10
11
12
13

结果显示在不使用桥接器的情况下,双卡无法实现 P2P 通信

6.3 开启 TCC 计算模式

nvidia-smi -i 0 -dm TCC
1

# 6、参考资料

  • [1] https://www.microway.com/hpc-tech-tips/nvidia-smi_control-your-gpus/

  • [2] https://bizon-tech.com/blog/bizon-z9000-8gpu-deeplearning-server-rtx2080ti-titan-benchamarks-review#:~:text=Idle%20temperature%3A%2040C%20Max%20load%20noise%20level%20%28deep,%28deep%20learning%29%3A%206X%20times%20lower%20%28150%20vs.%20900%29

  • [3] https://www.cnblogs.com/devilmaycry812839668/p/13264080.html

上次更新: 2021/09/26, 00:09:41
ubuntu 18-04 搭建 go 语言开发环境
Ubuntu系统安装

← ubuntu 18-04 搭建 go 语言开发环境 Ubuntu系统安装→

最近更新
01
Structured Knowledge Distillation for Semantic Segmentation
06-03
02
README 美化
05-20
03
常见 Tricks 代码片段
05-12
更多文章>
Theme by Vdoing | Copyright © 2021-2023 Muyun99 | MIT License
  • 跟随系统
  • 浅色模式
  • 深色模式
  • 阅读模式
×