为自己的 inicls 框架集成 Horovod
# 为自己的 inicls 框架集成 Horovod
1、改动代码
import torch
import horovod.torch as hvd
# Initialize Horovod
hvd.init()
# Pin GPU to be used to process local rank (one GPU per process)
torch.cuda.set_device(hvd.local_rank())
# Define dataset...
train_dataset = ...
# Partition dataset among workers using DistributedSampler
train_sampler = torch.utils.data.distributed.DistributedSampler(
train_dataset, num_replicas=hvd.size(), rank=hvd.rank())
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler)
# Build model...
model = ...
model.cuda()
optimizer = optim.SGD(model.parameters())
# Add Horovod Distributed Optimizer
optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())
# Broadcast parameters from rank 0 to all other processes.
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
for epoch in range(100):
for batch_idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if batch_idx % args.log_interval == 0:
print('Train Epoch: {} [{}/{}]\tLoss: {}'.format(
epoch, batch_idx * len(data), len(train_sampler), loss.item()))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
2、训练脚本
export CUDA_VISIBLE_DEVICES=0,1
python train.py config/resnet/resnet18_b16x8_kaggle_leaves.py --tag resnet_fp16_lr_0.1_horovod --options "fp16=True" "data.train.ann_file=train_fold0.csv" "data.val.ann_file=valid_fold0.csv"
1
2
3
2
3
3、实验结果
实验名称 | Batch_size | 学习率 | 验证集上最优精度 | 训练时间 | 显存占用 |
---|---|---|---|---|---|
resnet_fp16_lr_0.1_horovod | 64 | 0.1 | 0.9327 | 19m30s | 每张卡 1600MiB |
resnet_fp16_lr_0.1_horovod_gpu0 | 64 | 0.1 | 0.9349 | 12m53s | GPU0 占用2100MiB |
resnet_fp16_lr_0.1_horovod_gpu1 | 64 | 0.1 | 0.9324 | 16m6s | GPU1 占用2100MiB |
# 参考资料
- https://github.com/horovod/horovod/blob/master/docs/pytorch.rst
上次更新: 2021/08/02, 21:04:52
- 02
- README 美化05-20
- 03
- 常见 Tricks 代码片段05-12